lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

ZAHRADA.

2022-03-31 03:13:31 | Science News

The Node from Pak on Vimeo.


“One thought fills immensity.”




□ ptdalgorithms: Graph-based algorithms for phase-type distributions

>> https://www.biorxiv.org/content/10.1101/2022.03.12.484077v1.full.pdf

ptdalgorithms that implements graph-based algorithms for constructing and transforming unrewarded and rewarded continuous and discrete phase-type distributions and for computing their moments and distribution functions.

For generalized iterative state-space construction, ptdalgorithms allows the computation of moments for huge state spaces, and for the state probability vector of the underlying Markov chains of both time-homogeneous and time-inhomogeneous phase-type distributions.





□ SIEVE: joint inference of single-nucleotide variants and cell phylogeny from single-cell DNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.03.24.485657v1.full.pdf

The previous methods do not operate within the statistical phylogenetic framework, in particular do not infer branch lengths of the tree. Moreover, either they fully follow the infinite-sites assumption (ISA).

SIEVE (SIngle-cell EVolution Explorer) exploits raw read counts for all nucleotides from scDNA-seq to reconstruct the cell phylogeny and call variants based on the inferred phylogenetic relations. SIEVE employs a statistical phylogenetic model following finite-sites assumption.





□ Sobolev Alignment: Identifying commonalities between cell lines and tumors at the single cell level using Sobolev Alignment of deep generative models

>> https://www.biorxiv.org/content/10.1101/2022.03.08.483431v1.full.pdf

Sobolev Alignment, a computational framework which uses deep generative models to capture non-linear processes in single-cell RNA sequencing data and kernel methods to align and interpret these processes.

Recent works have shown theoretical connections, demonstrating, for instance, the equivalence between the Laplacian kernel and the so-called Neural Tangent Kernel.

The interpretation scheme relies on the decomposition of the Gaussian kernel, which we extended to the Laplacian kernel by exploiting connections between the feature spaces of Gaussian and Laplacian kernels.

Mapping towards the latent factors using Falkon-trained kernel machines, which allows to calculate the contribution of each gene to each latent factors. Constructing a consensus space by interpolation b/n matched Sobolev Principal Vectors onto which all data can be projected.





□ scAllele: a versatile tool for the detection and analysis of variants in scRNA-seq

>> https://www.biorxiv.org/content/10.1101/2022.03.29.486330v1.full.pdf

scAllele, a versatile tool that performs both variant calling and functional analysis of the variants in alternative splicing using scRNA-seq. As a variant caller, scAllele reliably identifies SNVs and microindels (less than 20 bases) with low coverage.

scAllele calls nucleotide variants via local reassembly. scAllele enables read-level allelic linkage analysis. It refines read alignments and possible misalignments, and enhances variant detection accuracy per read. scAllele uses a GLM model to detect high confidence variants.





□ The complexity of the Structure and Classification of Dynamical Systems

>> https://arxiv.org/pdf/2203.10655v1.pdf

A survey of the complexity of structure, anti-structure, classification and anti-classification results in dynamical systems. Focussing primarily on ergodic theory, with excursions into topological dynamical systems, but suggest methods and problems in related areas.

Every perfect Polish space contains a non-Borel analytic set. Moreover, the analytic sets are closed under countable intersections and unions. Hence the co- analytic sets are also closed under unions and intersections.

Are there complete numerical invariants for orientation preserving diffeomorphisms of the circle up to conjugation by orientation preserving diffeomorphisms?





□ A glimpse of the toposophic landscape: Exploring mathematical objects from custom-tailored mathematical universes

>> https://arxiv.org/pdf/2204.00948.pdf

There are toposes in which the axiom of choice and the intermediate value theorem from undergraduate calculus fail, toposes in which any function R → R is continuous and toposes in which infinitesimal numbers exist.

In the semantic view, the effective topos is an alternative universe which contains its own version of the natural numbers. “There are infinitely many primes in Eff” is equivalent to the statement “for any number n, there effectively exists a prime number p > n”.





□ ALFATClust: Clustering biological sequences with dynamic sequence similarity threshold

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04643-9

ALFATClust exploits rapid pairwise alignment-free sequence distance calculations and community detection. Although ALFATClust computes a full Mash distance matrix for its graph clustering, the matrix can be significantly reduced using a divide-and-conquer approach.

ALFATClust is conceptually similar to hierarchical agglomerative clustering since its algorithm begins with each sequence (vertex) as a singleton graph cluster, and the graph clusters are gradually merged through iterations with decreasing resolution parameter γ.





□ The Graphical R2D2 Estimator for the Precision Matrices

>> https://www.biorxiv.org/content/10.1101/2022.03.22.485374v1.full.pdf

Graphical R2D2 (R2-induced Dirichlet Decomposition) draws Monte Carlo samples from the posterior distribution based on the graphical R2D2 prior, to estimate the precision matrix for multivariate Gaussian data.

GR2D2 estimator has attractive properties in estimating the precision ma- trices, such as greater concentration near the origin and heavier tails than current shrinkage priors.

When the true precision matrix is sparse and of high dimension, The graphical R2D2 hierarchical model provides estimates close to the true distribution in Kullback-Leibler divergence and with the smallest bias for nonzero elements.





□ PORTIA: Fast and accurate inference of Gene Regulatory Networks through robust precision matrix estimation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac178/6553011

The possible cell transcriptional states are determined by the underlying Gene Regulatory Network (GRN), and reliably inferring such network would be invaluable to understand biological processes and disease progression.

PORTIA, a novel algorithm for GRN inference based on power transforms and covariance matrix inversion. A key aspect of GRN inference is the need to disentangle direct from indirect correlations. PORTIA has thus been conceptually inspired by Direct Coupling Analysis methods.





□ CAISC: A software to integrate copy number variations and single nucleotide mutations for genetic heterogeneity profiling and subclone detection by single-cell RNA sequencing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04625-x

Clonal Architecture with Integration of SNV and CNV (CAISC), an R package for scRNA-seq data analysis that clusters single cells into distinct subclones by integrating CNV and SNV genotype matrices using an entropy weighted approach.

Entropy measures the structural complexity of a network, thus its concept can be utilized to integrate multiple weighted graphs or networks, or in this case, to integrate the cell–cell distance matrices generated by the DENDRO and infercnv analyses.





□ Haplotype-resolved assembly of diploid genomes without parental data

>> https://www.nature.com/articles/s41587-022-01261-x

An algorithm that combines PacBio HiFi reads and Hi-C chromatin interaction data to produce a haplotype-resolved assembly without the sequencing of parents.

the algorithm consistently outperforms existing single-sample assembly pipelines and generates assemblies of similar quality to the best pedigree-based assemblies.

It reduce unitig bipartition to a graph max-cut problem and find a near optimal solution with a stochastic algorithm in the principle of simulated annealing,and also consider the topology of the assembly graph to reduce the chance of local optima.





□ Gfastats: conversion, evaluation and manipulation of genome sequences using assembly graphs

>> https://www.biorxiv.org/content/10.1101/2022.03.24.485682v1.full.pdf

Gfastats is a standalone tool to compute assembly summary statistics and manipulate assembly sequences in fasta, fastq, or gfa [.gz] format. Gfastats stores assembly sequences internally in a gfa-like format.

Gfastats builds a bidirected graph representation of the assembly using adjacency lists, where each node is a segment, and each edge is a gap. Walking the graph allows to generate different kinds of outputs, including manipulated assemblies and feature coordinates.





□ SEACells: Inference of transcriptional and epigenomic cellular states from single-cell genomics data

>> https://www.biorxiv.org/content/10.1101/2022.04.02.486748v1.full.pdf

SEACells outperforms existing algorithms in identifying accurate, compact, and well-separated metacells in both RNA and ATAC modalities across datasets with discrete cell types and continuous trajectories.

SEACells improves gene-peak associations, computes ATAC gene scores and measures gene accessibility. Using a count matrix as input, it provides per-cell weights for each metacell, per-cell hard assignments to each metacell, and the aggregated counts for each metacell as output.





□ Accelerated identification of disease-causing variants with ultra-rapid nanopore genome sequencing

>> https://www.nature.com/articles/s41587-022-01221-5

An approach for ultra-rapid nanopore WGS that combines an optimized sample preparation protocol, distributing sequencing over 48 flow cells, near real-time base calling and alignment, accelerated variant calling and fast variant filtration.

This cloud-based pipeline scales compute-intensive base calling and alignment across 16 instances with 4× Tesla V100 GPUs each and runs concurrently. It aims for maximum resource utilization, where base calling using Guppy runs on GPU and alignment using Minimap2.





□ PEER: Transcriptome diversity is a systematic source of variation in RNA-sequencing data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009939

Probabilistic estimation of expression residuals (PEER), which infers broad variance components in gene expression measurements, has been used to account for some systematic effects, but it has remained challenging to interpret these PEER factors.

PEER “hidden” covariates encode for transcriptome diversity – a simple metric based on Shannon entropy – explains a large portion of variability in gene expression and is the strongest known factor encoded in PEER factors.





□ DeepAcr: Predicting Anti-CRISPR with Deep Learning

>> https://www.biorxiv.org/content/10.1101/2022.04.02.486820v1.full.pdf

DeepAcr compiles the large protein sequence database to obtain secondary structure, relative solvent accessibility, evolutionary features, and Transformer features with RaptorX,.

DeepAcr applies Hidden Markov Model and uses it a baseline for Acr classification comparison. It outperforms macro-average metrics. Thus, DeepAcr is an unbiased predictor. DeepAcr captures the evolutionarily conserved pattern and the interaction between anti-CRISPR.





□ RecGen: Prediction of designer-recombinases for DNA editing with generative deep learning

>> https://www.biorxiv.org/content/10.1101/2022.04.01.486669v1.full.pdf

RecGen, an algorithm for the intelligent generation of designer-recombinases. RecGen is trained with 89 evolved recombinase libraries and their respective target sites, captures the affinities between the recombinase sequences and their respective DNA binding sequences.

RecGen uses CVAE (Conditional Variational Autoencoders) architecture for recombinase prediction. The latent space is designed to resemble a multivariate normal distribution. For each latent space dimension mean and standard deviation are learned for normal distribution sampling.





□ BiTSC2: Bayesian inference of tumor clonal tree by joint analysis of single-cell SNV and CNA data

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac092/6562684

BiTSC2 takes raw reads from scDNA-seq as input, accounts for the overlapping of CNA and SNV, models allelic dropout rate, sequencing errors and missing rate, as well as assigns single cells into subclones.

By applying Markov Chain Monte Carlo sampling, BiTSC2 can simultaneously estimate the subclonal scCNA and scSNV genotype matrices. BiTSC2 shows high accuracy in genotype recovery, subclonal assignment and tree reconstruction.





□ LSMMD-MA: Scaling multimodal data integration for single-cell genomics data analysis

>> https://www.biorxiv.org/content/10.1101/2022.03.23.485536v1.full.pdf

MMD-MA is a method for analyzing multimodal data that relies on mapping the observed cell samples to embeddings, using functions belonging to a Reproducing Kernel Hilbert Space.

LSMMD-MA, a large-scale Python implementation of the MMD-MA method for multimodal data integration. Reformulating the MMD-MA optimization problem using linear algebra and solve it with KeOps, a CUDA framework for symbolic matrix computation.





□ CNETML: Maximum likelihood inference of phylogeny from copy number profiles of spatio-temporal samples

>> https://www.biorxiv.org/content/10.1101/2022.03.18.484889v1.full.pdf

CNETML, a new maximum likelihood method based on a novel evolutionary model of copy number alterations (CNAs) to infer phylogenies from spatio-temporal samples taken within a single patient.

CNETML is the first program to jointly infer the tree topology, node ages, and mutation rates from total copy numbers when samples were taken at different time points. The change of copy number at each site follows a continuous-time non-reversible Markov chain.





□ BISER: Fast characterization of segmental duplication structure in multiple genome assemblies

>> https://almob.biomedcentral.com/articles/10.1186/s13015-022-00210-2

BISER (Brisk Inference of Segmental duplication Evolutionary stRucture) is a fast tool for detecting and decomposing segmental duplications in genome assemblies. BISER infers elementary and core duplicons and enable an evolutionary analysis of all SDs in a given set of genomes.

BISER uses a two-tiered local chaining algorithm from SEDEF based on a seed-and-extend approach and efficient O(nlogn) chaining method following by a SIMD-parallelized sparse dynamic programming algorithm to calculate the boundaries of the final SD regions and their alignments.





□ NIFA: Non-negative Independent Factor Analysis disentangles discrete and continuous sources of variation in scRNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac136/6550501

NIFA (Non-negative Independent Factor Analysis), a new probabilistic single-cell factor analysis model that incorporates different interpretability inducing assumptions into a single modeling framework.

NIFA models uni- and multi-modal latent factors, and isolates discrete cell-type identity and continuous pathway activity into separate components. NIFA-derived factors outperform results from ICA, PCA, NMF and scCoGAPS in terms of disentangling biological sources of variation.





□ Coverage-preserving sparsification of overlap graphs for long-read assembly

>> https://www.biorxiv.org/content/10.1101/2022.03.17.484715v1.full.pdf

Accordingly, problem formulations for genome assembly which seek a single genome reconstruction, e.g., by finding a Hamiltonian cycle in an overlap graph, or computing the shortest common superstring of input reads, are not used in practice.

A novel theoretical framework that computes a directed multi-graph structure which is also a sub-graph of overlap graph, and it is guaranteed to be coverage-preserving.

The safe graph sparsification rules for vertex and edge removal from overlap graph Ok(R), k ≤ l2 which guarantee that all circular strings ∈ C(R, l1, l2, φ) can be spelled in the sparse graph.





□ Quantum algorithmic randomness

>> https://arxiv.org/pdf/2008.03584.pdf

Quantum Martin-Lo ̈f randomness (q-MLR) for infinite qubit sequences was introduced. Defining a notion of quantum Solovay randomness which is equivalent to q-MLR. The proof of this goes through a purely linear algebraic result about approximating density matrices by subspaces.

Quantum-K is intended to be a quantum version of K, the prefix-free Kolmogorov complexity. Weak Solovay random states have a characterization in terms of the incompressibility of their initial segments. ρ is weak Solovay random ⇐⇒ ∀ε > 0, limn QKε(ρn) − n = ∞.





□ mm2-ax: Accelerating Minimap2 for accurate long read alignment on GPUs

>> https://www.biorxiv.org/content/10.1101/2022.03.09.483575v1.full.pdf

Chaining in mm2 identifies optimal collinear ordered subsets of anchors from the input sorted list of anchors. mm2 does a sequential pass over all the predecessors and does sequential score comparisons to identify the best scoring predecessor for every anchor.

mm2-ax (minimap2-accelerated), a heterogeneous software- hardware co-design for accelerating the chaining step of minimap2. It extracts better intra-read parallelism from chaining without loosing mapping accuracy by forward transforming Minimap2’s chaining algorithm.

mm2-ax demonstrates a 12.6-5X Speedup and 9.44-3.77X Speedup:Costup over SIMD-vectorized mm2-fast baseline. mm2-ax converts a sparse vector which defines the chaining workload to a dense one in order to optimize for better arithmetic intensity.





□ scINSIGHT for interpreting single-cell gene expression from biologically heterogeneous data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02649-3

Based on a novel matrix factorization model, scINSIGHT learns coordinated gene expression patterns that are common among or specific to different biological conditions, offering a unique chance to jointly identify heterogeneous biological processes and diverse cell types.

scINSIGHT achieves sparse, interpretable, and biologically meaningful decomposition. scINSIGHT simultaneously identifies common and condition-specific gene modules and quantify their expression levels in each sample in a lower-dimensional space.





□ Gradient-k: Improving the performance of K-Means using the density gradient

>> https://www.biorxiv.org/content/10.1101/2022.03.30.486343v1.full.pdf

Gradient-k reduces the number of iterations required for convergence. This is achieved by correcting the distance used in the k-means algorithm by a factor based on the angle between the density gradient and the direction to the cluster center.

Gradient-k uses auxiliary information about how the data is distributed in space, enabling it to detect clusters regardless of their density, shape, and size. Gradient-k allows non-linear splits, can find clusters of non-Gaussian shapes, and has a reduced tessellation behavior.





□ Multigrate: single-cell multi-omic data integration

>> https://www.biorxiv.org/content/10.1101/2022.03.16.484643v1.full.pdf

Multigrate equipped with transfer learning enables mapping a query multimodal dataset into an existing reference atlas.

Multigrate learns a joint latent space combining information from multiple modalities from paired and unpaired measurements while accounting for technical biases within each modality.





□ Gapless provides combined scaffolding, gap filling and assembly correction with long reads

>> https://www.biorxiv.org/content/10.1101/2022.03.08.483466v1.full.pdf

The included assembly correction can remove errors in the initial assembly that are highlighted by the long-reads. The necessary mapping and consensus calling are performed with minimap2 and racon, but this can be quickly changed in the short accompanying bash script.

The scaffold module is the core of gapless. It requires the split assembly to extract the names and length of existing scaffolds, the alignment of the split assembly to itself to detect repeats and the alignment of the long reads to the split assembly.

The long read alignments are initially filtered, requiring a minimum mapping quality and alignment length, and in case of PacBio, only one subread per fragment is kept to avoid giving large weight to short DNA fragments that are repeatedly sequenced multiple times.





□ DiSCERN - Deep Single Cell Expression ReconstructioN for improved cell clustering and cell subtype and state detection

>> https://www.biorxiv.org/content/10.1101/2022.03.09.483600v1.full.pdf

DISCERN is based on a modified Wasserstein Autoencoder. DISCERN allows for the realistic reconstruction of gene expression information by transferring the style of hq data onto lq data, in latent and gene space.

DISCERN transfers the “style” of hq onto lq data to reconstruct missing gene expression, which operate in a lower dimensional representation. DISCERN models GE values realistically while retaining prior and vital biological information of the lq dataset after reconstruction.





□ DNA co-methylation has a stable structure and is related to specific aspects of genome regulation

>> https://www.biorxiv.org/content/10.1101/2022.03.16.484648v1.full.pdf

Highly correlated DNAm sites in close proximity are highly heritable, influenced by nearby genetic variants (cis mQTLs), and are enriched for transcription factor binding sites related to regulation of short RNAs essential for cellular function transcribed by RNA polymerase III.

DNA co-methylation of distant sites may be related to long-range cooperative TF interactions. Highly correlated sites that are either distant, or on different chromosomes, are driven by unique environmental factors, and methylation is less likely to be driven by genotype.





Element Biosciences

>> https://www.elementbiosciences.com/products/aviti

High data quality and throughput enable whole genome sequencing for rare disease. Our study with UCSD is the first of its kind to demonstrate the clinical potential of #AVITI System on previously unsolved cases.
#NGS #AviditySequencing

Comparative analysis shows Loopseq has the lowest error rate of all commercially available long read sequencing technologies.

>> https://www.elementbiosciences.com/news/element-launches-the-aviti-system-to-democratize-access-to-genomics


Jim Tananbaum

I'm excited to support the team at @ElemBio as they unveil their benchtop sequencer AVITI. I believe sequencing will touch all our lives. To enable it, we need high quality, inexpensive sequencing.



Svatyně.

2022-03-31 03:13:17 | Science News




□ sc-CGconv: A copula based topology preserving graph convolution network for clustering of single-cell RNA-seq data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009600

sc-CGconv, a stepwise robust unsupervised feature extraction and clustering approach that formulates and aggregates cell–cell relationships using copula correlation (Ccor), followed by a graph convolution network based clustering approach.

sc-CGconv formulates a cell-cell graph using Ccor that is learned by a graph-based artificial intelligence model, graph convolution network. sc-CGconv provides a topology-preserving embedding of cells in low dimensional space.





□ RegScaf: a Regression Approach to Scaffolding

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac174/6554191

RegScaf examines the distribution of distances between contigs from read alignment by the kernel density. When multiple modes are shown in a density, orientation-supported links are grouped into clusters, each of which defines a linking distance corresponding to a mode.

The linear model parameterizes contigs by their positions on the genome; then each linking distance between a pair of contigs is taken as an observation on the difference of their positions.

The parameters are estimated by minimizing a global loss function, which is a version of trimmed sum of squares. The least trimmed squares estimate has such a high breakdown value that it can automatically remove the mistaken linking distances.

RegScaf outperforms other scaffolders, especially in the accuracy of gap estimates by substantially reducing extremely abnormal errors. Its strength in resolving repeat regions is exemplied by a real case. Its adaptability to large genomes and TGS long reads is validated as well.





□ DCATS: differential composition analysis for complex single-cell experimental designs

>> https://www.biorxiv.org/content/10.1101/2022.03.21.485232v1.full.pdf

DCATS improves composition analysis through accounting for uncertainty in classification of cell types in differential abundance analysis. DCATS detects differential abundance using a beta-binomial generalized linear model (GLM) model, which returns the estimated coefficients.

DCATS has the capability to account for covariates or to test multiple covariates jointly in the association w/ composition abundance for each cell type. DCATS corrects the misclassification bias based on the similarity matrix, the estimation of the matrix is an important step.





□ L-GIREMI uncovers RNA editing sites in long-read RNA-seq

>> https://www.biorxiv.org/content/10.1101/2022.03.23.485515v1.full.pdf

L-GIREMI (Long-read GIREMI), effectively handles sequencing errors and biases in the reads, and uses a model-based approach to score RNA editing sites. Applied to PacBio long-read RNA-seq data, L-GIREMI affords a high accuracy in RNA editing identification.

L-GIREMI examines the linkage patterns between sequence variants in the same reads, complemented by a model-driven approach. the performance of L-GIREMI is robust given a wide range of total read coverage.





□ ggtranscript: an R package for the visualization and interpretation of transcript isoforms using ggplot2

>> https://www.biorxiv.org/content/10.1101/2022.03.28.486050v1.full.pdf

As a ggplot2 extension, ggtranscript inherits a vast amount of flexibility when determining the plot aesthetics, as well as interoperability with existing ggplot2 geoms and ggplot2 extensions.

ggtranscript enables a fast and simplified way to visualize, explore and interpret transcript isoforms. It allows users to combine data from both long-read and short-read RNA-sequencing technologies, making systematic assessment of transcript support easier.





□ CoLoRd: compressing long reads

>> https://www.nature.com/articles/s41592-022-01432-3

CoLoRd, an algorithm able to reduce the size of third-generation sequencing data by an order of magnitude without affecting the accuracy of downstream analyses.

Equipped with an overlap-based algorithm for compressing the DNA stream and a lossy processing of the quality information, it allows even tenfold space reduction compared to gzip, without affecting down-stream analyses like variant calling or consensus generation.





□ scChromHMM: Characterizing cellular heterogeneity in chromatin state with scCUT&Tag-pro

>> https://www.nature.com/articles/s41587-022-01250-0

single-cell (sc)CUT&Tag-pro, a multimodal assay for profiling protein–DNA interactions coupled with the abundance of surface proteins in single cells.

single-cell ChromHMM integrates data from multiple experiments to infer and annotate chromatin states based on combinatorial histone modification patterns.





□ scMAGS: Marker gene selection from scRNA-seq data for spatial transcriptomics studies

>> https://www.biorxiv.org/content/10.1101/2022.03.22.485261v1.full.pdf

scMAGS uses a filtering step in which the candidate genes are extracted prior to the marker gene selection step. For the selection of marker genes, cluster validity indices, Silhouette index or Calinski-Harabasz index (for large datasets) are utilized.

scMAGS calculates the expression rates of all genes in all cell types. The count matrix should be normalized to reduce the bias. The number of reads for a gene in each cell is expected to be proportional to the gene-specific expression level and cell-specific scaling factors.





□ SMetABF: A rapid algorithm for Bayesian GWAS meta-analysis with a large number of studies included https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009948

SMetABF, a method based on the Markov chain Monte Carlo (MCMC) method and its extension named shotgun stochastic search (SSS) to speed the process of subset selection. SSS is proved to be superior in speed, accuracy, and stability through simulation.

The SSS algorithm can reach the maximum ABF in a short time with a small number of iterations. On the contrary, the MCMC algorithm can hardly find the maximum ABF in even longer time. The large-scale multi-phenotypic meta-analyses will be possible through SMetABF.





□ CIAlign: A highly customisable command line tool to clean, interpret and visualise multiple sequence alignments

>> https://peerj.com/articles/12983/

CIAlign is particularly targetted towards users working with complex or highly divergent alignments, partial sequences and problematic assemblies and towards those developing complex pipelines requiring fine-tuning of parameters to meet specific criteria.

When running CIAlign with all core functions and for fixed gap proportions, the runtime scales quadratically with the size of the MSA, i.e. with n as the number of sequences and m the length of the MSA, the worst case time complexity is O((nm)2).





□ scPipeline: Multi-level cellular and functional annotation of single-cell transcriptomes

>> https://www.biorxiv.org/content/10.1101/2022.03.13.484162v1.full.pdf

scPipeline is a modular collection of Rmarkdown scripts. The modular framework permits flexible usage and facilitates QC & preprocessing, integration, cluster optimization, cell annotation, gene expression and association analyses, and gene program discovery.

Scale-free Shared Nearest neighbor network (SSN) analysis as an approach to identify and functionally annotate gene sets in an unsupervised manner, providing an additional layer of functional characterization of scRNA-seq data.





□ ScanExitronLR: characterization and quantification of exitron splicing events in long-read RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2022.03.25.485864v1.full.pdf

ScanExitronLR, an application for the characterization and quantification of exitron splicing events in long-reads. From a BAM alignment file, reference genome and reference gene annotation, ScanExitronLR outputs exitron events at the transcript level.

ScanExitronLR executes calling and filtering processes for each chromosome in parallel. For every exitron that passes filtering, It examines whether reads aligning to the exitron's position which were not called in the previous step could have harbored misaligned exitrons.





□ TLVar: Exploiting deep transfer learning for the prediction of functional noncoding variants using genomic sequence

>> https://www.biorxiv.org/content/10.1101/2022.03.19.484983v1.full.pdf

The validated variants are rare due to technical difficulty and financial cost. The small sample size of validated variants makes it less reliable to develop a supervised machine learning model for achieving a whole genome-wide prediction of noncoding causal variants.

TLVar, a deep transfer learning model, which consists of pretrained layers trained by large-scale generic functional noncoding variants, and retrained layers by context-specific functional noncoding variants with the pretrained layers frozen.





□ LANTSA: Landmark-based transferable subspace analysis for single-cell and spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.03.13.484116v1.full.pdf

LANTSA constructs a representation graph of samples for clustering and visualization based on a novel subspace model, which can learn a more accurate representation and is theoretically proven to be linearly proportional to data size in terms of the time consumption.

LANTSA approximates the whole representation graph (i.e., sample-by-sample relationship) by representing each landmark sample as a linear combination of all samples based on a novel subspace model which preserves local structures.

LANTSA uses a dimensionality reduction as an integrative method to extract the discriminants underlying the representation structure, which enables label transfer from one learning dataset to the other prediction datasets, thus solving the massive-volume / cross-platform problem.





□ scGDC: Learning deep features and topological structure of cells for clustering of scRNA-sequencing data

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac068/6549863

scGDC extends auto-encoder by introducing a self-representation layer to extract deep features of cells, and learns affinity graph of cells, which provide a better and more comprehensive strategy to characterize structure of cell types.

scGDC projects cells of various types onto different subspaces, where types, particularly rare cell types, are well discriminated by utilizing generative adversarial learning.

scGDC joins deep feature extraction, structural learning and cell type discovery, where features of cells are extracted under the guidance of cell types, thereby improving performance of algorithms.





□ DeepREAL: A Deep Learning Powered Multi-scale Modeling Framework for Predicting Out-of-distribution Ligand-induced GPCR Activity

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac154/6547052

DeepREAL utilizes self-supervised learning on tens of millions of protein sequences and pre-trained binary interaction classification to solve the data distribution shift and data scarcity problems.

DeepREAL is based on a new multi-stage deep transfer learning architecture that combines binary DTI pretraining and embedding with a three-way receptor activity fine-tuning to address OOD challenges using sparse receptor activity data.





□ GraphGONet: a self-explaining neural network encapsulating the Gene Ontology graph for phenotype prediction on gene expression

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac147/6546279

The production of accurate and intelligible predictions can benefit from the inclusion of domain knowledge. Therefore, knowledge-based deep learning models appear to be a promising solution.

GraphGONet, where the Gene Ontology is encapsulated in the hidden layers of a new self-explaining neural network. Each neuron in the layers represents a biological concept, combining the gene expression profile of a patient, and the information from its neighboring neurons.





□ Statistical and machine learning methods for spatially resolved transcriptomics data analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02653-7

Graph convolutional networks can aggregate features from each spatial location’s neighbors through convolutional layers and utilize the learned representation to perform node classification, community detection, and link prediction.

scHOT is a computational approach designed to identify changes in higher-order interactions among genes in cells along a continuous trajectory or across space. This method has also been demonstrated to be effective in spatial transcriptomics data.





□ Variomes: a high recall search engine to support the curation of genomic variants

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac146/6547047

The system can be used as a literature triage system in the same way as LitVar. It can also be used to prioritize variants to facilitate the identification of clinically actionable variants.

Variomes enables searching the biomedical literature. The collections are pre-processed with a set of medical terminologies. User queries are automatically processed to map keywords to the terminologies and expand genetic variants using a dedicated variant expansion system.





□ Generating minimum set of gRNA to cover multiple targets in multiple genomes with MINORg

>> https://www.biorxiv.org/content/10.1101/2022.03.10.481891v1.full.pdf

MINORg is an offline gRNA design tool that generates the smallest possible combination of gRNA capable of covering all desired targets in multiple non-reference genomes.

MINORg aims to lessen this workload by capitalising on sequence homology to favour multi-target gRNA while si- multaneously screening multiple genetic backgrounds in order to generate reusable gRNA panels.





□ CNV-espresso: Accurate in silico confirmation of rare copy number variant calls from exome sequencing data using transfer learning

>> https://www.biorxiv.org/content/10.1101/2022.03.09.483665v1.full.pdf

CNV-espresso encodes candidate CNV regions from exome sequencing data as images and uses convolutional neural networks to classify the image into different copy numbers.

Assuming the CNVs detected from WGS data as proxy of ground truth, CNV-espresso significantly improves precision while keeping recall almost intact, especially for CNVs that span small number of exons in exome data.





□ UniFuncNet: a flexible network annotation framework

>> https://www.biorxiv.org/content/10.1101/2022.03.15.484380v1.full.pdf

UniFuncNet, a network annotation framework that dynamically integrates data from multiple biological databases. If UniFuncNet finds searchable information for the other databases (in this case metacyc and hmdb) then it will also collect data from those databases.

The output from UniFuncNet can be represented as a multipartite graph, where the central layers correspond to the entity types (e.g., proteins), and the outer layers to the annotations.





□ OTUP-workflow: Target specific optimization of the transmit k-space trajectory for flexible universal parallel transmit RF pulse design

>> https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/nbm.4728

Transmit k-space trajectories (stack-of-spirals and SPINS) were optimized to best match different excitation targets using the parameters of the analytical equations of spirals and SPINS.

The OTUP-workflow (Optimization of transmit k-space Trajectories and Universal Pulse calculation) was tested on three test target excitation patterns. It emphasized the importance of a well-suited trajectory for pTx RF pulse design.





□ SavvyCNV: Genome-wide CNV calling from off-target reads

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009940

SavvyCNV finds the greatest number of true positive CNVs in all data sets. SavvyCNV calls CNVs by looking at read depth over the genome. The genome is split into bins and each bin is assessed for statistical divergence from normal copy number.

depth of the sample across all genomic locations, and then subsequently dividing the read count by the mean read depth of the genomic location across all samples. SavvyCNV then uses singular vector decomposition (SVD) to reduce noise.





□ Adversarial attacks and adversarial robustness in computational pathology

>> https://www.biorxiv.org/content/10.1101/2022.03.15.484515v1.full.pdf

Vision transformers (ViTs) perform equally well compared to CNNs at baseline and are orders of magnitude more robust to different types of white-box and black-box attacks. This is associated with a more robust latent representation of clinically relevant categories.

ViTs are robust learners in computational pathology. This implies that large-scale rollout of AI models in computational pathology should rely on ViTs rather than CNN-based classifiers to provide inherent protection against adversaries.





□ ChromDMM: A Dirichlet-Multinomial Mixture Model For Clustering Heterogeneous Epigenetic Data

>> https://www.biorxiv.org/content/10.1101/2022.03.25.485838v1.full.pdf

ChromDMM, a product Dirichlet-multinomial mixture model for clustering genomic regions that are characterised by multiple chromatin features.

ChromDMM extends the mixture model framework by profile shifting and flipping that can probabilistically account for inaccuracies in the position and strand-orientation. ChromDMM regularises the smoothness of the epigenetic profiles across the consecutive genomic regions.





□ Phenotype to genotype mapping using supervised and unsupervised learning

>> https://www.biorxiv.org/content/10.1101/2022.03.17.484826v1.full.pdf

This pipeline is capable of relating distinct vacuole morphologies to genetic perturbations. A mixed supervised-unsupervised learning methodology with the aim of reducing the annotation burden and the inherent bias due to the human annotation task.






□ Syrah: a Slide-seqV2 pipeline augmentation

>> https://www.biorxiv.org/content/10.1101/2022.03.20.485023v1.full.pdf

Syrah was built as an augmentation to the original Slide-seqV2 pipeline, such that it takes as input the output from the original pipeline and creates a corrected version of the data, facilitating comparison with the original pipeline’s results.

Syrah aligns the known linker sequence to each read and uses the beginning and end points of that alignment to determine where to extract the barcode and UMI segments.





□ EDClust: An EM-MM hybrid method for cell clustering in multiple-subject single-cell RNA sequencing

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac168/6551990

EDClust adopts a Dirichlet-multinomial mixture model and explicitly accounts for cell type heterogeneity, subject heterogeneity, and clustering uncertainty.

An EM-MM hybrid algorithm is derived for maximizing the data likelihood and clustering the cells. EDClust offers functions for predicting cell type labels, estimating parameters of effects from different sources, and posterior probabilities for cells being in each cluster.





□ DCLEAR: Single cell lineage reconstruction using distance-based algorithms

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04633-x

This method consists of two steps: Distance matrix estimation and the tree reconstruction from the distance matrix. Two of the more sophisticated distance methods display a substantially improved level of performance compared to the traditional Hamming distance method.

The algorithm used to compute the k-mer replacement distance (KRD) method first uses the prominence of mutations in the character arrays to estimate the summary statistics used for the generation of the tree to be reconstructed.





□ Parallel sequence tagging for concept recognition

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04511-y

A paradigm for biomedical concept recognition where named entity recognition (NER) and normalisation (NEN) are tackled in parallel. In a traditional NER+NEN pipeline, the NEN module is restricted to predict concept labels (IDs) for the spans identified by the NER tagger.

The system consistently achieves better scores than the baseline, which is a pipeline with a CRF-based span tagger and a BiLSTM-based concept classifier that were also trained on the CRAFT corpus alone.





□ Ontology-Aware Biomedical Relation Extraction

>> https://www.biorxiv.org/content/10.1101/2022.03.22.485304v1.full.pdf

Extending a Recurrent Neural Network (RNN) with a Convolutional Neural Network (CNN) to process three sets of features, namely, tokens, types, and graphs.

Entity type and ontology graph structure provide better representations than simple token-based representations for RE.





□ BarWare: efficient software tools for barcoded single-cell genomics

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04620-2

BarWare provides a comprehensive set of tools which lowers the barrier to entry of Cell Hashing workflows for small laboratories in the field of single-cell sequencing, and should be useful for core facilities that can use cell hashing to mix and overload samples.





□ vcferr: Development, Validation, and Application of a SNP Genotyping Error Simulation Framework

>> https://www.biorxiv.org/content/10.1101/2022.03.28.485853v1.full.pdf

vcferr, a novel framework for probabilistically simulating genotyping error and missingness in VCF files. The processing runs iteratively for every site in the input VCF, with the output streamed or optionally written to a new output VCF file.

vcferr checks each genotype, and randomly draws from a list of possible genotypes (heterozygous, homozygous for the alternate allele, homozygous for the reference allele, missing) with each element weighted by error rates.





□ SHOOT: phylogenetic gene search and ortholog inference

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02652-8

The phylogenetic tree returned by SHOOT provides the evolutionary relationships between genes inferred from multiple sequence alignment and maximum likelihood tree inference allowing orthologs and paralogs to be identified.

SHOOT also automatically identifies orthologs and colors the genes in the tree according to whether they are orthologs or paralogs, as identified using the species overlap method, which has been shown to be an accurate method for automated orthology inference.















Nebe.

2022-03-31 03:13:03 | Science News






□ MAECI: A Pipeline For Generating Consensus Sequence With Nanopore Sequencing Long-read Assembly and Error Correction

>> https://www.biorxiv.org/content/10.1101/2022.04.04.487014v1.full.pdf

The assemblies can be corrected using nanopore sequencing data and then polished with NGS data. Both approaches can mitigate some of these problems and improve the accuracy of the assemblies, but assembly errors cannot be completely avoided.

MAECI enables the assembly for nanopore long-read sequencing data. It takes nanopore sequencing data as input, uses multiple assembly algorithms to generate a single consensus sequence, and then uses nanopore sequencing data to perform self-error correction.





□ DPI: Single-cell multimodal modeling with deep parametric inference

>> https://www.biorxiv.org/content/10.1101/2022.04.04.486878v1.full.pdf

DPI, a deep parameter inference model that integrates CITE-seq/REAP-seq data. With DPI, the cellular heterogeneity embedded in the single-cell multimodal omics can be comprehensively understood from multiple views.

DPI describes the state of all cells in the sample in terms of the multimodal latent space. The multimodal latent space generated by DPI is continuous, which means that perturbing the genes/proteins of cells in the sample can find the cell state closest to it in this space.





□ MOSS: Multi-omic integration with Sparse Value Decomposition

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac179/6553658

MOSS performs a Sparse Singular Value Decomposition (sSVD) on the integrated omic blocks to obtain latent dimensions as sparse factors (i.e., with zeroed out elements), representing variability across subjects and features.

MOSS can fit supervised analyses via partial least squares, linear discriminant analysis, and low-rank regressions. Sparsity is imposed via Elastic Net on the sSVD solutions. MOSS allows an automatic tuning of the number of elements different from zero.




□ GPS-seq: The DNA-based global positioning system—a theoretical framework for large-scale spatial genomics

>> https://www.biorxiv.org/content/10.1101/2022.03.22.485380v1.full.pdf

GPS-seq, a theoretical framework that enables massively scalable, optics-free spatial transcriptomics. GPS-seq combines data from high-throughput sequencing with manifold learning to obtain the spatial transcriptomic landscape of a given tissue section without optical microscopy.

In this framework, similar to technologies like Slide-seq and 10X Visium, tissue samples are stamped on a surface of randomly-distributed DNA-barcoded spots (or beads). The transcriptomic sequences of proximal cells are fused to DNA barcodes.

The barcode spots serve as “anchors” which also capture spatially diffused “satellite” barcodes, and therefore allow computational reconstruction of spot positions without optical sequencing or depositing barcodes to pre-specified positions.

The general framework of GPS-seq is also compatible with standard single-cell (or single-nucleus) capture methods, and any modality of single- cell genomics, such as sci-ATAC-seq, could be transformed into spatial genomics in this strategy.





□ MEDUSA: A Pipeline for Sensitive Taxonomic Classification and Flexible Functional Annotation of Metagenomic Shotgun Sequences

>> https://www.frontiersin.org/articles/10.3389/fgene.2022.814437/full

MEDUSA performs preprocessing, assembly, alignment, taxonomic classification, and functional annotation on shotgun data, supporting user-built dictionaries to transfer annotations to any functional identifier.

MEDUSA includes several tools, as fastp, Bowtie2, DIAMOND, Kaiju, MEGAHIT, and a novel tool implemented in Python to transfer annotations to BLAST/DIAMOND alignment results.





□ NAb-seq: an accurate, rapid and cost-effective method for antibody long-read sequencing in hybridoma cell lines and single B cells

>> https://www.biorxiv.org/content/10.1101/2022.03.25.485728v1.full.pdf

When compared to Sanger sequencing of two hybridoma cell lines, long-read ONT sequencing was highly accurate, reliable, and amenable to high throughput.

NAb-seq, a three-day, species-independent, and cost-effective workflow to characterize paired full- length immunoglobulin light and heavy chain genes from hybridoma cell lines.





□ SimSCSnTree: a simulator of single-cell DNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac169/6551250

SimSCSnTree, a new single-cell DNA sequence simulator which generates an evolutionary tree of cells and evolves single nucleotide variants (SNVs) and copy number aberrations (CNAs) along its branches.

Data generated by the simulator can be used to benchmark tools for single-cell genomic analyses, particularly in cancer where SNVs and CNAs are ubiquitous.





□ Dynamic Mantis: An Incrementally Updatable and Scalable System for Large-Scale Sequence Search using the Bentley-Saxe Transformation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac142/6553005

an efficient algorithm for merging two Mantis indexes, and tackle several scalability and efficiency obstacles along the way. The proposed algorithm targets Minimum Spanning Tree-based Mantis.

MST-based Mantis is ≈ 10× faster to construct, requires ≈ 10× less construction memory, results in ≈ 2.5× smaller indexes, and performs bulk queries ≈ 74× faster and with ≈ 100× less query memory than Bifrost.





□ Triku: a feature selection method based on nearest neighbors for single-cell data

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giac017/6547682

Triku is a feature selection method that favors genes defining the main cell populations. It does so by selecting GE by groups of cells that are close in the k-NN graph. The expression of these genes is higher than the expected expression if the k-cells were chosen at random.

the Wasserstein distance between the observed and the expected distributions is computed and genes are ranked according to that distance. Higher distances imply that the gene is locally expressed in a subset of transcriptomically similar cells.





□ RF4Del: A Random Forest approach for accurate deletion detection

>> https://www.biorxiv.org/content/10.1101/2022.03.10.483419v1.full.pdf

The model consists of 13 features extracted from a mapping file. RF4Del outperforms established SV callers (DELLY, Pindel) with higher overall performance (F1-score > 0.75; 6x-12x sequence coverage) and is less affected by low sequencing coverage and deletion size variations.

RF4Del could learn from a compilation of sequence patterns linked to a given SV. Such models can then be combined to form a learning system able to detect all types of SVs in a given genome.





□ GRAPE: Genomic Relatedness Detection Pipeline

>> https://www.biorxiv.org/content/10.1101/2022.03.11.483988v1.full.pdf

GRAPE: Genomic RelAtedness detection PipelinE. It combines data preprocess- ing, identity-by-descent (IBD) segments detection, and accurate relationship esti- mation.

GRAPE has a modular architecture that allows switching between tools and adjust tools parameters for better control of precision and recall levels. The pipeline also contains a simulation workflow w/ an in-depth evaluation of pipeline accuracy using simulated and reference data.





□ ClusterFoldSimilarity: A single-cell clusters similarity measure for different batches, datasets, and samples

>> https://www.biorxiv.org/content/10.1101/2022.03.14.483731v1.full.pdf

ClusterFoldSimilarity calculates a measure of similarity b/n clusters from different datasets/batches, without the need of correcting for batch effect or normalizing and merging the data, thus avoiding artifacts and the loss of information derived from these kinds of techniques.

The similarity metric is based on the average vector module and sign of the product of logarithmic fold-changes. ClusterFoldSimilarity compares every single pair of clusters from any number of different samples/datasets, including different number of clusters for each sample.





□ HCLC-FC: a novel statistical method for phenome-wide association studies

>> https://www.biorxiv.org/content/10.1101/2022.03.14.484203v1.full.pdf

HCLC-FC (Hierarchical Clustering Linear Combination with False discovery rate Control), to test the association between a genetic variant with multiple phenotypes for each phenotypic category in phenome-wide association studies (PheWAS).

HCLC-FC clusters phenotypes within each phenotypic category, which reduces the degrees of freedom of the association tests and has the potential to increase statistical power. HCLC-FC has an asymptotic distribution which avoids the computational burden of simulation.





□ CONGAS: A Bayesian method to cluster single-cell RNA sequencing data using Copy Number Alterations

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac143/6550058

CONGAS jointly identifies clusters of single cells with subclonal copy number alterations, and differences in RNA expression.

CONGAS builds statistical priors leveraging bulk DNA sequencing data, does not require a normal reference and scales fast thanks to a GPU backend and variational inference.





□ OMAMO: orthology-based alternative model organism selection

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac163/6550503

The only unicellular organisms considered in these databases are fission and budding yeast, whilst abundance of unicellular species in nature and their unique features make it difficult to find other non-complex model organisms for a biological process of interest.

OMAMO (Orthologous Matrix and Alternative Model Organisms), a software and a web service that provide the user with the best non-complex organism for research into a biological process of interest based on orthologous relationships between human and the species.





□ DENVIS: scalable and high-throughput virtual screening using graph neural networks with atomic and surface protein pocket features

>> https://www.biorxiv.org/content/10.1101/2022.03.17.484710v1.full.pdf

DENVIS, a purely machine learning-based, high-throughput, end-to-end-strategy for SBVS using GNNs for binding affinity prediction. DENVIS exhibits several orders of magnitude faster screening times (i.e., higher throughput) than both docking-based and hybrid models.

The atom-level model consists of a modified version of the graph isomorphism network (GIN). The surface-level approach utilises a mixture model network (MoNet), a specialised GNN with a convolution operation that respects the geometry of the input manifold.





□ Wochenende - modular and flexible alignment-based shotgun metagenome analysis

>> https://www.biorxiv.org/content/10.1101/2022.03.18.484377v1.full.pdf

Wochenende runs alignment of short reads (eg Illumina) or long reads (eg Oxford Nanopore) against a reference sequence. It is relevant for genomics and metagenomics. Wochenende is simple (python script), portable and is easy to configure with a central config file.

Wochenende has the ability to find and filter alignments to all kingdoms of life using both short and long reads with high sensitivity and specificity, and provides the user with multiple normalization techniques and configurable and transparent filtering steps.





□ GBScleanR: Robust genotyping error correction using hidden Markov model with error pattern recognition.

>> https://www.biorxiv.org/content/10.1101/2022.03.18.484886v1.full.pdf

GBScleanR implements a novel HMM-based error correction algorithm. This algorithm estimates the allele read bias and mismap rate per marker and incorporates these into the HMM as parameters to capture the skewed probabilities in read acquisitions.

GBScleanR provides functions for data visualization, filtering, and loading/writing a VCF file. The algorithm of GBScleanR is based on the HMM and treats the observed allele read counts for each SNP marker along a chromosome as outputs from a sequence of latent true genotypes.





□ 3GOLD: optimized Levenshtein distance for clustering third-generation sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04637-7

3GOLD offers a novel way of determining error type and frequency by interpreting the unweighted SLD value and position on the matrix by comparing it to the unweighted LD value. 3GOLD combines the discriminatory benefits of weighted LD and the permissive benefits of SLD.

This approach is appropriate for datasets of unknown cluster centroids, such as those generated with unique molecular identifiers as well as known centroids such as barcoded datasets. It has high accuracy in resolving small clusters and mitigating the number of singletons.





□ The role of cell geometry and cell-cell communication in gradient sensing

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009552

Generalizing the existing mathematical models to investigate how short- and long-range cellular communication can increase gradient sensing in two-dimensional models of epithelial tissues.

With long-range communication, the gradient sensing ability improves for tissues with more disordered geometries; on the other hand, an ordered structure with mostly hexagonal cells is advantageous with nearest neighbour communication.





□ Crimp: fast and scalable cluster relabeling based on impurity minimization

>> https://www.biorxiv.org/content/10.1101/2022.03.22.485309v1.full.pdf

CRIMP, a lightweight command-line tool, which offers a relatively fast and scalable heuristic to align clusters across multiple replicate clusterings consisting of the same number of clusters.

CRIMP allows to rearrange a number of membership matrices of identical shape in order to minimize differences caused by label switching. The remaining differences should be attributable to either noise or truly different ways of the data, referred to as ‘genuine multimodality’.





□ RabbitV: fast detection of viruses and microorganisms in sequencing data on multi-core architectures

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac187/6554196

RabbitV, a tool for rapid detection of viruses and microorganisms in Illumina sequencing datasets based on fast identification of unique k-mers. It can exploit the power of modern multi-core CPUs by using multi-threading, vectorization, and fast data parsing.

RabbitV outperforms fastv by a factor of at least 42.5 and 14.4 in unique k-mer generation (RabbitUniq) and pathogen identification (RabbitV), respectively.





□ q2-fondue: Reproducible acquisition, management, and meta-analysis of nucleotide sequence (meta)data

>> https://www.biorxiv.org/content/10.1101/2022.03.22.485322v1.full.pdf

q2-fondue (Functions for reproducibly Obtaining and Normalizing Data re-Used from Elsewhere) to expedite the initial acquisition of data from the SRA, while offering complete provenance tracking.

q2-fondue simplifies retrieval of sequencing data and accompanying metadata in a validated and standardized format interoperable with the QIIME 2 ecosystem.





□ MASI: Fast model-free standardization and integration of single-cell transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2022.03.28.486110v1.full.pdf

MASI (Marker-Assisted Standardization and Integration) can run integrative annotation on a personal laptop for approximately one million cells, providing a cheap computational alternative for the single-cell data analysis community.

MASI will not be able to annotate cell types in query data that have not been seen in reference data.

However, it is still worth answering if a cell-type score matrix constructed using the reference data can preserve cell-type structure for query data, even though query data contains unseen cell types.





□ The Codon Statistics Database: a Database of Codon Usage Bias

>> https://www.biorxiv.org/content/10.1101/2022.03.29.486291v1.full.pdf

the Codon Statistics Database, an online database that contains codon usage statistics for all the species with reference or representative genomes in RefSeq.

If a species is selected, the user is directed to a table that lists, for each codon, the encoded amino acid, the total count in the genome, the RSCU, and whether the codon is preferred or unpreferred.





□ Boquila: NGS read simulator to eliminate read nucleotide bias in sequence analysis

>> https://www.biorxiv.org/content/10.1101/2022.03.29.486262v1.full.pdf

Boquila generates sequences that mimic the nucleotide profile of true reads, which can be used to correct the nucleotide-based bias of genome-wide distribution of NGS reads.

Boquila can be configured to generate reads from only specified regions of the reference genome. It also allows the use of input DNA sequencing to correct the bias due to the copy number variations in the genome.





□ SprayNPray: user-friendly taxonomic profiling of genome and metagenome contigs

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08382-2

SprayNPray offers a quick and user-friendly, semi-automated approach, allowing users to separate contigs by taxonomy of interest. SprayNPray can be used for broad-level overviews, preliminary analyses, or as a supplement to other taxonomic classification or binning software.

SprayNPray profiles contigs using multiple metrics, including closest homologs from a user-specified reference database, gene density, read coverage, GC content, tetranucleotide frequency, and codon-usage bias.





□ LPMX: a pure rootless composable container system

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04649-3

LPMX accelerates science by letting researchers compose existing containers and containerize tools/pipelines that are difficult to package/containerize using Conda or Singularity, thereby saving researchers’ precious time.

LPMX can minimize the overhead of splitting a large pipeline into smaller containerized components or tools to avoid conflicts between the components.

A caveat is that compared to Singularity, the LPMX approach might put a larger burden on a central shared file system, so Singularity might scale well beyond a certain large number of nodes.





□ StORF-Reporter: Finding Genes between Genes

>> https://www.biorxiv.org/content/10.1101/2022.03.31.486628v1.full.pdf

StORF- Reporter, a tool that takes as input an annotated genome and returns missed CDS genes from the unannotated regions. Stop-ORFs (StORFs) are identified in these unannotated regions. StORFs are Open Reading Frames that are delimited by stop codons.

StORFs recovers complete coding sequences (with/without similarity to known genes) which were missing from both canonical and novel genome annotations.





□ Prime-seq, efficient and powerful bulk RNA sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02660-8

Prime-seq, a bulk RNA-seq protocol, and show that it is as powerful and accurate as TruSeq in quantifying gene expression levels, but more sensitive and much more cost-efficient.

The prime-seq protocol is based on the SCRB-seq and the optimized derivative mcSCRB-seq. It uses the principles of poly(A) priming, template switching, early barcoding, and UMIs to generate 3′ tagged RNA-seq libraries.





□ Communication-Efficient Cluster Scalable Genomics Data Processing Using Apache Arrow Flight

>> https://www.biorxiv.org/content/10.1101/2022.04.01.486780v1.full.pdf

This solution has similar performance to MPI-based HPC solutions, with the added advantage of easy programmability and transparent big data scalability. It outperforms existing Apache Spark based solutions in term of both computation time (2x) and lower communication overhead.

QUARTIC (QUick pArallel algoRithms for high-Throughput sequencIng data proCessing) is implemented using MPI. Though this implementation uses I/Os between pre-processing stages, it still performs better than other Apache Spark based frameworks.





□ epiAneufinder: identifying copy number variations from single-cell ATAC-seq data

>> https://www.biorxiv.org/content/10.1101/2022.04.03.485795v1.full.pdf

epiAneufinder, a novel algorithm that exploits the read count information from scATAC-seq data to extract genome-wide copy number variations (CNVs) for individual cells, allowing to explore the CNV heterogeneity present in a sample at the single-cell level.

epiAneufinder extracts single-cell copy number variations from scATAC-seq data alone, or alternatively from single-cell multiome data, without the need to supplement the data with other data modalities.





□ BIODICA: a computational environment for Independent Component Analysis

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac204/6564219

BIODICA, an integrated computational environment for application of Independent Component Analysis (ICA) to bulk and single-cell molecular profiles, interpretation of the results in terms of biological functions and correlation with metadata.

BIODICA automates deconvolution of large omics datasets with optimization of deconvolution parameters, and compares the results of deconvolution of independent datasets for distinguishing reproducible signals, universal and specific for a particular disease/data type or subtype.





□ acorde unravels functionally interpretable networks of isoform co-usage from single cell data

>> https://www.nature.com/articles/s41467-022-29497-w

acorde, a pipeline that successfully leverages bulk long reads and single-cell data to confidently detect alternative isoform co-expression relationships.

acorde uses a strategy to obtain noise-robust correlation estimates in scRNA-seq data, and a semi-automated clustering approach to detect modules of co-expressed isoforms across cell types.

Percentile-summarized Pearson correlations outperform both classic and single-cell specific correlation strategies, including proportionality methods that were recently proposed as one of the best alternatives to measure co-expression in single-cell data.