lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

ZAHRADA.

2022-03-31 03:13:31 | Science News

The Node from Pak on Vimeo.


“One thought fills immensity.”




□ ptdalgorithms: Graph-based algorithms for phase-type distributions

>> https://www.biorxiv.org/content/10.1101/2022.03.12.484077v1.full.pdf

ptdalgorithms that implements graph-based algorithms for constructing and transforming unrewarded and rewarded continuous and discrete phase-type distributions and for computing their moments and distribution functions.

For generalized iterative state-space construction, ptdalgorithms allows the computation of moments for huge state spaces, and for the state probability vector of the underlying Markov chains of both time-homogeneous and time-inhomogeneous phase-type distributions.





□ SIEVE: joint inference of single-nucleotide variants and cell phylogeny from single-cell DNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.03.24.485657v1.full.pdf

The previous methods do not operate within the statistical phylogenetic framework, in particular do not infer branch lengths of the tree. Moreover, either they fully follow the infinite-sites assumption (ISA).

SIEVE (SIngle-cell EVolution Explorer) exploits raw read counts for all nucleotides from scDNA-seq to reconstruct the cell phylogeny and call variants based on the inferred phylogenetic relations. SIEVE employs a statistical phylogenetic model following finite-sites assumption.





□ Sobolev Alignment: Identifying commonalities between cell lines and tumors at the single cell level using Sobolev Alignment of deep generative models

>> https://www.biorxiv.org/content/10.1101/2022.03.08.483431v1.full.pdf

Sobolev Alignment, a computational framework which uses deep generative models to capture non-linear processes in single-cell RNA sequencing data and kernel methods to align and interpret these processes.

Recent works have shown theoretical connections, demonstrating, for instance, the equivalence between the Laplacian kernel and the so-called Neural Tangent Kernel.

The interpretation scheme relies on the decomposition of the Gaussian kernel, which we extended to the Laplacian kernel by exploiting connections between the feature spaces of Gaussian and Laplacian kernels.

Mapping towards the latent factors using Falkon-trained kernel machines, which allows to calculate the contribution of each gene to each latent factors. Constructing a consensus space by interpolation b/n matched Sobolev Principal Vectors onto which all data can be projected.





□ scAllele: a versatile tool for the detection and analysis of variants in scRNA-seq

>> https://www.biorxiv.org/content/10.1101/2022.03.29.486330v1.full.pdf

scAllele, a versatile tool that performs both variant calling and functional analysis of the variants in alternative splicing using scRNA-seq. As a variant caller, scAllele reliably identifies SNVs and microindels (less than 20 bases) with low coverage.

scAllele calls nucleotide variants via local reassembly. scAllele enables read-level allelic linkage analysis. It refines read alignments and possible misalignments, and enhances variant detection accuracy per read. scAllele uses a GLM model to detect high confidence variants.





□ The complexity of the Structure and Classification of Dynamical Systems

>> https://arxiv.org/pdf/2203.10655v1.pdf

A survey of the complexity of structure, anti-structure, classification and anti-classification results in dynamical systems. Focussing primarily on ergodic theory, with excursions into topological dynamical systems, but suggest methods and problems in related areas.

Every perfect Polish space contains a non-Borel analytic set. Moreover, the analytic sets are closed under countable intersections and unions. Hence the co- analytic sets are also closed under unions and intersections.

Are there complete numerical invariants for orientation preserving diffeomorphisms of the circle up to conjugation by orientation preserving diffeomorphisms?





□ A glimpse of the toposophic landscape: Exploring mathematical objects from custom-tailored mathematical universes

>> https://arxiv.org/pdf/2204.00948.pdf

There are toposes in which the axiom of choice and the intermediate value theorem from undergraduate calculus fail, toposes in which any function R → R is continuous and toposes in which infinitesimal numbers exist.

In the semantic view, the effective topos is an alternative universe which contains its own version of the natural numbers. “There are infinitely many primes in Eff” is equivalent to the statement “for any number n, there effectively exists a prime number p > n”.





□ ALFATClust: Clustering biological sequences with dynamic sequence similarity threshold

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04643-9

ALFATClust exploits rapid pairwise alignment-free sequence distance calculations and community detection. Although ALFATClust computes a full Mash distance matrix for its graph clustering, the matrix can be significantly reduced using a divide-and-conquer approach.

ALFATClust is conceptually similar to hierarchical agglomerative clustering since its algorithm begins with each sequence (vertex) as a singleton graph cluster, and the graph clusters are gradually merged through iterations with decreasing resolution parameter γ.





□ The Graphical R2D2 Estimator for the Precision Matrices

>> https://www.biorxiv.org/content/10.1101/2022.03.22.485374v1.full.pdf

Graphical R2D2 (R2-induced Dirichlet Decomposition) draws Monte Carlo samples from the posterior distribution based on the graphical R2D2 prior, to estimate the precision matrix for multivariate Gaussian data.

GR2D2 estimator has attractive properties in estimating the precision ma- trices, such as greater concentration near the origin and heavier tails than current shrinkage priors.

When the true precision matrix is sparse and of high dimension, The graphical R2D2 hierarchical model provides estimates close to the true distribution in Kullback-Leibler divergence and with the smallest bias for nonzero elements.





□ PORTIA: Fast and accurate inference of Gene Regulatory Networks through robust precision matrix estimation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac178/6553011

The possible cell transcriptional states are determined by the underlying Gene Regulatory Network (GRN), and reliably inferring such network would be invaluable to understand biological processes and disease progression.

PORTIA, a novel algorithm for GRN inference based on power transforms and covariance matrix inversion. A key aspect of GRN inference is the need to disentangle direct from indirect correlations. PORTIA has thus been conceptually inspired by Direct Coupling Analysis methods.





□ CAISC: A software to integrate copy number variations and single nucleotide mutations for genetic heterogeneity profiling and subclone detection by single-cell RNA sequencing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04625-x

Clonal Architecture with Integration of SNV and CNV (CAISC), an R package for scRNA-seq data analysis that clusters single cells into distinct subclones by integrating CNV and SNV genotype matrices using an entropy weighted approach.

Entropy measures the structural complexity of a network, thus its concept can be utilized to integrate multiple weighted graphs or networks, or in this case, to integrate the cell–cell distance matrices generated by the DENDRO and infercnv analyses.





□ Haplotype-resolved assembly of diploid genomes without parental data

>> https://www.nature.com/articles/s41587-022-01261-x

An algorithm that combines PacBio HiFi reads and Hi-C chromatin interaction data to produce a haplotype-resolved assembly without the sequencing of parents.

the algorithm consistently outperforms existing single-sample assembly pipelines and generates assemblies of similar quality to the best pedigree-based assemblies.

It reduce unitig bipartition to a graph max-cut problem and find a near optimal solution with a stochastic algorithm in the principle of simulated annealing,and also consider the topology of the assembly graph to reduce the chance of local optima.





□ Gfastats: conversion, evaluation and manipulation of genome sequences using assembly graphs

>> https://www.biorxiv.org/content/10.1101/2022.03.24.485682v1.full.pdf

Gfastats is a standalone tool to compute assembly summary statistics and manipulate assembly sequences in fasta, fastq, or gfa [.gz] format. Gfastats stores assembly sequences internally in a gfa-like format.

Gfastats builds a bidirected graph representation of the assembly using adjacency lists, where each node is a segment, and each edge is a gap. Walking the graph allows to generate different kinds of outputs, including manipulated assemblies and feature coordinates.





□ SEACells: Inference of transcriptional and epigenomic cellular states from single-cell genomics data

>> https://www.biorxiv.org/content/10.1101/2022.04.02.486748v1.full.pdf

SEACells outperforms existing algorithms in identifying accurate, compact, and well-separated metacells in both RNA and ATAC modalities across datasets with discrete cell types and continuous trajectories.

SEACells improves gene-peak associations, computes ATAC gene scores and measures gene accessibility. Using a count matrix as input, it provides per-cell weights for each metacell, per-cell hard assignments to each metacell, and the aggregated counts for each metacell as output.





□ Accelerated identification of disease-causing variants with ultra-rapid nanopore genome sequencing

>> https://www.nature.com/articles/s41587-022-01221-5

An approach for ultra-rapid nanopore WGS that combines an optimized sample preparation protocol, distributing sequencing over 48 flow cells, near real-time base calling and alignment, accelerated variant calling and fast variant filtration.

This cloud-based pipeline scales compute-intensive base calling and alignment across 16 instances with 4× Tesla V100 GPUs each and runs concurrently. It aims for maximum resource utilization, where base calling using Guppy runs on GPU and alignment using Minimap2.





□ PEER: Transcriptome diversity is a systematic source of variation in RNA-sequencing data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009939

Probabilistic estimation of expression residuals (PEER), which infers broad variance components in gene expression measurements, has been used to account for some systematic effects, but it has remained challenging to interpret these PEER factors.

PEER “hidden” covariates encode for transcriptome diversity – a simple metric based on Shannon entropy – explains a large portion of variability in gene expression and is the strongest known factor encoded in PEER factors.





□ DeepAcr: Predicting Anti-CRISPR with Deep Learning

>> https://www.biorxiv.org/content/10.1101/2022.04.02.486820v1.full.pdf

DeepAcr compiles the large protein sequence database to obtain secondary structure, relative solvent accessibility, evolutionary features, and Transformer features with RaptorX,.

DeepAcr applies Hidden Markov Model and uses it a baseline for Acr classification comparison. It outperforms macro-average metrics. Thus, DeepAcr is an unbiased predictor. DeepAcr captures the evolutionarily conserved pattern and the interaction between anti-CRISPR.





□ RecGen: Prediction of designer-recombinases for DNA editing with generative deep learning

>> https://www.biorxiv.org/content/10.1101/2022.04.01.486669v1.full.pdf

RecGen, an algorithm for the intelligent generation of designer-recombinases. RecGen is trained with 89 evolved recombinase libraries and their respective target sites, captures the affinities between the recombinase sequences and their respective DNA binding sequences.

RecGen uses CVAE (Conditional Variational Autoencoders) architecture for recombinase prediction. The latent space is designed to resemble a multivariate normal distribution. For each latent space dimension mean and standard deviation are learned for normal distribution sampling.





□ BiTSC2: Bayesian inference of tumor clonal tree by joint analysis of single-cell SNV and CNA data

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac092/6562684

BiTSC2 takes raw reads from scDNA-seq as input, accounts for the overlapping of CNA and SNV, models allelic dropout rate, sequencing errors and missing rate, as well as assigns single cells into subclones.

By applying Markov Chain Monte Carlo sampling, BiTSC2 can simultaneously estimate the subclonal scCNA and scSNV genotype matrices. BiTSC2 shows high accuracy in genotype recovery, subclonal assignment and tree reconstruction.





□ LSMMD-MA: Scaling multimodal data integration for single-cell genomics data analysis

>> https://www.biorxiv.org/content/10.1101/2022.03.23.485536v1.full.pdf

MMD-MA is a method for analyzing multimodal data that relies on mapping the observed cell samples to embeddings, using functions belonging to a Reproducing Kernel Hilbert Space.

LSMMD-MA, a large-scale Python implementation of the MMD-MA method for multimodal data integration. Reformulating the MMD-MA optimization problem using linear algebra and solve it with KeOps, a CUDA framework for symbolic matrix computation.





□ CNETML: Maximum likelihood inference of phylogeny from copy number profiles of spatio-temporal samples

>> https://www.biorxiv.org/content/10.1101/2022.03.18.484889v1.full.pdf

CNETML, a new maximum likelihood method based on a novel evolutionary model of copy number alterations (CNAs) to infer phylogenies from spatio-temporal samples taken within a single patient.

CNETML is the first program to jointly infer the tree topology, node ages, and mutation rates from total copy numbers when samples were taken at different time points. The change of copy number at each site follows a continuous-time non-reversible Markov chain.





□ BISER: Fast characterization of segmental duplication structure in multiple genome assemblies

>> https://almob.biomedcentral.com/articles/10.1186/s13015-022-00210-2

BISER (Brisk Inference of Segmental duplication Evolutionary stRucture) is a fast tool for detecting and decomposing segmental duplications in genome assemblies. BISER infers elementary and core duplicons and enable an evolutionary analysis of all SDs in a given set of genomes.

BISER uses a two-tiered local chaining algorithm from SEDEF based on a seed-and-extend approach and efficient O(nlogn) chaining method following by a SIMD-parallelized sparse dynamic programming algorithm to calculate the boundaries of the final SD regions and their alignments.





□ NIFA: Non-negative Independent Factor Analysis disentangles discrete and continuous sources of variation in scRNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac136/6550501

NIFA (Non-negative Independent Factor Analysis), a new probabilistic single-cell factor analysis model that incorporates different interpretability inducing assumptions into a single modeling framework.

NIFA models uni- and multi-modal latent factors, and isolates discrete cell-type identity and continuous pathway activity into separate components. NIFA-derived factors outperform results from ICA, PCA, NMF and scCoGAPS in terms of disentangling biological sources of variation.





□ Coverage-preserving sparsification of overlap graphs for long-read assembly

>> https://www.biorxiv.org/content/10.1101/2022.03.17.484715v1.full.pdf

Accordingly, problem formulations for genome assembly which seek a single genome reconstruction, e.g., by finding a Hamiltonian cycle in an overlap graph, or computing the shortest common superstring of input reads, are not used in practice.

A novel theoretical framework that computes a directed multi-graph structure which is also a sub-graph of overlap graph, and it is guaranteed to be coverage-preserving.

The safe graph sparsification rules for vertex and edge removal from overlap graph Ok(R), k ≤ l2 which guarantee that all circular strings ∈ C(R, l1, l2, φ) can be spelled in the sparse graph.





□ Quantum algorithmic randomness

>> https://arxiv.org/pdf/2008.03584.pdf

Quantum Martin-Lo ̈f randomness (q-MLR) for infinite qubit sequences was introduced. Defining a notion of quantum Solovay randomness which is equivalent to q-MLR. The proof of this goes through a purely linear algebraic result about approximating density matrices by subspaces.

Quantum-K is intended to be a quantum version of K, the prefix-free Kolmogorov complexity. Weak Solovay random states have a characterization in terms of the incompressibility of their initial segments. ρ is weak Solovay random ⇐⇒ ∀ε > 0, limn QKε(ρn) − n = ∞.





□ mm2-ax: Accelerating Minimap2 for accurate long read alignment on GPUs

>> https://www.biorxiv.org/content/10.1101/2022.03.09.483575v1.full.pdf

Chaining in mm2 identifies optimal collinear ordered subsets of anchors from the input sorted list of anchors. mm2 does a sequential pass over all the predecessors and does sequential score comparisons to identify the best scoring predecessor for every anchor.

mm2-ax (minimap2-accelerated), a heterogeneous software- hardware co-design for accelerating the chaining step of minimap2. It extracts better intra-read parallelism from chaining without loosing mapping accuracy by forward transforming Minimap2’s chaining algorithm.

mm2-ax demonstrates a 12.6-5X Speedup and 9.44-3.77X Speedup:Costup over SIMD-vectorized mm2-fast baseline. mm2-ax converts a sparse vector which defines the chaining workload to a dense one in order to optimize for better arithmetic intensity.





□ scINSIGHT for interpreting single-cell gene expression from biologically heterogeneous data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02649-3

Based on a novel matrix factorization model, scINSIGHT learns coordinated gene expression patterns that are common among or specific to different biological conditions, offering a unique chance to jointly identify heterogeneous biological processes and diverse cell types.

scINSIGHT achieves sparse, interpretable, and biologically meaningful decomposition. scINSIGHT simultaneously identifies common and condition-specific gene modules and quantify their expression levels in each sample in a lower-dimensional space.





□ Gradient-k: Improving the performance of K-Means using the density gradient

>> https://www.biorxiv.org/content/10.1101/2022.03.30.486343v1.full.pdf

Gradient-k reduces the number of iterations required for convergence. This is achieved by correcting the distance used in the k-means algorithm by a factor based on the angle between the density gradient and the direction to the cluster center.

Gradient-k uses auxiliary information about how the data is distributed in space, enabling it to detect clusters regardless of their density, shape, and size. Gradient-k allows non-linear splits, can find clusters of non-Gaussian shapes, and has a reduced tessellation behavior.





□ Multigrate: single-cell multi-omic data integration

>> https://www.biorxiv.org/content/10.1101/2022.03.16.484643v1.full.pdf

Multigrate equipped with transfer learning enables mapping a query multimodal dataset into an existing reference atlas.

Multigrate learns a joint latent space combining information from multiple modalities from paired and unpaired measurements while accounting for technical biases within each modality.





□ Gapless provides combined scaffolding, gap filling and assembly correction with long reads

>> https://www.biorxiv.org/content/10.1101/2022.03.08.483466v1.full.pdf

The included assembly correction can remove errors in the initial assembly that are highlighted by the long-reads. The necessary mapping and consensus calling are performed with minimap2 and racon, but this can be quickly changed in the short accompanying bash script.

The scaffold module is the core of gapless. It requires the split assembly to extract the names and length of existing scaffolds, the alignment of the split assembly to itself to detect repeats and the alignment of the long reads to the split assembly.

The long read alignments are initially filtered, requiring a minimum mapping quality and alignment length, and in case of PacBio, only one subread per fragment is kept to avoid giving large weight to short DNA fragments that are repeatedly sequenced multiple times.





□ DiSCERN - Deep Single Cell Expression ReconstructioN for improved cell clustering and cell subtype and state detection

>> https://www.biorxiv.org/content/10.1101/2022.03.09.483600v1.full.pdf

DISCERN is based on a modified Wasserstein Autoencoder. DISCERN allows for the realistic reconstruction of gene expression information by transferring the style of hq data onto lq data, in latent and gene space.

DISCERN transfers the “style” of hq onto lq data to reconstruct missing gene expression, which operate in a lower dimensional representation. DISCERN models GE values realistically while retaining prior and vital biological information of the lq dataset after reconstruction.





□ DNA co-methylation has a stable structure and is related to specific aspects of genome regulation

>> https://www.biorxiv.org/content/10.1101/2022.03.16.484648v1.full.pdf

Highly correlated DNAm sites in close proximity are highly heritable, influenced by nearby genetic variants (cis mQTLs), and are enriched for transcription factor binding sites related to regulation of short RNAs essential for cellular function transcribed by RNA polymerase III.

DNA co-methylation of distant sites may be related to long-range cooperative TF interactions. Highly correlated sites that are either distant, or on different chromosomes, are driven by unique environmental factors, and methylation is less likely to be driven by genotype.





Element Biosciences

>> https://www.elementbiosciences.com/products/aviti

High data quality and throughput enable whole genome sequencing for rare disease. Our study with UCSD is the first of its kind to demonstrate the clinical potential of #AVITI System on previously unsolved cases.
#NGS #AviditySequencing

Comparative analysis shows Loopseq has the lowest error rate of all commercially available long read sequencing technologies.

>> https://www.elementbiosciences.com/news/element-launches-the-aviti-system-to-democratize-access-to-genomics


Jim Tananbaum

I'm excited to support the team at @ElemBio as they unveil their benchtop sequencer AVITI. I believe sequencing will touch all our lives. To enable it, we need high quality, inexpensive sequencing.



Svatyně.

2022-03-31 03:13:17 | Science News




□ sc-CGconv: A copula based topology preserving graph convolution network for clustering of single-cell RNA-seq data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009600

sc-CGconv, a stepwise robust unsupervised feature extraction and clustering approach that formulates and aggregates cell–cell relationships using copula correlation (Ccor), followed by a graph convolution network based clustering approach.

sc-CGconv formulates a cell-cell graph using Ccor that is learned by a graph-based artificial intelligence model, graph convolution network. sc-CGconv provides a topology-preserving embedding of cells in low dimensional space.





□ RegScaf: a Regression Approach to Scaffolding

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac174/6554191

RegScaf examines the distribution of distances between contigs from read alignment by the kernel density. When multiple modes are shown in a density, orientation-supported links are grouped into clusters, each of which defines a linking distance corresponding to a mode.

The linear model parameterizes contigs by their positions on the genome; then each linking distance between a pair of contigs is taken as an observation on the difference of their positions.

The parameters are estimated by minimizing a global loss function, which is a version of trimmed sum of squares. The least trimmed squares estimate has such a high breakdown value that it can automatically remove the mistaken linking distances.

RegScaf outperforms other scaffolders, especially in the accuracy of gap estimates by substantially reducing extremely abnormal errors. Its strength in resolving repeat regions is exemplied by a real case. Its adaptability to large genomes and TGS long reads is validated as well.





□ DCATS: differential composition analysis for complex single-cell experimental designs

>> https://www.biorxiv.org/content/10.1101/2022.03.21.485232v1.full.pdf

DCATS improves composition analysis through accounting for uncertainty in classification of cell types in differential abundance analysis. DCATS detects differential abundance using a beta-binomial generalized linear model (GLM) model, which returns the estimated coefficients.

DCATS has the capability to account for covariates or to test multiple covariates jointly in the association w/ composition abundance for each cell type. DCATS corrects the misclassification bias based on the similarity matrix, the estimation of the matrix is an important step.





□ L-GIREMI uncovers RNA editing sites in long-read RNA-seq

>> https://www.biorxiv.org/content/10.1101/2022.03.23.485515v1.full.pdf

L-GIREMI (Long-read GIREMI), effectively handles sequencing errors and biases in the reads, and uses a model-based approach to score RNA editing sites. Applied to PacBio long-read RNA-seq data, L-GIREMI affords a high accuracy in RNA editing identification.

L-GIREMI examines the linkage patterns between sequence variants in the same reads, complemented by a model-driven approach. the performance of L-GIREMI is robust given a wide range of total read coverage.





□ ggtranscript: an R package for the visualization and interpretation of transcript isoforms using ggplot2

>> https://www.biorxiv.org/content/10.1101/2022.03.28.486050v1.full.pdf

As a ggplot2 extension, ggtranscript inherits a vast amount of flexibility when determining the plot aesthetics, as well as interoperability with existing ggplot2 geoms and ggplot2 extensions.

ggtranscript enables a fast and simplified way to visualize, explore and interpret transcript isoforms. It allows users to combine data from both long-read and short-read RNA-sequencing technologies, making systematic assessment of transcript support easier.





□ CoLoRd: compressing long reads

>> https://www.nature.com/articles/s41592-022-01432-3

CoLoRd, an algorithm able to reduce the size of third-generation sequencing data by an order of magnitude without affecting the accuracy of downstream analyses.

Equipped with an overlap-based algorithm for compressing the DNA stream and a lossy processing of the quality information, it allows even tenfold space reduction compared to gzip, without affecting down-stream analyses like variant calling or consensus generation.





□ scChromHMM: Characterizing cellular heterogeneity in chromatin state with scCUT&Tag-pro

>> https://www.nature.com/articles/s41587-022-01250-0

single-cell (sc)CUT&Tag-pro, a multimodal assay for profiling protein–DNA interactions coupled with the abundance of surface proteins in single cells.

single-cell ChromHMM integrates data from multiple experiments to infer and annotate chromatin states based on combinatorial histone modification patterns.





□ scMAGS: Marker gene selection from scRNA-seq data for spatial transcriptomics studies

>> https://www.biorxiv.org/content/10.1101/2022.03.22.485261v1.full.pdf

scMAGS uses a filtering step in which the candidate genes are extracted prior to the marker gene selection step. For the selection of marker genes, cluster validity indices, Silhouette index or Calinski-Harabasz index (for large datasets) are utilized.

scMAGS calculates the expression rates of all genes in all cell types. The count matrix should be normalized to reduce the bias. The number of reads for a gene in each cell is expected to be proportional to the gene-specific expression level and cell-specific scaling factors.





□ SMetABF: A rapid algorithm for Bayesian GWAS meta-analysis with a large number of studies included https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009948

SMetABF, a method based on the Markov chain Monte Carlo (MCMC) method and its extension named shotgun stochastic search (SSS) to speed the process of subset selection. SSS is proved to be superior in speed, accuracy, and stability through simulation.

The SSS algorithm can reach the maximum ABF in a short time with a small number of iterations. On the contrary, the MCMC algorithm can hardly find the maximum ABF in even longer time. The large-scale multi-phenotypic meta-analyses will be possible through SMetABF.





□ CIAlign: A highly customisable command line tool to clean, interpret and visualise multiple sequence alignments

>> https://peerj.com/articles/12983/

CIAlign is particularly targetted towards users working with complex or highly divergent alignments, partial sequences and problematic assemblies and towards those developing complex pipelines requiring fine-tuning of parameters to meet specific criteria.

When running CIAlign with all core functions and for fixed gap proportions, the runtime scales quadratically with the size of the MSA, i.e. with n as the number of sequences and m the length of the MSA, the worst case time complexity is O((nm)2).





□ scPipeline: Multi-level cellular and functional annotation of single-cell transcriptomes

>> https://www.biorxiv.org/content/10.1101/2022.03.13.484162v1.full.pdf

scPipeline is a modular collection of Rmarkdown scripts. The modular framework permits flexible usage and facilitates QC & preprocessing, integration, cluster optimization, cell annotation, gene expression and association analyses, and gene program discovery.

Scale-free Shared Nearest neighbor network (SSN) analysis as an approach to identify and functionally annotate gene sets in an unsupervised manner, providing an additional layer of functional characterization of scRNA-seq data.





□ ScanExitronLR: characterization and quantification of exitron splicing events in long-read RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2022.03.25.485864v1.full.pdf

ScanExitronLR, an application for the characterization and quantification of exitron splicing events in long-reads. From a BAM alignment file, reference genome and reference gene annotation, ScanExitronLR outputs exitron events at the transcript level.

ScanExitronLR executes calling and filtering processes for each chromosome in parallel. For every exitron that passes filtering, It examines whether reads aligning to the exitron's position which were not called in the previous step could have harbored misaligned exitrons.





□ TLVar: Exploiting deep transfer learning for the prediction of functional noncoding variants using genomic sequence

>> https://www.biorxiv.org/content/10.1101/2022.03.19.484983v1.full.pdf

The validated variants are rare due to technical difficulty and financial cost. The small sample size of validated variants makes it less reliable to develop a supervised machine learning model for achieving a whole genome-wide prediction of noncoding causal variants.

TLVar, a deep transfer learning model, which consists of pretrained layers trained by large-scale generic functional noncoding variants, and retrained layers by context-specific functional noncoding variants with the pretrained layers frozen.





□ LANTSA: Landmark-based transferable subspace analysis for single-cell and spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.03.13.484116v1.full.pdf

LANTSA constructs a representation graph of samples for clustering and visualization based on a novel subspace model, which can learn a more accurate representation and is theoretically proven to be linearly proportional to data size in terms of the time consumption.

LANTSA approximates the whole representation graph (i.e., sample-by-sample relationship) by representing each landmark sample as a linear combination of all samples based on a novel subspace model which preserves local structures.

LANTSA uses a dimensionality reduction as an integrative method to extract the discriminants underlying the representation structure, which enables label transfer from one learning dataset to the other prediction datasets, thus solving the massive-volume / cross-platform problem.





□ scGDC: Learning deep features and topological structure of cells for clustering of scRNA-sequencing data

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac068/6549863

scGDC extends auto-encoder by introducing a self-representation layer to extract deep features of cells, and learns affinity graph of cells, which provide a better and more comprehensive strategy to characterize structure of cell types.

scGDC projects cells of various types onto different subspaces, where types, particularly rare cell types, are well discriminated by utilizing generative adversarial learning.

scGDC joins deep feature extraction, structural learning and cell type discovery, where features of cells are extracted under the guidance of cell types, thereby improving performance of algorithms.





□ DeepREAL: A Deep Learning Powered Multi-scale Modeling Framework for Predicting Out-of-distribution Ligand-induced GPCR Activity

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac154/6547052

DeepREAL utilizes self-supervised learning on tens of millions of protein sequences and pre-trained binary interaction classification to solve the data distribution shift and data scarcity problems.

DeepREAL is based on a new multi-stage deep transfer learning architecture that combines binary DTI pretraining and embedding with a three-way receptor activity fine-tuning to address OOD challenges using sparse receptor activity data.





□ GraphGONet: a self-explaining neural network encapsulating the Gene Ontology graph for phenotype prediction on gene expression

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac147/6546279

The production of accurate and intelligible predictions can benefit from the inclusion of domain knowledge. Therefore, knowledge-based deep learning models appear to be a promising solution.

GraphGONet, where the Gene Ontology is encapsulated in the hidden layers of a new self-explaining neural network. Each neuron in the layers represents a biological concept, combining the gene expression profile of a patient, and the information from its neighboring neurons.





□ Statistical and machine learning methods for spatially resolved transcriptomics data analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02653-7

Graph convolutional networks can aggregate features from each spatial location’s neighbors through convolutional layers and utilize the learned representation to perform node classification, community detection, and link prediction.

scHOT is a computational approach designed to identify changes in higher-order interactions among genes in cells along a continuous trajectory or across space. This method has also been demonstrated to be effective in spatial transcriptomics data.





□ Variomes: a high recall search engine to support the curation of genomic variants

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac146/6547047

The system can be used as a literature triage system in the same way as LitVar. It can also be used to prioritize variants to facilitate the identification of clinically actionable variants.

Variomes enables searching the biomedical literature. The collections are pre-processed with a set of medical terminologies. User queries are automatically processed to map keywords to the terminologies and expand genetic variants using a dedicated variant expansion system.





□ Generating minimum set of gRNA to cover multiple targets in multiple genomes with MINORg

>> https://www.biorxiv.org/content/10.1101/2022.03.10.481891v1.full.pdf

MINORg is an offline gRNA design tool that generates the smallest possible combination of gRNA capable of covering all desired targets in multiple non-reference genomes.

MINORg aims to lessen this workload by capitalising on sequence homology to favour multi-target gRNA while si- multaneously screening multiple genetic backgrounds in order to generate reusable gRNA panels.





□ CNV-espresso: Accurate in silico confirmation of rare copy number variant calls from exome sequencing data using transfer learning

>> https://www.biorxiv.org/content/10.1101/2022.03.09.483665v1.full.pdf

CNV-espresso encodes candidate CNV regions from exome sequencing data as images and uses convolutional neural networks to classify the image into different copy numbers.

Assuming the CNVs detected from WGS data as proxy of ground truth, CNV-espresso significantly improves precision while keeping recall almost intact, especially for CNVs that span small number of exons in exome data.





□ UniFuncNet: a flexible network annotation framework

>> https://www.biorxiv.org/content/10.1101/2022.03.15.484380v1.full.pdf

UniFuncNet, a network annotation framework that dynamically integrates data from multiple biological databases. If UniFuncNet finds searchable information for the other databases (in this case metacyc and hmdb) then it will also collect data from those databases.

The output from UniFuncNet can be represented as a multipartite graph, where the central layers correspond to the entity types (e.g., proteins), and the outer layers to the annotations.





□ OTUP-workflow: Target specific optimization of the transmit k-space trajectory for flexible universal parallel transmit RF pulse design

>> https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/nbm.4728

Transmit k-space trajectories (stack-of-spirals and SPINS) were optimized to best match different excitation targets using the parameters of the analytical equations of spirals and SPINS.

The OTUP-workflow (Optimization of transmit k-space Trajectories and Universal Pulse calculation) was tested on three test target excitation patterns. It emphasized the importance of a well-suited trajectory for pTx RF pulse design.





□ SavvyCNV: Genome-wide CNV calling from off-target reads

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009940

SavvyCNV finds the greatest number of true positive CNVs in all data sets. SavvyCNV calls CNVs by looking at read depth over the genome. The genome is split into bins and each bin is assessed for statistical divergence from normal copy number.

depth of the sample across all genomic locations, and then subsequently dividing the read count by the mean read depth of the genomic location across all samples. SavvyCNV then uses singular vector decomposition (SVD) to reduce noise.





□ Adversarial attacks and adversarial robustness in computational pathology

>> https://www.biorxiv.org/content/10.1101/2022.03.15.484515v1.full.pdf

Vision transformers (ViTs) perform equally well compared to CNNs at baseline and are orders of magnitude more robust to different types of white-box and black-box attacks. This is associated with a more robust latent representation of clinically relevant categories.

ViTs are robust learners in computational pathology. This implies that large-scale rollout of AI models in computational pathology should rely on ViTs rather than CNN-based classifiers to provide inherent protection against adversaries.





□ ChromDMM: A Dirichlet-Multinomial Mixture Model For Clustering Heterogeneous Epigenetic Data

>> https://www.biorxiv.org/content/10.1101/2022.03.25.485838v1.full.pdf

ChromDMM, a product Dirichlet-multinomial mixture model for clustering genomic regions that are characterised by multiple chromatin features.

ChromDMM extends the mixture model framework by profile shifting and flipping that can probabilistically account for inaccuracies in the position and strand-orientation. ChromDMM regularises the smoothness of the epigenetic profiles across the consecutive genomic regions.





□ Phenotype to genotype mapping using supervised and unsupervised learning

>> https://www.biorxiv.org/content/10.1101/2022.03.17.484826v1.full.pdf

This pipeline is capable of relating distinct vacuole morphologies to genetic perturbations. A mixed supervised-unsupervised learning methodology with the aim of reducing the annotation burden and the inherent bias due to the human annotation task.






□ Syrah: a Slide-seqV2 pipeline augmentation

>> https://www.biorxiv.org/content/10.1101/2022.03.20.485023v1.full.pdf

Syrah was built as an augmentation to the original Slide-seqV2 pipeline, such that it takes as input the output from the original pipeline and creates a corrected version of the data, facilitating comparison with the original pipeline’s results.

Syrah aligns the known linker sequence to each read and uses the beginning and end points of that alignment to determine where to extract the barcode and UMI segments.





□ EDClust: An EM-MM hybrid method for cell clustering in multiple-subject single-cell RNA sequencing

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac168/6551990

EDClust adopts a Dirichlet-multinomial mixture model and explicitly accounts for cell type heterogeneity, subject heterogeneity, and clustering uncertainty.

An EM-MM hybrid algorithm is derived for maximizing the data likelihood and clustering the cells. EDClust offers functions for predicting cell type labels, estimating parameters of effects from different sources, and posterior probabilities for cells being in each cluster.





□ DCLEAR: Single cell lineage reconstruction using distance-based algorithms

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04633-x

This method consists of two steps: Distance matrix estimation and the tree reconstruction from the distance matrix. Two of the more sophisticated distance methods display a substantially improved level of performance compared to the traditional Hamming distance method.

The algorithm used to compute the k-mer replacement distance (KRD) method first uses the prominence of mutations in the character arrays to estimate the summary statistics used for the generation of the tree to be reconstructed.





□ Parallel sequence tagging for concept recognition

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04511-y

A paradigm for biomedical concept recognition where named entity recognition (NER) and normalisation (NEN) are tackled in parallel. In a traditional NER+NEN pipeline, the NEN module is restricted to predict concept labels (IDs) for the spans identified by the NER tagger.

The system consistently achieves better scores than the baseline, which is a pipeline with a CRF-based span tagger and a BiLSTM-based concept classifier that were also trained on the CRAFT corpus alone.





□ Ontology-Aware Biomedical Relation Extraction

>> https://www.biorxiv.org/content/10.1101/2022.03.22.485304v1.full.pdf

Extending a Recurrent Neural Network (RNN) with a Convolutional Neural Network (CNN) to process three sets of features, namely, tokens, types, and graphs.

Entity type and ontology graph structure provide better representations than simple token-based representations for RE.





□ BarWare: efficient software tools for barcoded single-cell genomics

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04620-2

BarWare provides a comprehensive set of tools which lowers the barrier to entry of Cell Hashing workflows for small laboratories in the field of single-cell sequencing, and should be useful for core facilities that can use cell hashing to mix and overload samples.





□ vcferr: Development, Validation, and Application of a SNP Genotyping Error Simulation Framework

>> https://www.biorxiv.org/content/10.1101/2022.03.28.485853v1.full.pdf

vcferr, a novel framework for probabilistically simulating genotyping error and missingness in VCF files. The processing runs iteratively for every site in the input VCF, with the output streamed or optionally written to a new output VCF file.

vcferr checks each genotype, and randomly draws from a list of possible genotypes (heterozygous, homozygous for the alternate allele, homozygous for the reference allele, missing) with each element weighted by error rates.





□ SHOOT: phylogenetic gene search and ortholog inference

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02652-8

The phylogenetic tree returned by SHOOT provides the evolutionary relationships between genes inferred from multiple sequence alignment and maximum likelihood tree inference allowing orthologs and paralogs to be identified.

SHOOT also automatically identifies orthologs and colors the genes in the tree according to whether they are orthologs or paralogs, as identified using the species overlap method, which has been shown to be an accurate method for automated orthology inference.















Nebe.

2022-03-31 03:13:03 | Science News






□ MAECI: A Pipeline For Generating Consensus Sequence With Nanopore Sequencing Long-read Assembly and Error Correction

>> https://www.biorxiv.org/content/10.1101/2022.04.04.487014v1.full.pdf

The assemblies can be corrected using nanopore sequencing data and then polished with NGS data. Both approaches can mitigate some of these problems and improve the accuracy of the assemblies, but assembly errors cannot be completely avoided.

MAECI enables the assembly for nanopore long-read sequencing data. It takes nanopore sequencing data as input, uses multiple assembly algorithms to generate a single consensus sequence, and then uses nanopore sequencing data to perform self-error correction.





□ DPI: Single-cell multimodal modeling with deep parametric inference

>> https://www.biorxiv.org/content/10.1101/2022.04.04.486878v1.full.pdf

DPI, a deep parameter inference model that integrates CITE-seq/REAP-seq data. With DPI, the cellular heterogeneity embedded in the single-cell multimodal omics can be comprehensively understood from multiple views.

DPI describes the state of all cells in the sample in terms of the multimodal latent space. The multimodal latent space generated by DPI is continuous, which means that perturbing the genes/proteins of cells in the sample can find the cell state closest to it in this space.





□ MOSS: Multi-omic integration with Sparse Value Decomposition

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac179/6553658

MOSS performs a Sparse Singular Value Decomposition (sSVD) on the integrated omic blocks to obtain latent dimensions as sparse factors (i.e., with zeroed out elements), representing variability across subjects and features.

MOSS can fit supervised analyses via partial least squares, linear discriminant analysis, and low-rank regressions. Sparsity is imposed via Elastic Net on the sSVD solutions. MOSS allows an automatic tuning of the number of elements different from zero.




□ GPS-seq: The DNA-based global positioning system—a theoretical framework for large-scale spatial genomics

>> https://www.biorxiv.org/content/10.1101/2022.03.22.485380v1.full.pdf

GPS-seq, a theoretical framework that enables massively scalable, optics-free spatial transcriptomics. GPS-seq combines data from high-throughput sequencing with manifold learning to obtain the spatial transcriptomic landscape of a given tissue section without optical microscopy.

In this framework, similar to technologies like Slide-seq and 10X Visium, tissue samples are stamped on a surface of randomly-distributed DNA-barcoded spots (or beads). The transcriptomic sequences of proximal cells are fused to DNA barcodes.

The barcode spots serve as “anchors” which also capture spatially diffused “satellite” barcodes, and therefore allow computational reconstruction of spot positions without optical sequencing or depositing barcodes to pre-specified positions.

The general framework of GPS-seq is also compatible with standard single-cell (or single-nucleus) capture methods, and any modality of single- cell genomics, such as sci-ATAC-seq, could be transformed into spatial genomics in this strategy.





□ MEDUSA: A Pipeline for Sensitive Taxonomic Classification and Flexible Functional Annotation of Metagenomic Shotgun Sequences

>> https://www.frontiersin.org/articles/10.3389/fgene.2022.814437/full

MEDUSA performs preprocessing, assembly, alignment, taxonomic classification, and functional annotation on shotgun data, supporting user-built dictionaries to transfer annotations to any functional identifier.

MEDUSA includes several tools, as fastp, Bowtie2, DIAMOND, Kaiju, MEGAHIT, and a novel tool implemented in Python to transfer annotations to BLAST/DIAMOND alignment results.





□ NAb-seq: an accurate, rapid and cost-effective method for antibody long-read sequencing in hybridoma cell lines and single B cells

>> https://www.biorxiv.org/content/10.1101/2022.03.25.485728v1.full.pdf

When compared to Sanger sequencing of two hybridoma cell lines, long-read ONT sequencing was highly accurate, reliable, and amenable to high throughput.

NAb-seq, a three-day, species-independent, and cost-effective workflow to characterize paired full- length immunoglobulin light and heavy chain genes from hybridoma cell lines.





□ SimSCSnTree: a simulator of single-cell DNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac169/6551250

SimSCSnTree, a new single-cell DNA sequence simulator which generates an evolutionary tree of cells and evolves single nucleotide variants (SNVs) and copy number aberrations (CNAs) along its branches.

Data generated by the simulator can be used to benchmark tools for single-cell genomic analyses, particularly in cancer where SNVs and CNAs are ubiquitous.





□ Dynamic Mantis: An Incrementally Updatable and Scalable System for Large-Scale Sequence Search using the Bentley-Saxe Transformation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac142/6553005

an efficient algorithm for merging two Mantis indexes, and tackle several scalability and efficiency obstacles along the way. The proposed algorithm targets Minimum Spanning Tree-based Mantis.

MST-based Mantis is ≈ 10× faster to construct, requires ≈ 10× less construction memory, results in ≈ 2.5× smaller indexes, and performs bulk queries ≈ 74× faster and with ≈ 100× less query memory than Bifrost.





□ Triku: a feature selection method based on nearest neighbors for single-cell data

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giac017/6547682

Triku is a feature selection method that favors genes defining the main cell populations. It does so by selecting GE by groups of cells that are close in the k-NN graph. The expression of these genes is higher than the expected expression if the k-cells were chosen at random.

the Wasserstein distance between the observed and the expected distributions is computed and genes are ranked according to that distance. Higher distances imply that the gene is locally expressed in a subset of transcriptomically similar cells.





□ RF4Del: A Random Forest approach for accurate deletion detection

>> https://www.biorxiv.org/content/10.1101/2022.03.10.483419v1.full.pdf

The model consists of 13 features extracted from a mapping file. RF4Del outperforms established SV callers (DELLY, Pindel) with higher overall performance (F1-score > 0.75; 6x-12x sequence coverage) and is less affected by low sequencing coverage and deletion size variations.

RF4Del could learn from a compilation of sequence patterns linked to a given SV. Such models can then be combined to form a learning system able to detect all types of SVs in a given genome.





□ GRAPE: Genomic Relatedness Detection Pipeline

>> https://www.biorxiv.org/content/10.1101/2022.03.11.483988v1.full.pdf

GRAPE: Genomic RelAtedness detection PipelinE. It combines data preprocess- ing, identity-by-descent (IBD) segments detection, and accurate relationship esti- mation.

GRAPE has a modular architecture that allows switching between tools and adjust tools parameters for better control of precision and recall levels. The pipeline also contains a simulation workflow w/ an in-depth evaluation of pipeline accuracy using simulated and reference data.





□ ClusterFoldSimilarity: A single-cell clusters similarity measure for different batches, datasets, and samples

>> https://www.biorxiv.org/content/10.1101/2022.03.14.483731v1.full.pdf

ClusterFoldSimilarity calculates a measure of similarity b/n clusters from different datasets/batches, without the need of correcting for batch effect or normalizing and merging the data, thus avoiding artifacts and the loss of information derived from these kinds of techniques.

The similarity metric is based on the average vector module and sign of the product of logarithmic fold-changes. ClusterFoldSimilarity compares every single pair of clusters from any number of different samples/datasets, including different number of clusters for each sample.





□ HCLC-FC: a novel statistical method for phenome-wide association studies

>> https://www.biorxiv.org/content/10.1101/2022.03.14.484203v1.full.pdf

HCLC-FC (Hierarchical Clustering Linear Combination with False discovery rate Control), to test the association between a genetic variant with multiple phenotypes for each phenotypic category in phenome-wide association studies (PheWAS).

HCLC-FC clusters phenotypes within each phenotypic category, which reduces the degrees of freedom of the association tests and has the potential to increase statistical power. HCLC-FC has an asymptotic distribution which avoids the computational burden of simulation.





□ CONGAS: A Bayesian method to cluster single-cell RNA sequencing data using Copy Number Alterations

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac143/6550058

CONGAS jointly identifies clusters of single cells with subclonal copy number alterations, and differences in RNA expression.

CONGAS builds statistical priors leveraging bulk DNA sequencing data, does not require a normal reference and scales fast thanks to a GPU backend and variational inference.





□ OMAMO: orthology-based alternative model organism selection

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac163/6550503

The only unicellular organisms considered in these databases are fission and budding yeast, whilst abundance of unicellular species in nature and their unique features make it difficult to find other non-complex model organisms for a biological process of interest.

OMAMO (Orthologous Matrix and Alternative Model Organisms), a software and a web service that provide the user with the best non-complex organism for research into a biological process of interest based on orthologous relationships between human and the species.





□ DENVIS: scalable and high-throughput virtual screening using graph neural networks with atomic and surface protein pocket features

>> https://www.biorxiv.org/content/10.1101/2022.03.17.484710v1.full.pdf

DENVIS, a purely machine learning-based, high-throughput, end-to-end-strategy for SBVS using GNNs for binding affinity prediction. DENVIS exhibits several orders of magnitude faster screening times (i.e., higher throughput) than both docking-based and hybrid models.

The atom-level model consists of a modified version of the graph isomorphism network (GIN). The surface-level approach utilises a mixture model network (MoNet), a specialised GNN with a convolution operation that respects the geometry of the input manifold.





□ Wochenende - modular and flexible alignment-based shotgun metagenome analysis

>> https://www.biorxiv.org/content/10.1101/2022.03.18.484377v1.full.pdf

Wochenende runs alignment of short reads (eg Illumina) or long reads (eg Oxford Nanopore) against a reference sequence. It is relevant for genomics and metagenomics. Wochenende is simple (python script), portable and is easy to configure with a central config file.

Wochenende has the ability to find and filter alignments to all kingdoms of life using both short and long reads with high sensitivity and specificity, and provides the user with multiple normalization techniques and configurable and transparent filtering steps.





□ GBScleanR: Robust genotyping error correction using hidden Markov model with error pattern recognition.

>> https://www.biorxiv.org/content/10.1101/2022.03.18.484886v1.full.pdf

GBScleanR implements a novel HMM-based error correction algorithm. This algorithm estimates the allele read bias and mismap rate per marker and incorporates these into the HMM as parameters to capture the skewed probabilities in read acquisitions.

GBScleanR provides functions for data visualization, filtering, and loading/writing a VCF file. The algorithm of GBScleanR is based on the HMM and treats the observed allele read counts for each SNP marker along a chromosome as outputs from a sequence of latent true genotypes.





□ 3GOLD: optimized Levenshtein distance for clustering third-generation sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04637-7

3GOLD offers a novel way of determining error type and frequency by interpreting the unweighted SLD value and position on the matrix by comparing it to the unweighted LD value. 3GOLD combines the discriminatory benefits of weighted LD and the permissive benefits of SLD.

This approach is appropriate for datasets of unknown cluster centroids, such as those generated with unique molecular identifiers as well as known centroids such as barcoded datasets. It has high accuracy in resolving small clusters and mitigating the number of singletons.





□ The role of cell geometry and cell-cell communication in gradient sensing

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009552

Generalizing the existing mathematical models to investigate how short- and long-range cellular communication can increase gradient sensing in two-dimensional models of epithelial tissues.

With long-range communication, the gradient sensing ability improves for tissues with more disordered geometries; on the other hand, an ordered structure with mostly hexagonal cells is advantageous with nearest neighbour communication.





□ Crimp: fast and scalable cluster relabeling based on impurity minimization

>> https://www.biorxiv.org/content/10.1101/2022.03.22.485309v1.full.pdf

CRIMP, a lightweight command-line tool, which offers a relatively fast and scalable heuristic to align clusters across multiple replicate clusterings consisting of the same number of clusters.

CRIMP allows to rearrange a number of membership matrices of identical shape in order to minimize differences caused by label switching. The remaining differences should be attributable to either noise or truly different ways of the data, referred to as ‘genuine multimodality’.





□ RabbitV: fast detection of viruses and microorganisms in sequencing data on multi-core architectures

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac187/6554196

RabbitV, a tool for rapid detection of viruses and microorganisms in Illumina sequencing datasets based on fast identification of unique k-mers. It can exploit the power of modern multi-core CPUs by using multi-threading, vectorization, and fast data parsing.

RabbitV outperforms fastv by a factor of at least 42.5 and 14.4 in unique k-mer generation (RabbitUniq) and pathogen identification (RabbitV), respectively.





□ q2-fondue: Reproducible acquisition, management, and meta-analysis of nucleotide sequence (meta)data

>> https://www.biorxiv.org/content/10.1101/2022.03.22.485322v1.full.pdf

q2-fondue (Functions for reproducibly Obtaining and Normalizing Data re-Used from Elsewhere) to expedite the initial acquisition of data from the SRA, while offering complete provenance tracking.

q2-fondue simplifies retrieval of sequencing data and accompanying metadata in a validated and standardized format interoperable with the QIIME 2 ecosystem.





□ MASI: Fast model-free standardization and integration of single-cell transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2022.03.28.486110v1.full.pdf

MASI (Marker-Assisted Standardization and Integration) can run integrative annotation on a personal laptop for approximately one million cells, providing a cheap computational alternative for the single-cell data analysis community.

MASI will not be able to annotate cell types in query data that have not been seen in reference data.

However, it is still worth answering if a cell-type score matrix constructed using the reference data can preserve cell-type structure for query data, even though query data contains unseen cell types.





□ The Codon Statistics Database: a Database of Codon Usage Bias

>> https://www.biorxiv.org/content/10.1101/2022.03.29.486291v1.full.pdf

the Codon Statistics Database, an online database that contains codon usage statistics for all the species with reference or representative genomes in RefSeq.

If a species is selected, the user is directed to a table that lists, for each codon, the encoded amino acid, the total count in the genome, the RSCU, and whether the codon is preferred or unpreferred.





□ Boquila: NGS read simulator to eliminate read nucleotide bias in sequence analysis

>> https://www.biorxiv.org/content/10.1101/2022.03.29.486262v1.full.pdf

Boquila generates sequences that mimic the nucleotide profile of true reads, which can be used to correct the nucleotide-based bias of genome-wide distribution of NGS reads.

Boquila can be configured to generate reads from only specified regions of the reference genome. It also allows the use of input DNA sequencing to correct the bias due to the copy number variations in the genome.





□ SprayNPray: user-friendly taxonomic profiling of genome and metagenome contigs

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08382-2

SprayNPray offers a quick and user-friendly, semi-automated approach, allowing users to separate contigs by taxonomy of interest. SprayNPray can be used for broad-level overviews, preliminary analyses, or as a supplement to other taxonomic classification or binning software.

SprayNPray profiles contigs using multiple metrics, including closest homologs from a user-specified reference database, gene density, read coverage, GC content, tetranucleotide frequency, and codon-usage bias.





□ LPMX: a pure rootless composable container system

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04649-3

LPMX accelerates science by letting researchers compose existing containers and containerize tools/pipelines that are difficult to package/containerize using Conda or Singularity, thereby saving researchers’ precious time.

LPMX can minimize the overhead of splitting a large pipeline into smaller containerized components or tools to avoid conflicts between the components.

A caveat is that compared to Singularity, the LPMX approach might put a larger burden on a central shared file system, so Singularity might scale well beyond a certain large number of nodes.





□ StORF-Reporter: Finding Genes between Genes

>> https://www.biorxiv.org/content/10.1101/2022.03.31.486628v1.full.pdf

StORF- Reporter, a tool that takes as input an annotated genome and returns missed CDS genes from the unannotated regions. Stop-ORFs (StORFs) are identified in these unannotated regions. StORFs are Open Reading Frames that are delimited by stop codons.

StORFs recovers complete coding sequences (with/without similarity to known genes) which were missing from both canonical and novel genome annotations.





□ Prime-seq, efficient and powerful bulk RNA sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02660-8

Prime-seq, a bulk RNA-seq protocol, and show that it is as powerful and accurate as TruSeq in quantifying gene expression levels, but more sensitive and much more cost-efficient.

The prime-seq protocol is based on the SCRB-seq and the optimized derivative mcSCRB-seq. It uses the principles of poly(A) priming, template switching, early barcoding, and UMIs to generate 3′ tagged RNA-seq libraries.





□ Communication-Efficient Cluster Scalable Genomics Data Processing Using Apache Arrow Flight

>> https://www.biorxiv.org/content/10.1101/2022.04.01.486780v1.full.pdf

This solution has similar performance to MPI-based HPC solutions, with the added advantage of easy programmability and transparent big data scalability. It outperforms existing Apache Spark based solutions in term of both computation time (2x) and lower communication overhead.

QUARTIC (QUick pArallel algoRithms for high-Throughput sequencIng data proCessing) is implemented using MPI. Though this implementation uses I/Os between pre-processing stages, it still performs better than other Apache Spark based frameworks.





□ epiAneufinder: identifying copy number variations from single-cell ATAC-seq data

>> https://www.biorxiv.org/content/10.1101/2022.04.03.485795v1.full.pdf

epiAneufinder, a novel algorithm that exploits the read count information from scATAC-seq data to extract genome-wide copy number variations (CNVs) for individual cells, allowing to explore the CNV heterogeneity present in a sample at the single-cell level.

epiAneufinder extracts single-cell copy number variations from scATAC-seq data alone, or alternatively from single-cell multiome data, without the need to supplement the data with other data modalities.





□ BIODICA: a computational environment for Independent Component Analysis

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac204/6564219

BIODICA, an integrated computational environment for application of Independent Component Analysis (ICA) to bulk and single-cell molecular profiles, interpretation of the results in terms of biological functions and correlation with metadata.

BIODICA automates deconvolution of large omics datasets with optimization of deconvolution parameters, and compares the results of deconvolution of independent datasets for distinguishing reproducible signals, universal and specific for a particular disease/data type or subtype.





□ acorde unravels functionally interpretable networks of isoform co-usage from single cell data

>> https://www.nature.com/articles/s41467-022-29497-w

acorde, a pipeline that successfully leverages bulk long reads and single-cell data to confidently detect alternative isoform co-expression relationships.

acorde uses a strategy to obtain noise-robust correlation estimates in scRNA-seq data, and a semi-automated clustering approach to detect modules of co-expressed isoforms across cell types.

Percentile-summarized Pearson correlations outperform both classic and single-cell specific correlation strategies, including proportionality methods that were recently proposed as one of the best alternatives to measure co-expression in single-cell data.







KYїV.

2022-03-30 03:03:03 | 社会・経済

(Photo by irwiiiiiish)




□ Thomas Bergersen - Lament for our Children (Feat. Kate St. Pierre)




□ Oleksandra Matviichuk RT

>> https://twitter.com/avalaina/status/1506556139252633611?s=21

Russians dropped air bombs on the road bridge across the Desna which connected Chernigiv with Kyiv.

“Chernihiv has no electricity, water, heat and almost no gas, and all infrastructure has been destroyed. Medical institutions were also targeted by Russians”- the city mayor said




□ Hanna Liubakova RT

>> https://twitter.com/hannaliubakova/status/1505972610177421316?s=21

The head of the Ukrainian Railways Alexander Kamyshin confirmed that there is no railway connection between #Ukraine and #Belarus "thanks to Belarusian railway workers". They've indeed launched what they called "a railway war" with many acts of sabotage to stop Russian equipment




□ Lesia Vasylenko RT

>> https://twitter.com/lesiavasylenko/status/1506363180418838531?s=21

#Ukraine national anthem sang by children in #Kharkiv metro shelter





鶴の湯

2022-03-29 03:51:02 | ホテル


□ 『乳頭温泉郷 鶴の湯』

>> http://www.tsurunoyu.com/

開湯400年の歴史を持つ日本屈指の秘湯。
乳白色の大きな混浴露天風呂でも有名。
江戸風情を残した宿舎は質素ながら情緒たっぷりで、泊まり客は若い恋人同士で賑わっている。
深夜、露天からは星空が仰げて最高。
芋煮鍋も美味しかった•̥  ̫ •̥ ♡














侘桜

2022-03-28 02:36:49 | ホテル


『角館山荘 侘桜』

>> http://wabizakura.com/

秋田県の山奥にある、全室源泉掛け流し露天風呂付きの高級旅館。97平米ある和洋室は部屋食も出来て最上のホスピタリティ。トロトロの高アルカリ温泉(pH9.5)で熱った全身を、深夜の森の暗闇に面したデッキテラスで涼を取る快感と言ったら…֊ ̫ ֊♡











Jóhann Jóhannsson / “DRONE MASS”

2022-03-18 22:42:42 | art music

□ Jóhann Jóhannsson / “DRONE MASS”

>> https://www.deutschegrammophon.com/en/catalogue/products/drone-mass-johannsson-12620

Track List

One Is True
Two Is Apocryphal
Triptych In Mass
To Fold & Remain Dormant
Divine Objects
The Low Drone Of Circulating Blood, Diminishes With Time
Moral Vacuums
Take The Night Air
The Mountain View, The Majesty Of The Snow-Clad Peaks, From A Place Of Contemplation And Reflection


Release Date: 18/03/2022
Jóhann Jóhannsson · Theatre of Voices · Paul Hillier · American Contemporary Music Ensemble


Drone Mass, a contemporary oratorio based on “Coptic Gospel of the Egyptians”. Both the enigmatic nature of these Gnostic writings and the sheer beauty of the vocalise-style writing add to the spiritual quality of the work as a whole.

至高の現代作曲家による『オラトリオ』。エジプトで発見されたコプト語による新約聖書外典(ナグ・ハマディ写本)に基づく。茫漠としたアンビエント・ノイズと、神秘的な多声合唱が奏でる”グノーシスの秘蹟”。



□ Jóhannsson: Triptych in Mass



□ Jóhannsson: Take the Night Air






Enchanted.

2022-03-18 22:41:35 | delerium


Delerium / “Enchanted”


From the album “Karma”


Featured Vocalist: Delerium
Producer: Delerium
Engineer: Greg Reely
Mixer, Recorder: Greg Reely
Performance: Kristy Thirsk
Mastering Engineer: Ted Jensen
Masterer: Ted Jensen
Composer: Bill Leeb
Writer: Kristy Lee Thirsk
Composer: Kristy Thirsk
Composer: Rhys Fulber
Writer: Rhys Nowell Fulber
Writer: Wilhelm Anton Leeb




Ark.

2022-03-03 03:03:03 | Science News

(“Supercube” owned by Pak)




□ SVDSS: Improved structural variant discovery in hard-to-call regions using sample-specific string detection from accurate long reads

>> https://www.biorxiv.org/content/10.1101/2022.02.12.480198v1.full.pdf

SVDSS is a novel method for discovery of structural variants in accurate long reads using SFS. SVDSS utilizes SFS for coarse-grained identification (anchoring) of potential SV sites and performs local partial-order-assembly (POA) of clusters of SFS.

SVDSS combines advantages of all three mapping-based, mapping-free, and assembly-based approaches for predicting SVs. The SFS assembly procedure effectively merges all the SFS belonging to the same variant into a single long superstring.





□ Odysseia: Genetic Regulatory Feature Analysis with Interpretable Classification Machine Learning Models

>> https://www.biorxiv.org/content/10.1101/2022.02.17.480852v1.full.pdf

Odysseia, an interpretable machine learning classifier based single-cell gene expression profile(scGEP) analysis system, that assesses importances of genetic regulatory features in differentiating cell states.

Odysseia does not require any background expression database but searching for potential key GFs in converting one CS to another with only expression profiles labeled with binary CS categories as input.

Odysseia enhances the feature extraction capability. Odysseia segments scGEPs under same CS category into subsets with constant size to generate pseudo-cGEPs.





□ scGate: marker-based purification of cell types from heterogeneous single-cell RNA-seq datasets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac141/6544581

scGate purifies a cell population of interest using a set of markers organized in a hierarchical structure, akin to gating strategies employed in flow cytometry. scGate outperforms state- of-the-art single-cell classifiers and it can be applied to multiple modalities of single-cell data.

scGate evaluates the strength of signature marker expression in each cell using the rank-based method UCell, and then performs k-nearest neighbor (kNN) smoothing by calculating the mean UCell score across neighboring cells.





□ LJA: Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads

>> https://www.nature.com/articles/s41587-022-01220-6

La Jolla Assembler (LJA), a fast algorithm using the Bloom filter, sparse de Bruijn graphs and disjointig generation. LJA reduces the error rate in HiFi reads, constructs the de Bruijn graph for large genomes / large k-mer sizes and transforms it into a multiplex de Bruijn graph.

La Jolla Assembler (LJA) includes three modules addressing all three challenges in assembling long and accurate reads: jumboDBG (constructing large de Bruijn graphs), mowerDBG (error-correcting reads), and multiplexDBG (utilizing the entire read-length for resolving repeats).





□ ESNN: Uncertainty Quantification in Variable Selection for Genetic Fine-Mapping using Bayesian Neural Networks

>> https://www.biorxiv.org/content/10.1101/2022.02.23.481675v1.full.pdf

Ensemble of Single-effect Neural Networks (ESNN) generalizes the “sum of single-effects” regression framework by both accounting for nonlinear structure in genotypic data (e.g., dominance effects) and having the capability to model discrete phenotypes.

ESNN provides posterior inclusion probabilities and credible sets. ESNN uses an iterative Bayesian stepwise selection (IBSS) procedure where it trains L models by first fitting one model with a coordinate ascent algorithm and then regressing out that model to compute residuals.





□ High-dimension to high-dimension screening for detecting genome-wide epigenetic regulators of gene expression

>> https://www.biorxiv.org/content/10.1101/2022.02.21.481160v1.full.pdf

A novel screening method based on robust partial correlation to detect epigenetic regulators of gene expression over the whole genome, a problem that includes both high-dimensional predictors and high-dimensional responses.

The Data-driven procedures is developed to determine the conditional set and the optimal screening threshold and implement an iterative algorithm which is computationally feasible with hundreds of thousands of predictors and responses.

This method is conceptually innovative that it can reduce the dimension of both predictor and response, and screens out both irrelevant nodes and edges. The tail robustified partial correlation is used to protect against non-normality and heavy-tailed distributions.





□ scDVF: Data-driven Single-cell Transcriptomic Deep Velocity Field Learning with Neural Ordinary Differential Equations

>> https://www.biorxiv.org/content/10.1101/2022.02.15.480564v1.full.pdf

scDVF framework allows hypothetical cells to evolve according to the dynamics learned from existing cells in the data. Using the ability to simulate future gene expression trajectories.

scDVF uses a new metric called the CCI, analogous to the “kinetic energy” of Waddington landscapes. Single-cell dynamical systems may exhibit properties similar to chaotic systems. scDVF learns the variance of the velocity vectors.





□ sccomp: Robust differential composition and variability analysis for multisample cell omics

>> https://www.biorxiv.org/content/10.1101/2022.03.04.482758v1.full.pdf

sccomp, a generalised method for differential composition and variability analyses able to jointly model data count distribution, compositionality, group-specific variability and proportion mean-variability association, with awareness against outliers.

Sccomp allows realistic data simulation and cross-study knowledge transfer. Mean-variability association is ubiquitous across technologies showing the inadequacy of the Dirichlet-multinomial modelling and provide mandatory principles for differential variability analysis.





□ BWA-MEME: BWA-MEM emulated with a machine learning approach

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac137/6543607

BWA-MEME, the first full-fledged short read alignment software that leverages learned indices for solving the exact match search problem for efficient seeding.

BWA-MEME is a practical and efficient seeding algorithm based on a suffix array search algorithm that solves the challenges in utilizing learned indices for SMEM search which is extensively used in the seeding phase.

BWA-MEME achieves up to 3.45x speedup in seeding throughput over BWA-MEM2 by reducing the number of instructions by 4.60x, memory accesses by 8.77x, and LLC misses by 2.21x, while ensuring the identical SAM output to BWA-MEM2.





□ nPoRe: n-Polymer Realigner for improved pileup variant calling

>> https://www.biorxiv.org/content/10.1101/2022.02.15.480561v1.full.pdf

nPoRe uses a read realignment algorithm, the initial mapping of each read. Each read and its corresponding section of the reference genome are realigned, and a new traceback (alignment path) is computed.

Read phasing and realignment can recover a significant portion of INDELs lost during this stage. nPoRe defines an n-polymer to consist of at least 3 exact repeats of the same repeated sequence, where the repeat unit is of length 1 to 6 bases.

The worst-case time complexity for computing the reference annotations is (|R|n^2maxlmax), Since the n-polymer score matrix n is of size (6, 100, 100), the time complexity is O(|R|). The time required for reference annotations to be insignificant. They require O(|R|nmax) space.





□ Disentanglement of Entropy and Coevolution using Spectral Regularization

>> https://www.biorxiv.org/content/10.1101/2022.03.04.483009v1.full.pdf

Investigating the origins of the entropy signal. A spectral regularizer that penalizes the largest eigen-mode of the pairwise parameters of the markov random field (MRF) during training.

GREMLIN, a Markov Random Field or Potts model, allows for the inference of a sparse contact map without loss in precision, meanwhile improving interpretability, and resolving overfitting issues important for sequence evaluation and design.





□ Novel feature selection method via kernel tensor decomposition for improved multi-omics data analysis

>> https://bmcmedgenomics.biomedcentral.com/articles/10.1186/s12920-022-01181-4

Feature selection of multi-omics data analysis remains challenging owing to the size of omics datasets, comprising 10^2-10^5 features. Appropriate methods to weight individual omics datasets are unclear, and the approach adopted has substantial consequences for feature selection.

Extendeding the kernel tensor decomposition (KTD)-based unsupervised feature extraction (FE) method to integrate multi-omics datasets obtained from common samples in a weight-free manner.





□ scDSC: Deep structural clustering for single-cell RNA-seq data jointly through autoencoder and graph neural network

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac018/6529282

Previous studies have shown that the distribution of UMI count is not zero expansion, and NB distribution is suitable for UMI-based data. It is necessary to explore the characteristics of data obtained by different scRNA-seq technolgies and assume a suitable data distribution.

scDSC formulates and aggregates cell-cell relationships with graph neural networks and learns embedded gene expression patterns. scDSC is mainly composed of ZINB model-based autoencoder module (ZAE), GNN module and multiple Mutual Supervision Module.





□ Parametrised Presentability over Orbital Categories

>> https://arxiv.org/pdf/2202.02594v1.pdf

The notion of presentability in the parametrised homotopy theory framework of over orbital categories. Such a theory is of interest for example in equivariant homotopy theory, and construct the category of parametrised noncommutative motives for equivariant algebraic K-theory.

Translating the theory of presentable ∞-categories to the parametrised setting and understanding the relationship b/n the notion of parametrised presentability and its unparametrised analogue. And also give a complete parametrised analogue of presentable ∞-categories.





□ A Semantic Hierarchy for Intuitionistic Logic

>> https://escholarship.org/uc/item/2vp2x4rx

Nuclear semantics has one foot in the world of posets and another foot in the world of algebras. It is therefore natural to ask whether the nucleus in a nuclear frame can be replaced by some more concrete data.

Any complete Heyting algebra can be realized as an algebra of fixpoints arising from a nuclear frame. the Kripke-style semantics is as general as Dragalin semantics and hence algebraic semantics based on complete Heyting algebras.





□ MAPLE: A Hybrid Framework for Multi-Sample Spatial Transcriptomics Data

>> https://www.biorxiv.org/content/10.1101/2022.02.28.482296v1.full.pdf

MAPLE: a hybrid deep learning and Bayesian modeling framework for detection of spatially informed cell sub-populations, uncertainty quantification, and inference of group effects in multi-sample HST experiments.

MAPLE is designed to be used within standard Seurat workflows, and the user may specify to use principal components (PCs), highly variable genes (HVGs), spatially variable genes (SVGs), or custom cell/cell-spot embeddings such as those generated by RESEPT.

MAPLE accompanies cell sub-population labels w/ uncertainty measures defined in terms of posterior probabilities from the Bayesian finite mixture model, which can be used to characterize ambiguous cell sub-population boundaries and discern b/n high and low confidence assignments.




□ CeSpGRN: Inferring cell-specific gene regulatory networks from single cell gene expression data

>> https://www.biorxiv.org/content/10.1101/2022.03.03.482887v1.full.pdf

CeSpGRN uses a Gaussian weighted kernel which allows the GRN of a given cell to be learned from the gene expression profile of this cell and cells that are upstream and downstream of this cell in the developmental process.

CeSpGRN is not limited to gene expression data which are binary or Gaussian-distributed; and through the use of the high-dimensional weighted kernel, CeSpGRN can infer one GRN for each cell in datasets where cells can form any trajectory or cluster structures.





□ Deep Learning in Spatial Transcriptomics: A Survey of Deep Learning Methods for Spatially-Resolved Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.02.28.482392v1.full.pdf

DestVI employs a conditional deep generative model. DestVI defines two latent variable models (LVMs) for each data modality: an LVM for modeling scRNAseq data and one that aims to model the ST data.

HMRF, a Hidden-Markov Random Field models the spatial dependency of GE using both the sequencing and imaging-based transcriptomic technologies. BayesSpace employs a Bayesian formulation of HMRF, and uses the Markov chain Monte Carlo algorithm to estimate the model parameters.





□ Integrating temporal single-cell gene expression modalities for trajectory inference and disease prediction

>> https://www.biorxiv.org/content/10.1101/2022.03.01.482381v1.full.pdf

the first task-oriented benchmarking study that investigates integration of temporal sequencing modalities for dynamic cell state prediction.

Motivated by identifying a new more biologically-meaningful set of features underlying cellular dynamics, they investigate integration of gene expression modalities at three distinct temporal stages of gene regulation: unspliced, spliced, and RNA velocity.





□ StabMap: Mosaic single cell data integration using non-overlapping features

>> https://www.biorxiv.org/content/10.1101/2022.02.24.481823v1.full.pdf

Data integration aims to place cells, captured with different techniques, onto a common embedding to facilitate downstream analytics. Current horizontal data integration techniques use a set of common features, thereby ignoring non-overlapping features and losing information.

StabMap embeds single cell data from multiple technology sources into the same low dimensional coordinate space. StabMap infers a mosaic data topology, then projects all cells onto supervised or unsupervised reference coordinates by traversing shortest paths along the topology.





□ mm2-fast: Accelerating minimap2 for long-read sequencing applications on modern CPUs

>> https://www.nature.com/articles/s43588-022-00201-8

Multiple optimizations using SIMD-parallel, a learned index data structure to accelerate the three main computational modules of minimap2: seeding, chaining and pairwise sequence alignment. These optimizations result in an up to 1.8-fold reduction of end-to-end mapping time.

Acceleration of the anchor chaining step was achieved by designing a single instruction SIMD-Parallel co-linear chaining algorithm which uses vector processing units. All the modules are optimized using AVX-512 and AVX2 vectorization.





□ Dictionary learning for integrative, multimodal, and scalable single-cell analysis

>> https://www.biorxiv.org/content/10.1101/2022.02.24.481684v1.full.pdf

Demonstrating how dictionary learning can be combined with sketching techniques to substantially improve computational scalability, and harmonize 8.6 million human immune cell profiles from sequencing and mass cytometry experiments.

Atomic sketch integration maps the scATAC-seq dataset on the Azimuth reference, compute the graph laplacian for the multi-omic dataset, and calculate an eigendecomposition, thereby reducing the dimensionality from the number of atoms to the number of selected eigenvectors.





□ UPP2: Fast and Accurate Alignment Estimation of Datasets with Fragmentary Sequences

>> https://www.biorxiv.org/content/10.1101/2022.02.26.482099v1.full.pdf

UPP (Ultra-large multiple sequence alignment using Phylogeny-aware Profiles) builds eHMM: an ensemble of Hidden Markov Models to represent an estimated alignment on the full length sequences, and adds the remaining sequences into the alignment using selected HMMs in the ensemble.

UPP2, a direct improvement on UPP. Accuracy differences between methods UPP2 are statistically significantly on several high fragmentary model conditions. Asterisks denote the model conditions on which UPP2 was statistically significantly better than MAGUS.





□ ACTIVA: realistic single-cell RNA-seq generation with automatic cell-type identification using introspective variational autoencoders

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac095/6531957

ACTIVA (Automated Cell-Type-informed Introspective Variational Autoencoder): a novel framework for generating realistic synthetic data using a single-stream adversarial variational autoencoder conditioned with cell-type information.

ACTIVA generates cells that are more realistic for classifiers to identify as synthetic which have better pair-wise correlation between genes. ACTIVA can generate specific subpopulations on demand, as opposed to two separate models such as scGAN and cscGAN.





□ DeepMinimizer: A Differentiable Framework for Optimizing Sequence-Specific Minimizer Schemes

>> https://www.biorxiv.org/content/10.1101/2022.02.17.480870v1.full.pdf

DeepMinimizer framework employs a twin network architecture. PriorityNet generates valid mini- mizers, but has no guarantee on density. In contrast, TemplateNet generates low-density templates that might not correspond to valid minimizers.

Coupling these networks leads to a fully differentiable proxy objective that can effectively leverage gradient-based learning techniques. The solution space of the re-parameterization is only restricted by the modelling capacity encoded by the architecture weight space.





□ RNA velocity unraveled

>> https://www.biorxiv.org/content/10.1101/2022.02.12.480214v1.full.pdf

An assessment of the impact of hyper-parameterized, heuristic data pre-processing and visualization in current RNA velocity workflows is useful for developing more reliable analyses.

The count processing and inference steps, which comprise the model estimation procedure, serve to identify parameters for a transcription model under some fairly strong assumptions, such as constitutive production and approximately Gaussian noise.

The literature contains numerous assertions that a meaningful Markovian transition probability matrix can be defined on observed cell states. However, the constructed Markov chains have not been demonstrated to possess any particular relationship to an actual biological process.





□ scISR: A novel method for single-cell data imputation using subspace regression

>> https://www.nature.com/articles/s41598-022-06500-4

scISR (single-cell Imputation via Subspace Regression) identifies the true dropout values using hyper-geomtric testing approach. Based on the result obtained from hyper-geometric testing, the original dataset is segregated into two including training data and imputable data.

scISR determines zero-valued entries that are most likely affected by dropout events and then estimates the dropout values using a subspace regression model. This hypothesis is that dropout events happen randomly for a gene affected by this phenomenon.





□ GraphMB: Metagenomic binning with assembly graph embeddings

>> https://www.biorxiv.org/content/10.1101/2022.02.25.481923v1.full.pdf

GraphMB, a binner developed using long-read metagenomic data and incorporates the assembly graph into the contig features learning process, taking full advantage of its potential by training a neural network to give more importance to higher coverage edges.

GraphMB requires an assembly consisting of a set of contig sequences in FASTA format and an assembly graph in GFA format.

They intends to adapt Graph Attention Networks to deal with more complex graphs. This type of algorithm learns an attention mechanism to decide which neighbors of a node should have more weight when computing its embedding.


for il, layer in enumerate(self.layers):
y = torch.zeros(g.num_nodes(), self.n_hidden if il != len(self.layers) - 1 else self.n_classes)

sampler = dgl.dataloading.MultiLayerFullNeighborSampler(1)
dataloader = dgl.dataloading.NodeDataLoader





□ ICI-Kt: Information-Content-Informed Kendall-tau Correlation: Utilizing Missing Values

>> https://www.biorxiv.org/content/10.1101/2022.02.24.481854v1.full.pdf

ICI-Kt, an information-content-informed Kendall-tau correlation coefficient that allows missing values to carry explicit information in the determination of concordant and discordant pairs.

ICI-Kt allows for the inclusion of missing data values as interpretable information. Moreover, the implementation of ICI-Kt uses a mergesort-like algorithm that provides O(nlog(n)) computational performance.





□ HyperChIP: identification of hypervariable signals across ChIP-seq or ATAC-seq samples

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02627-9

HyperChIP uses scaled variances that account for the mean-variance dependence to rank genomic regions, and it increases the statistical power by diminishing the influence of true hypervariable regions on model fitting.

Given a matrix of normalized signal intensities, HyperChIP accounts for the associated mean-variability relationship by applying a gamma family regression method to observed mean-variance pairs.





□ abc4pwm: affinity based clustering for position weight matrices in applications of DNA sequence analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04615-z

Affinity Based Clustering for Position Weight Matrices (abc4pwm) efficiently clustered PWMs from multiple sources with or without using DNA-Binding Domain (DBD) information, generated a representative motif for each cluster, evaluated the clustering quality automatically.

Abc4pwm has functions for visualization of PWMs clusters, and for searching a given PWM against known PWMs by reporting the top matched ones. It also has format conversion function for conversion between various formats e.g., TRANSFAC, JASPAR, and BayesPI.





□ STRIDE: accurately decomposing and integrating spatial transcriptomics using single-cell RNA sequencing

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac150/6543547

Spatial TRanscrIptomics DEconvolution by topic modeling (STRIDE), is a computational method to decompose cell types from spatial mixtures by leveraging topic profiles trained from single-cell transcriptomics.

Besides the cell-type composition deconvolution, STRIDE provides several downstream analysis functions, incl. signature detection, spatial clustering and domain identification based on neighborhood cell populations and reconstruction of three-dimensional architecture.





Cubiculum.

2022-03-03 03:01:03 | Science News
(designed by Pak)






□ TraSig: inferring cell-cell interactions from pseudotime ordering of scRNA-Seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02629-7

TraSig (Trajectory-based Signalling genes inference) takes the pseudo-time ordering for each group and the expression of genes along the trajectory as input and then outputs an interaction score and p-value for each possible ligand-receptor pair.

TraSig uses the Continuous-State Hidden Markov Model (CSHMM). learns a generative model on the expression data using transition states and emission probabilities. CSHMM assumes a tree structure for the trajectory and assigns cells to specific locations on its edges.





□ The Inferelator 3.0: High performance single-cell gene regulatory network inference at scale

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac117/6533443

The Inferelator 3.0 pipeline for single-cell GRN inference, based on regularized regression. This pipeline calculates TF activity using a prior knowledge network and regresses scRNAseq expression data against that activity estimate to learn new regulatory edges.

The inferelator 3.0 uses TF motif position-weight matrices to score TF binding within gene regulatory regions and build sparse prior networks. It is able to distribute work across multiple computational nodes, allowing networks to be rapidly learned from over 10^5 cells.





□ Flow-GTED: The Effect of Genome Graph Expressiveness on the Discrepancy Between Genome Graph Distance and String Set Distance

>> https://www.biorxiv.org/content/10.1101/2022.02.18.481102v1.full.pdf

Extending a genome graph distance metric, Graph Traversal Edit Distance (GTED) to FGTED to model the distance between heterogeneous string sets and show that GTED and FGTED always underestimate the Earth Mover’s Edit Distance (EMED) between string sets.

FGTED always produces a distance that is larger than or equal to GTED, and that FGTED computes a metric that is always less than or equal to the EMED between true sets of strings.

Define the collection of strings that can be represented by the genome graph as its string set universe, and genome graph expressiveness as the diameter of its string set universe (SUD), which is the maximum EMED between two string sets that can be represented by the graph.

Flow-GTED denotes the distance computed using the alignment graph after removing all infinity cost edges that forbid aligning the sink with any nodes other than the source node.





□ Tensor decomposition- and principal component analysis-based unsupervised feature extraction to select more reasonable differentially expressed genes: Optimization of standard deviation versus state-of-art methods

>> https://www.biorxiv.org/content/10.1101/2022.02.18.481115v1.full.pdf

Optimizing the standard deviation such that the histogram of P-values is as much as possible coincident with the null hypothesis results in an increase in the number and biological reliability of the selected genes.

One of the striking features is that DEGs with lesser gene expression are less likely recognized even with the same LFC, if the genes are selected by TD- and PCA-based unsupervised FE with optimized SD.





□ seqgra: Principled Selection of Neural Network Architectures for Genomics Prediction Tasks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac101/6534325

seqgra, a deep learning pipeline that incorporates the rule-based simulation of biological sequence data and the training and evaluation of models, whose decision boundaries mirror the rules from the simulation process.

seqgra creates models based on a precise description of their architecture, loss, optimizer, and training process, and evaluate the trained models using conventional test set metrics as well as an array of feature attribution methods.





□ TrieDedup: A fast trie-based deduplication algorithm to handle ambiguous bases in high-throughput sequencing

>> https://www.biorxiv.org/content/10.1101/2022.02.20.481170v1.full.pdf

Suppose there are n input sequences, and each sequence has m bases. For the preprocessing steps, the time complexity of counting 'N's is O(m×n), and sorting n sequences can be O(n×log(n)) for quick sort, or O(n) for bucket sort.

TrieDedup uses trie (prefix tree) structure to compare and store sequences. TrieDedup can handle ambiguous base 'N's, and efficiently deduplicate at the level of raw sequences.





□ SCRIP: Single-cell Gene Regulation Network Interference by Large-scale Data Integration

>> https://www.biorxiv.org/content/10.1101/2022.02.19.481131v1.full.pdf

SCRIP, an integrative method to infer single-cell TR activities and targets based on the integration of scATAC-seq and public bulk ChIP-seq datasets.

The SCRIP takes the scATAC-seq peak by count matrix or bin count matrix as input. SCRIP allows identifying the targets of different TRs in diverse cell types and constructing GRNs of multiple TRs in the same cell.





□ GMAT: An Improved Linear Mixed Model for Multivariate Genome-Wide Association Studies

>> https://www.biorxiv.org/content/10.1101/2022.02.21.481252v1.full.pdf

GMAT, can handle incomplete multivariate data with missing records and reduce the time complexity to O(n) per SNP.

GMAT has increased the statistical power with a proper control of false positivity for association studies compared to the conventional linear mixed model (LMM) that removes individuals with incomplete records.





□ Distance correlation application to gene co-expression network analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04609-x

a correlation metric integrating both linear and non-linear dependence, with other three typical metrics (Pearson’s correlation, Spearman’s correlation, and maximal information coefficient) on four different arrays and RNA-seq datasets.

Incorporated distance correlation into WGCNA to construct a distance correlation-based WGCNA (DC-WGCNA) algorithm for gene co-expression analysis.

In DC-WGCNA, the correlation coefficients between the gene expression profiling data are calculated by distance correlation, and the other process of DC-WGCNA is identical to the traditional WGCNA except for the different correlation coefficients.





□ POIBM: Batch correction of heterogeneous RNA-seq datasets through latent sample matching

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac124/6535232

a POIsson Batch correction through sample Matching (POIBM), which is based on an idea of inferring virtual reference samples from the data. Consequently, special experimental designs or design factors are not required since POIBM automatically learns these from the data.

POIBM utilizes only two expression matrices of read counts, a target matrix and a source matrix. POIBM is designed to be optimal for RNA-seq count data, similar to ComBat-seq, which has been shown to outperform the Gaussian alternatives on RNA-seq data.





□ Regulatory network-based imputation of dropouts in single-cell RNA sequencing data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009849

The simple explanation estimates the average using all cells is a much more robust estimator of the true mean than using only a small set of similar cells, especially when the gene was detected in only few cells and/or if the gene’s expression does not vary much across cells.

This imputes missing states of genes in cases where the respective gene was not detected in any cell or in only extremely few cells. This approach rests on the assumption that the network describes the true regulatory relationships in the cells at hand with sufficient accuracy.





□ SEQUIN: rapid and reproducible analysis of RNA-seq data in R/Shiny

>> https://www.biorxiv.org/content/10.1101/2022.02.23.481646v1.full.pdf

SEQUIN is guided by the NIH principles of scientific data management (findability, accessibility, interoperability, reusability). SEQUIN is a R/Shiny app for real- time analysis and visualization of bulk and scRNA-seq raw count and metadata.

SEQUIN empowers users with different backgrounds to perform customizable analysis of bulk and single-cell RNA-seq in real-time and in one location.





□ BamToCov: an efficient toolkit for sequence coverage calculations

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac125/6535233

BamToCov performs coverage calculations using an optimized implementation of the algorithm of Covtobed with new features to support interval targets, new output formats, coverage statistics and multiple BAM files, while retaining the ability to read input streams.

BamToCov uses a streaming approach that takes full advantage of sorted input alignments. Furthermore, its memory usage depends only on the maximum coverage and not on the reference size. BamToCov proves to be a suitable alternative for gene panels and long reads datasets.





□ monaLisa: an R/Bioconductor package for identifying regulatory motifs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac102/6535228

monaLisa: MOtif aNAlysis with Lisa was inspired by her father Homer to look for enriched motifs in sets (bins) of genomic regions, compared to all other regions ("binned motif enrichment analysis").

The regions are for example promoters or accessible regions, which are grouped into bins according to a numerical value assigned to each region, such as change of expression or accessibility.





□ Truvari: Refined Structural Variant Comparison Preserves Allelic Diversity

>> https://www.biorxiv.org/content/10.1101/2022.02.21.481353v1.full.pdf

Truvari - a SV comparison, annotation and analysis toolkit - and demonstrate the effect of SV comparison choices by building population-level VCFs from 36 haplotype-resolved long-read assemblies.

When SV comparison is too lenient, over-merging occurs, distinct alleles are lost, and metrics such as allele frequency are inflated. Truvari’s core functionality involves building a matrix of pairs of SVs and ordering the pairs to determine how each should be handled.





□ xcore: an R package for inference of gene expression regulators

>> https://www.biorxiv.org/content/10.1101/2022.02.23.481130v1.full.pdf

xcore takes promoter or gene expression counts matrix as input, the data is then filtered for lowly expressed features, normalized for the library size and transformed into counts per million (CPM) using edgeR.

Using ridge regression xcore models changes in expression as a linear combination of molecular signatures in an attempt to find their unknown activities.





□ EagleImp-Web: A Fast and Secure Genotype Phasing and Imputation Web Service using Field-Programmable Gate Arrays

>> https://www.biorxiv.org/content/10.1101/2022.02.24.481790v1.full.pdf

EagleImp-Web uses technical improvements in phasing and imputation algorithms and a field-programmable gate array (FPGA) accelerator design to reduce computation time without loss of phasing and imputation quality.

The main advantages of EagleImp over the classical two step approach with Eagle2 and PBWT are the increased computation speed of a factor 2 to 10 while the phasing and imputation quality is at least maintained or even improved.




□ DRDNet: A statistical framework for recovering pseudo-dynamic networks from static data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac038/6537533

DRDNet incorporates a varying coefficient model with multiple ordinary differential equations to learn a series of networks.

Since DRDNet is under the philosophy of prediction, where interaction effects from each node are assumed to be unknown and modeled nonparametrically.





□ Contamination detection in genomic data: more is not enough

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02619-9

The algorithms can be divided into two main categories, depending on if they are database-free or, in opposition, if they rely on a reference database. The second category contains two different types of tools: genome-wide approaches and estimators based on single-copy gene markers.

The most frequent rationale for using multiple approaches is to increase the sensitivity and catch more contaminated genomes by considering the union of the methods. This is especially useful in large genomic projects where the loss of individual genomes is not too important.




□ Hanna Liubakova RT

#Ukraine
Residents of Energodar took to the streets to prevent Russian troops. In this city, the largest nuclear power plant in Europe - the Zaporizhzhia Nuclear Power Station - is located. Any shelling or explosion can be deadly here. I hope the Kremlin understands it

>> https://twitter.com/hannaliubakova/status/1498951257783947267?s=21




□ stavridisj RT

History being made in so many ways right in front of our eyes. As Supreme Allied Commander of NATO for 4 years, I never considered use of these “war reserve” equipment.

>> https://www.armytimes.com/flashpoints/2022/03/01/army-activates-prepositioned-stocks-for-first-time-in-wake-of-ukraine-invasion/

>> https://twitter.com/stavridisj/status/1499013500630413321?s=21




□ Victore Kovalenko RT

During the 5th day of war, the #Ukrainian air defense is actively engaging, and functional. In this video you can see how it intercepts the Russian missile in the sky between #Melipotol city and Vasilyevka settlement on the south. pic.twitter.com/iC5yCBtvGU #Ukraine

>> https://twitter.com/mrkovalenko/status/1498374524819189764?s=21




□ GEORGIA RT

>> https://twitter.com/tbilisime/status/1498439504696328192?s=21

Stay Strong Ukraine!🇺🇦 We pray for you!
#staystrongUkraine
Вся Грузия объединилась в поддержку Украины.
Люди здесь выходят каждый день с начала этого ада.
Video🎥 Spitfire Media





□ The SETI Institute RT

>> https://twitter.com/setiinstitute/status/1498409435714174976?s=21

The U.S. and Russia have cooperated extensively in building and operating the @Space_Station since 1993. @esa, @JAXA_en, and @csa_asc have played major roles, but that deep cooperation is failing. Will Western sanctions end joint programs? buff.ly/3M24frb @NExSSManyWorlds




□ NFDI-de RT

>> https://twitter.com/nfdi_de/status/1498670920545849347?s=21

As a large network of research institutions in Germany, #NFDI is collecting links, contacts and services that can help scientists from Ukraine affected by the war. We hope that we can show our solidarity this way. #ScienceForUkraine @Sci_for_Ukraine

https://www.nfdi.de/important-links-for-scientists-from-ukraine/?lang=en





□ GraphBio: a shiny web app to easily perform popular visualization analysis for omics data

>> https://www.biorxiv.org/content/10.1101/2022.02.28.482106v1.full.pdf

GraphBio provides 15 modules, incl. heatmap, volcano plots, MA plots, network plots, dot plots, chord plots, pie plots, four quadrant diagrams, venn diagrams, cumulative distribution curves, PCA, survival analysis, ROC analysis, correlation analysis and text cluster analysis.





□ Mini-IsoQLR: a pipeline for isoform quantification using long-reads sequencing data for single locus analysis

>> https://www.biorxiv.org/content/10.1101/2022.03.01.482488v1.full.pdf

Mini-IsoQLR was developed to detect and quantify isoforms from the expression of minigenes, whose cDNA was sequenced using Oxford Nanopore Technologies (ONT).

This protocol uses GMAP aligner, which aligns cDNA sequences to a genome, using the parameter --format=2 which generates a GFF3 file which contains the coordinates of the exons from all reads. Using this information, Mini-IsoQLR.R classify the mapped reads into isoforms.





□ kana: Single-cell data analysis in the browser

>> https://www.biorxiv.org/content/10.1101/2022.03.02.482701v1.full.pdf

kana provides a streamlined one-click workflow for all steps in a typical scRNA-seq analysis, starting from a count matrix and finishing with marker detection.

Users can interactively explore the low- dimensional embeddings, clusterings and marker genes in an intuitive graphical interface that encourages iterative re-analysis.





□ Nanopore quality score resolution can be reduced with little effect on downstream analysis

>> https://www.biorxiv.org/content/10.1101/2022.03.03.482048v1.full.pdf

The experiments on various usage scenarios for nanopore sequencing data, including different applications and coverage levels, show that the precision that is currently used for quality scores is unnecessarily high.

All these results were obtained with applications as they are provided, with no special tuning or training for quantized quality scores.

Although such specific tuning may improve the performance of these applications (for example through neural network retraining), the matter of fact is that excelent results are obtained with no software adjustment.

The quantization of quality scores results in large storage space savings, even using a general purpose compressor such as gzip.





□ CNVind: an open source cloud-based pipeline for rare CNVs detection in whole exome sequencing data based on the depth of coverage

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04617-x

CNVind performs n independent depth of coverage normalizations. Before each normalization, the application selects the k most correlated sequencing regions with the depth of coverage Pearson’s Correlation as distance metric.

Then, the resulting subgroup of k+1 sequencing regions is normalized, the results of all n independent normalizations are combined; finally, the segmentation and CNV calling process is performed on the resultant dataset.





□ supCPM: Supervised Capacity Preserving Mapping: A Clustering Guided Visualization Method for scRNAseq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac131/6543606

supCPM, a robust supervised visualization method, which separates different clusters, preserves the global structure and tracks the cluster variance.

Continuous scRNAseq data often exhibits trajectories where functional overlaps occur. This real world challenge could limit effectiveness of supCPM, because the second optimization part separates different clusters far apart.

One could think of how to process the dataset with the mixture of both discrete and continuous cell types. supCPM shows improved performance than other methods in preserving the global geometric structure and data variance.





□ JIND: Joint Integration and Discrimination for Automated Single-Cell Annotation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac140/6543609

JIND is a framework for automated cell-type identification based on neural networks. It directly learns a low-dimensional representation (latent code) inwhich cell-types can be reliably determined.

JIND performs a novel asymmetric alignment in which the transcriptomic profileof unseen cells is mapped onto the previously learned latent space, hence avoiding the need of retraining the model whenever a new dataset becomes available.

The NN used by JIND consists of two subnetworks, an encoder and a classifier. First, the encoder network maps the input gene expression vector onto a 256-dimensional latent space via a one-layer NN.





□ GenErode: a bioinformatics pipeline to investigate genome erosion in endangered and extinct species

>> https://www.biorxiv.org/content/10.1101/2022.03.04.482637v1.full.pdf

GenErode aims to produce comparable estimates of genomic diversity indices from temporally sampled datasets that can be used to quantify genomic erosion through time.

GenErode requires only a reference genome assembly and whole-genome re-sequencing data. GenErode offers two complementary methods to estimate mutational load, a proxy for genetic load, from the genomic data of the samples analyzed.





□ vissE: A versatile tool to identify and visualise higher-order molecular phenotypes from functional enrichment analysis

>> https://www.biorxiv.org/content/10.1101/2022.03.06.483195v1.full.pdf

vissE, a flexible network-based analysis method that summarises redundancies into biological themes and provides various analytical modules to characterise and visualise them with respect to the underlying data, thus providing a comprehensive view of the biological system.

The vissE method tackles gene-set redundancy by condensing information from all significant gene-sets into higher-order biological processes, thus hierarchically structuring the results in an easily browsable manner.





□ iPheGWAS : an intelligent computational framework to integrate and visualise genome-phenome wide association studies

>> https://www.biorxiv.org/content/10.1101/2022.03.05.483121v1.full.pdf

Since iPheGWAS provides an ordered or clustered visualisation of multiple traits that are genetically similar, an easy visual appreciation of the overall genome-wide landscape provides initial clues about shared genetic effects across multiple phenotypes.

iPheGWAS assists the process of selecting traits for a multi-trait analysis genome-wide association studies (MTAG) to improve power for detecting genetic variants contributing to disease risk.





□ NewWave: a scalable R/Bioconductor package for the dimensionality reduction and batch effect removal of single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac149/6546285

ZINB-WaVE uses a zero inflated negative binomial model to find biologically meaningful latent factors. Optionally, the model can remove batch effects and other confounding variables, leading to a low-dimensional representation that focuses on biological differences among cells.

NewWave allows users to massively parallelize computations using PSOCK clusters. NewWave is able to achieve the same, or even better, performance of ZINB-WaVE at a fraction of the computational speed and memory usage, reducing the runtime by 90% with respect to ZINB-WaVE.