2022年1月31日のブログ記事一覧-lens, align.

elementum.

2022-01-31 13:31:13 | Science News

□ Dynamic inference of cell developmental complex energy landscape from time series single-cell transcriptomic data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009821

GraphFP, a nonlinear Fokker-Planck equation (FPE) on graph based model and dynamic inference framework, with the aim of reconstructing the cell state-transition complex potential energy landscape from time series single-cell transcriptomic data.

The discrete Wasserstein distance is introduced to transform the probability simplex into a Riemannian manifold, called discrete Wasserstein manifold. The FPE is proven to be the gradient flow of the free energy on the discrete Wasserstein manifold.

GraphFP learns the complex geometry of data, as well as provides a novel way to quantify cell-cell interactions during cell development. It models the cell developmental process as stochastic dynamics of the cell state/type frequencies on probability simplex in continuous time.

□ End-to-end Learning Of Evolutionary Models To Find Coding Regions In Genome Alignments

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac028/6513381

ClaMSA classifies multiple sequence alignments using a phylogenetic model. It builds on TensorFlow and a custom layer for Continuous-Time Markov Chains (CTMC) and trains a set of rate matrices for a classification task.

This model is the standard general-time reversible (GTR) CTMC that allows to compute gradients of the tree- likelihood under the almost universally used continuous-time Markov chain model.

□ DIDL: A deep learning approach to predict inter-omics interactions in multi-layer networks

>>

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04569-2

DIDL is a novel autoencoder architecture that is capable of learning a joint representation of both first-order and second-order proximities. DIDL offers several advantages like automatic feature extraction from raw data, end-to-end training, and robustness to network sparsity.

DIDL is a combination of multilayer perceptron (MLP) and tensor factorization. The predictor and encoder parameters can be jointly optimized. DIDL encoder cluster omics elements through latent feature extraction.

□ LongPhase: an ultra-fast chromosome-scale phasing algorithm for small and large variants

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac058/6519151

LongPhase, an ultra-fast algorithm which can simultaneously phase single nucleotide polymorphisms (SNPs) and SVs of a human genome in ~10-20 minutes, 10x faster than the state-of-the-art WhatsHap and Margin.

LongPhase combined with Nanopore is a cost-effective approach for providing chromosome-scale phasing without the need for additional trios, chromosome-conformation, and single-cell strand-seq data.

□ Syllable-PBWT for space-efficient haplotype long-match query

>> https://www.biorxiv.org/content/10.1101/2022.01.31.478234v1.full.pdf

Syllable-PBWT, a space- efficient variation of the positional Burrows-Wheeler transform which divides every haplotype into syllables, builds the PBWT positional prefix arrays on the compressed syllabic panel, and leverages the polynomial rolling hash function.

The Syllable-Query algorithm finds long matches between a query haplotype and the panel. Syllable-Query is significantly faster than the full memory algorithm. After reading in the query haplotype in O(N ) time, these sequences require O(nβ log M ) time to compute.

□ VeChat: Correcting errors in long reads using variation graphs

>> https://www.biorxiv.org/content/10.1101/2022.01.30.478352v1.full.pdf

Unlike single consensus sequences, which current approaches are generally centering on, variation graphs are able to represent the genetic diversity across multiple, evolutionarily or environmentally coherent genomes.

VeChat distinguishes errors from haplotype-specific true variants based on variation graphs, which reflect a popular type of data structure for pangenome reference systems.

□ ETNA: Joint embedding of biological networks for cross-species functional alignment

>> https://www.biorxiv.org/content/10.1101/2022.01.17.476697v1.full.pdf

ETNA (Embeddings to Network Alignment) generates individual network embeddings based on network topological structures and then uses a Natural Language Processing-inspired cross-training approach to align the two embeddings using sequence orthologs.

ETNA uses an autoencoder framework to generate lower-dimensional latent embeddings that preserve both local / global network topology while capturing the non-linear relationships. ETNA can be used to transfer genetic interactions across species and identify phenotypic alignments.

□ maxATAC: genome-scale transcription-factor binding prediction from ATAC-seq with deep neural networks

>> https://www.biorxiv.org/content/10.1101/2022.01.28.478235v1.full.pdf

The maxATAC models were specifically designed to improve prediction of TFBS from rare cell types and in vivo settings, where limited sample material or cell sorting strategies would preclude experimental TFBS measurement.

maxATAC predictions for all three TFs outperformed TF motif scanning in ATAC-seq peaks. maxATAC is capable of high resolution TFBS prediction using information-sharing between proximal sequence and accessibility signals.

□ lv89: C implementation of the Landau-Vishkin algorithm

>> https://github.com/lh3/lv89

This repo implements the Landau-Vishkin algorithm to compute the edit distance between two strings. This is a fast method for highly similar strings.

The actual implementation follows a simplified the wavefront alignment (WFA) formulation rather than the original formulation. It also learns a performance trick from WFA.

□ Causal-net category

>> https://arxiv.org/pdf/2201.08963v1.pdf

A causal-net is a finite acyclic directed graph. A category, denoted as Cau, whose objects are causal-nets and morphisms are functors of path categories of causal-nets.

It is called causal-net category and in fact the Kleisli category of the "free category on a causal-net" monad. Cau characterizes interesting causal-net relations, such as coarse-graining, immersion-minor, topological minor, etc., and prove several useful decomposition theorems.

□ Stone Duality for Topological Convexity Spaces

>> https://arxiv.org/pdf/2201.09819v1.pdf

A convexity space is a set X with a chosen family of subsets (called convex subsets) that is closed under arbitrary intersections and directed unions. There is a lot of interest in spaces that have both a convexity space and a topological space structure.

the category of topological convexity spaces and extend the Stone duality between coframes and topological spaces to an adjunction between topological convexity spaces and sup-lattices.

An alternative approach to modelling the category of T0 topological spaces is via strictly zero dimensional biframes. For topological convexity spaces, this construction does not generate any new spaces to improve the properties of the category of spaces.

□ AGAMEMNON: an Accurate metaGenomics And MEtatranscriptoMics quaNtificatiON analysis suite

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02610-4

AGAMEMNON, a time and space-efficient in silico framework for the analysis of metagenomic/metatranscriptomic samples providing highly accurate microbial abundance estimates at genus, species, and strain resolution.

AGAMEMNON uses an EM algorithm to probabilistically resolve the origin of reads. AGAMEMNON takes into account the sparsity of single-cell approaches also in the differential abundance analyses, it offers methods shown to be robust in such settings such as edgeR-LRT and edgeR-QLF.

□ Cross-Dependent Graph Neural Networks for Molecular Property Prediction

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac039/6517516

The multi-view modeling with graph neural network (MVGNN) to form a novel paralleled framework which considers both atoms and bonds equally important when learning molecular representations.

CD-MVGNN, a cross-dependent message passing scheme to enhance information communication of different views. It theoretically justifies the expressiveness of the proposed model in terms of distinguishing non-isomorphism graphs.

□ Discovering adaptation-capable biological network structures using control-theoretic approaches

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009769

Since adaptation is a stable (convergent) response, according to the Hartman–Grobman theorem, the conditions obtained for adaptation using linear time-invariant (LTI) systems theory serve as sufficient conditions for the same even in non-linear systems.

The entire algorithm remains agnostic to the particularities of the reaction kinetics.

The network structures for adaptation ipso facto reduce peak time because of the infinite precision (zero-gain) requirement. The control-theoretic approach addresses the question of non-zero sensitivity along with the infinite precision requirement for perfect adaptation.

□ DeepGOZero: Improving protein function prediction from sequence and zero-shot learning based on ontology axioms

>> https://www.biorxiv.org/content/10.1101/2022.01.14.476325v1.full.pdf

DeepGOZero combines a model-theoretic approach for learning ontology embeddings and protein function prediction. DeepGOZero can exploit formal axioms in the GO to make zero-shot predictions.

DeepGOZero uses a model-theoretic approach for embedding ontologies into a distributed space, ELEmbeddings. ELEmbeddings uses normalized GO axioms as constraints and projects each GO class into an n-ball and each relation as a transformation within n-dimensional space.

DeepGOZero computes the binary crossentropy loss between the predictions and the labels, and optimize them together with four normal form losses for ontology axioms from ELEmbeddings.

□ EagleImp: Fast and Accurate Genome-wide Phasing and Imputation in a Single Tool

>> https://www.biorxiv.org/content/10.1101/2022.01.11.475810v1.full.pdf

EagleImp is 2 to 10 times faster (depending on the single or multiprocessor configuration selected) than Eagle2/Position-based Burrows-Wheeler Transform (PBWT), with the same or better phasing and imputation quality.

EagleImp uses multiple threads for genotype imputation. A conversion of genotypes and haplotypes into a compact representation with integer registers and made extensive use of Boolean and bit masking operations as well as processor directives for bit operations.

□ Lerna: transformer architectures for configuring error correction tools for short- and long-read genome sequencing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04547-0

Lerna, the automated configuration of k-mer-based EC tools. Lerna first creates a language model (LM) of the uncorrected genomic reads, and then, based on this LM, calculates a metric called the perplexity metric to evaluate the corrected reads for different parameter choices.

Lerna leverages the perplexity metric for automated tuning of k-mer sizes without needing a reference genome. The perplexity computation in Lerna, in contrast, is linear in the length of the input (number of reads × read length). Lerna is 80x to 275x faster than Bowtie2.

Lerna relies on a Simulated Annealing (SA)-based searching. In Lerna, the Transformer LM uses the perplexity metric, which is derived to be the exponential of the cross-entropy loss. Lerna maximizes the similarity between the ground truth and the predictions.

□ GraphChainer: Co-linear Chaining for Accurate Alignment of Long Reads to Variation Graphs

>> https://www.biorxiv.org/content/10.1101/2022.01.07.475257v1.full.pdf

GraphChainer solves co-linear channing on a DAG, when allowing one-node suffix- prefix overlaps between anchor paths. This solution is an extension of the O(k(|V | + |E|) log |V | + kN log N ) time solution. GraphChainer significantly improves the alignments of GraphAligner.

GraphChainer divides the running time of the algorithm into O(k3|V | + k|E|) for pre-processing the graph, and O(kN log kN). That is, for constant width graphs, this solution takes linear time to preprocess the graph plus O(N log N ) time to solve co-linear chaining.

□ MOJITOO: a fast and universal method for integration of multimodal single cell data

>> https://www.biorxiv.org/content/10.1101/2022.01.19.476907v1.full.pdf

MOJITOO uses canonical correlation analysis for a fast detection of a shared representation of cells from multimodal scdata. Moreover, estimated canonical components can be used for interpretation, i.e. association of modality specific molecular features with the latent space.

MOJITOO does not require the definition of parameters such as the rank of the matrix. Furthermore, it provides an approach to estimate the size of the latent space after a single execution of CCA.

□ GAGAM: a genomic annotation-based enrichment of scATAC-seq data for Gene Activity Matrix

>> https://www.biorxiv.org/content/10.1101/2022.01.24.477458v1.full.pdf

Using genes as features solves the problem of the feature dataset dependency allowing for the link of gene accessibility and expression. The latter is crucial for gene regulation understanding and fundamental for the increasing impact of multi-omics data.

a Genomic Annotated GAM (GAGAM), which leverages accessibility data and information from genomic annotations of regulatory regions to weigh the gene activity with the annotated functional significance of accessible regulatory elements linked to the genes.

□ scAR: Probabilistic modeling of ambient noise in single-cell omics data

>> https://www.biorxiv.org/content/10.1101/2022.01.14.476312v1.full.pdf

Single cell Ambient Remover (scAR) which uses probabilistic deep learning to deconvolute the observed signals into native and ambient composition. scAR provides an efficient and universal solution to count denoising for multiple types of single-cell omics data.

This hypothesis suggests that ambient RNAs may not be completely random but deterministic signals to a certain extent.

scAR outputs a probability matrix representing the probability whether raw observed counts contain native signals. scAR simultaneously infers noise ratio (ε) and native expression frequencies (β) using the VAE framework.

□ sc-REnF: An entropy guided robust feature selection for single-cell RNA-seq data

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbab517/6509050

sc-REnF [robust entropy based feature (gene) selection method], aiming to leverage the advantages of R′enyi and Tsallis entropies in gene selection for single cell clustering.

sc-REnF raises an objective function that will minimize conditional entropy between the selected features and maximize the conditional entropy between the class label and feature.

While applying sc-REnF multiple times with varying number features, the resulting ARI scores employ a minimum deviation for Renyi and Tsallis entropy.

□ scMVP: A deep generative model for multi-view profiling of single-cell RNA-seq and ATAC-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02595-6

scMVP (the single-cell Multi-View Profiler), a multi-modal deep generative model, which is designed for handling sequencing data that simultaneously measure gene expression / chromatin accessibility in the same cell, incl. SNARE-seq, sci-CAR, Paired-seq, SHARE-seq, and 10X Multiome.

scMVP takes raw count of scRNA-seq and term frequency–inverse document frequency (TF-IDF) transformed scATAC-seq as input.

scMVP automatically learns the common latent representation for scRNA-seq and scATAC-seq data through a clustering consistency-constrained multi-view VAE, and imputes each single layer data from the common latent embedding of the multi-omic data.

scMVP uses a cycle-GAN like auxiliary network. scMVP introduces the multi-heads self-attention module to capture the local long-distance correlation from sparse and high-dimension scATAC profile of joint dataset, and the mask attention to focus on the local semantic region.

□ GPSA: Alignment of spatial genomics and histology data using deep Gaussian processes

>> https://www.biorxiv.org/content/10.1101/2022.01.10.475692v1.full.pdf

Gaussian process spatial alignment (GPSA), a probabilistic model that aligns a set of spatially-resolved genomics and histology slices onto a known or unknown common coordinate system into which the samples are aligned both spatially and in terms of the phenotypic readouts.

GPSA uses two stacked Gaussian processes to align spatial slices across technologies and samples in a two-dimensional, three-dimensional, or potentially spatiotemporal coordinate system. GPSA allows for imputation of missing data and creation of dense spatial readouts.

□ SENIES: DNA Shape Enhanced Two-layer Deep Learning Predictor for the Identification of Enhancers and Their Strength

>> https://ieeexplore.ieee.org/document/9678035/

SENIES is a deep learning based two-layer predictor for enhancing the identification of enhancers and their strength by utilizing DNA shape information beyond two common sequence-derived features, namely kmer and one-hot.

Since there are 7 nucleotide / 6 base pair-step shape parameters used, the length of the concatenated shape feature vector can be formulated as 7×(N−4) + 6×(N−3). Given N=200 in this case, an input DNA sequence can be finally encoded with a DNA shape vector of 2554 dimensions.

□ LuxRep: a technical replicate-aware method for bisulfite sequencing data analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04546-1

LuxRep, a probabilistic method that implements a general linear model and simultaneously accounts for technical replicates (libraries from the same biological sample) from different bisulfite-converted DNA libraries.

LuxRep retains the general linear model with matrix normal distribution used by LuxGLM to handle covariates wherein matrix normal distribution is a generalisation of multivariate normal distribution to matrix-valued random variables.

□ OptiDiff: structural variation detection from single optical mapping reads

>> https://www.biorxiv.org/content/10.1101/2022.01.08.475501v1.full.pdf

OptiDiff uses a single molecule segment-matching approach to the reference map to detect and classify SV sites at coverages as low as 20x. OptiDiff uses a reference molecule set to obtain background mapping levels in all genomic regions on the reference.

OptiDiff calculates the ratio between this background mapping rate and the SV candidate molecules’ mapping rate to detect SV sites. Based on this segment-match information, OptiDiff then applies a simple rule tree to classify the type of structural variation.

□ Illumina

>> https://www.illumina.com/science/genomics-research/articles/infinity-high-performance-long-read-assay.html

The Infinity technology platform combines highly accurate Illumina SBS chemistry, the latest advancements in our data analysis portfolio and a novel proprietary assay to generate long contiguous data to address the most challenging regions of the genome.

Infinity also enables 10x greater throughput with 90% less DNA input than legacy long reads. We anticipate an early access launch for Infinity technology in the second half of the year.

□ Identifying and correcting repeat-calling errors in nanopore sequencing of telomeres

>> https://www.biorxiv.org/content/10.1101/2022.01.11.475254v1.full.pdf

Telomeres are represented by (TTAGGG)n and (CCCTAA)n repeats in many organisms were frequently miscalled (~40-50% of reads) as (TTAAAA)n, or as (CTTCTT)n and (CCCTGG)n repeats respectively in a strand-specific manner during nanopore sequencing.

This miscalling is likely caused by the high similarity of current profiles between telomeric repeats and these repeat artefacts, leading to mis-assignment of electrical current profiles during basecalling.

An overall strategy to re-basecall telomeric reads using a tuned nanopore basecaller. And selective application of the tuned models to telomeric reads led to improved recovery and analysis of telomeric regions, with little detected negative impact on basecalling of other genomic regions.

□ CCPLS reveals cell-type-specific spatial dependence of transcriptomes in single cells

>> https://www.biorxiv.org/content/10.1101/2022.01.12.476034v1.full.pdf

CCPLS (Cell-Cell communications analysis by Partial Least Square regression modeling), which is a statistical framework for identifying cell-cell communications as the effects of multiple neighboring cell types on cell-to-cell expression variability of HVGs.

CCPLS performs PLS regression modeling and reports coefficients as the quantitative index of the cell-cell communications. CCPLS realizes a multiple regression approach for estimating the MIMO (multiple-input and multiple-output) system.

□ The effects of sequencing depth on the assembly of coding and noncoding transcripts in the human genome

>> https://www.biorxiv.org/content/10.1101/2022.01.30.478357v1.full.pdf

the effect of the sequencing depth varied based on cell or tissue type, the type of read considered and the nature and expression levels of the transcripts.

The detection of coding transcripts saturated rapidly for both short-read and long-reads. There was no sign of saturation for noncoding transcripts at any sequencing depth. Increasing long-read sequencing depth specifically benefited transcripts containing transposable elements.

□ HiC-LDNet: A general and robust deep learning framework for accurate chromatin loop detection in genome-wide contact maps

>> https://www.biorxiv.org/content/10.1101/2022.01.30.478367v1.full.pdf

HiC-LDNet can give relatively more accurate predictions in multiple tissue types and contact technologies. HiC-LDNet recovers a higher number of loop calls in multiple experimental platforms, and achieves higher confidence scores in multiple cell types.

HiC-LDNet shows strong robustness when scanning through the extremely sparse scHi-C data, and can recover the majority of the labeled loops. Considering the time complexity, HiC-LDNet could finish its prediction at an average 25s/Mbp across the entire genome at 10kb resolution.

□ NetMix2: Unifying network propagation and altered subnetworks

>> https://www.biorxiv.org/content/10.1101/2022.01.31.478575v1.full.pdf

NetMix2 is an algorithm for identifying altered subnetworks from a wide range of subnetwork families, including the propagation family which approximates the subnetworks ranked highly by network propagation.

E Pluribus Unum.

2022-01-31 13:13:31 | Science News

□ SELINA: Single-cell Assignment using Multiple-Adversarial Domain Adaptation Network with Large-scale References

>> https://www.biorxiv.org/content/10.1101/2022.01.14.476306v1.full.pdf

SELINA (single cELl identity NAvigator) optimizes the annotation for minority cell types by synthetic minority over-sampling, removes batch effects using a multiple-adversarial domain adaptation network (MADA), and fits the query data with reference data using an autoencoder.

SELINA affords a comprehensive and uniform reference atlas with 1.7 million cells covering 230 major human cell types.

SELINA multiplies its gene expression vector by a random weight and then sums the pair of weighted vectors to obtain a synthetic cell which is at a random point on the line connecting the pair of cells. SELINA freezes the decoder and turn to update the parameters of the encoder.

□ MGcount: a total RNA-seq quantification tool to address multi-mapping and multi-overlapping alignments ambiguity in non-coding transcripts

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04544-3

Multi-Graph count (MGcount) assigns reads hierarchically to small-RNA and long-RNA features to account for length disparity when transcripts overlap in the same genomic position.

MGcount outputs a transcriptomic count matrix compatible with RNA-sequencing downstream analysis pipelines, with both bulk and single-cell resolution, and the graphs that model repeated transcript structures for different biotypes.

MGcount aggregates RNA products with similar sequences where reads systematically multi-map using a graph-based approach. The map equation formulates the theoretical limit to compress the description of an infinite random walk trajectory.

□ LDA: Supervised dimensionality reduction for exploration of single-cell data by Hybrid Subset Selection - Linear Discriminant Analysis

>> https://www.biorxiv.org/content/10.1101/2022.01.06.475279v1.full.pdf

LDA (linear discriminant analysis) identifies linear combinations of predictors that optimally separate a priori classes, enabling users to tailor visualizations to separate specific aspects of cellular heterogeneity.

Hybrid-Subset-Selection - LDA performs feature selection to enhance dimensionality reduction and visualization of single-cell data by maximizing class separation via a stepwise feature selection approach, selecting the final model based on a separation metric.

□ NetSeekR: a network analysis pipeline for RNA-Seq time series data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04554-1

an integration of one of the best performing spliced aligners—STAR—with a pseudo-aligner—Kallisto—as well as two differential gene expression analysis tools (edgeR and Sleuth) using different statistical models and data analysis and visualization methods.

NetSeekR, an RNA-Seq data analysis R package aimed at analyzing the transcriptome dynamics for inferring networks of differentially expressed genes associated with experimental treatments measured at multiple time points.

□ Uncovering hidden assembly artifacts: when unitigs are not safe and bidirected graphs are not helpful

>> https://www.biorxiv.org/content/10.1101/2022.01.20.477068v1.full.pdf

Under-assembly issues due to the palindrome artifact are rare in real genomes and, moreover, can be trivially fixed by forcing the unitigs to “push their way through” lonely inverted loops.

A theoretical and empirical study to validate the two hypothesis about common algorithm-driven sources of mis- and under-assemblies. First, despite widespread belief to the contrary, we show that even on error-free data, unitigs do not always appear in the sequenced genome (i.e. they are unsafe).

There is a bijection between maximal unitigs in the doubled and bidirected dBGs, except that palindromic unitigs in the doubled dBG are split in half in the bidirected dBG. Naively using the bidirected graph actually contributes to under-assembly compared to the doubled graph.

□ PSSs: Using syncmers improves long-read mapping

>> https://www.biorxiv.org/content/10.1101/2022.01.10.475696v1.full.pdf

Parameterized Syncmer Schemes provides a theoretical analysis for multiple arbitrary s-minimizer positions. It is possible to retain properties of syncmers such as minimum and most frequent distances b/n selected positions by choosing the correct parameters and downsampling rate.

They incorporates PSSs into the long read mappers minimap2 and Winnowmap2. This syncmer mappers outperformed minimap2 and Winnowmap2 and succeeded in mapping more long reads across a range of different compression values.

□ Phiclust: a clusterability measure for single-cell transcriptomics reveals phenotypic subpopulations

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02590-x

phiclust (ϕclust), a clusterability measure derived from random matrix theory that can be used to identify cell clusters with non-random substructure, testably leading to the discovery of previously overlooked phenotypes.

Universal properties of the underlying theory make it possible to apply phiclust to arbitrary noise distributions, and the noise can be additive or multiplicative.

If the number of non-zero singular values is small compared to the dimensions of the matrix, low-rank perturbation theory is applicable. This theory allows us to calculate the singular values of the measured gene expression matrix from the singular values of the signal matrix.

□ JAFFAL: detecting fusion genes with long-read transcriptome sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02588-5

JAFFAL, a new method which is built on the concepts developed in JAFFA and overcomes the high error rate in long-read transcriptome data by using alignment methods and filtering heuristics which are designed to handle noisy long reads.

JAFFAL employs a strategy which anchors transcript breakpoints to exon boundaries. It uses the end position of reference genome alignments to determine fusion breakpoints. JAFFAL is a transcript-centric approach rather than a genome-centric approach like other fusion finders.

□ OLOGRAM-MODL: mining enriched n-wise combinations of genomic features with Monte Carlo and dictionary learning

>> https://academic.oup.com/nargab/article/3/4/lqab114/6478886

OLOGRAM-MODL considers overlaps between n ≥ 2 sets of genomic regions, and computes their statistical mutual enrichment by Monte Carlo fitting of a Negative Binomial distribution, resulting in more resolutive P-values.

OLOGRAM-MODL combines an optional itemset mining algorithm with a statistical model to determine the enrichment of the relevant combinations, asserting whether this combination occurs in the real data across more base pairs that would be expected by chance.

□ ORTHOSKIM: in silico sequence capture from genomic and transcriptomic libraries for phylogenomic and barcoding applications

>> https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13584

ORTHOSKIM, which performs in silico capture of targeted sequences from genomic and transcriptomic libraries without assembling whole organelle genomes.

ORTHOSKIM proceeds in three steps: 1) global sequence assembly, 2) mapping against reference sequences, and 3) target sequence extraction. ORTHOSKIM recovered with high success rates cpDNA, mtDNA and rDNA sequences.

□ CaiNet: Periodic synchronization of isolated network elements facilitates simulating and inferring gene regulatory networks including stochastic molecular kinetics

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04541-6

By considering a deterministic time evolution within each time interval for all elements, this method approaches the solution of the system of deterministic differential equations associated with the GRN.

CaiNet is able to recover the network topology and the network parameters well. CaiNet is able to reproduce noise-induced bi-stability and oscillations in dynamically complex GRNs. This modular approach further allows for a simple consideration of deterministic delays.

□ PPS: Path-level interpretation of Gaussian graphical models using the pair-path subscore

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04542-5

the pair-path subscore (PPS), a method for interpreting Gaussian graphical models at the level of individual network paths. The scoring is based on the relative importance of such paths in determining the Pearson correlation between their terminal nodes.

The PPS can be used to probe network structure on a finer scale by investigating which paths in a potentially intricate topology contribute most substantially to marginal behavior.

□ FAVSeq: Machine learning-assisted identification of factors contributing to the technical variability between bulk and single-cell RNA-seq experiments

>> https://www.biorxiv.org/content/10.1101/2022.01.06.474932v1.full.pdf

FAVSeq (Factors Affecting Variability in Sequencing data) pipeline analyzes multimodal RNA sequencing data, which allowed to identify factors affecting quantitative difference in gene expression measurements as well as the presence of dropouts.

FAVSeq module supports both non- and parametric imputation strategies, including k-Nearest Neighbors. FAVSeq optimizes hyper-parameters of models through the 5-fold cross-validated (CV) grid-search.

□ scDIOR: single cell RNA-seq data IO software

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04528-3

scDIOR accommodates a variety of data types across programming languages and platforms in an ultrafast way, including single-cell RNA-seq and spatial resolved transcriptomics data, using only a few codes in IDE or command line interface.

scDIOR can perform spatial omics data IO between Seurat and Scanpy. scDIOR creates 8 HDF5 groups to store core single-cell information, including data, layers, obs, var, dimR, graphs, uns and spatial.

□ scDALI: modeling allelic heterogeneity in single cells reveals context-specific genetic regulation

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02593-8

scDALI, a versatile computational framework that integrates information on cellular states with allelic quantifications of single-cell sequencing data to characterize cell-state-specific genetic effects.

scDALI enables the estimation of allelic imbalance from sparse sequencing data in individual cells, thereby facilitating the visualization and downstream interpretation of allelic regulation.

□ GCRNN: graph convolutional recurrent neural network for compound–protein interaction prediction

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04560-x

Graph Convolutional Recurrent Neural Network (GCRNN) uses protein analysis based on a CNN after a max-pooling layer followed by a bidirectional LSTM layer. And Gate Recurrent unit is used for protein sequence vectorization.

GCRNN uses a 3-layer GNN with an r-radius number of 2 to represent molecules as vectors. the CNN takes the original amino acid sequence and passes through a 3-layer structure with 320 convolutional kernels and a window size of 30 with random initiation based on a similar model.

□ ChromoMap: an R package for interactive visualization of multi-omics data and annotation of chromosomes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04556-z

ChromoMap’s flexibility allows for concurrent visualization of genomic data in each strand of a given chromosome, or of more than one homologous chromosome; allowing the comparison of multi-omic data b/n genotypes or b/n homologous chromosomes of phased diploid/polyploid genomes.

ChromoMap takes tab-delimited files (BED like) or alternatively R objects to specify the genomic co-ordinates of the chromosomes and elements to annotate. ChromoMap renders chromosomes as a continuous composition of windows, to surmount this restriction.

□ Bookend: Precise Transcript Reconstruction with End-Guided Assembly

>> https://www.biorxiv.org/content/10.1101/2022.01.12.476004v1.full.pdf

Bookend uses end information to guide transcript assembly for identifying RNA ends in sequencing data and using the information to assemble transcript isoforms as paths through a network accounting for splice sites, transcription start sites (TSS) and polyadenylation sites (PAS).

Bookend enables the automated annotation of promoter architecture. Bookend takes RNA-seq reads from any method as input and after alignment to a reference genome, reads are stored in a lightweight end-labeled read (ELR) file format that records all RNA boundary features.

□ OKseqHMM: a genome-wide replication fork directionality analysis toolkit

>> https://www.biorxiv.org/content/10.1101/2022.01.12.476022v1.full.pdf

OKseqHMM directly measures the genome-wide replication fork directionality (RFD) as well as replication initiation and termination from data obtained by Okazaki fragment sequencing (OK-Seq) and related techniques.

OKseqHMM allows accurate detection of replication initiation/termination zones with an HMM algorithm. OKseqHMM can be applied to analyze data obtained by both kinds of techniques, i.e., eSPAN and TrAEL-seq.

□ Telogator: a method for reporting chromosome-specific telomere lengths from long reads

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac005/6505201

While a majority of methods for measuring telomere length will report average lengths across all chromosomes, it is known that aberrations in specific chromosome arms are biomarkers for certain diseases.

Telogator detects chromosome-specific telomere length in simulated data across a range of read lengths and error rates. And investigate common subtelomere rearrangements and identify the minimum read length required to anchor telomere/subtelomere boundaries.

□ c-TSNE: Explainable t-SNE for single-cell RNA-seq data analysis

>> https://www.biorxiv.org/content/10.1101/2022.01.12.476084v1.full.pdf

c-TSNE (cell-driven t-SNE), an explainable t-SNE that demonstrates robustness to dropout and noise in dimension reduction and clustering. It provides a novel and practical way to investigate the interpretability of t-SNE in scRNA-seq data analysis.

c-TSNE uses appropriate and explainable distance metrics incl. Yule, L-Chebyshev, and fractional distance metrics. The cell-driven distance metrics make more relevant samples mapped as the closest neighbors to each other in the low-dimensional embedding space.

□ baredSC: Bayesian approach to retrieve expression distribution of single-cell data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04507-8

baredSC infers the intrinsic expression distribution using a Gaussian mixture model. baredSC can be used to obtain the distribution in one dimension for individual genes and in two dimensions for pairs of genes, in particular to estimate the correlation in the two genes.

baredSC allows to retrieve precisely multi-modal expression distribution even when they are not distinguishable in the input data due to sampling noise. And is able to uncover the expression distribution used to simulate the data, even in multi-modal cases with very sparse data.

□ A semi-supervised Bayesian mixture modelling approach for joint batch correction and classification

>> https://www.biorxiv.org/content/10.1101/2022.01.14.476352v1.full.pdf

This model allows observations to be probabilistically assigned to classes in a way that incorporates uncertainty arising from batch effects.

The MVN mixture model exhibited good behaviour, except when misspecified as in the MVT generated data. the MVT mixture model’s estimate tended to be either centred on the true value.

□ LYRUS: a machine learning model for predicting the pathogenicity of missense variants

>> https://academic.oup.com/bioinformaticsadvances/article-abstract/2/1/vbab045/6483096

LYRUS, a machine learning method that uses an XGBoost classifier to predict the pathogenicity of SAVs. LYRUS incorporates five sequence-based, six structure-based and four dynamics-based features.

LYRUS includes a newly proposed sequence co-evolution feature called the variation number. Variation numbers employed in the model are scaled using min. to max. normalization for each amino acid sequence.

□ scAnnotatR: framework to accurately classify cell types in single-cell RNA-sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04574-5

scAnnotatR is a novel R package that provides a complete framework to classify cells in scRNA-seq datasets using pre-trained classifiers. It supports both Seurat and Bioconductor’s SingleCellExperiment and is thereby compatible w/ the vast majority of R-based analysis workflows.

scAnnotatR uses hierarchically organised SVMs to distinguish a specific cell type versus all others. It shows comparable or even superior accuracy, sensitivity and specificity compared to existing tools while being able to not-classify unknown cell types.

□ Bulk2Space: Spatially resolved single-cell deconvolution of bulk RNA-seq

>> https://www.biorxiv.org/content/10.1101/2022.01.15.476472v1.full.pdf

Bulk2Space, a spatial deconvolution method based on deep learning frameworks, which converts bulk transcriptomes into spatially resolved single-cell expression profiles using existing high-quality scRNA-seq data and spatial transcriptomics as references.

Bulk2Space first generates single-cell transcriptomic data within the clustering space to find a set of cells whose aggregated data are close to the bulk data. Next, the generated single cells were allocated to optimal spatial locations using a spatial transcriptome reference.

□ A novel gene functional similarity calculation model by utilizing the specificity of terms and relationships in gene ontology

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04557-6

The proposed method mainly contains three steps. Firstly, a novel computing model is put forward to compute the IC of terms. This model has the ability to exploit the specific structural information of GO terms.

Secondly, the IC of term sets are computed by capturing the genetic structure between the terms contained in the set.

They measure the gene functional similarity according to the IC overlap ratio of the corresponding annotated genes sets. The proposed method accurately measures the IC of not only GO terms but also the annotated term sets by leveraging the specificity of edges in the GO graph.

□ sPLINK: a hybrid federated tool as a robust alternative to meta-analysis in genome-wide association studies

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02562-1

sPLINK, a hybrid federated and user-friendly tool, which performs privacy-aware GWAS on distributed datasets while preserving the accuracy of the results.

sPLINK is robust against heterogeneous distributions of data across cohorts while meta-analysis considerably loses accuracy in such scenarios. sPLINK achieves practical runtime and acceptable network usage for chi-square and linear/logistic regression tests.

□ AIME: Autoencoder-based integrative multi-omics data embedding that allows for confounder adjustments

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009826

AIME can detect nonlinear associations between the data matrices. It finds data embedding from the input data matrix that best preserves its relation with the output data matrix.

AIME can be seen as a nonlinear equivalent to CCA, with the added capability to adjust for confounder variables. AIME is even more effective than traditional linear methods such as CCA, PLS, jSVD, iCluster2 and MOFA2 in extracting linear relationships.

□ LmTag: functional-enrichment and imputation-aware tag SNP selection for population-specific genotyping arrays

>> https://www.biorxiv.org/content/10.1101/2022.01.28.478108v1.full.pdf

LmTag, a novel method for tag SNP selection that not only improves imputation performance but also prioritizes highly functional SNP markers.

LmTag uses a robust statistical modeling to systematically integrate LD information, minor allele frequency (MAF), and physical distance of SNPs into the imputation accuracy score to improve tagging efficiency.

LmTag adapts the beam search framework to prioritize both variant imputation scores and functional scores to solve the tag SNP selection problem. LmTag improves both imputation performance and prioritization of functional variants.

Tagging efficiency of tag SNP sets selected by LmTag are sustainability higher than existing genotyping arrays, indicating the potential improvements for future genotyping platforms.

□ STORM: spectral sparsification helps restore the spatial structure at single-cell resolution

>> https://www.biorxiv.org/content/10.1101/2022.01.25.477389v1.full.pdf

STORM reconstructs the single-cell resolution quasi-structure from the spatial transcriptome with diminished pseudo affinities.

STORM first curates the representative single-cell profiles for each spatial spot from a candidate library, then reduces the pseudo affinities in the intercellular affinity matrix by partial correlation, spectral graph sparsification, and spatial coordinates refinement.

STORM embeds the estimated interactions into a low-dimensional space with the cross-entropy objective to restore the intercellular quasi-structures, which facilitates the discovery of dominant ligand-receptor pairs between neighboring cells at single-cell resolution.

□ Diagnostic Evidence GAuge of Single cells (DEGAS): a flexible deep transfer learning framework for prioritizing cells in relation to disease

>> https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-022-01012-2

DEGAS, the deep transfer learning framework to integrate scRNA-seq and patient-level transcriptomic data in order to infer the transferrable “impressions” between patient characteristics in single cells and cellular characteristics in patients.

DEGAS models are trained using both single-cell and patient disease attributes using a multitask learning neural network that learns latent representation reducing the differences between patients and single cells at the final hidden layer using Maximum Mean Discrepancy.

□ ReadBouncer: Precise and Scalable Adaptive Sampling for Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2022.02.01.478636v1.full.pdf

Read-Bouncer, a new approach for nanopore adaptive sam- pling that combines fast CPU and GPU base calling with read classification based on Interleaved Bloom Filters (IBF).

ReadBouncer uses Oxford Nanopore's Read Until functionality to unblock reads that match to a given reference sequence database. Signals are basecalled in real-time with Guppy or DeepNano-blitz.

Perplexium.

2022-01-31 13:13:13 | Science News

□ Aristotle: stratified causal discovery for omics data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04521-w

Aristotle is a multi-phase algorithm that tackles the above challenges by using a novel divide-and-conquer scheme that utilizes biclustering for finding the promising strata and candidate causes and QED to identify the stratum-specific causes.

Aristotle detects the hidden strata using SUBSTRA. SUBSTRA learns feature weights, and uses these weights when computing the strata. Aristotle needs to evaluate the causality of the association between each of the candidate features and each of the positive strata.

□ Improving the time and space complexity of the WFA algorithm and generalizing its scoring

>>

https://www.biorxiv.org/content/10.1101/2022.01.12.476087v1.full.pdf

The time complexity of Wavefront Algorithm (WFA) is O(sN), taking N = min{M,N} without loss of generality. It may need to perform O(N) character comparisons over the course of the algorithm. The algorithm requires O(s2) additional space over and above the O(M + N) space.

The suffix tree-based algorithm required significantly more time than the direct comparison algorithm. This contrasts with the suffix tree algorithm’s favorable asymptotic time complexity, these sequences are insufficiently divergent for the asymptotic behavior to set in.

Refinements of the WFA alignment algorithm with better complexity. These variants WFA that improve its asymptotic memory use from O(s^2) to O(s^3/2) and its asymptotic run time from O(sN) to O(s^2 +N).

□ The minimizer Jaccard estimator is biased and inconsistent

>> https://www.biorxiv.org/content/10.1101/2022.01.14.476226v1.full.pdf

The minimizer Jaccard estimator is biased and inconsistent, which means that the expected difference (i.e., the bias) between the estimator and the true value is not zero, even in the limit as the lengths of the sequences grow.

An analytical formula for the bias as a function of how the shared k-mers are laid out along the sequences. Both theoretically and empirically that there are families of sequences where the bias can be substantial e.g. the true Jaccard can be more than double the estimate.

□ Power analysis for spatial omics

>> https://www.biorxiv.org/content/10.1101/2022.01.26.477748v1.full.pdf

An in silico tissue framework to enable spatial power analysis and assist with experimental design. ISTs can be directly used for method development and benchmarking of existing or novel spatial analysis methods.

In silico tissues were generated by first constructing a tissue scaffold - a blank tissue with no cell information assigned - then assigning cell type labels to the scaffold.

a beta-binomial model to predict how many single cells need to be measured to observe a cell type of interest at a certain probability and a gamma-Poisson model to predict how many FOVs are required to observe a cell type of interest at a certain probability.

□ NGSEP 4: Efficient and Accurate Identification of Orthogroups and Whole Genome Alignment

>> https://www.biorxiv.org/content/10.1101/2022.01.27.478091v1.full.pdf

NGSEP implements functionalities for identification of clusters of homologus genes, synteny analysis and whole genome alignment, and visualization. Clustering is performed from the graph running Markov Clustering on the connected components.

If genome assemblies are provided as input, synteny relationships are identified for each pair of genomes implementing an adapted version of the HalSynteny algorithm.

A synteny block is identified making a single traversal, and calculating for each vertex the total length of the longest path that finishes. The vertex with the longest global path length is chosen as the last vertex of the synteny path and predecessors reconstruct the path.

□ SNP calling for the Illumina Infinium Omni5-4 SNP BeadChip kit using the butterfly method

>> https://www.biorxiv.org/content/10.1101/2022.01.17.476594v1.full.pdf

the “butterfly method” for SNP calling with the Illumina Infinium Omni5-4 BeadChip kit without the use of Illumina GenomeStudio software. The method is a within-sample method and does not use other samples nor population frequencies to call SNPs.

By lowering the a posteriori probability threshold for no-calls, we can get a higher call rate fraction than the GenomeStudio and by using a higher a posteriori probability threshold, we can achieve a higher concordance with the WGS data.

□ SLAG: A Program for Seeded Local Assembly of Genes in Complex Genomes

>> https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13580

SLAG (Seeded Local Assembly of Genes) fulfills this need by performing iterative local assembly based on cycles of matching-read retrieval with blast and assembly with CAP3, phrap, SPAdes, canu, or Unicycler.

Read fragmentation allows SLAG to use phrap or CAP3 to assemble long reads at lower coverage (e.g., 5x) than is possible with canu or Unicycler.

a SLAG assembly can cover a whole chromosome, but in complex genomes the growth of target-matching contigs is limited as additional reads are consumed by consensus contigs consisting of repetitive elements.

□ scCorr: A novel graph-based k-partitioning approach improves the detection of gene-gene correlations by single-cell RNA sequencing

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-021-08235-4

scCorr uses a graph-based algorithm to recover the missing gene-gene correlation in scRNA-seq data that enables the reliable acquisition of cluster-based gene-gene correlations in three independent scRNA-seq datasets.

The scCorr algorithm generates a graph or topological structure, and partitioning the graph into k multiple min-clusters employing the Louvain algorithm. And averaging the expression values, including zero values.

□ DENTIST-using long reads for closing assembly gaps at high accuracy

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giab100/6514926

DENTIST determines repetitive assembly regions to identify reliable and unambiguous alignments of long reads to the correct loci, integrates a consensus sequence computation to obtain a high base accuracy for the inserted sequence, and validates the accuracy of closed gaps.

DENTIST improves the contiguity and completeness of fragmented assemblies with long reads. DENTIST uses the first 3 repeat annotations as a soft mask/aligns all input long reads to the assembly using damapper, which outputs chains of local alignments arising from read artefacts.

□ ECCsplorer: a pipeline to detect extrachromosomal circular DNA (eccDNA) from next-generation sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04545-2

Following Illumina-sequencing of amplified circular DNA (circSeq), the ECCsplorer enables an easy and automated discovery of eccDNA candidates.

The ECCsplorer pipeline provides a framework for the automated detection of eccDNA candidates using well established tools including data transfer between tools, data summarization and assessment.

□ Using dual-network-analyser for communities detecting in dual networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04564-7

Dual-Network-Analyser is based on the identification of communities that induce optimal modular subgraphs in the conceptual network and connected subgraphs in the physical one. It includes the Louvain algorithm applied to the considered case.

The Dual-Network-Analyser algorithm receives as input two input networks. Networks are initially merged into a single Weighted Alignment Graph. The Louvain algorithm is used for finding them modular communities, while in the case of DCS, then the Charikar algorithm is used.

□ Acidbio: Assessing and assuring interoperability of a genomics file format

>> https://www.biorxiv.org/content/10.1101/2022.01.07.475366v1.full.pdf

Bioinformatics software tools operate largely through the use of specialized genomics file formats. Often these formats lack formal specification, and only rarely do the creators of these tools robustly test them for correct handling of input and output.

Acidbio provides a test system for software that parses the BED format as input. Acidbio unifies correct behavior when tools encounter various edge cases—potentially unexpected inputs that exemplify the limits of the format.

□ oCEM: Automatic detection and analysis of overlapping co-expressed gene modules

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-021-08072-5

Overlapping CoExpressed gene Module (oCEM) did the extraction of non-Gaussian signatures by ICA - the fastICA algorithm was configured using parallel extraction method and the default measure of non-Gaussianity logcosh approximation of negentropy with α = 1.

optimizeCOM specifies the optimal number of principal components in advance required by the decomposition methods. the processed data were inputted into the function overlapCEM, rendering co-expressed gene modules (i.e., Signatures with their own kurtosis ≥ 3) and Patterns.

□ Transitivity scores to account for triadic edge weight similarity in undirected weighted graphs

>> https://www.biorxiv.org/content/10.1101/2022.01.11.475816v1.full.pdf

The graph transitivity is usually computed for dichotomized networks, therefore focusing on whether triangular relationships are closed or open. But when the connections vary in strength, focusing on whether the closing ties exist or not can be reductive.

Scoring the weighted transitivity according to the similarity between the weights of the three possible links in each triad. It correctly diagnosed excesses of balanced or imbalanced triangles, e.g. strong triplets closed by weak links.

□ Debiasing FracMinHash and deriving confidence intervals for mutation rates across a wide range of evolutionary distances

>> https://www.biorxiv.org/content/10.1101/2022.01.11.475870v1.full.pdf

While there is ample computational evidence for the superiority of FracMinHash when compared to the classic MinHash, particularly when comparing sets of different sizes, no theoretical characterization about the accuracy of the FracMinHash approach has yet been given.

FracMinHash can estimate the true containment index better when the sizes of two sets are dissimilar. One particularly attractive feature of FracMinHash is its analytical tractability.

□ An accurate method for identifying recent recombinants from unaligned sequences

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac012/6506517

An algorithm to detect recent recombinant sequences from a dataset without a full multiple alignment. This algorithm can handle thousands of gene-length sequences without the need for a reference panel.

This framework develops on the basis of the paritial alignment results from jumping hidden markov model (JHMM), after that, by dividing them into multiple equal-length triples, on which they use a new distance-based procedure to identify recombinant from each triple.

□ Slinker: Visualising novel splicing events in RNA-Seq data

>> https://f1000research.com/articles/10-1255

Slinker, a bioinformatics pipeline written in Python and Bpipe that uses a data-driven approach to assemble sample-specific superTranscripts.

Slinker uses Stringtie2 to assemble transcripts with any sequence across any gene. This assembly is merged with reference transcripts, converted to a superTranscript, of which rich visualisations are made through Plotly with associated annotation and coverage information.

□ MeShClust v3.0: High-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores

>> https://www.biorxiv.org/content/10.1101/2022.01.15.476464v1.full.pdf

MeShClust v3.0 is based on the mean shift algorithm, which is an instance of unsupervised learning. The scaled-up MeShClust v3.0 is also an instance of out-of-core learning, in which the learning algorithm is trained on separate batches of the training data consecutively.

MeShClust v3.0 utilizes the k-means clus- tering algorithm with a k value of 2. To determine the maximum center-member identity score, MeShClust v3.0 reads 10,000 sequences. It calculates all-versus-all identity scores on these sequences using Identity.

□ ONTdeCIPHER: An amplicon-based nanopore sequencing pipeline for tracking pathogen variants

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac043/6515611

ONTdeCIPHER integrates 13 bioinformatics tools, including Seqkit, ARTIC bioinformatics tool, PycoQC, MultiQC, Minimap2, Medaka, Nanopolish, Pangolin (with the model database pangoLEARN), Deeptools (PlotCoverage, BamCoverage), Sniffles, MAFFT, RaxML and snpEff.

While building on the main features of the ARTIC pipeline, the ONTdeCIPHER pipeline incorporates additional useful features such as variant calling, variant annotation, lineage inference, multiple alignments and phylogenetic tree construction.

□ conST: an interpretable multi-modal contrastive learning framework for spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.01.14.476408v1.full.pdf

conST can learn low-dimensional embeddings by effectively in- tegrating multi-modal SRT data, i.e. gene expression, spatial information, and morphology to learn low-dimensional embeddings.

The GNNExplainer explains which neighboring spots contribute to the prediction that conST makes, which is also biologically consistent with the interaction of the L-R pair identified in CCI.

□ NIMAA: an R/CRAN package to accomplish NomInal data Mining AnAlysis

>> https://www.biorxiv.org/content/10.1101/2022.01.13.475835v1.full.pdf

NIMAA can select a larger sub-matrix with no missing values in a matrix containing missing data, and then use the matrix to generate a bipartite graph and cluster on two projections.

NIMAA provides functions for constructing weighted and unweighted bipartite graphs, analysing the similarity of labels in nominal variables, clustering labels or categories to super-labels, validating clustering results, predicting bipartite edges by missing weight imputation.

□ Varia: a tool for prediction, analysis and visualisation of variable genes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04573-6

Varia predicts near full-length gene sequences and domain compositions of query genes from database genes sharing short sequence tags. Varia generates output through two complementary pipelines.

Varia_VIP returns all putative gene sequences and domain compositions of the query gene from any partial sequence provided, thereby enabling experimental validation of specific genes of interest and detailed assessment of their putative domain structure.

□ plotsr: Visualising structural similarities and rearrangements between multiple genomes

>> https://www.biorxiv.org/content/10.1101/2022.01.24.477489v1.full.pdf

Plotsr generates high-quality visualisation of synteny and structural rearrangements between multiple genomes. For this it uses the genomic structural annotations between multiple chromosome-level assemblies.

plotsr can be used to compare multiple haploid genomes as well as different haplotypes of individual polyploid genomes. In addition, plotsr can mark specific loci as well as plot histogram tracks to show distributions of genomic features along the chromosomes.

□ BioInfograph: An Online Tool to Design and Display Multi-Panel Scientific Figure Interactively

>> https://www.frontiersin.org/articles/10.3389/fgene.2021.784531/full

bioInfograph, a web-based tool that allows users to interactively arrange high-resolution images in diversified formats, mainly Scalable Vector Graphics (SVG), to produce one multi-panel publication-quality composite figure.

bioInfograph solves stylesheet conflicts of coexisting SVG plots, integrates a rich-text editor, and allows creative design by providing advanced functionalities like image transparency, controlled vertical stacking of plots, versatile image formats, and layout templates.

□ Nanopore adaptive sampling: a tool for enrichment of low abundance species in metagenomic samples

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02582-x

A mathematical model which can predict the enrichment levels possible in a metagenomic community given a known relative abundance and read length distribution.

Using a synthetic mock community, the predictions of the model correlate well with observed behaviour and quantify the negative effect on flow cell yields caused by employing adaptive sampling.

The use of adaptive sampling provides us with the benefits of library-based enrichment, without complex protocols or the bias that these may introduce. The repeated ejection of molecules from the pores had less effect on pore stability than has been previously reported.

□ FMSClusterFinder: A new tool for detection and identification of clusters of sequential motifs with varying characteristics inside genomic sequences

>> https://www.biorxiv.org/content/10.1101/2022.01.23.474238v1.full.pdf

FMSClusterFinder, a new algorithm for identification and detection of clusters of sequential blocks inside the DNA and RNA subject sequences. Gene expression and genomic groups' performance is under the control of functional elements cooperating with each other as clusters.

The functional blocks are often comparably short, degenerate and are located within varying distances from each other. Since functional motifs mostly act in relation to each other as clusters, finding such clusters of blocks is to identify functional groups and their structure.

□ An exactly valid and distribution-free statistical significance test for correlations between time series

>> https://www.biorxiv.org/content/10.1101/2022.01.25.477698v1.full.pdf

The truncated time-shift (TTS), a statistical hypothesis test of dependence between two time series which can be used with any correlation function and which is valid as long as one of the time series is stationary.

This is a minimally restrictive requirement among exactly valid nonparametric tests of dependence between time series. This test was able to verify the previously observed dependences between obliquity and deglaciation timing.

□ MM4LMM: Efficient ReML inference in variance component mixed models using a Min-Max algorithm

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009659

a Min-Max (MM) algorithm for the ReML inference in Gaussian Variance Component (VC) mixed model. The MM algorithm can be combined to the classical tricks used to accelerate the inference process (e.g. simultaneous orthogonalization or squared iterative acceleration methods).

A limitation for such further developments is the fact that MM methods require the derivation of a specific surrogate function for each class of mixed model to be considered, making the extension of the inference procedure to e.g. auto-regressive or factor analytic models not straightforward.

□ Detecting gene–gene interactions from GWAS using diffusion kernel principal components

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04580-7

This approach employs kernel PCA on a “sandwich” kernel matrix which contains a diffusion kernel as “filling”. The dimensions of the “sandwich” kernel are determined by the available number of individuals in the study.

Interaction information between SNPs allocated to the same gene is used to compute diffusion kernels and graphical within-gene network structures. Data reduction via kernel PCA gives gene summaries that are submitted to an epistasis detection model of choice.

□ Tricycle: Universal prediction of cell-cycle position using transfer learning

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02581-y

Tricycle predicts a cell-specific position in the cell cycle based on the data projection. Tricycle generalizes across datasets and is highly scalable and applicable to atlas-level single-cell RNA-seq data.

Tricycle is a locked-down prediction procedure. There are no tuning parameters, neither explicitly set nor implicitly set through the use of cross-validation or alternatives.

□ MUON: multimodal omics analysis framework

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02577-8

MUON comes with interfaces to multi-omics analysis methods that jointly process multiple modalities, including multi-omics factor analysis (MOFA) to obtain lower-dimensional representations, and weighted nearest neighbours (WNN) to calculate multimodal neighbours.

At the core of MUON is MuData (multimodal data)—an open data structure for multimodal datasets. MuData handles multimodal datasets as containers of unimodal data. MuData provides a coherent structure for storing associated metadata and other side information.

□ ClustAssess: tools for assessing the robustness of single-cell clustering

>> https://www.biorxiv.org/content/10.1101/2022.01.31.478592v1.full.pdf

ClustAssess provides fine-grained information enabling (a) the detection of optimal number of clusters, (b) identification of regions of similarity (and divergence) across methods, (c) a data driven assessment of optimal parameter ranges.

ClustAssess comprises functions for evaluating clustering stability with regard to the number of clusters using proportion of ambiguous clusterings, functions for quantifying per-observation agreement between two or more clusterings using element-centric clustering comparison.

□ SeqWho: Reliable, Rapid Determination of Sequence File Identity using k-mer Frequencies in Random Forest Classifiers

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac050/6520802

SeqWho, a program designed to assess heuristically the quality of sequencing files and reliably classify the organism and protocol type by using Random Forest classifiers trained on biases native in k-mer frequencies and repeat sequence identities.

While there are some errors in the heuristic assessment of quality, SeqWho remains able to very accurately characterize the file’s quality substantially faster than FASTQC.

□ mSigHdp: hierarchical Dirichlet process mixture modeling for mutational signature discovery

>> https://www.biorxiv.org/content/10.1101/2022.01.31.478587v1.full.pdf

The hierarchical Dirichlet process (HDP) mixture model’s estimate of the number of signatures is influenced by the prior gamma distributions of the Dirichlet-process concentration parameters.

mSigHdp and SigProfilerExtractor had different strengths, with mSigHdp less susceptible to false negatives and SigProfilerExtractor less susceptible to false positives.

Sanctum.

2022-01-31 13:13:03 | Science News

□ Trees, graphs and aggregates: a categorical perspective on combinatorial surface topology, geometry, and algebra

>> https://arxiv.org/pdf/2201.10537v1.pdf

The graph morphisms of Borisov-Manin are adapted to capture all relevant aspects. Their level of sophistication allows to compute the automorphisms correctly and formalizes the operations of contracting, grafting and merging.

It realizes these graph morphisms as the two–morphisms of a double category in which horizontal composition is graph insertion, while vertical composition is the usual composition restricted to aggregates, where throughout the text an aggregate is a disjoint union of corollas.

□ NCMF: Neural Collective Matrix Factorization for Integrated Analysis of Heterogeneous Biomedical Data

>> https://www.biorxiv.org/content/10.1101/2022.01.20.477057v1.full.pdf

NCMF has a novel architecture that is dynamically constructed based on the number of entities and matrices in the input collection. Through the use of VAE where the decoded output is modeled using Zero-Inflated distributions, NCMF effectively models sparse and noisy inputs.

NCMF has 3 subnetworks: |Q| autoencoders to entity representations in each matrix; Fusion Subnetwork: |E| feedforward networks to fuse multiple encodings; Matrix Completion Subnetwork: |X | feedforward networks to reconstruct the input matrices: |Q| ≤ 2M,|E| ≤ N,|X| = M.

□ CellPhy: accurate and fast probabilistic inference of single-cell phylogenies from scDNA-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02583-w

CellPhy, a probabilistic model for the phylogenetic analysis of single-cell diploid genotypes inferred from scDNA-seq experiments. The CellPhy tree shows very high bootstrap values, highlighting the quality of this dataset, which has a strong phylogenetic signal.

CellPhy leverages a finite-site Markov genotype model with all 16 possible phased DNA genotypes—but can work with both phased and unphased data—and can also account for their uncertainty. CellPhy was the most accurate method, under infinite- and finite-site mutation models.

□ Echtvar: Really, truly rapid variant annotation and filtering

>> https://github.com/brentp/echtvar

Echtvar enables rapid annotation of variants with huge pupulation datasets and it supports filtering on those values. It chunks the genome into 1<

□ SSHash: Sparse and Skew Hashing of K-Mers

>> https://www.biorxiv.org/content/10.1101/2022.01.15.476199v1.full.pdf

A compressed and associative dictionary for k-mers, supporting fast Lookup, Access, and streaming queries: a data structure where strings are represented in compact form and each of them is associated to a unique integer identifier in the range [0,n).

SSHash exploits the sparseness and the skew distribution of k-mer minimizers to achieve compact space, while allowing fast lookup queries. SSHash is a read-only data structure, its queries are amenable to parallelism.

The dictionary space is 2N +5M +z⌈log2(N)⌉ + M⌈log2(z/M)⌉ + p⌈log2(N/p)⌉ + 2p + o(p) + o(M) bits. Instead of paying Θ(k − m + 1) time and O(1) space to compute each minimizer, it is possible to spend O(1) amortized per minimizer and a global working space of O(k − m + 1).

□ sc-spectrum: Spectral clustering of single-cell multi-omics data on multilayer graphs

>> https://www.biorxiv.org/content/10.1101/2022.01.24.477443v1.full.pdf

Single-Cell Spectral analysis Using Multilayer graphs (sc-spectrum) is a package for clustering cells in multi-omic single-cell sequencing datasets. The package provides an implementation of the Spectral Clustering on Multilayer graphs (SCML) algorithm.

A unifying mathematical framework that represents each layer using a Hamiltonian operator and a mixture of its eigenstates to integrate the multiple graph layers,the weighted locally linear (WLL) method is a rigorous multilayer spectral graph theoretic reformulation.

□ scSampler: fast diversity-preserving subsampling of large-scale single-cell transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2022.01.15.476407v1.full.pdf

scSampler, a Python package for fast diversity-preserving subsampling of large-scale single-cell transcriptomic data. By “diversity-preserving sampling,” scSampler implements the maximin distance design to make cells in the subsample as separative as possible.

scSampler outperforms existing subsampling methods in minimizing the Hausdorff distance between the subsample and the original sample. Moreover, scSampler is fast and scalable for million-level data.

□ The adapted Activity-By-Contact-model for enhancer-gene assignment and its combination with transcription factor affinities in single cell data

>> https://www.biorxiv.org/content/10.1101/2022.01.28.478202v1.full.pdf

STARE was designed under the assumption that cell type specificity is mainly driven by enhancer activity. It would be sufficient to define candidate enhancers and measure their activity in individual cells, or summarising activity over clusters of cells or cell types.

STARE combines enhancer-gene links called by the ABC-score with a non hit-based TF annotation. STARE is adapted to run on multiple cell types with the same candidate enhancers but varying activity, represented by activity columns.

□ MERIT: controlling Monte-Carlo error rate in large-scale Monte-Carlo hypothesis testing

>> https://www.biorxiv.org/content/10.1101/2022.01.15.476485v1.full.pdf

MERIT (Monte-Carlo Error Rate control In large-scale MC hypothesis Testing), a method for large-scale MC hypothesis testing that also controls the MCER but is more statistically efficient than the GH method.

MERIT aims to maximize detection efficiency by minimizing the number of “undecided” hypotheses at a given MC sample size or by making conclusive decisions for all hypotheses with fewer MC replicates.

□ multipleANOM: Hidden multiplicity in the analysis of variance (ANOVA): multiple contrast tests as an alternative

>> https://www.biorxiv.org/content/10.1101/2022.01.15.476452v1.full.pdf

There is no question that adjusting against hidden multiplicity reveals a conservative behavior relative to standard ANOVA. However, in the mostly non-a priori powered studies, some conservatism is preferable to a massive false positive rate.

multipleANOM allows not only to interpret global factor effects but also local effects between factor levels as adjusted p-values or simultaneous confidence intervals for selected effect measures in generalized linear models.

□ IPM: Inverse Potts model improves accuracy of phylogenetic profiling

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac034/6513380

Ipm is a program for calculating direct information based on the inverse Potts Model using the persistative contrastive divergence method.

Appliying the IPM to phylogenetic profiling to accurately predict gene functions. They use direct information (DI) calculated based on the IPM as the global metric.

□ ATLIGATOR: Editing protein interactions with an atlas-based approach

>> https://www.biorxiv.org/content/10.1101/2022.01.19.476980v1.full.pdf

ATLIGATOR – a computational method to support the analysis and design of a protein’s interaction with a single side chain. It enables the building of interaction atlases based on structures from the PDB.

the ATLIGATOR tool also incorporates association rule learning in the form of frequent itemset mining to extract frequent groups of pairwise interactions based on single ligand residues from the atlas.

□ FILER: a framework for harmonizing and querying large-scale functional genomics knowledge https://academic.oup.com/nargab/article/4/1/lqab123/6507423

FILER (FunctIonaL gEnomics Repository) is a framework for querying large-scale genomics knowledge with a large, curated integrated catalog of harmonized functional genomic and annotation data coupled with a scalable genomic search and querying interface.

FILER already integrates a broad range of genomic data types and biological conditions/tissues/cell types. FILER is highly scalable, with a sub-linear 32-fold increase in querying time when increasing the number of queries 1000-fold from 1000 to 1 000 000 intervals.

□ WGSUniFrac: Using the UniFrac metric on Whole Genome Shotgun data

>> https://www.biorxiv.org/content/10.1101/2022.01.17.476629v1.full.pdf

a method to overcome this intrinsic difference and compute the UniFrac metric on WGS data by assigning branch lengths to the taxonomic tree obtained from input taxonomic profiles.

Conducting a series of experiments to demonstrate that this WGSUniFrac method is comparably robust to traditional 16S UniFrac and is not highly sensitive to branch lengths assignments, be they data-derived or model-prescribed.

□ Sequencing of individual barcoded cDNAs on Pacific Biosciences and Oxford Nanopore reveals platform-specific error patterns

>> https://www.biorxiv.org/content/10.1101/2022.01.17.476636v1.full.pdf

PacBio reads are significantly more accurate and typically capture slightly longer transcript portions than ONT reads. While ONT and PacBio reads from RT pairs often agree on splicing structure, inconsistencies mostly arise from inexact ONT alignments.

the single-reverse-transcription event approach provides a powerful instrument for platform comparisons. In contrast to the comparisons of distinct molecules, this method offers tertium-non-datur reasoning, where disagreements are known to be caused by errors of one of the platforms.

□ SERM: a self-consistent deep learning solution for rapid and accurate gene expression recovery

>> https://www.biorxiv.org/content/10.1101/2022.01.18.476789v1.full.pdf

SERM (self-consistent expression recovery machine), a broadly applicable data-driven gene expression recovery framework to impute the missing gene expression. SERM first learns from a subset of the noisy gene expression data to estimate the underlying data distribution.

SERM then recovers the overall gene expression data by imposing a self-consistency on the gene expression matrix, thus ensuring that the expression levels are similarly distributed in different parts of the matrix.

□ Symbolic Kinetic Models in Python (SKiMpy): Intuitive modeling of large-scale biological kinetic models

>> https://www.biorxiv.org/content/10.1101/2022.01.17.476618v1.full.pdf

SKiMpy, the first open-source implementation of the ORACLE framework to efficiently generate steady-state consistent parameter sets.

SKiMpy enables the user to reconstruct kinetic models for large-scale biochemical reaction systems. SKiMpy represents a method development platform to analyze cell dynamics and physiology on a large scale.

□ ScrepYard: an online resource for disulfide-stabilised tandem repeat peptides

>> https://www.biorxiv.org/content/10.1101/2022.01.17.476686v1.full.pdf

ScrepYard is designed to assist researchers in identification of SCREP sequences of interest and to aid in characterizing this emerging class of biomolecules.

ScrepYard reveals two-domain tandem repeats constitute the most abundant SCREP domain architecture, while the interdomain “linker” regions connecting the ordered domains are found to be abundant in amino acids with short or polar sidechains.

□ scSeqComm: Identify, quantify and characterize cellular communication from single cell RNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac036/6511439

scSeqComm, a computational method to identify and quantify the evidence of ongoing intercellular and intracellular signaling from scRNA-seq data, and at the same time providing a functional characterization of the inferred cellular communication.

The possibility to quantify the evidence of ongoing communication assists the prioritization of the results, while the combined evidence of both intercellular and intracellular signaling increase the reliability of inferred communication.

□ scFeatures: Multi-view representations of single-cell and spatial data for disease outcome prediction

>> https://www.biorxiv.org/content/10.1101/2022.01.20.476845v1.full.pdf

scFeatures, a tool that generates a large collection of interpretable molecular representations for individual samples in single-cell omics data, which can be readily used by any machine learning algorithms to perform disease outcome prediction and drive biological discovery.

The features vector generated by scFeatures can be used for a broader set of downstream applications and not limited to the ones illustrated in the case studies. The feature vector can be subjected to latent class analysis, which has typically been applied on single-cell level for exploring cellular diversity.

□ FISH: Fine-grained Hashing with Double Filtering

>> https://ieeexplore.ieee.org/document/9695302/

the double-filtering mechanism consists of two modules, i.e., Space Filtering module and Feature Filtering module, which address the fine-grained feature extraction and feature refinement issues, respectively.

the proxy-based loss is adopted to train the model by preserving similarity relationships between data instances and proxy-vectors of each class rather than other data instances, further making FISH much efficient and effective.

the Space Filtering module is designed to highlight the critical regions in images and help the model to capture more subtle and discriminative details; the Feature Filtering module is the key of FISH and aims to further refine extracted features by supervised re-weighting.

□ Simon Barnett

>> https://patentimages.storage.googleapis.com/e5/1a/be/635c1b98feac24/WO2021168155A1.pdf

PacBio recently has been hinting at multi-chip instruments and new "core technology". The company's recent '631 patent features more breadcrumbs about what this may look like.

□ NIC: Network-based integrative analysis of single-cell transcriptomic and epigenomic data for cell types

>> https://pubmed.ncbi.nlm.nih.gov/35043143/

NIC automatically learns the cell–cell similarity graphs, which transforms the fusion of multi-omics data into the analysis of multiple networks.

NIC employs joint non-negative matrix factorization to learn the shared features of cells by exploiting the structure of learned cell–cell similarity networks, providing a better way to characterize the features of cells.

□ how_are_we_stranded_here: quick determination of RNA-Seq strandedness

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04572-7

how_are_we_stranded_here runs a series of commands to determine read orientation. A kallisto index of the organisms’ transcriptome is created using transcript fasta sequences, and a GTF which contains the locations and strands for the corresponding transcript sequences.

Next, input fastq files are sampled to a default of 200,000 reads. These reads are then mapped to the transcriptome, and using kallisto’s—genomebam argument are pseudoaligned into a genome sorted BAM file.

Finally, RSeQC’s infer_experiment.py is used to determine the direction of reads from the first and second pairs relative to the mapped transcript, and estimate the number of reads explained by each of the two layouts (FR or RF), and those unable to be explained by either.

□ JAX-CNV: A Whole Genome Sequencing-based Algorithm for Copy Number Detection at Clinical Grade Level

>> https://www.sciencedirect.com/science/article/pii/S1672022922000055

JAX-CNV, a newly developed WGS-based CNV calling algorithm. An evaluation of its performance was performed on WGS data from 31 patient samples and compared to callsets of the clinically validated CMA at the Jackson Laboratory for Genomic Medicine (JAX-GM).

JAX-CNV has high sensitivity (100%) necessary for diagnostic decisions and a low false discovery rate (4%). This algorithm could serve as a basis for the use of WGS, as a replacement for array-based clinical genetic testing.

□ Optimus: a general purpose adaptive optimisation engine in R

>> https://www.biorxiv.org/content/10.1101/2022.01.18.476810v1.full.pdf

Optimus recovers the rate constants for a system of coupled ordinary differential equations (ODEs) modelling a biological pathway.

Optimus features an acceptance ratio simulated annealing, acceptance ratio replica exchange, and adaptive thermoregulation, thus driving a Monte Carlo optimisation process, through constrained acceptance frequency but unconstrained adaptive pseudo temperature regiments.

□ SuperAtomicCharge: Out-of-the-box deep learning prediction of quantum-mechanical partial charges by graph representation and transfer learning

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab597/6513729

SuperAtomicCharge, a data-driven deep graph learning framework, was proposed to predict three important types of partial charges (i.e. RESP, DDEC4 and DDEC78) derived from high-level QM calculations based on the structures of molecules.

SuperAtomicCharge was designed to simultaneously exploit the 2D/3D structural information of molecules. A simple transfer learning strategy and a multitask learning strategy based on self-supervised descriptors were also employed to further improve the prediction accuracy.

□ GMQN: A Reference-Based Method for Correcting Batch Effects and Probe Bias in HumanMethylation BeadChip

>> https://www.frontiersin.org/articles/10.3389/fgene.2021.810985/full

GMQN removes unwanted technical variations at signal intensity level between samples for 450K / 850K DNA methylation array. It can also easily combined with Subset-quantile Within Array Normalization(SWAN) or Beta-Mixture Quantile (BMIQ) Normalisation to remove probe design bias.

Fitting of a two-state Gaussian mixture model to the input Infinium I probe signal intensity. Transform the probability of Infinium I probes from each component of input data to quantiles using the inverse of the cumulative Gaussian distribution.

After reversing the batch effect, GMQN can also normalize Infinium II probes on the basis of Infinium I probes in combination with BMIQ and SWAN, the two well-known normalization methods on β-values of DNA methylation.

□ Iam hiQ—a novel pair of accuracy indices for imputed genotypes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04568-3

Iam hiQ, an independent pair of accuracy measures that can be applied to dosage files, the output of all imputation software. Iam (imputation accuracy measure) quantifies the average amount of individual-specific versus population-specific genotype information in a linear manner.

Both measures can be used to identify markers or regions in which population-specific genetic information conceal individual-specific information and are therefore less informative for e.g. association testing.

□ Statistics or biology: the zero-inflation controversy about scRNA-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02601-5

Zero measurements in scRNA-seq data have two sources: biological and non-biological. While biological zeros carry meaningful information about cell states, non-biological zeros represent missing values artificially introduced during the generation of scRNA-seq data.

Non-biological zeros include technical zeros, which occur during the preparation of biological samples for sequencing, and sampling zeros, which arise due to limited sequencing depths.

□ DisEnrich: Database of Enriched Regions in Human Dark Proteome

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac051/6517502

DisEnrich - the database of human proteome IDRs that are significantly enriched in particular amino acids. Each human protein is described using gene ontology (GO) function terms, disorder prediction for the full-length sequence.

Analysis of IDP distribution in broad functional categories based on DisEnrich disordered consensus revealed that disorder is closely related to regulation and signaling, rather than metabolic and enzymatic activities.

□ CBMOS: a GPU-enabled Python framework for the numerical study of center-based models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04575-4

CBMOS, a framework designed explicitly for the numerical study of center-based models in two and three space dimensions.

Its additional computational cost due to requiring a linear solve remains too high even when approximating the Jacobian and using as few iterations as possible.

The CBMOS code is event-driven, meaning that cell events are queued according to their execution time and the mechanical equations for the center positions are solved in between the execution of cell events.

□ Flexible seed size enables ultra-fast and accurate read alignment

>> https://www.biorxiv.org/content/10.1101/2021.06.18.449070v3.full.pdf

A novel seeding approach for constructing dynamic-sized fuzzy seeds. Syncmers and strobemers can be combined in what becomes a high-speed indexing method, roughly corresponding to the speed of computing minimizers.

This technique is based on first subsampling k-mers from the reference sequences by computing canonical open syncmers, then producing strobemers formed from linking together syncmers occurring close-by on the reference using the randstrobe method.

□ MCRWR: a new method to measure the similarity of documents based on semantic network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04578-1

Besides Boolean retrieval with medical subject headings (MeSH), PubMed provides users with an alternative way called “Related Articles” to access and collect relevant documents based on semantic similarity.

MeSH-concept random walk with restart algorithm (MCRWR) has better performance in constructing article semantic similarity network. Semantic similarity b/n two articles was computed according to the feature vectors generated from MeSH-concept similarity network by RWR algorithm.

□ MetaLogo: a heterogeneity-aware sequence logo generator and aligner

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab591/6519790

MetaLogo can automatically cluster the input sequences after multiple sequence alignment and phylogenetic tree construction, and then output sequence logos for multiple groups and aligned them in one figure.

MetaLogo can perform pairwise and global sequence logos alignment to highlight the sequence pattern dynamics across different sequence groups. MetaLogo provides basic statistical analysis to additionally reveal the sequence convergences and divergences.

	【gooブロガー・先着】dアカウント連携でdポイント2,000pt
	ブログを読むだけ。毎月の訪問日数に応じてポイント進呈
	【コメント募集中】goo blogスタッフの気になったニュース
	gooブロガーの今日のひとこと
	訪問者数に応じてdポイント最大1,000pt当たる！

2022年1月
日	月	火	水	木	金	土
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

elementum.

E Pluribus Unum.

Perplexium.

Sanctum.