2022年2月14日のブログ記事一覧-lens, align.

Intimacy.

2022-02-14 22:12:24 | Science News

□ Numbat: Haplotype-enhanced inference of somatic copy number profiles from single-cell transcriptomes

>> https://www.biorxiv.org/content/10.1101/2022.02.07.479314v1.full.pdf

Numbat integrates expression/allele/haplotype information derived from population-based phasing to comprehensively characterize the CNV landscape. a Hidden Markov model integrates expression deviation and haplotype imbalance signals to detect CNVs in cell population pseudobulks.

Numbat employs an iterative approach to reconstruct the subclonal phylogeny and single-cell copy number profile. Numbat identifies distinct subclonal lineages that harbor haplotype-specific alterations. It does not require sample-matched DNA data or a priori genotyping.

□ ClonoCluster: a method for using clonal origin to inform transcriptome clustering

>> https://www.biorxiv.org/content/10.1101/2022.02.11.480077v1.full.pdf

ClonoCluster, a computational method that combines both clone and transcriptome information to create hybrid clusters that weight both kinds of data with a tunable parameter - Warp Factor.

Warp Factor incorporates clonality information into the dimensionality reduction step prior to the commonly-used UMAP algorithm for visualizing high dimensional datasets. Individual clone clusters formed distinct spatial clusters in UMAP space.

□ Optimal Evaluation of Symmetry-Adapted n-Correlations Via Recursive Contraction of Sparse Symmetric Tensors

>> https://arxiv.org/pdf/2202.04140v1.pdf

A comprehensive analysis of an algorithm for evaluating high-dimensional polynomials. The key bottleneck is the contraction of a high-dimensional symmetric and sparse tensor with a specific sparsity pattern that is directly related to the symmetries imposed on the polynomial.

The key step is to understand the insertion of so-called “auxiliary nodes” into this graph which represent intermediate computational steps. An explicit construction of a recursive evaluation strategy and show that it is optimal in the limit of infinite polynomial degree.

□ n-Best Kernel Approximation in Reproducing Kernel Hilbert Spaces

>> https://arxiv.org/pdf/2201.07228v1.pdf

By making a seminal use of the maximum modulus principle of holomorphic functions they prove existence of n-best kernel approximation for a wide class of reproducing kernel Hilbert spaces of holomorphic functions in the unit disc.

A clever and concise proof for the existence of the n-best kernel approximation for a large class of reproducing kernel Hilbert spaces, that is in particular strictly larger than that of the weighted Bergman spaces, enclosing all the weighted Hardy spaces.

□ LSCON: Fast and accurate gene regulatory network inference by normalized least squares regression

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac103/6530276

LSCON (Least Squares Cut-Off with Normalization) extends the LSCO algorithm by regularization to avoid hyper-connected genes and thereby reduce false positives.

LSCON performed similarly to the LASSO algorithm in correctness, while outperforming LSCO, RidgeCO, and Genie3 on data with infinitesimal fold change values. LSCON was found to be about 1000 times faster than Genie3.

□ polishCLR: a Nextflow workflow for polishing PacBio CLR genome assemblies

>> https://www.biorxiv.org/content/10.1101/2022.02.10.480011v1.full.pdf

polishCLR, a reproducible Nextflow workflow that implements best practices for polishing assemblies made from Continuous Long Reads (CLR) data.

PolishCLR provides re-entry points throughout several key processes including identifying duplicate haplotypes in purge_dups, allowing a break for scaffolding if data are available, and throughout multiple rounds of polishing and evaluation with Arrow and FreeBayes.

□ NEN: Single-cell RNA sequencing data analysis based on non-uniform ε- neighborhood network

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac114/6533440

non-uniform ε - neighborhood network (NEN) combines the advantages of both k-nearest neighbors (KNN) and ε - neighborhood (EN) to represent the manifold that data points reside in gene space.

Then from such a network, NEN uses its layout, its community and further its shortest path to achieve the purpose of scRNA-seq data visualization, clustering and trajectory inference.

□ Anc2vec: embedding gene ontology terms by preserving ancestors relationships

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac003/6523148

Significant performance improvements have been observed when the vector representations are used on diverse downstream tasks, such as the measurement of semantic similarity. However, existing embeddings of GO terms still fail to capture crucial structural features.

anc2vec, a novel protocol based on neural networks for constructing vector representations of Gene Ontology. These embeddings are built to preserve three structural features: the ontological uniqueness of terms, their ancestor relationships and the sub-ontology to which they belong.

□ scvi-tools: A Python library for probabilistic analysis of single-cell omics data

>> https://www.nature.com/articles/s41587-021-01206-w

scvi-tools offers standardized access to methods for many single-cell data analysis, such as integration of scVI / scArches, annotation of CellAssign / scANVI, deconvolution of bulk spatial transcriptomics (Stereoscope), doublet detection(Solo) and multi-modal analysis (totalVI).

Those elements are organized into a class that inherits from the abstract class BaseModuleClass. scvi-tools offers a set of building blocks. It can be used for efficient model development through Stereoscope, which demonstrates a substantial reduction in code complexity.

□ MegaGate: A toxin-less gateway molecular cloning tool

>> https://star-protocols.cell.com/protocols/1120

MegaGate is an enabling technology for use in cDNA screening and cell engineering for mammalian systems. MegaGate eliminates the ccdb toxin used in Gateway recombinase cloning and instead utilizes mega- nuclease-mediated digestion to eliminate background vectors during cloning.

MegaDestination vectors can optionally feature unique DNA barcodes that can be captured through gDNA sequencing. if a plasmid does not contain a gene of interest, it retains the meganuclease recognition cassette which is digested by the meganucleases in the MegaGate reaction mix.

□ Syllable-PBWT for space-efficient haplotype long-match query

>> https://www.biorxiv.org/content/10.1101/2022.01.31.478234v1.full.pdf

Syllable-PBWT, a space- efficient variation of the positional Burrows-Wheeler transform which divides every haplotype into syllables, builds the PBWT positional prefix arrays on the compressed syllabic panel, and leverages the polynomial rolling hash function.

The Syllable-Query algorithm finds long matches between a query haplotype and the panel. Syllable-Query is significantly faster than the full memory algorithm. After reading in the query haplotype in O(N ) time, these sequences require O(nβ log M ) time to compute.

□ mcPBWT: Space-efficient Multi-column PBWT Scanning Algorithm for Composite Haplotype Matching

>> https://www.biorxiv.org/content/10.1101/2022.02.02.478879v1.full.pdf

mcPBWT (multi-column PBWT) uses multiple synchronized runs of PBWT at different variant sites providing a “look-ahead” information of matches at those variant sites. Such “look-ahead” information allows us to analyze multiple contiguous matching pairs in a single pass.

Triangulating the genealogical relationship among individuals carrying these matching segments. double-PBWT finds two matching pairs’ combinations representative of phasing error while triple-PBWT finds three matching pairs’ combinations representative of gene-conversion tract.

□ gwfa: Proof-of-concept implementation of GWFA for sequence-to-graph alignment

>> https://github.com/lh3/gwfa

GWFA (Graph WaveFront Alignment) is an algorithm to align a sequence against a sequence graph. It adapts the WFA algorithm for graphs. A proof-of-concept implementation of GWFA that computes the edit distance between a graph and a sequence without backtracing.

GWFA algorithm assumes the start of the sequence to be aligned with the start of the first segment in the graph and requires the query sequence to be fully aligned. GWFA is optimized for graphs consisting of long segments.

□ Dynamo: Mapping transcriptomic vector fields of single cells

>> https://www.cell.com/cell/fulltext/S0092-8674(21)01577-4

Dynamo infers absolute RNA velocity, reconstructs continuous vector fields that predict cell fates, employs differential geometry to extract underlying regulations, and ultimately predicts optimal reprogramming paths.

Dynamo calculates RNA acceleration, curvature, divergence and RNA Jacobian. Dynamo uses Least Action Paths (LAPs) and in silico perturbation. Dynamo makes it possible to use single-cell data to directly explore governing regulatory mechanisms and even recover kinetic parameters.

□ Transformation of alignment files improves performance of variant callers for long-read RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.02.08.479579v1.full.pdf

DeepVariant, Clair3 and NanoCaller use a deep learning approach in which variants are detected by analysis of read-alignment images; Clair3 uses a pileup model to call most variants, and a more computationally-intensive full-alignment model to handle more complex variants.

Clair3-mix and SNCTR+flagCorrection+DeepVariant are among the best-performing pipelines to call indels, the former having higher recall and the latter higher precision.

□ SeCNV: Resolving single-cell copy number profiling for large datasets

>> https://www.biorxiv.org/content/10.1101/2022.02.09.479672v1.full.pdf

SeCNV, a novel method that leverages structural entropy, to profile the copy numbers. SeCNV adopts a local Gaussian kernel to construct a matrix, depth congruent map, capturing the similarities between any two bins along the genome.

SeCNV partitions the genome into segments by minimizing the structural entropy from the depth congruent map. With the partition, SeCNV estimates the copy numbers within each segment for cells.

□ WAFNRLTG: A Novel Model for Predicting LncRNA Target Genes Based on Weighted Average Fusion Network Representation Learning Method

>> https://www.frontiersin.org/articles/10.3389/fcell.2021.820342/full

WAFNRLTG constructs a heterogeneous network, which integrated two similar networks and three interaction networks. Next, the network representation learning method was utilized to gain the representation vectors of lncRNA and mRNA nodes.

The representation vectors of lncRNAs and the representation vectors of mRNAs were merged to form the lncRNA-gene pairs, and XGBoost classifier was built based on the merged representations of lncRNA-miRNA pairs.

□ CELESTA: Identification of cell types in multiplexed in situ images by combining protein expression and spatial information

>> https://www.biorxiv.org/content/10.1101/2022.02.02.478888v1.full.pdf

CELESTA (CELl typE identification with SpaTiAl information) incorporates both cell’s protein expression profile and its spatial information, with minimal to no user-dependence, to produce relatively fast cell type assignments.

CELESTA defines an energy function using the Potts model to leverage cell type information on its spatially N-nearest neighboring cells in a probabilistic manner. CELESTA represents each index cell as a node in an undirected graph with each edge connecting its spatially N-NN.

CELESTA associates each node with a hidden state, where the hidden state is the cell type to be inferred, and assumes that the joint distribution of the hidden states satisfy discrete Markov Random Field.

□ Vivarium: an interface and engine for integrative multiscale modeling in computational biology

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac049/6522109

Vivarium can apply to any type of dynamic model – ordinary differential equations (ODEs), stochastic processes, Boolean networks, spatial models, and more – and allows users to plug these models together in integrative, multiscale representations.

Vivarium's modular interface makes individual simulation tools into modules that can be wired together in composite multi-scale models, parallelized across multiple CPUs, and run with Vivarium's discrete-event simulation engine.

□ RaPID-Query for Fast Identity by Descent Search and Genealogical Analysis

>> https://www.biorxiv.org/content/10.1101/2022.02.03.478907v1.full.pdf

RaPID-Query (Random Projection-based Identical-by-descent Detection Query) method identifies IBD segments between a query haplotype and a panel of haplotypes. RaPID-Query locates IBD segments quickly with a given cutoff length while allowing mismatched sites in IBD segments.

RaPID-Query uses x-PBWT-Query, an extended PBWT query algorithm, by single sweep long match query algorithm. It eliminates the redundant steps of evaluating the divergence values of the haplotypes if they are already in the set-maximal match block.

□ Dysgu: efficient structural variant calling using short or long reads

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac039/6517943

dysg can rapidly call SVs from PE or LR data, across all size categories. Conceptually, dysgu identifies SVs from alignment cigar information as well as discordant and split-read mappings.

Dysgu employs a fast consensus sequence algorithm, inspired by the positional de Brujin graph, followed by remapping of anomalous sequences to discover additional small SVs.

□ The SAMBA tool uses long reads to improve the contiguity of genome assemblies

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009860

Several previously developed tools also allow for scaffolding with long reads. These include AHA, which is part of the SMRT software analysis suite, SSPACE-LongRead, and LINKS however none of these tools utilize the consensus of the long reads to fill gaps in the scaffolds.

SAMBA (Scaffolding Assemblies with Multiple Big Alignments) is designed to scaffold and gap-fill genome assemblies with long-read data, resulting in substantially greater contiguity. SAMBA fills in the sequence for all spanned gaps in the scaffolds, yielding much longer contigs.

□ sgcocaller and comapr: personalised haplotype assembly and comparative crossover map analysis using single gametes

>> https://www.biorxiv.org/content/10.1101/2022.02.10.479822v1.full.pdf

an efficient software toolset using modern programming languages for the common tasks of haplotyping haploid gamete genomes and calling crossovers (sgcocaller), and constructing and visualising individualised crossover landscapes (comapr) from single gametes.

sgcocaller xo implements a two-state Hidden Markov Model and adopts binomial distributions for modelling the emission probabilities of the observed allele read counts. the Viterbi algorithm is applied to infer the most probable hidden state sequence for the list of hetSNPs.

□ UINMF performs mosaic integration of single-cell multi-omic datasets using nonnegative matrix factorization

>> https://www.nature.com/articles/s41467-022-28431-4

UINMF can integrate data types such as scRNA-seq and snATAC-seq using both gene-centric features and intergenic information. UINMF fully utilizes the available data when estimating metagenes and matrix factors, significantly improving sensitivity for resolving cellular distinctions.

UINMF can integrate targeted spatial transcriptomic data with simultaneous single-cell RNA and chromatin accessibility measurements using both unshared epigenomic information and unshared genes.

The UINMF optimization algorithm has a reduced computational complexity per iteration compared to iNMF algorithm on a dataset of the same size. UINMF, as well as iNMF, requires random initializations, and is nondeterministic in nature.

□ SmMIP-tools: a computational toolset for processing and analysis of single-molecule molecular inversion probes derived data

>>

https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac081/6527628

Single-molecule molecular inversion probes (SmMIPs) - tools is specifically tailored to address the high error rates associated with amplicon-based sequencing and support the implementation of cost-effective molecular inversion probes-based NGS.

By linking each sequence read to its probe-of-origin, AmMIP-tools can identify and filter error-prone reads, such as chimeric reads or those derived from self-annealing probes, that are uniquely associated with smMIP-based sequencing.

□ INTEGRATE: Model-based multi-omics data integration to characterize multi-level metabolic regulation

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009337

INTEGRATE, a computational pipeline that integrates metabolomics and transcriptomics data, using constraint-based stoichiometric metabolic models as a scaffold.

INTEGRATE takes as input a generic metabolic network model, including GPRs, cross-sectional transcriptomics data, cross-sectional intracellular metabolomics data and steady-state extracellular fluxes data.

□ Genion: an accurate tool to detect gene fusion from long transcriptomics reads

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08339-5

Genion is an accurate gene fusion discovery tool that uses a combination of dynamic programming and statistical filtering. Genion accurately identifies the gene fusions and its clustering accuracy for detecting fusion reads is better than LongGF.

From the mapping of transcriptomic long reads to a reference genome, Genion first identifies chains of exons. Reads with chains that contain exons from several genes provide an initial set of reads supporting potential gene fusions.

Genion clusters the reads that indicate potential gene fusions to define fusion candidates using a statistical method based on the analysis of background expression patterns for the involved genes and on the co-occurrence of the fusion candidates in other potential fusion events.

□ uniPort: a unified computational framework for single-cell data integration with optimal transport

>> https://www.biorxiv.org/content/10.1101/2022.02.14.480323v1.full.pdf

uniPort, a unified single-cell data integration framework which combines coupled-VAE and Minibatch Unbalanced Optimal Transport. It leverages both highly variable common and dataset-specific genes for integration and is scalable to large-scale and partially overlapping datasets.

uniPort can further construct a reference atlas for online prediction across datasets. Meanwhile, uniPort provides a flexible label transfer framework to deconvolute spatial heterogeneous data using optimal transport space, instead of embedding latent space.

□ scGAC: a graph attentional architecture for clustering single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac099/6530275

scGAC (single-cell Graph Attentional Clustering), for scRNA-seq data. scGAC firstly constructs a cell graph and refines it by network denoising. scGAC adopts a self-optimizing method to obtain the cell clusters.

scGAC learns clustering-friendly representation of cells through a graph attentional autoencoder, which propagates information across cells with different weights and captures latent relationship among cells.

□ scITD: Tensor decomposition reveals coordinated multicellular patterns of transcriptional variation that distinguish and stratify disease individuals

>> https://www.biorxiv.org/content/10.1101/2022.02.16.480703v1.full.pdf

A joint decomposition would more naturally describe scenarios where different cell types respond specifically to the same external signals. It would also improve the ability to infer dependencies between transcriptional programs across cell types.

Single-cell Interpretable Tensor Decomposition (scITD) extracts “multicellular GE patterns” that vary acrossdifferent biological samples. The multicellular patterns inferred by scITD can be linked with various clinical annotations, technical batch effects, and other metadata.

□ AIscEA: Unsupervised Integration of Single-cell Gene Expression and Chromatin Accessibility via Their Biological Consistency

>> https://www.biorxiv.org/content/10.1101/2022.02.17.480279v1.full.pdf

AIscEA first defines a ranked similarity score to quantify the biological consistency between cell types across measurements. AIscEA then uses the ranked similarity score and a novel permutation test to identify the cell-type alignment across measurements.

AIscEA further utilizes graph alignment to align the cells across measurements. the graph alignment method uses the symmetric k nearest neighbor graph to characterize the low-dimensional manifold.

□ EMOGEA: ERROR MODELLED GENE EXPRESSION ANALYSIS PROVIDES A SUPERIOR OVERVIEW OF TIME COURSE RNA-SEQ MEASUREMENTS AND LOW COUNT GENE EXPRESSION

>> https://www.biorxiv.org/content/10.1101/2022.02.18.481000v1.full.pdf

EMOGEA, a principled framework for analyzing RNA-seq data that incorporates measurement uncertainty in the analysis, while introducing a special formulation for modelling data that are acquired as a function of time or other continuous variable.

EMOGEA yields gene expression profiles that represent groups of genes with similar modulations in their expression during embryogenesis. EMOGEA profiles highlight with clarity how the expression of different genes is modulated over time with a tractable biological interpretation.

□ scAnnotate: an automated cell type annotation tool for single-cell RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.02.19.481159v1.full.pdf

scAnnotate uses a marginal mixture model to describe both the dropout proportion and the non-dropout expression level distribution of a gene.

A marginal model based ensemble learning approach is developed to avoid having to specify and estimate a high-dimensional joint distribution for all genes.

To address the curse of high dimensionality, they use every gene to make a classifier and consider it as a ‘weak’ learner, and then use a combiner function to ensemble ‘weak’ learners built from all genes into a single ‘strong’ learner for making the final decision.

Cross.

2022-02-14 22:12:12 | Science News

□ PΨFinder: a practical tool for the identification and visualization of novel pseudogenes in DNA sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04583-4

PΨFinder can automatically identify novel PΨgs from DNA sequencing data and determine their location in the genome with high sensitivity. Insert positions of the pseudogene candidates are recorded by linking the pseudogene candidate with chimeric reads and chimeric pairs.

The resulting analysis with PΨFinder, determined that predictions obtained from samples with a sequencing depth of 5 M reads, an average coverage of at least 144X and including both CPs and CRs, can be deemed as true positive PΨg-insertion sites.

□ A scalable and unbiased discordance metric with H+

>> https://www.biorxiv.org/content/10.1101/2022.02.03.479015v1.full.pdf

H+, a modification of G+ that retains the scale-agnostic discordance quantification while addressing problems with G+. Explicitly, H+ is an unbiased estimator for P (dij ) > P (dkl ).

An estimate of H+ based on bootstrap resampling from the original observations that does not require the full dissimilarity matrices to be calculated. H+ provides an additional means to consider termination of a clustering algorithm in a distance-agnostic manner.

□ Omics-informed CNV calls reduce false positive rate and improve power for CNV-trait associations

>> https://www.biorxiv.org/content/10.1101/2022.02.07.479374v1.full.pdf

A method to improve the detection of false positive CNV calls amongst PennCNV output by discriminating between high quality (true) and low quality (false) CNV regions based on multi-omics data.

a predictor of CNV quality inferred from WGS, transcriptomics and methylomics, solely based on PennCNV software output parameters in these samples assayed by multiple omics technologies.

□ scHFC: a hybrid fuzzy clustering method for single-cell RNA-seq data optimized by natural computation

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbab588/6523126

scHFC is a hybrid fuzzy clustering method optimized by natural computation based on Fuzzy C Mean (FCM) and Gath-Geva (GG) algorithms. Specifically, principal component analysis algorithm is utilized to reduce the dimensions of scRNA-seq data after it is preprocessed.

FCM algorithm optimized by simulated annealing algorithm and genetic algorithm is applied to cluster the data to output a membership matrix, which represents the initial clustering result and is taken as the input for GG algorithm to get the final clustering results.

a cluster number estimation method called multi-index comprehensive estimation, which can estimate the cluster numbers well by combining four clustering effectiveness indexes.

□ expiMap: Biologically informed deep learning to infer gene program activity in single cells

>> https://www.biorxiv.org/content/10.1101/2022.02.05.479217v1.full.pdf

The key concept is the substitution of the uninterpretable nodes in an autoencoder’s bottleneck by labeled nodes mapping to interpretable lists of genes, such as gene ontologies, or biological pathways, for which activities are learned as constraints during reconstruction.

expiMap, “explainable programmable mapper” is consist of interpretable CVAE that allows the incorporation of domain knowledge by “architecture programming”, i.e., constraining the network architecture to ensure that each latent dimension captures the variability of known GPs.

□ Changes in chromatin accessibility are not concordant with transcriptional changes for single-factor perturbations

>> https://www.biorxiv.org/content/10.1101/2022.02.03.478981v1.full.pdf

Integrating tandem, genome-wide chromatin accessibility and transcriptomic data to characterize the extent of concordance between them in response to inductive signals.

While certain genes have a high degree of concordance of change between expression and accessibility changes, there is also a large group of differentially expressed genes whose local chromatin remains unchanged.

□ StructuralVariantAnnotation: a R/Bioconductor foundation for a caller-agnostic structural variant software ecosystem

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac042/6522107

StructuralVariantAnnotation provides the caller-agnostic foundation needed for a R/Bioconductor ecosystem of structural variant annotation, classification, and interpretation tools able to handle both simple and complex genomic rearrangements.

StructuralVariantAnnotation can match equivalent variants reported as insertion and duplication and can identify transitive breakpoints. Such features are important as they are common when comparing short and long read call sets.

□ SAMAR: Assembly-free rapid differential gene expression analysis in non-model organisms using DNA-protein alignment

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-021-08278-7

SAMAR (Speedy, Assembly-free Method to Analyze RNA-seq expression data) -- a quick-and-easy way to perform differential expression (DE) analysis in non-model organisms.

SAMAR uses LAST to learn the alignment scoring parameters suitable for the input, and to estimate the paired-end fragment size distribution of paired-end reads, and directly align RNA-seq reads to the high-confidence proteome that would have been otherwise used for annotation.

□ Fast and compact matching statistics analytics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac064/6522115

A parallel algorithm for shared-memory machines that computes matching statistics 30 times faster with 48 cores in the cases that are most difficult to parallelize.

A a lossy compression scheme that shrinks the matching statistics array to a bitvector that takes from 0.8 to 0.2 bits per character, depending on the dataset and on the value of a threshold, and that achieves 0.04 bits per character in some variants.

Efficient implementations of range-maximum and range-sum queries that take a few tens of milliseconds while operating on our compact representations, and that allow computing key local statistics about the similarity between two strings.

□ FUNKI: Interactive functional footprint-based analysis of omics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac055/6522117

FUNKI, a FUNctional toolKIt for footprint analysis. It provides a user-friendly interface for an easy and fast analysis of transcriptomics, phosphoproteomics and metabolomics data, either from bulk or single-cell experiments.

FUNKI provides a user interface to upload omics data, and then run DoRothEA, PROGENy, KinAct, CARNIVAL and COSMOS to estimate the activity of pathways, transcription factors, and kinases. The results are visualized in diverse forms.

□ CRFalign: A Sequence-structure alignment of proteins based on a combination of HMM-HMM comparison and conditional random fields

>> https://www.biorxiv.org/content/10.1101/2022.02.03.478675v1.full.pdf

CRFalign improves upon a reduced three-state or five-state scheme of HMM-HMM profile alignment model by means of conditional random fields with nonlinear scoring on sequence and structural features implemented with boosted regression trees.

CRFalign extracts complex nonlinear relationships among sequence profiles & structural features incl secondary structures/solvent accessibilities/environment-dependent properties that give rise to position-dependent as well as environment-dependent match scores and gap penalties.

□ Beware to ignore the rare: how imputing zero-values can improve the quality of 16S rRNA gene studies results

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04587-0

A collection of normalization and zero-imputation approaches is tested for 16S rRNA-gene sequencing data preprocessing. This permits to compare an updated list of normalization tools considering the recent publications and evaluate the effect of introducing zero-imputation step.

Bray–Curtis dissimilarity was used to build a distance matrix on which Non-metric Multidimensional Scaling (NMDS) dimensionality reduction was performed to assess spatial distribution of samples, whereas Whittaker dissimilarity values were graphically represented using heatmaps.

□ Belayer: Modeling discrete and continuous spatial variation in gene expression from spatially resolved transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.02.05.479261v1.full.pdf

In the simplest case of an axis-aligned tissue structure, Belayer infers the maximum likelihood expression function using a dynamic programming algorithm that is related to the classical problems of changepoint detection and segmented regression.

Belayer models the expression of each gene with a piecewise linear expression function. And analyzes spatially resolved transcriptomics data using a global model of tissue organization and an explicit definition of GE that combines both discrete and continuous variation in space.

□ SpatialCorr: Identifying Gene Sets with Spatially Varying Correlation Structure

>> https://www.biorxiv.org/content/10.1101/2022.02.04.479191v1.full.pdf

SpatialCorr, a semiparametric approach for identifying spatial changes in the correlation structure of a group of genes. SpatialCorr estimates spot-specific correlation matrices using a Gaussian kernel; region-specific correlations are estimated using all spots in a region.

SpatialCorr tests for spatially varying correlation within each tissue region using a multivariate normal (MVN) likelihood ratio test statistic that compares the MVN w/ spot-specific correlation estimates to an MVN with constant correlation estimated from all spots in the region.

□ CRSP: Comparative RNA-seq pipeline for species lacking both of sequenced genomes and reference transcripts

>> https://www.biorxiv.org/content/10.1101/2022.02.04.479193v1.full.pdf

CRSP integrates a set of computational strategies, such as mapping de novo transcriptomic assembly contigs to a protein database, using Expectation-Maximization (EM) algorithm to assign reads mapping uncertainty, and integrative statistics to quantify gene expression values.

CRSP estimated gene expression values are highly correlated with gene expression values estimated by directly mapping to a reference genome.

10 to 20 million single-end reads are sufficient to achieve reasonable gene expression quantification accuracy while a pre-compiled de novo transcripts assembly from deep sequencing can dramatically decrease the minimal reads requirement for the rest of RNA-seq experiments.

□ LRLoop: Feedback loops as a design principle of cell-cell communication

>> https://www.biorxiv.org/content/10.1101/2022.02.04.479174v1.full.pdf

Currently available techniques for predicting ligand-receptor interactions are one-directional from sender to receiver cells.

LRLoop, a new method for analyzing cell-cell communication that is based on bi-directional ligand-receptor interactions, where two pairs of ligand-receptor interactions are identified that are responsive to each other, and thereby form a closed feedback loop.

□ Thirdkind: displaying phylogenetic encounters beyond 2-level reconciliation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac062/6525213

No simple generic tool is available to visualise reconciliation results. Moreover there is no tool to visualise 3-levels reconciliations, i.e. to visualise 2 nested reconciliations as for example in a host/symbiont/gene complex.

Thirdkind is a light command-line software allowing the user to generate a svg from recPhyloXML files with a large choice of options (orientation, police size, branch length, multiple trees, redundant transfers handling, etc.) and to handle the visualisation of 2 nested reconciliations.

□ SPRI: Spatial Pattern Recognition using Information based method for spatial gene expression data

>> https://www.biorxiv.org/content/10.1101/2022.02.09.479510v1.full.pdf

SPRI directly models spatial transcriptome raw count data without model assumptions, which transforms the problem of spatial expression pattern recognition into the detection of dependencies between spatial coordinate pairs with gene read count as the observed frequencies.

SPRI converts the spatial gene pattern problem into an association detection problem b/n coordinate values with observed raw count data, and then estimates associations using an information-based method, TIC, which calculates the total mutual information with all possible grids.

□ blitzGSEA: Efficient computation of Gene Set Enrichment Analysis through Gamma distribution approximation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac076/6526383

blitzGSEA, an algorithm that is based on the same running sum statistic as GSEA, but instead of performing permutations, blitzGSEA approximates the enrichment score probabilities based on Gamma distributions.

The blitzGSEA calculates a background distribution analytically for the weighted Kolmogorov-Smirnov statistic described in GSEA-P and fGSEA using the gene set shuffling methodology.

□ SCAR: Recombination-aware Phylogeographic Inference Using the Structured Coalescent with Ancestral Recombination

>> https://www.biorxiv.org/content/10.1101/2022.02.08.479599v1.full.pdf

the Structured Coalescent with Ancestral Recombination (SCAR) model, which builds on recent approximations to the structured coalescent by incorporating recombination into the ancestry of sampled individuals.

The SCAR model allows us to infer how the migration history of sampled individuals varies across the genome from ARGs, and improves estimation of key population genetic parameters. SCAR explores the potential and limitations of phylogeographic inference using full ARGs.

□ iDESC: Identifying differential expression in single-cell RNA sequencing data with multiple subjects

>> https://www.biorxiv.org/content/10.1101/2022.02.07.479293v1.full.pdf

A zero-inflated negative binomial mixed model is used to consider both subject effect and dropouts. iDESC models dropout events as inflated zeros and non-dropout events using a negative binomial distribution.

In the negative binomial component, a random effect is used to separate subject effect from group effect. Wald statistic is used to assess the significance of group effect.

□ Efficient Privacy-Preserving Whole Genome Variant Queries

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac070/6527622

This project provides a method that uses secure multi-party computation (MPC) to query genomic databases in a privacy-protected manner.

The proposed solution privately outsources genomic data from arbitrarily many sources to the two non-colluding proxies and allows genomic databases to be safely stored in semi-honest cloud. It provides data privacy, query privacy, and output privacy by using XOR-based sharing.

It is possible to query a genomic database with 3, 000, 000 variants with five genomic query predicates under 400 ms. Querying 1, 048, 576 genomes, each containing 1, 000, 000 variants, for the presence of five different query variants can be achieved approximately in 6 minutes.

□ PAC: Scalable sequence database search using Partitioned Aggregated Bloom Comb-Trees

>> https://www.biorxiv.org/content/10.1101/2022.02.11.480089v1.full.pdf

PAC index construction works in a streaming fashion without any disk footprint besides the index itself. It shows a 3 to 6 fold improvement in construction time compared to other compressed methods for comparable index size.

Using inverted indexes and a novel data structure dubbed aggregative Bloom filters, a PAC query can need single random access and be performed in constant time in favorable instances.

As in SBTs, these trees’ inner nodes are unions of Bloom filters, but for PAC, they are organized in a binary left-comb tree and another binary Bloom right-comb tree. Each aggregative Bloom comb tree indexes all k-mers sharing a given minimizer.

□ Degeneracy measures in biologically plausible random Boolean networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04601-5

Although degeneracy is a feature of network topologies and seems to be implicated in a wide variety of biological processes, research on degeneracy in biological networks is mostly limited to weighted networks.

An information theoretic definition of degeneracy on random Boolean networks. Random Boolean networks are discrete dynamical systems with binary connectivity and thus, these networks are well-suited for tracing information flow and the causal effects.

□ BinSPreader: refine binning results for fuller MAG reconstruction

>> https://www.biorxiv.org/content/10.1101/2022.02.14.480326v1.full.pdf

BinSPreader — a novel binning refiner tool that exploits the assembly graph topology and other connectivity information to refine the existing binning, correct binning errors, propagate binning from longer contigs to shorter contigs and infer contigs belonging to multiple bins.

BinSPreader can split input reads in accordance with the resulting binning, predicting reads potentially belonging to multiple MAGs.

BinSPreader uses a special working mode of the binning refining algorithm for sparse binnings, where the total length of initially binned contigs is significantly lower than the total assembly length.

□ scShapes: A statistical framework for identifying distribution shapes in single-cell RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.02.13.480299v1.full.pdf

While most methods for differential gene expression analysis aim to detect a shift in the mean of expressed values, single cell data are driven by over-dispersion and dropouts requiring statistical distributions that can handle the excess zeros.

scShapes quantifies cell-to-cell variability by testing for differences in the expression distribution while flexibly adjusting for covariates. scShapes identifies subtle variations that are independent of altered mean expression and detects biologically-relevant genes.

□ gcaPDA: a haplotype-resolved diploid assembler

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04591-4

gcaPDA (gamete cells assisted Phased Diploid Assembler) can generate chromosome-scale phased diploid assemblies for highly heterozygous and repetitive genomes using PacBio HiFi data, Hi-C data and gamete cell WGS data.

Both of the reconstructed haplotype assemblies generated using gcaPDA have excellent collinearity with their corresponding reference assemblies.

gcaPDA used all the HiFi reads to construct assembly graphs, with haplotype-specific k-mer derived from gamete cell reads to assist in resolving graph, and generate both haplotype assembly simultaneously.

□ Efficient Bayesian inference for mechanistic modelling with high-throughput data

>> https://www.biorxiv.org/content/10.1101/2022.02.14.480336v1.full.pdf

Inspired by the method of stochastic gradient descent (SGD) in machine learning, a minibatch approach tackles this issue: for each comparison between simulated and observed data, it uses a stochastically sampled subset (minibatch) of the data.

Choosing a large enough minibatch ensures that the relevant signatures in the observed data can be accurately estimated, while avoiding unnecessary comparisons that slow down inference.

□ GMMchi: Gene Expression Clustering Using Gaussian Mixture Modeling

>> https://www.biorxiv.org/content/10.1101/2022.02.14.480329v1.full.pdf

Since the iterative process within GMMchi is based on the chi-square goodness of fit test as the main criterion for measuring the fit of the mixed normal distribution model, it is important for the validity of the Chi-square test to have at least 5 measurements in each bin.

This is achieved by an algorithm called dynamic binning, which involves automatically combining bins while applying the least manipulation to the histogram for ensuring optimal results of the underlying chi-square test within GMMchi.

□ CNGPLD: Case-control copy-number analysis using Gaussian process latent difference

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac096/6530274

CNGPLD is a new tool for performing case-control somatic copy-number analysis that facilitates the discovery of differentially amplified or deleted copy-number aberrations in a case group of cancer compared to a control group of cancer.

This tool uses a Gaussian process statistical framework in order to account for the covariance structure of copy-number data along genomic coordinates and to control the false discovery rate at the region level.

□ DeepMNE: Deep Multi-network Embedding for lncRNA-Disease Association prediction

>> https://ieeexplore.ieee.org/document/9716828/

DeepMNE discovers potential lncRNA disease associations, especially for novel diseases and lncRNAs. DeepMNE extracts multi-omics data to describe diseases and lncRNAs, and proposes a network fusion method based on deep learning to integrate multi-source information.

DeepMNE complements the sparse association network and uses kernel neighborhood similarity to construct disease similarity and lncRNA similarity networks. DeepMNE also elicits a considerable predictive performance on perturbed datasets.

□ Penguin: A Tool for Predicting Pseudouridine Sites in Direct RNA Nanopore Sequencing Data

>> https://www.sciencedirect.com/science/article/abs/pii/S1046202322000354

Penguin integrates several Machine learning models (i.e., predictors) to identify RNA Ψ sites in Nanopore direct RNA sequencing reads. Penguin extracts a set of features from the raw signal measured by the Oxford Nanopore and the corresponding basecalled k-mer.

Penguin automates the data preprocessing incl. Nanopore direct RNA read alignment using Minimap2, and Signal extraction using Nanopolish, feature extraction from raw Nanopore signal for ML predictors integrated, and the prediction of RNA Ψ sites with those predictors.

□ SWIF(r): Enabling interpretable machine learning for biological data with reliability scores

>> https://www.biorxiv.org/content/10.1101/2022.02.18.481082v1.full.pdf

SWIF(r) (SWeep Inference Framework (controlling for correlation)), a supervised machine learning algorithm that applied to the problem of identifying genomic sites under selection in population genetic data. SWIF(r) learns the individual and joint distributions of attributes.

SWIF(r)’s algorithm classifies testing data according to these distributions along with user-provided priors on the relative frequencies of the classes. the SWIF(r) probabilities reported do not offer insight into the degree of "trustworthiness" of a particular classified instance.

Gravity Of Love.

2022-02-14 22:10:10 | Enigma

□ Enigma - Gravity Of Love

_*

Push The Limits.

2022-02-14 22:09:08 | Enigma

□ Enigma - Push The Limits

_*

	【gooブロガー・先着】dアカウント連携でdポイント2,000pt
	ブログを読むだけ。毎月の訪問日数に応じてポイント進呈
	【コメント募集中】goo blogスタッフの気になったニュース
	gooブロガーの今日のひとこと
	訪問者数に応じてdポイント最大1,000pt当たる！

2022年2月
日	月	火	水	木	金	土
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28

lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Intimacy.

Cross.

Gravity Of Love.

Push The Limits.