lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Vexillum.

2021-12-31 22:17:37 | Science News


“When the theorem is proved from the right axioms, the axioms can be proved from the theorem.”

—Harvey Friedman [Fri74]



□ Reverse mathematics of rings

>> https://arxiv.org/pdf/2109.02037v1.pdf

Turning to a fine-grained analysis of four different definitions of Noetherian in the weak base system RCA0 + IΣ2.

The most obvious way is to construct a computable non-UFD in which every enumeration of a nonprincipal ideal computes ∅′. resp. a computable non-Σ1-PID in which every enumeration of a nonprincipal prime ideal computes ∅′.

an omega-dimensional vector space over Q w/ basis {xn : n ∈/ A}, the a′i are a linearly independent sequence in I. Let f(n) be the largest variable appearing in a′0,...,a′n+1. f(n) must be greater than the nth element of AC. f dominates μ∅′, and so a′0, a′1, . . . computes ∅′.





□ Con-AAE: Contrastive Cycle Adversarial Autoencoders for Single-cell Multi-omics Alignment and Integration

>> https://www.biorxiv.org/content/10.1101/2021.12.12.472268v1.full.pdf

Contrastive Cycle adversarial Autoencoders (Con-AAE) can efficiently map the above data with high sparsity and noise from different spaces to a low-dimensional manifold in a unified space, making the downstream alignment and integration straightforward.

Con-AAE uses two autoencoders to map the two modal data into two low-dimensional manifolds, forcing the two spaces as unified as possible with the adversarial loss and latent cycle-consistency loss.





□ SpaceX: Gene Co-expression Network Estimation for Spatial Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2021.12.24.474059v1.full.pdf

SpaceX (spatially dependent gene co-expression network) employs a Bayesian model to infer spatially varying co-expression networks via incorporation of spatial information in determining network topology.

SpaceX uses an over-dispersed spatial Poisson model coupled with a high-dimensional factor model to infer the shared and cluster specific co-expression networks. The probabilistic model is able to quantify the uncertainty and based on a coherent dimension reduction.





□ AnchorWave: Sensitive alignment of genomes with high sequence diversity, extensive structural polymorphism, and whole-genome duplication

>> https://www.pnas.org/content/119/1/e2113075119

AnchorWave - Anchored Wavefront alignment implements a genome duplication informed longest path algorithm to identify collinear regions and performs base pair–resolved, end-to-end alignment for collinear blocks using an efficient two-piece affine gap cost strategy.

AnchorWave improves the alignment under a number of scenarios: genomes w/ high similarity, large genomes w/ high transposable element activity, genomes w/ many inversions, and alignments b/n species w/ deeper evolutionary divergence / different whole-genome duplication histories.





□ Grandline: Network-guided supervised learning on gene expression using a graph convolutional neural network

>> https://www.biorxiv.org/content/10.1101/2021.12.27.474240v1.full.pdf

Grandline transforms PPI into a spectral domain enables convolution of neighbouring genes and pinpointing high-impact subnetworks, which allow better interpretability of deep learning models.

Grandline integrates PPI network by considering the network as an undirected graph and gene expression values as node signals. Similar to a standard conventional neural network models, the model consists of multiple blocks for convolution and pooling layer.

Grandline could identify subnetworks that are important for the phenotype prediction using Grad-CAM technique. Grandline defines a spectral graph convolution on the Fourier domain and then defined a convolutional filter based on Chebychev polynomial.





□ Clair3: Symphonizing pileup and full-alignment for deep learning-based long-read variant calling

>> https://www.biorxiv.org/content/10.1101/2021.12.29.474431v1.full.pdf

Clair3 is the 3rd generation of Clair and Clairvoyante. the Clair3 method is not restricted to a certain sequencing technology. It should work particularly well in terms of both runtime and performance on noisy data.

Clair3 integrates both pileup model and full-alignment model for variant calling. While a pileup model determines the result of a majority of variant candidates, candidates with uncertain results are further processed with a more intensive haplotype-resolved full-alignment model.





□ scGET: Predicting Cell Fate Transition During Early Embryonic Development by Single-cell Graph Entropy

>> https://www.sciencedirect.com/science/article/pii/S1672022921002539

scGET accurately predicts all the impending cell fate transitions. scGET provides a new way to analyze the scRNA-seq data and helps to track the dynamics of biological systems from the perspectives of network entropy.

The Single-Cell Graph Entropy (SGE) value quantitatively characterizes the stability and criticality of gene regulatory networks among cell populations and thus can be employed to detect the critical signal of cell fate or lineage commitment at the single-cell level.





□ GLRP: Stability of feature selection utilizing Graph Convolutional Neural Network and Layer-wise Relevance Propagation

>> https://www.biorxiv.org/content/10.1101/2021.12.26.474194v1.full.pdf

a graph convolutional layer of GCNN as a Keras layer so that the SHAP (SHapley Additive exPlanation) explanation method could be also applied to a Keras version of a GCNN model.

GCNN+LRP shows the highest stability among other feature selection methods including GCNN+SHAP. a GLRP subnetwork of an individual patient is on average substantially more connected (and interpretable) than a GCNN+SHAP subnetwork, which consists mainly of single vertices.





□ isoformant: A visual toolkit for reference-free long-read isoform analysis at single-read resolution

>> https://www.biorxiv.org/content/10.1101/2021.12.17.457386v1.full.pdf

isoformant, an alternative approach that derives isoforms by generating consensus sequences from long reads clustered on k-mer density without the requirement for a reference genome or prior annotations.

isoformant was developed based on the concept that an individual long-read isoform can be uniquely identified by its constituent k-mer composition. For an appropriate length k, each unique read in a mixture can be represented by a correspondingly unique k-mer frequency vector.





□ contrastiveVI: Isolating salient variations of interest in single-cell transcriptomic data with contrastiveVI

>> https://www.biorxiv.org/content/10.1101/2021.12.21.473757v1.full.pdf

contrastiveVI learns latent representations that recover known subgroups of target data points better than previous methods and finds differentially expressed genes that agree with known ground truths.

contrastiveVI encodes each cell as the parameters of a distribution in a low-dimensional latent space. Only target data points are given salient latent variable values; background data points are instead assigned a zero vector for these variables to represent their absence.





□ scRAE: Deterministic Regularized Autoencoders with Flexible Priors for Clustering Single-cell Gene Expression Data

>> https://arxiv.org/pdf/2107.07709.pdf

There is a bias-variance trade-off with the imposition of any prior on the latent space in the finite data regime.

scRAE is a generative AE for single-cell RNA sequencing data, which can potentially operate at different points of the bias-variance curve.

scRAE consists of deterministic AE with a flexibly learnable prior generator network, which is jointly trained with the AE. This facilitates scRAE to trade-off better between the bias and variance in the latent space.





□ scIAE: an integrative autoencoder-based ensemble classification framework for single-cell RNA-seq data

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab508/6463428

scIAE, an integrative autoencoder-based ensemble classification framework, to firstly perform multiple random projections and apply integrative and devisable autoencoders (integrating stacked, denoising and sparse autoencoders) to obtain compressed representations.

Then base classifiers are built on the lower-dimensional representations and the predictions from all base models are integrated. The comparison of scIAE and common feature extraction methods shows that scIAE is effective and robust, independent of the choice of dimension, which is beneficial to subsequent cell classification.





□ PyLiger: Scalable single-cell multi-omic data integration in Python

>> https://www.biorxiv.org/content/10.1101/2021.12.24.474131v1.full.pdf

LIGER is a widely-used R package for single-cell multi-omic data integration. However, many users prefer to analyze their single-cell datasets in Python, which offers an attractive syntax and highly- optimized scientific computing libraries for increased efficiency.

PyLiger offers faster performance than the previous R implementation (2-5× speedup), interoperability with AnnData format, flexible on-disk or in-memory analysis capability, and new functionality for gene ontology enrichment analysis.





□ Dynamic Suffix Array with Polylogarithmic Queries and Updates

>> https://arxiv.org/pdf/2201.01285.pdf

the first data structure that supports both suffix array queries and text updates in O(polylog n) time, achieving O(log4 n) and O(log3+o(1) n) time.

Complement the structure by a hardness result: unless the Online Matrix-Vector Multiplication (OMv) Conjecture fails, no data structure with O(polylog n)-time suffix array queries can support the “copy-paste” operation in O(n1−ε) time for any ε > 0.





□ SHAHER: A novel framework for analysis of the shared genetic background of correlated traits

>> https://www.biorxiv.org/content/10.1101/2021.12.13.472525v1.full.pdf

SHAHER is versatile and applicable to summary statistics from GWASs with arbitrary sample sizes and sample overlaps, allows incorporation of different GWAS models (Cox, linear and logistic) and is computationally fast.

SHAHER is based on the construction of a linear combination of traits by maximizing the proportion of its genetic variance explained by the shared genetic factors. SHAHER requires only full GWAS summary statistics and matrices of genetic and phenotypic correlations.





□ Stacked-SGL: Overcoming the inadaptability of sparse group lasso for data with various group structures by stacking

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab848/6462433

Sparse group lasso has a mixing parameter representing the ratio of lasso to group lasso, thus providing a compromise between selecting a subset of sparse feature groups and introducing sparsity within each group.

Stacked SGL satisfies the criteria of prediction, stability and selection based on sparse group lasso penalty by stacking. stacked SGL weakens feature selection, because it selects a feature if and only if the meta learner selects the base learner that selects that feature.





□ MultiVelo: Single-cell multi-omic velocity infers dynamic and decoupled gene regulation

>> https://www.biorxiv.org/content/10.1101/2021.12.13.472472v1.full.pdf

MultiVelo uses a probabilistic latent variable model to estimate the switch time and rate parameters of gene regulation, providing a quantitative summary of the temporal relationship between epigenomic and transcriptomic changes.

MultiVelo accurately recovers cell lineages and quantifies the length of priming and decoupling intervals in which chromatin accessibility and gene expression are temporarily out of sync.





□ LocCSN: Constructing local cell-specific networks from single-cell data

>> https://www.pnas.org/content/118/51/e2113178118

locCSN, that estimates cell-specific networks (CSNs) for each cell, preserving information about cellular heterogeneity that is lost with other approaches.

LocCSN is based on a nonparametric investigation of the joint distribution of gene expression; hence it can readily detect nonlinear correlations, and it is more robust to distributional challenges.





□ CTSV: Identification of Cell-Type-Specific Spatially Variable Genes Accounting for Excess Zeros

>> https://www.biorxiv.org/content/10.1101/2021.12.27.474316v1.full.pdf

CTSV can achieve more power than SPARK-X in detecting cell-type-specific SV genes and also outperforms other methods at the aggregated level.

CTSV directly models spatial raw count data and considers zero-inflation as well as overdispersion using a zero-inflated negative binomial distribution. It then incorporates cell-type proportions and spatial effect functions in the zero-inflated negative binomial regression framework.





□ TSSN: A New Method for Recognizing Protein Complexes Based on Protein Interaction Networks and GO Terms

>> https://www.frontiersin.org/articles/10.3389/fgene.2021.792265/full

Topology and Semantic Similarity Network (TSSN) can filter the noise of PPI data. TSSN uses a new algorithm, called Neighbor Nodes of Proteins (NNP), for recognizing protein complexes by considering their topology information.

TSSN computes the edge aggregation coefficient as the topology characteristics of N, makes use of the GO annotation as the biological characteristics of N, and then constructs a weighted network. NNP identifies protein complexes based on this weighted network.





□ Thresholding Approach for Low-Rank Correlation Matrix based on MM algorithm

>> https://www.biorxiv.org/content/10.1101/2021.12.28.474401v1.full.pdf

Low-rank approximation is a very useful approach for interpreting the features of a correlation matrix; however, a low-rank approximation may result in estimation far from zero even if the corresponding original value was far from zero.

Estimating a sparse low-rank correlation matrix based on threshold values combined with cross-validation. the MM algorithm was used to estimate the sparse low-rank correlation matrix, and a grid search was performed to select the threshold values related to sparse estimation.





□ Pairs and Pairix: a file format and a tool for efficient storage and retrieval for Hi-C read pairs

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab870/6493233

Pairs, a block-compressed text file format for storing paired genomic coordinates from Hi-C data, and Pairix, is a stand-alone C program that was written on top of tabix as a tool for the 4DN-standard pairs file format describing Hi-C data.

However, Pairix can be used as a generic tool for indexing and querying any bgzipped text file containing genomic coordinates, for either 2D- or 1D- indexing and querying.





□ ClusTrast: a short read de novo transcript isoform assembler guided by clustered contigs

>> https://www.biorxiv.org/content/10.1101/2022.01.02.473666v1.full.pdf

ClusTrast, the de novo transcript isoform assembler which clusters a set of guiding contigs by similarity, aligns short reads to the guiding contigs, and assembles each clustered set of short reads individually.

ClusTrast combines two assembly methods: Trans-ABySS and Shannon, and incorporates a novel approach to clustering and cluster-wise assembly of short reads. The final step of ClusTrast is to merge the cluster-wise assemblies with the primary assembly by concatenation.





□ TIPars: Robust expansion of phylogeny for fast-growing genome sequence data

>> https://www.biorxiv.org/content/10.1101/2021.12.30.474610v1.full.pdf

TIPars, an algorithm which inserts sequences into a reference phylogeny based on parsimony criterion with the aids of a full multiple sequence alignment of taxa and pre-computed ancestral sequences.

TIPars searches the position for insertion by calculating the triplet-based minimal substitution score for the query sequence on all branches. TIPars showed promising taxa placement and insertion accuracy in the phylogenies with homogenous and divergent sequences.





□ Clustering Deviation Index (CDI): A robust and accurate unsupervised measure for evaluating scRNA-seq data clustering

>> https://www.biorxiv.org/content/10.1101/2022.01.03.474840v1.full.pdf

Clustering Deviation Index (CDI) that measures the deviation of any clustering label set from the observed single-cell data. CDI is an unsupervised evaluation index whose calculation does not rely on the actual unobserved label set.

CDI calculates the negative penalized maximum log-likelihood of the selected feature genes based on the candidate label set. CDI also informs the optimal tuning parameters for any given clustering method and the correct number of cluster components.





□ Cobolt: integrative analysis of multimodal single-cell sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02556-z

Cobolt, a novel method that not only allows for analyzing the data from joint-modality platforms, but provides a coherent framework for the integration of multiple datasets measured on different modalities.

Cobolt’s generative model for a single modality i starts by assuming that the counts measured on a cell are the mixture of the counts from different latent categories.

Cobolt estimates this joint representation via a novel application of Multimodal Variational Autoencoder (MVAE) to a hierarchical generative model. Cobolt results in an estimate of the latent variable for each cell, which is a vector in a K-dimensional space.





□ STonKGs: A Sophisticated Transformer Trained on Biomedical Text and Knowledge Graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac001/6497782

In order to exploit the information contained in KGs through machine learning algorithms, numerous KG embedding models have been developed to encode the entities and relations of KGs in a higher dimensional vector space while attempting to retain their structural properties.

STonKGs uses combined input sequences of structured information from KGs and unstructured text data from biomedical literature assembled by Integrated Network and Dynamical Reasoning Assembler (INDRA) to learn joint representations in a shared embedding space.





□ am: Implementation of a practical Markov chain Monte Carlo sampling algorithm in PyBioNetFit

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac004/6497784

the implementation of a practical MCMC method in the open-source software package PyBioNetFit (PyBNF), which is designed to support parameterization of mathematical models for biological systems.

am, the new MCMC method that incorporates an adaptive move proposal distribution. Sampling can be initiated at a specified location in parameter space and with a multivariate Gaussian proposal distribution defined initially by a specified covariance matrix.





□ Hierarchical shared transfer learning for biomedical named entity recognition

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04551-4

the hierarchical shared transfer learning, which combines multi-task learning and fine-tuning, and realizes the multi-level information fusion between the underlying entity features and the upper data features.

The model uses XLNet based on Self-Attention PLM to replace BERT as encoder, avoiding the problem of input noise from autoencoding language model. When fine-tuning the BioNER task, it decodes the output of the XLNet model with Conditional Random Field decoder.





□ endoR: Interpreting tree ensemble machine learning models

>> https://www.biorxiv.org/content/10.1101/2022.01.03.474763v1.full.pdf

endoR simplifies the fitted model into a decision ensemble from which it then extracts information on the importance of individual features and their pairwise interactions and also visualizes these data as an interpretable network.

endoR infers true associations with comparable accuracy than other commonly used approaches while easing and enhancing model interpretation. Adjustable regularization and bootstrapping help reduce the complexity and ensure that only essential parts of the model are retained.





□ Nm-Nano: Predicting 2′-O-methylation (Nm) Sites in Nanopore RNA Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2022.01.03.473214v1.full.pdf

Nm-Nano framework integrates two supervised machine learning models for predicting Nm sites in Nanopore sequencing data, namely Xgboost and Random Forest (RF).

Each model is trained with set of features that are extracted from the raw signal generated by the Oxford Nanopore MinION device, as well as the corresponding basecalled k-mer resulting from inferring the RNA sequence reads from the generated Nanopore signals.





□ Robust normalization and transformation techniques for constructing gene coexpression networks from RNA-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02568-9

a comprehensive benchmarking and analysis of 36 different workflows, each with a unique set of normalization and network transformation methods, for constructing coexpression networks from RNA-seq datasets.

Between-sample normalization has the biggest impact, with counts adjusted by size factors producing networks that most accurately recapitulate known tissue-naive and tissue-aware gene functional relationships.





□ SCOT: Single-Cell Multiomics Integration

>> https://www.liebertpub.com/doi/full/10.1089/cmb.2021.0477

Single-cell alignment using optimal transport (SCOT) is an unsupervised algorithm that addresses this limitation by using optimal transport to align single-cell multiomics data.

the Gromov-Wasserstein distance in the algorithm can guide SCOT's hyperparameter tuning in a fully unsupervised setting when no orthogonal alignment information is available.

SCOT finds a probabilistic coupling matrix that minimizes the discrepancy between the intra-domain distance matrices. Finally, it uses the coupling matrix to project one single-cell data set onto another through barycentric projection.





□ ABRIDGE: An ultra-compression software for SAM alignment files

>> https://www.biorxiv.org/content/10.1101/2022.01.04.474935v1.full.pdf

ABRIDGE, an ultra-compressor for SAM files offering users both lossless and lossy compression options. This reference-based file compressor achieves the best compression ratio among all compression software ensuring lower space demand and faster file transmission.

ABRIDGE accepts a single SAM file as input and returns a compressed file that occupies less space than its BAM or CRAM counterpart. ABRIDGE compresses alignments after retaining only non-redundant information.

ABRIDGE accumulates all reads that are mapped onto the same nucleotide on a reference. ABRIDGE modifies the traditional CIGAR string to store soft-clips, mismatches, insertions, deletions, and quality scores thereby removing the need to store the MD string.




Lagrange Point.

2021-12-31 22:17:36 | Science News




□ DeepSVP: Integration of genotype and phenotype for structural variant prioritization using deep learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab859/6482742

DeepSVP significantly improves the success rate of finding causative variants over StrVCTVRE and CADD-SV. DeepSVP uses as input an annotated VCF file of an individual and clinical phenotypes encoded using the Human Phenotype Ontology.

DeepSVP overcomes the limitation of missing phenotypes by incorporating information related to genes through ontologies, mainly the functions of gene products, gene expression in individual celltypes, and anatomical sites of expression and systematically relating them to their phenotypic consequences through ontologies.





□ MultiMAP: dimensionality reduction and integration of multimodal data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02565-y

MultiMAP is based on a framework of Riemannian geometry and algebraic topology and generalizes the UMAP framework to the setting of multiple datasets each with different dimensionality.

MultiMAP takes as input any number of datasets of potentially differing dimensions and recovers geodesic distances on a single latent manifold on which all of the data is uniformly distributed.





□ MSRCall: A Multi-scale Deep Neural Network to Basecall Oxford Nanopore Sequences

>> https://www.biorxiv.org/content/10.1101/2021.12.20.471615v1.full.pdf

MSRCall first uses convolutional layers to manipulate multi-scale downsampling. These back-to-back convolutional layers aim to capture features with receptive fields at different levels of complexity.

MSRCall simultaneously utilizes multi-scale convolutional and bidirectional LSTM layers to capture semantic information. MSRCall disentangles the relationship between raw signal data and nucleotide labels.





□ cLoops2: a full-stack comprehensive analytical tool for chromatin interactions

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab1233/6470683

cLoops2 consists of core modules for peak-calling, loop-calling, differentially enriched loops calling and loops annotation. cLoops2 addresses the practical analysis requirements, especially for loop-centric analysis with preferential design for Hi-TrAC/TrAC-looping data.

cLoops2 directly analyzes the paired-end tags to find candidate peaks and loops. It estimates the statistical significance for the peak/loop features with a permuted local background, eliminating the bias introduced from third part peak-calling parameters tuning for calling loops.





□ CMIA: Gene regulation network inference using k-nearest neighbor-based mutual information estimation- Revisiting an old DREAM

>> https://www.biorxiv.org/content/10.1101/2021.12.20.473242v1.full.pdf

the MI-based kNN Kraskov-Stoögbauer-Grassberger (KSG) algorithm leads to a significant improvement in GRN reconstruction for popular inference algorithms, such as Context Likelihood of Relatedness (CLR).

CMIA (Conditional Mutual Information Augmentation), a novel inference algorithm inspired by Synergy-Augmented CLR. Looking forward, the goal of complete reconstruction of GRNs may require new inference algorithms and probably Mutual information MI in more than three dimensions.





□ CoRE-ATAC: A deep learning model for the functional classification of regulatory elements from single cell and bulk ATAC-seq data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009670

CoRE-ATAC can infer regulatory functions in diverse cell types, capture activity differences modulated by genetic mutations, and can be applied to single cell ATAC-seq data to study rare cell populations.

CoRE-ATAC integrates DNA sequence data with chromatin accessibility data using a novel ATAC-seq data encoder that is designed to be able to integrate an individual’s genotype with the chromatin accessibility maps by inferring the genotype from ATAC-seq read alignments.





□ CosNeti: ComplexOme-Structural Network Interpreter used to study spatial enrichment in metazoan ribosomes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04510-z

CosNeti translates experimentally determined structures into graphs, with nodes representing proteins and edges the spatial proximity between them. CosNeti considers rProteins and ignores rRNA and other objects.

Spatial regions are defined using a random walk with restart methodology, followed by a procedure to obtain a minimum set of regions that cover all proteins in the complex.

Structural coherence is achieved by applying weights to the edges reflecting the physical proximity between purportedly contacting proteins. The weighting probabilistically guides the random-walk path trajectory.





□ 2FAST2Q: A general-purpose sequence search and counting program for FASTQ files

>> https://www.biorxiv.org/content/10.1101/2021.12.17.473121v1.full.pdf

2FAST2Q, a versatile and intuitive standalone program capable of extracting and counting feature occurrences in FASTQ files.

2FAST2Q can be used in any experimental setup that requires feature extraction from raw reads, being able to quickly handle mismatch alignments, nucleotide wise Phred score filtering, custom read trimming, and sequence searching within a single program.





□ Integration of public DNA methylation and expression networks via eQTMs improves prediction of functional gene-gene associations

>> https://www.biorxiv.org/content/10.1101/2021.12.17.473125v1.full.pdf

MethylationNetwork can identify experimentally validated interacting pairs of genes that could not be identified in the RNA-seq datasets.

an integration pipeline based on kernel cross-correlation matrix decomposition. Using this pipeline, they integrated GeneNetwork and MethylationNetwork and used the integrated results to predict functional gene–gene correlations that are collected in the STRING database.





□ FineMAV: Prioritising positively selected variants in whole-genome sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04506-9

Fine-Mapping of Adaptation Variation (FineMAV) is a statistical method that prioritizes functional SNP candidates under selection and depends upon population differentiation.

A stand-alone application that can perform FineMAV calculations on whole-genome sequencing data and can output bigWig files which can be used to graphically visualise the scores on genome browsers.





□ GraphOmics: an interactive platform to explore and integrate multi-omics data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04500-1

GraphOmics provides an interactive platform that integrates data to Reactome pathways emphasising interactivity and biological contexts. This avoids the presentation of the integrated omics data as a large network graph or as numerous static tables.

GraphOmics offers a way to perform pathway analysis separately on each omics, and integrate the results at the end. The separate pathway analysis results run on different omics datasets can be combined with an AND operator in the Query Builder.





□ anndata: Annotated data

>> https://www.biorxiv.org/content/10.1101/2021.12.16.473007v1.full.pdf

AnnData makes a particular choice for data organization that has been left unaddressed by packages like scikit-learn or PyTorch, which model input and output of model transformations as unstructured sets of tensors.

The AnnData object is a collection of arrays aligned to the common dimensions of observations (obs) and variables (var).

Storing low-dimensional manifold structure within a desired reduced representation is achieved through a k-nearest neighbor graph in form of a sparse adjacency matrix: a matrix of pairwise relationships of observations.





□ Class similarity network for coding and long non-coding RNA classification

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04517-6

Class Similarity Network considers more relationships among input samples in a direct way. It focuses on exploring the potential relationships between input samples and samples from both the same class and the different classes.

Class Similarity Network trains the parameters specific to each class to obtain the high-level features. The Fully Connected module learns parameters from diff dense branches to integrate similarity information. The Decision module concatenates the nodes to make the prediction.





□ FCLQC: fast and concurrent lossless quality scores compressor

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04516-7

FCLQC achieves a comparable compression rate while having much faster than the baseline algorithms. FCLQC uses concurrent programming to achieve fast compression and decompression.

Concurrent programming executes a program independently, not necessarily simultaneously, which is different from error-prone parallel computing. FCLQC shows at least 31x compression speed improvement, where a performance degradation in compression ratio is up to 13.58%.





□ ADClust: A Parameter-free Deep Embedded Clustering Method for Single-cell RNA-seq Data

>> https://www.biorxiv.org/content/10.1101/2021.12.19.473334v1.full.pdf

ADClust first obtains low-dimensional representation through pre-trained autoencoder, and uses the representa- tions to cluster cells into initial micro-clusters.

The micro-clusters are then compared in between through a statistical test for unimodality called Dip-test to detect similar micro- clusters, and similar micro-clusters are merged through jointly optimizing the carefully designed clustering and autoencoder loss functions.





□ fastMSA: Accelerating Multiple Sequence Alignment with Dense Retrieval on Protein Language

>> https://www.biorxiv.org/content/10.1101/2021.12.20.473431v1.full.pdf

The fastMSA framework, consisting of query sequence encoder and context sequences encoder, can improve the scalability and speed of multiple sequence alignment significantly.

fastMSA utilizes the query sequences to search from UniRef90 using JackHMMER v3.3 and build the resulted MSAs as ground truth. By filtering out the unrelated sequences on the low-dimensional space before performing MSA, fastMSA can accelerate the process by 35 folds.





□ XAE4Exp: Explainable autoencoder-based representation learning for gene expression data

>> https://www.biorxiv.org/content/10.1101/2021.12.21.473742v1.full.pdf

XAE4Exp (eX-plainable AutoEncoder for Expression data), which integrates AE and SHapley Additive exPlana-tions (SHAP), a flagship technique in the field of eXplainable AI (XAI).

XAE4Exp quantitatively evaluates the contributions of each gene to the hidden structure learned by an AE, substantially improving the expandability of AE outcomes.





□ DeepLOF: A deep learning framework for predicting human essential genes from population and functional genomic data

>> https://www.biorxiv.org/content/10.1101/2021.12.21.473690v1.full.pdf

DeepLOF, an evolution- based deep learning model for predicting human genes intolerant to LOF mutations. DeepLOF can integrate genomic features and population genomic data to predict LOF-intolerant genes without human-labeled training data.

DeepLOF combines the neural network-based beta prior distribution with the population genetics-based likelihood function to obtain a posterior distribution of η, which represents their belief about LOF intolerance after integrating genomic features and population genomic data.





□ CSNet: Estimating cell-type-specific gene co-expression networks from bulk gene expression data

>> https://www.biorxiv.org/content/10.1101/2021.12.21.473558v1.full.pdf

For finite sample cases, it may be desirable to ensure the positive definiteness of the final estimator. One strategy is to solve a constrained optimization problem to find the nearest correlation matrix in Frobenius norm.

CSNet, a sparse estimator w/ SCAD penalty. And deriving the non-asymptotic convergence rate in spectral norm of CSNet and establish variable selection consistency, ensuring that the edges in the cell-type specific networks can be correctly identified w/ probability tending to 1.





□ NanoGeneNet: Using Deep Learning for Gene Detection and Classification in Raw Nanopore Signals

>> https://www.biorxiv.org/content/10.1101/2021.12.23.473143v1.full.pdf

NanoGeneNet, a neural network-based method capable of detecting and classifying specific genomic regions already in raw nanopore signals – squiggles.

Therefore, the basecalling process can be omitted entirely as the raw signals of significant genes, or intergenic regions can be directly analysed, or if the nucleotide sequences are required, the identified squiggles can be basecalled, preferably to others.





□ binny: an automated binning algorithm to recover high-quality genomes from complex metagenomic datasets

>> https://www.biorxiv.org/content/10.1101/2021.12.22.473795v1.full.pdf

binny, a binning tool that produces high-quality metagenome-assembled genomes from both contiguous and highly fragmented genomes.

binny uses k-mer-composition and coverage by metagenomic reads for iterative, non-linear dimension reduction of genomic signatures as well as subsequent automated contig clustering with cluster assessment using lineage-specific marker gene sets.





□ Baltica: integrated splice junction usage analysis

>> https://www.biorxiv.org/content/10.1101/2021.12.23.473966v1.full.pdf

Baltica, a framework that provides workflows for quality control, de novo transcriptome assembly with StringTie2, and currently 4 DJU methods: rMATS, JunctionSeq, Majiq, and LeafCutter.

Baltica uses 2 datasets, the first uses Spike-in RNA Variant Control Mixes (SIRVs) and the second dataset of paired Illumina and Oxford Nanopore Technologies. Baltica integration allows us to compare the performance of different DJU and test the usability of a meta-classifier.





□ bulkAnalyseR: An accessible, interactive pipeline for analysing and sharing bulk sequencing results

>> https://www.biorxiv.org/content/10.1101/2021.12.23.473982v1.full.pdf

Critically, neither VIPER, nor BioJupies offer support for more complex differential expression (DE) tasks, beyond simple pair-wise comparisons. This limits the biological interpretations from more complex experimental designs.

bulkAnalyseR provides an accessible, yet flexible framework for the analysis of bulk sequencing data without relying on prior programming expertise. The users can create a shareable shiny app in two lines of code, from an expression matrix and a metadata table.





□ ePat: extended PROVEAN annotation tool

>> https://www.biorxiv.org/content/10.1101/2021.12.21.468911v1.full.pdf

The 'ePat' extends the conventional PROVEAN to enable the following two things, which the conventional PROVEAN could not calculate the pathogenicity of these variants.

ePat is able to calculate the pathogenicity of variants near the splice junction, frameshift, stop gain, and start lost. In addition, batch processing is used to calculate the pathogenicity of all variants in a VCF file in a single step.





□ A guide to trajectory inference and RNA velocity

>> https://www.biorxiv.org/content/10.1101/2021.12.22.473434v1.full.pdf

Whereas traditional trajectory inference methods reconstruct cellular dynamics given a population of cells of varying maturity, RNA velocity relies on a dynamical model describing splicing dynamics.

However, pseudotime is based solely on transcriptional information, so it cannot be interpreted as an estimator of the true time since initial differentiation.

Rather, it is a high-resolution estimate of cell state, which is likely to be monotonically related to the true chronological time, but there is no guarantee that equivalent changes in transcriptional profiles follow a similar chronological time.





□ GeneTonic: an R/Bioconductor package for streamlining the interpretation of RNA-seq data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04461-5

GeneTonic serves as a comprehensive toolkit for streamlining the interpretation of functional enrichment analyses, by fully leveraging the information of expression values in a differential expression context.

GeneTonic is not structured as an end-to-end workflow including quantification, preprocessing, exploratory data analysis, and DE modeling—all operations that are also time consuming, but in many scenarios need to be carried out only once.





□ The impact of low input DNA on the reliability of DNA methylation as measured by the Illumina Infinium MethylationEPIC BeadChip

>> https://www.biorxiv.org/content/10.1101/2021.12.22.473840v1.full.pdf

This study demonstrates that although as little as 40ng is sufficient to produce Illumina Infinium MethylationEPIC Beadchip DNAm data that passes standard QC checks, data quality and reliability diminish as DNA input decreases.

They recommend caution and use of sensitivity analyses when working with less than 200ng DNA on the Illumina Infinium MethylationEPIC Beadchip.





□ AMC: accurate mutation clustering from single-cell DNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab857/6482741

AMC first employs principal component analysis followed by K-means clustering to find mutation clusters, then infers the maximum likelihood estimates of the genotypes of each cluster.

The inferred genotypes can subsequently be used to reconstruct the phylogenetic tree with high efficiency. AMC uses BIC to jointly determine the best number of mutation clusters and the corresponding genotypes.





□ LotuS2: An ultrafast and highly accurate tool for amplicon sequencing analysis

>> https://www.biorxiv.org/content/10.1101/2021.12.24.474111v1.full.pdf

LotuS2 uses only truncated, high-quality reads for sequence clustering (except ITS amplicons), while the read backmapping and seed extension steps restore some of the discarded sequence data.

LotuS2 often reported the fewest ASVs/OTUs, while including more sequence reads in abundance tables. This indicates that LotuS2 has a more efficient usage of input data while covering a larger sequence space per ASV/OTU.




□ EdClust: A heuristic sequence clustering method with higher sensitivity

>> https://www.worldscientific.com/doi/abs/10.1142/S0219720021500360

Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from overestimation of inferred clusters and low clustering sensitivity.

The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH.





□ cDNA-detector: detection and removal of cDNA contamination in DNA sequencing libraries

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04529-2

cDNA-detector provides the option to remove contaminant reads from the alignment to reduce the risk of spurious coverage peak and variant calls in downstream analysis.

When using cDNA-detector on genomic sequence data, they recommend suppressing the “retrocopy” output, such that only potential vector cDNA candidates are reported. With this strategy, contaminants can be removed from alignments, revealing true signal previously obscured.





□ Artificial intelligence “sees” split electrons

>> https://www.science.org/doi/10.1126/science.abm2445

Chemical bonds between atoms are stabilized by the exchange-correlation (xc) energy, a quantum-mechanical effect in which “social distancing” by electrons lowers their electrostatic repulsion energy.

Kohn-Sham density functional theory (DFT) states that the electron density determines this xc energy, but the density functional must be approximated.

Two exact constraints—the ensemble-based piecewise linear variation of the total energy with respect to fractional electron number and fractional electron z-component of spin — require hard-to-control nonlocality.




□ RAxML Grove: An empirical Phylogenetic Tree Database

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab863/6486526

When generating synthetic data it is often unclear how to set simulation parameters for the models and generate trees that appropriately reflect empirical model parameter distributions and tree shapes.

RAxML Grove currently comprising more than 60,000 inferred trees and respective model parameter estimates from fully anonymized empirical data sets that were analyzed using RAxML and RAxML-NG on two web servers.





□ ifCNV: a novel isolation-forest-based package to detect copy number variations from NGS datasets

>> https://www.biorxiv.org/content/10.1101/2022.01.03.474771v1.full.pdf

About 1500 CNV regions have already been discovered in the human population, accounting for ~12–16% of the entire human genome,1 making it one of most common types of genetic variation. Although the biological impact of the majority of these CNVs remains uncertain.

ifCNV is a CNV detection tool based on read-depth distribution. ifCNV combines artificial intelligence using two isolation forests and a comprehensive scoring method to faithfully detect CNVs among various samples.





□ DICAST: Alternative splicing analysis benchmark

>> https://www.biorxiv.org/content/10.1101/2022.01.05.475067v1.full.pdf

DICAST offers a modular and extensible framework for the analysis of AS integrating 11 splice-aware mapping and eight event detection tools. DICAST allows researchers to employ a consensus approach to consider the most successful tools jointly for robust event detection.

While DICAST introduces a unifying standard for AS event reporting, AS event detection tools utilize inherently different approaches and lead to inconsistent results.





□ scNAME: Neighborhood contrastive clustering with ancillary mask estimation for scRNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac011/6499267

scNAME incorporates a mask estimation task for gene pertinence mining and a neighborhood contrastive learning framework for cell intrinsic structure exploitation.

A neighborhood contrastive paradigm with an offline memory bank, global in scope, which can inspire discriminative feature representation and achieve intra-cluster compactness, yet inter-cluster separation.





lens, align. Awards 2021.

2021-12-31 21:12:36 | Music20

(Comet Leonard C / 2021 A1: Photo By Michael Jäger)


2021年、個人的ベスト楽曲を紹介。



□ Ludovico Einaudi / “Twice (Reimagined by Mercan Dede)

スーフィズムとクラブミュージックを融合したダークなArabtronica。エイナウディによる原曲のフレーズを、全く別次元の解釈で聴かせてくれる。



□ Thomas Bergersen / “Made of Fire” (from the Album “Chapter IV”)

混声合唱を主軸としたエレクトロニカの一つの究極型。EnigmaやeRa、Hans Zimmerのこの手の作品が好きな人には是非聴いてほしい一曲。


□ Porter Robinson / “Get Your Wish” (from the Album “Nurture”)

Future Bassの天才として時代の寵児となったDJが、長年のブランクを脱して製作した楽曲は、陽だまりの様に素朴で瑞々しいメロディに溢れていた。



□ Miloš Karadaglić - Einaudi: Full Moon (Arr. Lewin for Guitar)