lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Vexillum.

2021-12-31 22:17:37 | Science News


“When the theorem is proved from the right axioms, the axioms can be proved from the theorem.”

—Harvey Friedman [Fri74]



□ Reverse mathematics of rings

>> https://arxiv.org/pdf/2109.02037v1.pdf

Turning to a fine-grained analysis of four different definitions of Noetherian in the weak base system RCA0 + IΣ2.

The most obvious way is to construct a computable non-UFD in which every enumeration of a nonprincipal ideal computes ∅′. resp. a computable non-Σ1-PID in which every enumeration of a nonprincipal prime ideal computes ∅′.

an omega-dimensional vector space over Q w/ basis {xn : n ∈/ A}, the a′i are a linearly independent sequence in I. Let f(n) be the largest variable appearing in a′0,...,a′n+1. f(n) must be greater than the nth element of AC. f dominates μ∅′, and so a′0, a′1, . . . computes ∅′.





□ Con-AAE: Contrastive Cycle Adversarial Autoencoders for Single-cell Multi-omics Alignment and Integration

>> https://www.biorxiv.org/content/10.1101/2021.12.12.472268v1.full.pdf

Contrastive Cycle adversarial Autoencoders (Con-AAE) can efficiently map the above data with high sparsity and noise from different spaces to a low-dimensional manifold in a unified space, making the downstream alignment and integration straightforward.

Con-AAE uses two autoencoders to map the two modal data into two low-dimensional manifolds, forcing the two spaces as unified as possible with the adversarial loss and latent cycle-consistency loss.





□ SpaceX: Gene Co-expression Network Estimation for Spatial Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2021.12.24.474059v1.full.pdf

SpaceX (spatially dependent gene co-expression network) employs a Bayesian model to infer spatially varying co-expression networks via incorporation of spatial information in determining network topology.

SpaceX uses an over-dispersed spatial Poisson model coupled with a high-dimensional factor model to infer the shared and cluster specific co-expression networks. The probabilistic model is able to quantify the uncertainty and based on a coherent dimension reduction.





□ AnchorWave: Sensitive alignment of genomes with high sequence diversity, extensive structural polymorphism, and whole-genome duplication

>> https://www.pnas.org/content/119/1/e2113075119

AnchorWave - Anchored Wavefront alignment implements a genome duplication informed longest path algorithm to identify collinear regions and performs base pair–resolved, end-to-end alignment for collinear blocks using an efficient two-piece affine gap cost strategy.

AnchorWave improves the alignment under a number of scenarios: genomes w/ high similarity, large genomes w/ high transposable element activity, genomes w/ many inversions, and alignments b/n species w/ deeper evolutionary divergence / different whole-genome duplication histories.





□ Grandline: Network-guided supervised learning on gene expression using a graph convolutional neural network

>> https://www.biorxiv.org/content/10.1101/2021.12.27.474240v1.full.pdf

Grandline transforms PPI into a spectral domain enables convolution of neighbouring genes and pinpointing high-impact subnetworks, which allow better interpretability of deep learning models.

Grandline integrates PPI network by considering the network as an undirected graph and gene expression values as node signals. Similar to a standard conventional neural network models, the model consists of multiple blocks for convolution and pooling layer.

Grandline could identify subnetworks that are important for the phenotype prediction using Grad-CAM technique. Grandline defines a spectral graph convolution on the Fourier domain and then defined a convolutional filter based on Chebychev polynomial.





□ Clair3: Symphonizing pileup and full-alignment for deep learning-based long-read variant calling

>> https://www.biorxiv.org/content/10.1101/2021.12.29.474431v1.full.pdf

Clair3 is the 3rd generation of Clair and Clairvoyante. the Clair3 method is not restricted to a certain sequencing technology. It should work particularly well in terms of both runtime and performance on noisy data.

Clair3 integrates both pileup model and full-alignment model for variant calling. While a pileup model determines the result of a majority of variant candidates, candidates with uncertain results are further processed with a more intensive haplotype-resolved full-alignment model.





□ scGET: Predicting Cell Fate Transition During Early Embryonic Development by Single-cell Graph Entropy

>> https://www.sciencedirect.com/science/article/pii/S1672022921002539

scGET accurately predicts all the impending cell fate transitions. scGET provides a new way to analyze the scRNA-seq data and helps to track the dynamics of biological systems from the perspectives of network entropy.

The Single-Cell Graph Entropy (SGE) value quantitatively characterizes the stability and criticality of gene regulatory networks among cell populations and thus can be employed to detect the critical signal of cell fate or lineage commitment at the single-cell level.





□ GLRP: Stability of feature selection utilizing Graph Convolutional Neural Network and Layer-wise Relevance Propagation

>> https://www.biorxiv.org/content/10.1101/2021.12.26.474194v1.full.pdf

a graph convolutional layer of GCNN as a Keras layer so that the SHAP (SHapley Additive exPlanation) explanation method could be also applied to a Keras version of a GCNN model.

GCNN+LRP shows the highest stability among other feature selection methods including GCNN+SHAP. a GLRP subnetwork of an individual patient is on average substantially more connected (and interpretable) than a GCNN+SHAP subnetwork, which consists mainly of single vertices.





□ isoformant: A visual toolkit for reference-free long-read isoform analysis at single-read resolution

>> https://www.biorxiv.org/content/10.1101/2021.12.17.457386v1.full.pdf

isoformant, an alternative approach that derives isoforms by generating consensus sequences from long reads clustered on k-mer density without the requirement for a reference genome or prior annotations.

isoformant was developed based on the concept that an individual long-read isoform can be uniquely identified by its constituent k-mer composition. For an appropriate length k, each unique read in a mixture can be represented by a correspondingly unique k-mer frequency vector.





□ contrastiveVI: Isolating salient variations of interest in single-cell transcriptomic data with contrastiveVI

>> https://www.biorxiv.org/content/10.1101/2021.12.21.473757v1.full.pdf

contrastiveVI learns latent representations that recover known subgroups of target data points better than previous methods and finds differentially expressed genes that agree with known ground truths.

contrastiveVI encodes each cell as the parameters of a distribution in a low-dimensional latent space. Only target data points are given salient latent variable values; background data points are instead assigned a zero vector for these variables to represent their absence.





□ scRAE: Deterministic Regularized Autoencoders with Flexible Priors for Clustering Single-cell Gene Expression Data

>> https://arxiv.org/pdf/2107.07709.pdf

There is a bias-variance trade-off with the imposition of any prior on the latent space in the finite data regime.

scRAE is a generative AE for single-cell RNA sequencing data, which can potentially operate at different points of the bias-variance curve.

scRAE consists of deterministic AE with a flexibly learnable prior generator network, which is jointly trained with the AE. This facilitates scRAE to trade-off better between the bias and variance in the latent space.





□ scIAE: an integrative autoencoder-based ensemble classification framework for single-cell RNA-seq data

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab508/6463428

scIAE, an integrative autoencoder-based ensemble classification framework, to firstly perform multiple random projections and apply integrative and devisable autoencoders (integrating stacked, denoising and sparse autoencoders) to obtain compressed representations.

Then base classifiers are built on the lower-dimensional representations and the predictions from all base models are integrated. The comparison of scIAE and common feature extraction methods shows that scIAE is effective and robust, independent of the choice of dimension, which is beneficial to subsequent cell classification.





□ PyLiger: Scalable single-cell multi-omic data integration in Python

>> https://www.biorxiv.org/content/10.1101/2021.12.24.474131v1.full.pdf

LIGER is a widely-used R package for single-cell multi-omic data integration. However, many users prefer to analyze their single-cell datasets in Python, which offers an attractive syntax and highly- optimized scientific computing libraries for increased efficiency.

PyLiger offers faster performance than the previous R implementation (2-5× speedup), interoperability with AnnData format, flexible on-disk or in-memory analysis capability, and new functionality for gene ontology enrichment analysis.





□ Dynamic Suffix Array with Polylogarithmic Queries and Updates

>> https://arxiv.org/pdf/2201.01285.pdf

the first data structure that supports both suffix array queries and text updates in O(polylog n) time, achieving O(log4 n) and O(log3+o(1) n) time.

Complement the structure by a hardness result: unless the Online Matrix-Vector Multiplication (OMv) Conjecture fails, no data structure with O(polylog n)-time suffix array queries can support the “copy-paste” operation in O(n1−ε) time for any ε > 0.





□ SHAHER: A novel framework for analysis of the shared genetic background of correlated traits

>> https://www.biorxiv.org/content/10.1101/2021.12.13.472525v1.full.pdf

SHAHER is versatile and applicable to summary statistics from GWASs with arbitrary sample sizes and sample overlaps, allows incorporation of different GWAS models (Cox, linear and logistic) and is computationally fast.

SHAHER is based on the construction of a linear combination of traits by maximizing the proportion of its genetic variance explained by the shared genetic factors. SHAHER requires only full GWAS summary statistics and matrices of genetic and phenotypic correlations.





□ Stacked-SGL: Overcoming the inadaptability of sparse group lasso for data with various group structures by stacking

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab848/6462433

Sparse group lasso has a mixing parameter representing the ratio of lasso to group lasso, thus providing a compromise between selecting a subset of sparse feature groups and introducing sparsity within each group.

Stacked SGL satisfies the criteria of prediction, stability and selection based on sparse group lasso penalty by stacking. stacked SGL weakens feature selection, because it selects a feature if and only if the meta learner selects the base learner that selects that feature.





□ MultiVelo: Single-cell multi-omic velocity infers dynamic and decoupled gene regulation

>> https://www.biorxiv.org/content/10.1101/2021.12.13.472472v1.full.pdf

MultiVelo uses a probabilistic latent variable model to estimate the switch time and rate parameters of gene regulation, providing a quantitative summary of the temporal relationship between epigenomic and transcriptomic changes.

MultiVelo accurately recovers cell lineages and quantifies the length of priming and decoupling intervals in which chromatin accessibility and gene expression are temporarily out of sync.





□ LocCSN: Constructing local cell-specific networks from single-cell data

>> https://www.pnas.org/content/118/51/e2113178118

locCSN, that estimates cell-specific networks (CSNs) for each cell, preserving information about cellular heterogeneity that is lost with other approaches.

LocCSN is based on a nonparametric investigation of the joint distribution of gene expression; hence it can readily detect nonlinear correlations, and it is more robust to distributional challenges.





□ CTSV: Identification of Cell-Type-Specific Spatially Variable Genes Accounting for Excess Zeros

>> https://www.biorxiv.org/content/10.1101/2021.12.27.474316v1.full.pdf

CTSV can achieve more power than SPARK-X in detecting cell-type-specific SV genes and also outperforms other methods at the aggregated level.

CTSV directly models spatial raw count data and considers zero-inflation as well as overdispersion using a zero-inflated negative binomial distribution. It then incorporates cell-type proportions and spatial effect functions in the zero-inflated negative binomial regression framework.





□ TSSN: A New Method for Recognizing Protein Complexes Based on Protein Interaction Networks and GO Terms

>> https://www.frontiersin.org/articles/10.3389/fgene.2021.792265/full

Topology and Semantic Similarity Network (TSSN) can filter the noise of PPI data. TSSN uses a new algorithm, called Neighbor Nodes of Proteins (NNP), for recognizing protein complexes by considering their topology information.

TSSN computes the edge aggregation coefficient as the topology characteristics of N, makes use of the GO annotation as the biological characteristics of N, and then constructs a weighted network. NNP identifies protein complexes based on this weighted network.





□ Thresholding Approach for Low-Rank Correlation Matrix based on MM algorithm

>> https://www.biorxiv.org/content/10.1101/2021.12.28.474401v1.full.pdf

Low-rank approximation is a very useful approach for interpreting the features of a correlation matrix; however, a low-rank approximation may result in estimation far from zero even if the corresponding original value was far from zero.

Estimating a sparse low-rank correlation matrix based on threshold values combined with cross-validation. the MM algorithm was used to estimate the sparse low-rank correlation matrix, and a grid search was performed to select the threshold values related to sparse estimation.





□ Pairs and Pairix: a file format and a tool for efficient storage and retrieval for Hi-C read pairs

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab870/6493233

Pairs, a block-compressed text file format for storing paired genomic coordinates from Hi-C data, and Pairix, is a stand-alone C program that was written on top of tabix as a tool for the 4DN-standard pairs file format describing Hi-C data.

However, Pairix can be used as a generic tool for indexing and querying any bgzipped text file containing genomic coordinates, for either 2D- or 1D- indexing and querying.





□ ClusTrast: a short read de novo transcript isoform assembler guided by clustered contigs

>> https://www.biorxiv.org/content/10.1101/2022.01.02.473666v1.full.pdf

ClusTrast, the de novo transcript isoform assembler which clusters a set of guiding contigs by similarity, aligns short reads to the guiding contigs, and assembles each clustered set of short reads individually.

ClusTrast combines two assembly methods: Trans-ABySS and Shannon, and incorporates a novel approach to clustering and cluster-wise assembly of short reads. The final step of ClusTrast is to merge the cluster-wise assemblies with the primary assembly by concatenation.





□ TIPars: Robust expansion of phylogeny for fast-growing genome sequence data

>> https://www.biorxiv.org/content/10.1101/2021.12.30.474610v1.full.pdf

TIPars, an algorithm which inserts sequences into a reference phylogeny based on parsimony criterion with the aids of a full multiple sequence alignment of taxa and pre-computed ancestral sequences.

TIPars searches the position for insertion by calculating the triplet-based minimal substitution score for the query sequence on all branches. TIPars showed promising taxa placement and insertion accuracy in the phylogenies with homogenous and divergent sequences.





□ Clustering Deviation Index (CDI): A robust and accurate unsupervised measure for evaluating scRNA-seq data clustering

>> https://www.biorxiv.org/content/10.1101/2022.01.03.474840v1.full.pdf

Clustering Deviation Index (CDI) that measures the deviation of any clustering label set from the observed single-cell data. CDI is an unsupervised evaluation index whose calculation does not rely on the actual unobserved label set.

CDI calculates the negative penalized maximum log-likelihood of the selected feature genes based on the candidate label set. CDI also informs the optimal tuning parameters for any given clustering method and the correct number of cluster components.





□ Cobolt: integrative analysis of multimodal single-cell sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02556-z

Cobolt, a novel method that not only allows for analyzing the data from joint-modality platforms, but provides a coherent framework for the integration of multiple datasets measured on different modalities.

Cobolt’s generative model for a single modality i starts by assuming that the counts measured on a cell are the mixture of the counts from different latent categories.

Cobolt estimates this joint representation via a novel application of Multimodal Variational Autoencoder (MVAE) to a hierarchical generative model. Cobolt results in an estimate of the latent variable for each cell, which is a vector in a K-dimensional space.





□ STonKGs: A Sophisticated Transformer Trained on Biomedical Text and Knowledge Graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac001/6497782

In order to exploit the information contained in KGs through machine learning algorithms, numerous KG embedding models have been developed to encode the entities and relations of KGs in a higher dimensional vector space while attempting to retain their structural properties.

STonKGs uses combined input sequences of structured information from KGs and unstructured text data from biomedical literature assembled by Integrated Network and Dynamical Reasoning Assembler (INDRA) to learn joint representations in a shared embedding space.





□ am: Implementation of a practical Markov chain Monte Carlo sampling algorithm in PyBioNetFit

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac004/6497784

the implementation of a practical MCMC method in the open-source software package PyBioNetFit (PyBNF), which is designed to support parameterization of mathematical models for biological systems.

am, the new MCMC method that incorporates an adaptive move proposal distribution. Sampling can be initiated at a specified location in parameter space and with a multivariate Gaussian proposal distribution defined initially by a specified covariance matrix.





□ Hierarchical shared transfer learning for biomedical named entity recognition

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04551-4

the hierarchical shared transfer learning, which combines multi-task learning and fine-tuning, and realizes the multi-level information fusion between the underlying entity features and the upper data features.

The model uses XLNet based on Self-Attention PLM to replace BERT as encoder, avoiding the problem of input noise from autoencoding language model. When fine-tuning the BioNER task, it decodes the output of the XLNet model with Conditional Random Field decoder.





□ endoR: Interpreting tree ensemble machine learning models

>> https://www.biorxiv.org/content/10.1101/2022.01.03.474763v1.full.pdf

endoR simplifies the fitted model into a decision ensemble from which it then extracts information on the importance of individual features and their pairwise interactions and also visualizes these data as an interpretable network.

endoR infers true associations with comparable accuracy than other commonly used approaches while easing and enhancing model interpretation. Adjustable regularization and bootstrapping help reduce the complexity and ensure that only essential parts of the model are retained.





□ Nm-Nano: Predicting 2′-O-methylation (Nm) Sites in Nanopore RNA Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2022.01.03.473214v1.full.pdf

Nm-Nano framework integrates two supervised machine learning models for predicting Nm sites in Nanopore sequencing data, namely Xgboost and Random Forest (RF).

Each model is trained with set of features that are extracted from the raw signal generated by the Oxford Nanopore MinION device, as well as the corresponding basecalled k-mer resulting from inferring the RNA sequence reads from the generated Nanopore signals.





□ Robust normalization and transformation techniques for constructing gene coexpression networks from RNA-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02568-9

a comprehensive benchmarking and analysis of 36 different workflows, each with a unique set of normalization and network transformation methods, for constructing coexpression networks from RNA-seq datasets.

Between-sample normalization has the biggest impact, with counts adjusted by size factors producing networks that most accurately recapitulate known tissue-naive and tissue-aware gene functional relationships.





□ SCOT: Single-Cell Multiomics Integration

>> https://www.liebertpub.com/doi/full/10.1089/cmb.2021.0477

Single-cell alignment using optimal transport (SCOT) is an unsupervised algorithm that addresses this limitation by using optimal transport to align single-cell multiomics data.

the Gromov-Wasserstein distance in the algorithm can guide SCOT's hyperparameter tuning in a fully unsupervised setting when no orthogonal alignment information is available.

SCOT finds a probabilistic coupling matrix that minimizes the discrepancy between the intra-domain distance matrices. Finally, it uses the coupling matrix to project one single-cell data set onto another through barycentric projection.





□ ABRIDGE: An ultra-compression software for SAM alignment files

>> https://www.biorxiv.org/content/10.1101/2022.01.04.474935v1.full.pdf

ABRIDGE, an ultra-compressor for SAM files offering users both lossless and lossy compression options. This reference-based file compressor achieves the best compression ratio among all compression software ensuring lower space demand and faster file transmission.

ABRIDGE accepts a single SAM file as input and returns a compressed file that occupies less space than its BAM or CRAM counterpart. ABRIDGE compresses alignments after retaining only non-redundant information.

ABRIDGE accumulates all reads that are mapped onto the same nucleotide on a reference. ABRIDGE modifies the traditional CIGAR string to store soft-clips, mismatches, insertions, deletions, and quality scores thereby removing the need to store the MD string.




Lagrange Point.

2021-12-31 22:17:36 | Science News




□ DeepSVP: Integration of genotype and phenotype for structural variant prioritization using deep learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab859/6482742

DeepSVP significantly improves the success rate of finding causative variants over StrVCTVRE and CADD-SV. DeepSVP uses as input an annotated VCF file of an individual and clinical phenotypes encoded using the Human Phenotype Ontology.

DeepSVP overcomes the limitation of missing phenotypes by incorporating information related to genes through ontologies, mainly the functions of gene products, gene expression in individual celltypes, and anatomical sites of expression and systematically relating them to their phenotypic consequences through ontologies.





□ MultiMAP: dimensionality reduction and integration of multimodal data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02565-y

MultiMAP is based on a framework of Riemannian geometry and algebraic topology and generalizes the UMAP framework to the setting of multiple datasets each with different dimensionality.

MultiMAP takes as input any number of datasets of potentially differing dimensions and recovers geodesic distances on a single latent manifold on which all of the data is uniformly distributed.





□ MSRCall: A Multi-scale Deep Neural Network to Basecall Oxford Nanopore Sequences

>> https://www.biorxiv.org/content/10.1101/2021.12.20.471615v1.full.pdf

MSRCall first uses convolutional layers to manipulate multi-scale downsampling. These back-to-back convolutional layers aim to capture features with receptive fields at different levels of complexity.

MSRCall simultaneously utilizes multi-scale convolutional and bidirectional LSTM layers to capture semantic information. MSRCall disentangles the relationship between raw signal data and nucleotide labels.





□ cLoops2: a full-stack comprehensive analytical tool for chromatin interactions

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab1233/6470683

cLoops2 consists of core modules for peak-calling, loop-calling, differentially enriched loops calling and loops annotation. cLoops2 addresses the practical analysis requirements, especially for loop-centric analysis with preferential design for Hi-TrAC/TrAC-looping data.

cLoops2 directly analyzes the paired-end tags to find candidate peaks and loops. It estimates the statistical significance for the peak/loop features with a permuted local background, eliminating the bias introduced from third part peak-calling parameters tuning for calling loops.





□ CMIA: Gene regulation network inference using k-nearest neighbor-based mutual information estimation- Revisiting an old DREAM

>> https://www.biorxiv.org/content/10.1101/2021.12.20.473242v1.full.pdf

the MI-based kNN Kraskov-Stoögbauer-Grassberger (KSG) algorithm leads to a significant improvement in GRN reconstruction for popular inference algorithms, such as Context Likelihood of Relatedness (CLR).

CMIA (Conditional Mutual Information Augmentation), a novel inference algorithm inspired by Synergy-Augmented CLR. Looking forward, the goal of complete reconstruction of GRNs may require new inference algorithms and probably Mutual information MI in more than three dimensions.





□ CoRE-ATAC: A deep learning model for the functional classification of regulatory elements from single cell and bulk ATAC-seq data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009670

CoRE-ATAC can infer regulatory functions in diverse cell types, capture activity differences modulated by genetic mutations, and can be applied to single cell ATAC-seq data to study rare cell populations.

CoRE-ATAC integrates DNA sequence data with chromatin accessibility data using a novel ATAC-seq data encoder that is designed to be able to integrate an individual’s genotype with the chromatin accessibility maps by inferring the genotype from ATAC-seq read alignments.





□ CosNeti: ComplexOme-Structural Network Interpreter used to study spatial enrichment in metazoan ribosomes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04510-z

CosNeti translates experimentally determined structures into graphs, with nodes representing proteins and edges the spatial proximity between them. CosNeti considers rProteins and ignores rRNA and other objects.

Spatial regions are defined using a random walk with restart methodology, followed by a procedure to obtain a minimum set of regions that cover all proteins in the complex.

Structural coherence is achieved by applying weights to the edges reflecting the physical proximity between purportedly contacting proteins. The weighting probabilistically guides the random-walk path trajectory.





□ 2FAST2Q: A general-purpose sequence search and counting program for FASTQ files

>> https://www.biorxiv.org/content/10.1101/2021.12.17.473121v1.full.pdf

2FAST2Q, a versatile and intuitive standalone program capable of extracting and counting feature occurrences in FASTQ files.

2FAST2Q can be used in any experimental setup that requires feature extraction from raw reads, being able to quickly handle mismatch alignments, nucleotide wise Phred score filtering, custom read trimming, and sequence searching within a single program.





□ Integration of public DNA methylation and expression networks via eQTMs improves prediction of functional gene-gene associations

>> https://www.biorxiv.org/content/10.1101/2021.12.17.473125v1.full.pdf

MethylationNetwork can identify experimentally validated interacting pairs of genes that could not be identified in the RNA-seq datasets.

an integration pipeline based on kernel cross-correlation matrix decomposition. Using this pipeline, they integrated GeneNetwork and MethylationNetwork and used the integrated results to predict functional gene–gene correlations that are collected in the STRING database.





□ FineMAV: Prioritising positively selected variants in whole-genome sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04506-9

Fine-Mapping of Adaptation Variation (FineMAV) is a statistical method that prioritizes functional SNP candidates under selection and depends upon population differentiation.

A stand-alone application that can perform FineMAV calculations on whole-genome sequencing data and can output bigWig files which can be used to graphically visualise the scores on genome browsers.





□ GraphOmics: an interactive platform to explore and integrate multi-omics data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04500-1

GraphOmics provides an interactive platform that integrates data to Reactome pathways emphasising interactivity and biological contexts. This avoids the presentation of the integrated omics data as a large network graph or as numerous static tables.

GraphOmics offers a way to perform pathway analysis separately on each omics, and integrate the results at the end. The separate pathway analysis results run on different omics datasets can be combined with an AND operator in the Query Builder.





□ anndata: Annotated data

>> https://www.biorxiv.org/content/10.1101/2021.12.16.473007v1.full.pdf

AnnData makes a particular choice for data organization that has been left unaddressed by packages like scikit-learn or PyTorch, which model input and output of model transformations as unstructured sets of tensors.

The AnnData object is a collection of arrays aligned to the common dimensions of observations (obs) and variables (var).

Storing low-dimensional manifold structure within a desired reduced representation is achieved through a k-nearest neighbor graph in form of a sparse adjacency matrix: a matrix of pairwise relationships of observations.





□ Class similarity network for coding and long non-coding RNA classification

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04517-6

Class Similarity Network considers more relationships among input samples in a direct way. It focuses on exploring the potential relationships between input samples and samples from both the same class and the different classes.

Class Similarity Network trains the parameters specific to each class to obtain the high-level features. The Fully Connected module learns parameters from diff dense branches to integrate similarity information. The Decision module concatenates the nodes to make the prediction.





□ FCLQC: fast and concurrent lossless quality scores compressor

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04516-7

FCLQC achieves a comparable compression rate while having much faster than the baseline algorithms. FCLQC uses concurrent programming to achieve fast compression and decompression.

Concurrent programming executes a program independently, not necessarily simultaneously, which is different from error-prone parallel computing. FCLQC shows at least 31x compression speed improvement, where a performance degradation in compression ratio is up to 13.58%.





□ ADClust: A Parameter-free Deep Embedded Clustering Method for Single-cell RNA-seq Data

>> https://www.biorxiv.org/content/10.1101/2021.12.19.473334v1.full.pdf

ADClust first obtains low-dimensional representation through pre-trained autoencoder, and uses the representa- tions to cluster cells into initial micro-clusters.

The micro-clusters are then compared in between through a statistical test for unimodality called Dip-test to detect similar micro- clusters, and similar micro-clusters are merged through jointly optimizing the carefully designed clustering and autoencoder loss functions.





□ fastMSA: Accelerating Multiple Sequence Alignment with Dense Retrieval on Protein Language

>> https://www.biorxiv.org/content/10.1101/2021.12.20.473431v1.full.pdf

The fastMSA framework, consisting of query sequence encoder and context sequences encoder, can improve the scalability and speed of multiple sequence alignment significantly.

fastMSA utilizes the query sequences to search from UniRef90 using JackHMMER v3.3 and build the resulted MSAs as ground truth. By filtering out the unrelated sequences on the low-dimensional space before performing MSA, fastMSA can accelerate the process by 35 folds.





□ XAE4Exp: Explainable autoencoder-based representation learning for gene expression data

>> https://www.biorxiv.org/content/10.1101/2021.12.21.473742v1.full.pdf

XAE4Exp (eX-plainable AutoEncoder for Expression data), which integrates AE and SHapley Additive exPlana-tions (SHAP), a flagship technique in the field of eXplainable AI (XAI).

XAE4Exp quantitatively evaluates the contributions of each gene to the hidden structure learned by an AE, substantially improving the expandability of AE outcomes.





□ DeepLOF: A deep learning framework for predicting human essential genes from population and functional genomic data

>> https://www.biorxiv.org/content/10.1101/2021.12.21.473690v1.full.pdf

DeepLOF, an evolution- based deep learning model for predicting human genes intolerant to LOF mutations. DeepLOF can integrate genomic features and population genomic data to predict LOF-intolerant genes without human-labeled training data.

DeepLOF combines the neural network-based beta prior distribution with the population genetics-based likelihood function to obtain a posterior distribution of η, which represents their belief about LOF intolerance after integrating genomic features and population genomic data.





□ CSNet: Estimating cell-type-specific gene co-expression networks from bulk gene expression data

>> https://www.biorxiv.org/content/10.1101/2021.12.21.473558v1.full.pdf

For finite sample cases, it may be desirable to ensure the positive definiteness of the final estimator. One strategy is to solve a constrained optimization problem to find the nearest correlation matrix in Frobenius norm.

CSNet, a sparse estimator w/ SCAD penalty. And deriving the non-asymptotic convergence rate in spectral norm of CSNet and establish variable selection consistency, ensuring that the edges in the cell-type specific networks can be correctly identified w/ probability tending to 1.





□ NanoGeneNet: Using Deep Learning for Gene Detection and Classification in Raw Nanopore Signals

>> https://www.biorxiv.org/content/10.1101/2021.12.23.473143v1.full.pdf

NanoGeneNet, a neural network-based method capable of detecting and classifying specific genomic regions already in raw nanopore signals – squiggles.

Therefore, the basecalling process can be omitted entirely as the raw signals of significant genes, or intergenic regions can be directly analysed, or if the nucleotide sequences are required, the identified squiggles can be basecalled, preferably to others.





□ binny: an automated binning algorithm to recover high-quality genomes from complex metagenomic datasets

>> https://www.biorxiv.org/content/10.1101/2021.12.22.473795v1.full.pdf

binny, a binning tool that produces high-quality metagenome-assembled genomes from both contiguous and highly fragmented genomes.

binny uses k-mer-composition and coverage by metagenomic reads for iterative, non-linear dimension reduction of genomic signatures as well as subsequent automated contig clustering with cluster assessment using lineage-specific marker gene sets.





□ Baltica: integrated splice junction usage analysis

>> https://www.biorxiv.org/content/10.1101/2021.12.23.473966v1.full.pdf

Baltica, a framework that provides workflows for quality control, de novo transcriptome assembly with StringTie2, and currently 4 DJU methods: rMATS, JunctionSeq, Majiq, and LeafCutter.

Baltica uses 2 datasets, the first uses Spike-in RNA Variant Control Mixes (SIRVs) and the second dataset of paired Illumina and Oxford Nanopore Technologies. Baltica integration allows us to compare the performance of different DJU and test the usability of a meta-classifier.





□ bulkAnalyseR: An accessible, interactive pipeline for analysing and sharing bulk sequencing results

>> https://www.biorxiv.org/content/10.1101/2021.12.23.473982v1.full.pdf

Critically, neither VIPER, nor BioJupies offer support for more complex differential expression (DE) tasks, beyond simple pair-wise comparisons. This limits the biological interpretations from more complex experimental designs.

bulkAnalyseR provides an accessible, yet flexible framework for the analysis of bulk sequencing data without relying on prior programming expertise. The users can create a shareable shiny app in two lines of code, from an expression matrix and a metadata table.





□ ePat: extended PROVEAN annotation tool

>> https://www.biorxiv.org/content/10.1101/2021.12.21.468911v1.full.pdf

The 'ePat' extends the conventional PROVEAN to enable the following two things, which the conventional PROVEAN could not calculate the pathogenicity of these variants.

ePat is able to calculate the pathogenicity of variants near the splice junction, frameshift, stop gain, and start lost. In addition, batch processing is used to calculate the pathogenicity of all variants in a VCF file in a single step.





□ A guide to trajectory inference and RNA velocity

>> https://www.biorxiv.org/content/10.1101/2021.12.22.473434v1.full.pdf

Whereas traditional trajectory inference methods reconstruct cellular dynamics given a population of cells of varying maturity, RNA velocity relies on a dynamical model describing splicing dynamics.

However, pseudotime is based solely on transcriptional information, so it cannot be interpreted as an estimator of the true time since initial differentiation.

Rather, it is a high-resolution estimate of cell state, which is likely to be monotonically related to the true chronological time, but there is no guarantee that equivalent changes in transcriptional profiles follow a similar chronological time.





□ GeneTonic: an R/Bioconductor package for streamlining the interpretation of RNA-seq data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04461-5

GeneTonic serves as a comprehensive toolkit for streamlining the interpretation of functional enrichment analyses, by fully leveraging the information of expression values in a differential expression context.

GeneTonic is not structured as an end-to-end workflow including quantification, preprocessing, exploratory data analysis, and DE modeling—all operations that are also time consuming, but in many scenarios need to be carried out only once.





□ The impact of low input DNA on the reliability of DNA methylation as measured by the Illumina Infinium MethylationEPIC BeadChip

>> https://www.biorxiv.org/content/10.1101/2021.12.22.473840v1.full.pdf

This study demonstrates that although as little as 40ng is sufficient to produce Illumina Infinium MethylationEPIC Beadchip DNAm data that passes standard QC checks, data quality and reliability diminish as DNA input decreases.

They recommend caution and use of sensitivity analyses when working with less than 200ng DNA on the Illumina Infinium MethylationEPIC Beadchip.





□ AMC: accurate mutation clustering from single-cell DNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab857/6482741

AMC first employs principal component analysis followed by K-means clustering to find mutation clusters, then infers the maximum likelihood estimates of the genotypes of each cluster.

The inferred genotypes can subsequently be used to reconstruct the phylogenetic tree with high efficiency. AMC uses BIC to jointly determine the best number of mutation clusters and the corresponding genotypes.





□ LotuS2: An ultrafast and highly accurate tool for amplicon sequencing analysis

>> https://www.biorxiv.org/content/10.1101/2021.12.24.474111v1.full.pdf

LotuS2 uses only truncated, high-quality reads for sequence clustering (except ITS amplicons), while the read backmapping and seed extension steps restore some of the discarded sequence data.

LotuS2 often reported the fewest ASVs/OTUs, while including more sequence reads in abundance tables. This indicates that LotuS2 has a more efficient usage of input data while covering a larger sequence space per ASV/OTU.




□ EdClust: A heuristic sequence clustering method with higher sensitivity

>> https://www.worldscientific.com/doi/abs/10.1142/S0219720021500360

Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from overestimation of inferred clusters and low clustering sensitivity.

The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH.





□ cDNA-detector: detection and removal of cDNA contamination in DNA sequencing libraries

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04529-2

cDNA-detector provides the option to remove contaminant reads from the alignment to reduce the risk of spurious coverage peak and variant calls in downstream analysis.

When using cDNA-detector on genomic sequence data, they recommend suppressing the “retrocopy” output, such that only potential vector cDNA candidates are reported. With this strategy, contaminants can be removed from alignments, revealing true signal previously obscured.





□ Artificial intelligence “sees” split electrons

>> https://www.science.org/doi/10.1126/science.abm2445

Chemical bonds between atoms are stabilized by the exchange-correlation (xc) energy, a quantum-mechanical effect in which “social distancing” by electrons lowers their electrostatic repulsion energy.

Kohn-Sham density functional theory (DFT) states that the electron density determines this xc energy, but the density functional must be approximated.

Two exact constraints—the ensemble-based piecewise linear variation of the total energy with respect to fractional electron number and fractional electron z-component of spin — require hard-to-control nonlocality.




□ RAxML Grove: An empirical Phylogenetic Tree Database

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab863/6486526

When generating synthetic data it is often unclear how to set simulation parameters for the models and generate trees that appropriately reflect empirical model parameter distributions and tree shapes.

RAxML Grove currently comprising more than 60,000 inferred trees and respective model parameter estimates from fully anonymized empirical data sets that were analyzed using RAxML and RAxML-NG on two web servers.





□ ifCNV: a novel isolation-forest-based package to detect copy number variations from NGS datasets

>> https://www.biorxiv.org/content/10.1101/2022.01.03.474771v1.full.pdf

About 1500 CNV regions have already been discovered in the human population, accounting for ~12–16% of the entire human genome,1 making it one of most common types of genetic variation. Although the biological impact of the majority of these CNVs remains uncertain.

ifCNV is a CNV detection tool based on read-depth distribution. ifCNV combines artificial intelligence using two isolation forests and a comprehensive scoring method to faithfully detect CNVs among various samples.





□ DICAST: Alternative splicing analysis benchmark

>> https://www.biorxiv.org/content/10.1101/2022.01.05.475067v1.full.pdf

DICAST offers a modular and extensible framework for the analysis of AS integrating 11 splice-aware mapping and eight event detection tools. DICAST allows researchers to employ a consensus approach to consider the most successful tools jointly for robust event detection.

While DICAST introduces a unifying standard for AS event reporting, AS event detection tools utilize inherently different approaches and lead to inconsistent results.





□ scNAME: Neighborhood contrastive clustering with ancillary mask estimation for scRNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac011/6499267

scNAME incorporates a mask estimation task for gene pertinence mining and a neighborhood contrastive learning framework for cell intrinsic structure exploitation.

A neighborhood contrastive paradigm with an offline memory bank, global in scope, which can inspire discriminative feature representation and achieve intra-cluster compactness, yet inter-cluster separation.





lens, align. Awards 2021.

2021-12-31 21:12:36 | Music20

(Comet Leonard C / 2021 A1: Photo By Michael Jäger)


2021年、個人的ベスト楽曲を紹介。



□ Ludovico Einaudi / “Twice (Reimagined by Mercan Dede)

スーフィズムとクラブミュージックを融合したダークなArabtronica。エイナウディによる原曲のフレーズを、全く別次元の解釈で聴かせてくれる。



□ Thomas Bergersen / “Made of Fire” (from the Album “Chapter IV”)

混声合唱を主軸としたエレクトロニカの一つの究極型。EnigmaやeRa、Hans Zimmerのこの手の作品が好きな人には是非聴いてほしい一曲。


□ Porter Robinson / “Get Your Wish” (from the Album “Nurture”)

Future Bassの天才として時代の寵児となったDJが、長年のブランクを脱して製作した楽曲は、陽だまりの様に素朴で瑞々しいメロディに溢れていた。



□ Miloš Karadaglić - Einaudi: Full Moon (Arr. Lewin for Guitar)







Don’t Look Up

2021-12-30 22:11:12 | 映画


『Don’t Look Up』(Netflix)

>> https://www.netflix.com/jp/title/81252357

巨大彗星による地球崩壊の危機に直面した人々をコメディ風に描く風刺劇。
メディアに溢れる情報に人間性が埋没したステロタイプの現代人を、
滑稽でありながら愛おしい視点で捉えている。

制裁感情を抱くことよりも、
人間性を信じることが義務であると感じさせてくれる作品。




□ Ariana Grande & Kid Cudi - Just Look Up (From 'Don’t Look Up')

映画『Don’t look up』劇中歌の公式歌詞動画。

彗星衝突をテーマにした映画のメランコリックで壮大なバラードなのに、
放送禁止用語のせいでApple Musicでは『ピー』 音連発で草。
[勿論Explicitバージョンも有]

ブラックコメディでありながら、映画本編のSFXはしっかり大型ディザスタームービー級の迫力なので見どころ。
太ったディカプリオもなかなかチャーミング。



Adam McKay ... (directed by)

Adam McKay ... (screenplay by)

Adam McKay ... (story by) &
David Sirota ... (story by)

Cast

Leonardo DiCaprio Leonardo DiCaprio ... Dr. Randall Mindy
Jennifer Lawrence Jennifer Lawrence ... Kate Dibiasky
Meryl Streep Meryl Streep ... President Orlean
Cate Blanchett Cate Blanchett ... Brie Evantee
Rob Morgan Rob Morgan ... Dr. Teddy Oglethorpe






Provenance.

2021-12-13 22:13:17 | Science News




□ STELLAR: Annotation of Spatially Resolved Single-cell Data

>> https://www.biorxiv.org/content/10.1101/2021.11.24.469947v1.full.pdf

STELLAR (SpaTial cELl LeARning), a geometric deep learning tool for cell-type discovery and identification in spatially resolved single-cell datasets. STELLAR uses a graph convolutional encoder to learn low-dimensional cell embeddings that capture cell topology.

STELLAR learns latent low-dimensional cell representations that jointly capture spatial and molecular similarities of cells that are transferable across different biological contexts.

STELLAR automatically assigns cells to cell types included in the reference set and also identifies cells with unique properties as belonging to a novel type that is not part of the reference set.

The encoder network in STELLAR consists of one fully-connected layer with ReLU activation and a graph convolutional layer with a hidden dimension of 128 in all layers. It uses the Adam optimizer with an initial learning rate of 10−3 and weight decay 0.





□ Sparse: Rapid, Reference-Free Human Genotype Imputation with Denoising Autoencoders

>> https://www.biorxiv.org/content/10.1101/2021.12.01.470739v1.full.pdf

Sparse, de-noising autoencoders spanning all bi-allelic SNPs observed in the Haplotype Reference Consortium were developed and optimized.

a generalized approach to unphased human genotype imputation using sparse, denoising autoencoders capable of highly accurate genotype imputation at genotype masking levels (98+%) appropriate for array-based genotyping and low-pass sequencing-based population genetics initiatives.

After merging the results from all genomic segments, the whole chromosome accuracy of autoencoder-based imputation remained superior to all HMM-based imputation tools, across all independent test datasets, and all genotyping array marker sets.

Inference time scales only with the number of variants to be imputed, whereas HMM-based inference time depends on both reference panel and the number of variants to be imputed.





□ Parity and time reversal elucidate both decision-making in empirical models and attractor scaling in critical Boolean networks

>> https://www.science.org/doi/10.1126/sciadv.abf8124

New applications of parity inversion and time reversal to the emergence of complex behavior from simple dynamical rules in stochastic discrete models. These applications underpin a novel attractor identification algorithm implemented for Boolean networks under stochastic dynamics.

Its speed enables resolving a long-standing open question of how attractor count in critical random Boolean networks scales with network size and whether the scaling matches biological observations.

The parity-based encoding of causal relationships and time-reversal construction efficiently reveal discrete analogs of stable and unstable manifolds.

The time reversal of stochastically asynchronous Boolean systems identify subsets of the state space that cannot be reached from outside. Using parity and time-reversal transformations in tandem, This algorithm efficiently identifies all attractors of large-scale Boolean systems.





□ EXMA: A Genomics Accelerator for Exact-Matching

>> https://arxiv.org/pdf/2101.05314.pdf

EXMA enhances FM-Index search throughput. EXMA first creates a novel table with a multi-task-learning (MTL)-based index to process multiple DNA symbols with each DRAM row activation.

The EXMA accelerator connects to four DRAM channel, and improves search throughput by 4.9×, and enhances search throughput per Watt by 4.8×. EXMA adopts the state-of-the-art Tangram neural network accelerator as the inference engine.





□ MIRA: Joint regulatory modeling of multimodal expression and chromatin accessibility in single cells

>> https://www.biorxiv.org/content/10.1101/2021.12.06.471401v1.full.pdf

MIRA: Probabilistic Multimodal Models for Integrated Regulatory Analysis, a comprehensive methodology that systematically contrasts transcription and accessibility to determine the regulatory circuitry driving cells along developmental continuums.

MIRA leverages joint topic modeling of cell states and regulatory potential modeling of individual gene loci.

MIRA represents cell states in an interpretable latent space, infers high fidelity lineage trees, determines key regulators of fate decisions at branch points, and exposes the variable influence of local accessibility on transcription at distinct loci.





□ scGTM: Single-cell generalized trend model: a flexible and interpretable model of gene expression trend along cell pseudotime

>> https://www.biorxiv.org/content/10.1101/2021.11.25.470059v1.full.pdf

scGTM can provide more informative and interpretable gene expression trends than the GAM and GLM when the count outcome comes from the Poisson, ZIP, NB or ZINB distributions.

scGTM robustly captures the hill-shaped trends for the four distributions and consistently estimates the change time around 0.75, which is where the MAOA gene reaches its expected maximum expression.

The scGTM parameters are estimated by the constrained maximum likelihood estimation via particle swarm optimization (PSO) metaheuristic algorithms.

scGTM is only applicable to a single pseudotime trajectory. A natural extension is to split a multiple-lineage cell trajectory into single lineages and fit the scGTM to each lineage separately. There is need to develop a variant algorithm of PSO or other metaheuristics algorithms.





□ ECLIPSER: identifying causal cell types and genes for complex traits through single cell enrichment of e/sQTL-mapped genes in GWAS loci

>> https://www.biorxiv.org/content/10.1101/2021.11.24.469720v1.full.pdf

ECLIPSER (Enrichment of Causal Loci and Identification of Pathogenic cells in Single Cell Expression and Regulation data) maps genes to GWAS loci for a given trait using s/eQTL data and other functional information.

ECLIPSER prioritizes causal genes in GWAS loci driving the enrichment signal in the specific cell types for experimental follow-up.

ECLIPSER is a computational framework that can be applied to single cell or single nucleus (sc/sn)RNA-seq data from multiple tissues and to multiple complex diseases and traits with discovered GWAS associations, and does not require genotype data from the e/sQTL.





□ Heron: Dynamic Pooling Improves Nanopore Base Calling Accuracy

>> https://ieeexplore.ieee.org/document/9616376/

Heron - high accuracy GPU nanopore basecaller. Heron is a dynamic pooling approach that continuous and differentiable almost everywhere.

Heron time-warps the signal using fractional distances in the pooling space.

• feature vector: fi = f(xi)∈(0,1)C
• point importance: wi = w(xi), wi∈(0, 1)
• length factor: mi = m(xi ), mi∈(0, 1)

Another intriguing goal is to extend dynamic pooling to multiple dimensions.





□ scCODA: a Bayesian model for compositional single-cell data analysis

>> https://www.nature.com/articles/s41467-021-27150-6

scCODA allows for identification of compositional changes in high-throughput sequencing count data, especially cell compositions from scRNA-seq. It also provides a framework for integration of cell-type annotated data directly from scanpy and other sources.

scCODA framework models cell-type counts with a hierarchical Dirichlet-Multinomial distribution that accounts for the uncertainty in cell-type proportions and the negative correlative bias via joint modeling of all measured cell-type proportions instead of individual ones.





□ Hubness reduction improves clustering and trajectory inference in single-cell transcriptomic data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab795/6433673

Considering a collection of datasets from the ARCHS4 repository, constructed the k-NN graphs with or without hubness reduction, then ran Louvain algorithm and calculated the modularity of the resulting clustering.

Reverse-Coverage approach, a method based on the size of the respective in-coming neighborhoods to retrieve hubs in a more robust way. Hubness reduction can be used instead of dimensionality reduction, in order to compensate for certain manifestations of the dimensionality curse.





□ DeepSNEM: Deep Signaling Network Embeddings for compound mechanism of action identification

>> https://www.biorxiv.org/content/10.1101/2021.11.29.470365v1.full.pdf

deepSNEM, a novel unsupervised graph deep learning pipeline to encode the information in the compound-induced signaling networks in fixed-length high-dimensional representations.

The core of deepSNEM is a graph transformer network, trained to maximize the mutual information between whole- graph and sub-graph representations that belong to similar perturbations. the 256-dimensional deepSNEM-GT-MI embeddings were clustered using the k-means algorithm.





□ IReNA: integrated regulatory network analysis of single-cell transcriptomes

>> https://www.biorxiv.org/content/10.1101/2021.11.22.469628v1.full.pdf

IReNA integrates both bulk and single-cell RNA-seq data with bulk ATAC-seq data to reconstruct modular regulatory networks which provide key transcription factors and intermodular regulations.

IReNA uses Monocle to construct the trajectory and calculate the pseudotime of single cells. IReNA calculates the smoothed expression profiles based on pseudotime and divide DEGs into different modules using the K-means clustering of the smoothed expression profiles.

IReNA calculates expression correlation (Pearson’s correlation) for each pair of DEGs and select highly correlated gene pairs which contain at least one transcription factor from the TRANSFAC database as potential regulatory relationships.






□ UNIFAN: Unsupervised cell functional annotation for single-cell RNA-Seq

>> https://www.biorxiv.org/content/10.1101/2021.11.20.469410v1.full.pdf

UNIFAN (Unsupervised Single-cell Functional Annotation) to simultaneously cluster and annotate cells with known biological processes including pathways.

UNIFAN uses an autoencoder that outputs a low-dimensional representation learned from the expression of all genes. UNIFAN combines both, the low dimension representation and the gene set activity scores to determine the cluster for each cell.





□ Meta-NanoSim: Characterization and simulation of metagenomic nanopore sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.11.19.469328v1.full.pdf

Meta-NanoSim characterizes read length distributions, error profiles, and alignment ratio models. It also detects chimeric read artifacts and quantifies an abundance ptofile. Meta-NanoSim calculates the deviation between expected and estimated abundance levels.

Meta-NanoSim significantly reduced the length of the unaligned regions. Meta-NanoSim uses kernel density estimation learnt from empirical reads.

Meta-NanoSim records the aligned bases for each sub-alignment towards their source genome, and then uses EM algorithm to assign multi-aligned segments proportionally to their putative source genomes iteratively.





□ KCOSS: an ultra-fast k-mer counter for assembled genome analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab797/6443080

KCOSS fulfills k-mer counting mainly for assembled genomes with segmented Bloom filter, lock-free queue, lock-free thread pool, and cuckoo hash table.

KCOSS optimizes running time and memory consumption by recycling memory blocks, merging multiple consecutive first-occurrence k-mers into C-read, and writing a set of C-reads to disk asynchronously.





□ On Hilbert evolution algebras of a graph

>> https://arxiv.org/pdf/2111.07399v1.pdf

Hilbert evolution algebras generalize the concept through a framework of Hilbert spaces. This allows to deal with a wide class of infinite-dimensional spaces.

Hilbert evolution algebra associated to a given graph and the Hilbert evolution algebra associated to the symmetric random walk on a graph. These definitions with infinitely many vertices a similar theory developed for evolution algebras associated to finite graphs.





□ Higher rank graphs from cube complexes and their spectral theory

>> https://arxiv.org/pdf/2111.09120v1.pdf

There is a strong connection between geometry of CW-complexes, groups and semigroup actions, higher rank graphs and the theory of C∗-algebras.

The difficulty is that there are many ways to associate C∗-algebras to groups, semigroups and CW-complexes, and this can lead to both isomorphic and non-isomorphic C∗-algebras.

a generalisation of the Cuntz-Krieger algebras from topological Markov shifts. a combinatorial definition of a finite k-graph Λ which is decoupled from geometrical realisations.

The existence of an infinite family of combinatorial k-graphs constructed from k-cube complexes. Aperiodicity of a higher rank graph is an important property, because together with cofinality it implies pure infiniteness if every vertex can be reached from a loop with an entrance.





□ Theory of local k-mer selection with applications to long-read alignment

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab790/6432031

os-minimap2: minimap2 with open syncmer capabilities. Investigating how different parameterizations lead to runtime and alignment quality trade-offs for ONT cDNA mapping.

the k-mers selected by more conserved methods are also more repetitive, leading to a runtime increase during alignment.

Deriving an exact expression for calculating the conservation of a k-mer selection method. This turns out to be tractable enough to prove closed-form expressions for a variety of methods, including (open and closed) syncmers, (a, b, n)-words, and an upper bound for minimizers.





□ CellVGAE: an unsupervised scRNA-seq analysis workflow with graph attention networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab804/6448212

CellVGAE leverages the connectivity between cells as an inductive bias to perform convolutions on a non-Euclidean structure, thus subscribing to the geometric deep learning paradigm.

CellVGAE can intrinsically capture information such as pseudotime and NF-B activation dynamics, the latter being a property that is not generally shared by existing neural alternatives. CellVGAE learns to reconstruct the original graph from the lower-dimensional latent space.





□ Portal: Adversarial domain translation networks enable fast and accurate large-scale atlas-level single-cell data integration

>> https://www.biorxiv.org/content/10.1101/2021.11.16.468892v1.full.pdf

Portal, a unified framework of adversarial domain translation to learn harmonized representations of datasets. Portal preserves biological variation during integration, while having significantly reduced running time and memory, achieving integration of millions of cells.

Portal can accurately align cells from complex tissues profiled by scRNA-seq and single-nucleus RNA sequencing (snRNA-seq), and also perform cross-species alignment of the gradient of cells.

Portal can focus only on merging cells of high probability to be of domain-shared cell types, while it remains inactive on cells of domain-unique cell types.

Portal leverages three regularizers to help it find correct and consistent correspondence across domains, including the autoencoder regularizer, the latent alignment regularizer and the cosine similarity regularizer.





□ Polarbear: Semi-supervised single-cell cross-modality translation

>> https://www.biorxiv.org/content/10.1101/2021.11.18.467517v1.full.pdf

Polarbear uses single-assay and co-assay data to train an autoencoder for each modality and then uses just the co-assay data to train a translator between the embedded representations learned by the autoencoders.

Polarbear is able to translate between modalities with improved accuracy relative to BABEL. Polarbear trains one VAE for each type of data, while taking into consideration sequencing depth and batch factors.





□ sc-SynO: Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04469-x

sc-SynO, which is based on LoRAS (Localized Random Affine Shadowsampling) algorithm applied to single-cell data. The algorithm corrects for the overall imbalance ratio of the minority and majority class.

The LoRAS algorithm generates synthetic samples from convex combinations of multiple shadowsamples generated from the rare cell types. The shadowsamples are obtained by adding Gaussian noise to features representing the rare cells.





□ Graph-sc: GNN-based embedding for clustering scRNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab787/6432030

Graph-sc, a method modeling scRNA-seq data as a graph, processed with a graph autoencoder network to create representations (embeddings) for each cell. The resulting embeddings are clustered with a general clustering algorithm to produce cell class assignments.

Graph-sc is stable across consecutive runs, robust to input down-sampling, generally insensitive to changes in the network architecture or training parameters and more computationally efficient than other competing methods based on neural networks.





□ Asc-Seurat: analytical single-cell Seurat-based web application

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04472-2

Asc-Seurat provides: quality control, by the exclusion of low-quality cells & potential doublets; data normalization, incl. log normalization and the SCTransform, dimension reduction, clustering of the cell populations, incl. selection or exclusion of clusters and re-clustering.

Asc-Seurat is built on three analytical cores. Using Seurat, users explore scRNA-seq data to identify cell types, markers, and DEGs. Dynverse allows the evaluation and visualization of developmental trajectories and identifies DEGs on these trajectories.





□ sc-CGconv: A copula based topology preserving graph convolution network for clustering of single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468695v1.full.pdf

sc-CGconv, a new robust-equitable copula correlation (Ccor) measure for constructing cell-cell graph leveraging the scale-invariant property of Copula while reducing the computational cost of processing large datasets due to the use of structure-aware using LSH.

sc-CGconv preserves the cell-to-cell variability within the selected gene set by constructing a cell-cell graph through copula correlation measure. And provides a topology-preserving embedding of cells in low dimensional space.





□ PHONI: Streamed Matching Statistics with Multi-Genome References

>> https://ieeexplore.ieee.org/document/9418770/

PHONI, Practical Heuristic ON Incremental matching statistics computation uses longest-common-extension (LCE) queries to compute the len values at the same time that computes the pos values.

The matching statistics MS of a pattern P [0..m − 1] with respect to a text T [0..n − 1] are an array of (position, length)-pairs MS[0..m − 1] such that

•P[i..i+MS[i].len−1]=T[MS[i].pos..MS[i].pos+MS[i].len−1],
• P [i..i + MS[i].len] does not occur in T.

Two-pass algorithm for quickly computing MS using only an O(r)-space data structure during the first pass, from right to left in O(m log log n) time.

• φ−1(p) = SA[ISA[p] + 1] (or NULL if ISA[p] = n − 1), • PLCP[p] = LCP[ISA[p]] (or 0 if ISA[p] = 0),

and SA, ISA, LCP and PLCP are the suffix array, inverse suffix array, longest-common-prefix array and permuted longest-common-prefix array.

PHONI uses Rossi et al.’s construction algorithm for MONI to build the RLBWT and the SLP. PHONI’s query times become faster as the number of reducible positions increases, making the time-expensive LCE queries less frequent.





□ UNBOUNDED ALGEBRAIC DERIVATORS

>> https://arxiv.org/pdf/2111.05918v1.pdf

Proving the derived category of a Grothendieck category with enough projective objects is the base category of a derivator. Therefore all such categories possess all co/limits and can be organized in a representable derivator.

This derivator is the base for constructing the derivator associated to the derived category by deriving the relevant functors. the framework provides a more general - arbitrary base ring, complexes as coefficients and simpler approach to some basic theorems of group cohomology.





□ Duesselpore: a full-stack local web server for rapid and simple analysis of Oxford Nanopore Sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468670v1.full.pdf

Duesselpore, a deep sequencing workflow that runs as a local webserver and allows the analysis of ONT data everywhere without requiring additional bioinformatic tools or internet connection.

Duesselpore performs differential gene expression (DGE) analysis. DuesselporeTM will also conduct gene set enrichment analyses (GSEA), enrichment analysis based on the DisGeNET and pathway-based data integration and visualization focusing on KEGG.





□ discover: Optimization algorithm for omic data subspace clustering

>> https://www.biorxiv.org/content/10.1101/2021.11.12.468415v1.full.pdf

the ground truth subspace is rarely the most compact one, and other subspaces may provide biologically relevant information.

discover, an optimization algorithm performing bottom-up subspace clustering on tabular high dimensional data. And identifies the corresponding sample clusters, such that the partitioning of the subspace has maximal internal clustering score of feature subspaces.





□ REMD-LSTM:A novel general-purpose hybrid model for time series forecasting

>> https://link.springer.com/article/10.1007/s10489-021-02442-y

Empirical Mode Decomposition (EMD) is a typical algorithm for decomposing data according to its time scale characteristics. The core of the EMD algorithm is empirical mode decomposition, which can decompose complex signals into a finite number of Intrinsic Mode Functions.

The REMD-LSTM algorithm can solve the problem of marginal effect and mode confusion in EMD. Decomposing time series data into multiple components through REMD can reveal the specific influence of hidden variables in time series data to a certain extent.





□ smBEVO: A computer vision approach to rapid baseline correction of single-molecule time series

>> https://www.biorxiv.org/content/10.1101/2021.11.12.468397v1.full.pdf

Current approaches for drift correction primarily involve either tedious manual assignment of the baseline or unsupervised frameworks such as infinite HMMs coupled with baseline nodes that are computationally expensive and unreliable.

smBEVO estimates the time-varying baseline drift that can in practice be difficult to eliminate in single-molecule experimental modalities. smBEVO provides visually and quantitatively compelling baseline estimation for simulated data w/ multiple types of mild to aggressive drift.




□ FMAlign: A novel fast multiple nucleotide sequence alignment method based on FM-index

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbab519/6458932

FMAlign, a novel algorithm to improve the performance of multiple nucleotide sequence alignment. FM-index uses FM-index to extract long common segments at a low cost rather than using a space-consuming hash table.





Rectangle.

2021-12-13 22:12:13 | Science News


"No problem is too small or too trivial if we can really do something about it."



□ BamToCov: an efficient toolkit for sequence coverage calculations

>> https://www.biorxiv.org/content/10.1101/2021.11.12.466787v1.full.pdf

BamToCov, a suite of tools for rapid coverage calculations relying on a memory efficient algorithm and designed for flexible integration in bespoke pipelines. BamToCov processes sorted BAM or CRAM, allowing to extract coverage information using different filtering approaches.

BamToCov uses a streaming approach that requires sorted alignments as input, computing coverage is computed starting from zero at the leftmost base in each contig and updated on-the-fly while reading alignments. In terms of Speed, BamToCov is second only to MegaDepth.





□ Long-read sequencing settings for efficient structural variation detection based on comprehensive evaluation

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04422-y

Generating a full range of simulated error-prone long-read datasets containing various sequencing settings and comprehensively evaluated the performance of SV calling with state-of-the-art long-read SV detection methods.

the overall F1 score and Matthews correlation coefficient (MCC) rate increase along with the coverage, read length, and accuracy rate.

Notably, it is sufficient for sensitive and accurate SV calling in practice when the long-read data comes to 20× coverage, 20 kbp average read length, and approximately 10–7.5% or below 1% error rates (or approximately 90–92.5% or over 99% accuracy rate).





□ CStone: A de novo transcriptome assembler for short-read data that identifies non-chimeric contigs based on underlying graph structure

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009631

CStones, a de Bruijn based de novo assembler for RNA-Seq data that utilizes a classification system to describe the Graph complexity. Each contig is labelled with one of three levels, indicating whether or not ambiguous paths exist.

The contigs that CStone produced were comparable in quality to those of Trinity and rnaSPAdes in terms of length, sequence identity of aligned regions and the range of cDNA transcripts represented, whilst providing additional information on chimerism.





□ HAllA: High-sensitivity pattern discovery in large, paired multi-omic datasets

>> https://www.biorxiv.org/content/10.1101/2021.11.11.468183v1.full.pdf

HAllA (Hierarchical All-against-All association testing) efficiently integrates hierarchical hypothesis testing with false discovery rate correction to reveal significant linear and non-linear block-wise relationships among continuous and/or categorical data.

HAllA is an end-to-end statistical method for Hierarchical All-against-All discovery of significant relationships among data features with high power. HAllA preserves statistical power in the presence of collinearity by testing coherent clusters of variables.





□ Meta-Transcriptome Detector (MTD): a novel pipeline for metatranscriptome analysis of bulk and single-cell RNAseq data

>> https://www.biorxiv.org/content/10.1101/2021.11.16.468881v1.full.pdf

Meta-Transcriptome Detector (MTD), supports automatic generation of the count matrix of the microbiome by using
raw data in the FASTQ format and count matrix of host genes from two commonly used single- cell RNA-seq platforms, 10x Genomics and Drop-seq.

MTD has a decontamination step that blacklists the common contaminant microbes in the laboratory environment. Users can easily install and run MTD using only one command and without requiring root privileges.





□ NSB: the improvements are most pronounced for larger distances and for higher levels of deviations Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model

>> https://www.biorxiv.org/content/10.1101/2021.11.10.468111v1.full.pdf

NSB (No Strand Bias) distance estimator, an algorithm and a tool for computing phylogenetic distances on alignment-free data based on a time-reversible, no strain-bias, 4-parameter evolutionary model called TK4.

a general model like TK4 can offer more accurate distances than Jukes-Cantor model, which is the simplest yet most dominantly used model in alignment-free phylogenetics. the improvements are most pronounced for larger distances and for higher levels of deviations.





□ Deep-BGCpred: A unified deep learning genome-mining framework for biosynthetic gene cluster prediction

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468547v1.full.pdf

Deep-BGCpred, a deep-learning method for Biosynthetic Gene Clusters (BGCs) identification within genomes. Deep-BGCpred effectively addresses the aforementioned customization challenges that arise in natural product genome mining.

Deep-BGCpred employs a stacked Bidirectional Long Short-Term Memory model to boost accuracy for BGC identifications. It integrates Sliding window strategy and dual-model serial screening, to reduce the number of false positive in BGC predictions.





□ sdcorGCN: Generating weighted and thresholded gene coexpression networks using signed distance correlation

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468627v1.full.pdf

a principled method to construct weighted gene coexpression networks using signed distance correlation. These networks contain weighted edges only between those pairs of genes whose correlation value is higher than a given threshold.

COGENT aids the selection of a robust network construction method without the need for any external validation data.

COGENT assists the selection of the optimal threshold value so that only pairs of genes for which the correlation value of their expression exceeds the threshold are connected in the network.




□ GEDI: an R package for integration of transcriptomic data from multiple high-throughput platforms

>> https://www.biorxiv.org/content/10.1101/2021.11.11.468093v1.full.pdf

Gene Expression Data Integration (GEDI) solves all the above mentioned challenges by implementing already existing R packages to read, re-annotate and merge the transcriptomic datasets after which the batch effect is removed, and the integration is verified.

This results in one transcriptomic dataset annotated with Ensembl or Entrez gene IDs. the batch effect is removed by the BatchCorrection function, and it is verified with a PCA plot and an RLE plot. VerifyGEDI verifies the data integration using a logistic regression model.




□ Modeling chromatin state from sequence across angiosperms using recurrent convolutional neural networks

>> https://www.biorxiv.org/content/10.1101/2021.11.11.468292v1.full.pdf

DanQ is a recurrent CNN that has already been shown to be able to more accurately predict a number of genomic labels, including chromatin accessibility and DNA methylation, in the human genome than standard CNNs like DeepSEA.

By incorporating sequence data from multiple species, they not only increase the size of the training data set, a critical factor for deep learning models, but also reduce the amount of confounding neutral variation around functional motifs.

Model architectures that can effectively incorporate trans factors, such as chromatin-remodeling TFs on neighboring regulatory elements or small RNA silencing, will likely surpass current methods but their cross-species applicability remains an open question.





□ CLMB: deep contrastive learning for robust metagenomic binning

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468566v1.full.pdf

CLMB improves the performance of bin refinement, reconstructing 8-22 more high-quality genomes and 15-32 more middle-quality genomes than the second-best result.

Vamb is a metagenomic binner which feeds sequence composition information from a contig catalogue and co-abundance information from BAM files into a variational autoencoder and clusters the latent representation.

Impressively, in addition to being compatible with the binning refiner, single CLMB even recovers on average 15 more HQ genomes than the refiner of VAMB and Maxbin on the benchmarking datasets.





□ PheneBank: a literature-based database of phenotypes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab740/6426070

PheneBank is the first to perform concept identification of phenotypic abnormalities directly to 13K Human Phenotype Ontology terms. PheneBank brings API access to a NN model trained on complex sentences from full text articles for identifying concepts.

The PheneBank model exploits latent semantic embeddings to infer text-to-concept mappings in 8 ontologies that would often not be apparent to conventional string matching approaches.





□ SCYN: single cell CNV profiling method using dynamic programming

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-021-07941-3

SCYN adopts a dynamic programming approach to find optimal single-cell CNV profiles. SCYN manifested more precise copy number inference on scDNA data, with array comparative genomic hybridization results of purified bulk samples as ground truth validation.

SCYN integrates SCOPE, which partitions chromosomes into consecutive bins and computes the cell-by-bin read depth matrix, to process the input BAM files and get the raw and normalized read depth matrices.





□ Idéfix: identifying accidental sample mix-ups in biobanks using polygenic scores

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab783/6430970

Idéfix relies on the comparison of actual phenotypes to PGSs. Idéfix works by modelling the relationships between phenotypes and polygenic scores, and calculating the residuals of the provided samples and their permutations.

Idéfix estimates mix-up rates to select a subset of samples that adhere to a specified maximum mix-up rate.





□ Approximate distance correlation for selecting highly interrelated genes across datasets

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009548

Approximate Distance Correlation (ADC) first obtains the k most correlated genes for each target gene as its approximate observations, and then calculates the distance correlation (DC) for the target gene across two datasets.

ADC repeats this process for all genes and then performs the Benjamini-Hochberg adjustment to control the false discovery rate. ADC can be applied to datasets ranging from thousands to millions of cells.




□ UVC: Calling small variants using universality with Bayes-factor-adjusted odds ratios

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab458/6427501

Empirical laws to improve variant calling: allele fraction at high sequencing depth is inversely proportional to the cubic root of variant-calling error rate, and odds ratios adjusted with Bayes factors can model various sequencing biases.

UVC outperformed other unique molecular identifier (UMI)-aware variant callers on the datasets used for publishing these variant callers. The executable uvc1 in the bin direcotry takes one BAM file as input and generates one block-gzipped VCF file as output.





□ ProSolo: Accurate and scalable variant calling from single cell DNA sequencing data

>> https://www.nature.com/articles/s41467-021-26938-w

ProSolo is a variant caller for multiple displacement amplified DNA sequencing data from diploid single cells. It relies on a pair of samples, where one is from an MDA single cell and the other from a bulk sample of the same cell population.

ProSolo uses an extension of the novel latent variable model of Varlociraptor, that already integrates various levels of uncertainty. It adds a layer that accounts for amplification biases and errors of MDA, and allows to properly asses the probability of having a variant.





□ PMD Uncovers Widespread Cell-State Erasure by scRNAseq Batch Correction Methods

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468733v1.full.pdf

Percent Maximum Difference (PMD), a new statistical metric that linearly quantifies batch similarity, and simulations generating cells from mixtures of distinct gene expression programs.

PMD is provably invariant to the number of clusters found when relative overlap in cluster composition is preserved, operates linearly across the spectrum of batch similarity, is unaffected by batch size differences or overall number of cells.

PMD does not require that batches be similar, filling a crucial gap in the field for benchmarking scRNAseq batch correction assessment.





□ CRAFT: a bioinformatics software for custom prediction of circular RNA functions

>> https://www.biorxiv.org/content/10.1101/2021.11.17.468947v1.full.pdf

circRNAs can be translated into CEP, incl. circRNA-specific ones generated by translation of ORF encompassing the backsplice junction, which are not present in linear transcripts, and circRNAs with a rolling ORF, lacking a stop codon a continuing along the ‘Mobius strip’.

CRAFT (CircRNA Function prediction Tool), allows investigating complex regulatory networks involving circRNAs acting in a concerted way, such as by decoying the same miRNAs or RBP, or miRNAs sharing target genes along with their coding potential.





□ Nonmetric ANOVA: a generic framework for analysis of variance on dissimilarity measures

>> https://www.biorxiv.org/content/10.1101/2021.11.19.469283v1.full.pdf

Based on the central limit theorem (CLT), Nonmetric ANOVA (nmA) as an extension of the cA and npA models where metric properties (identity, symmetry, and subad-ditivity) are relaxed.

nmA allows any dissimilarity measures to be defined between objects where a distinctiveness of a specific partitioning This derivation accommodates an ANOVA-like framework of judgment, indicative of significant dispersion of the partitioned outputs in nonmetric space.





□ STRling: a k-mer counting approach that detects short tandem repeat expansions at known and novel loci

>> https://www.biorxiv.org/content/10.1101/2021.11.18.469113v1.full.pdf

STRling is a method to detect large STR expansions from short-read sequencing data. It is capable of detecting novel STR expansions, that is expansions where there is no STR in the reference genome at that position.

STRling creates all possible rotations of each k-mer sequence and stores the minimum rotation. It then calculates the proportion of the read accounted for by each k-mer. STRling chooses the representative k-mer as the one that accounts for the greatest proportion of the read.

If multiple k-mers cover equal proportions, it chooses the smallest k-mer. If the representative k-mer exceeds a minimum threshold, STRling considers the read to have sufficient STR content to be informative for detecting STR expansions.





□ Hapl-o-MatGUI: Graphical user interface for the haplotype frequency estimation software

>> https://www.sciencedirect.com/science/article/pii/S019888592100255X

Hapl-o-Mat, a versatile and effective tool for haplotype frequency estimation based on an EM algorithm. Hapl-o-Mat is able to process large sets of unphased genotype data in various typing resolution.

Hapl-o-MatGUI acts as optional additional module to the Hapl-o-Mat software without directly intervening in the program. It supports processing and resolving various forms of HLA genotype data.





□ pISA-tree - a data management framework for life science research projects using a standardised directory tree

>> https://www.biorxiv.org/content/10.1101/2021.11.18.468977v1.full.pdf

pISA-tree, a straightforward and flexible data management solution for organisation of life science project-associated research data and metadata.

pISA-tree enables on-the-fly creation of enriched directory tree structure (project/Investigation/Study/Assay) via a series of sequential batch files in a standardised manner based on the ISA metadata framework.





□ reComBat: Batch effect removal in large-scale, multi-source omics data integration

>> https://www.biorxiv.org/content/10.1101/2021.11.22.469488v1.full.pdf

reComBat, a simple, yet effective, means of mitigating highly correlated experimental conditions through regularisation and compared various elastic net regularisation strengths.

The sources of biological variation are manifold and these can often only be encoded as categorical variables. Encoding these as one-hot categorical variables creates a sparse, high-dimensional feature vector and, when many such categorical features are considered, then m ≈ n.





□ Theoretical Guarantees for Phylogeny Inference from Single-Cell Lineage Tracing

>> https://www.biorxiv.org/content/10.1101/2021.11.21.469464v1.full.pdf

Theoretical guarantees for exact reconstruction of the underlying phylogenetic tree of a group of cells, showing that exact reconstruction can indeed be achieved with high probability given sufficient information capacity in the experimental parameters.

The lower bound assumption translates to a reasonable assumption over the minimal time until cell division. And extend this algorithm and bound to account for missing data, showing that the same bounds still hold assuming a constant probability of missing data.

The upper bound corresponds to an assumption on the maximum time until cell division, which can be evaluated in lineage-traced populations, as they by definition should not be post-mitotic.





□ HaplotypeTools: a toolkit for accurately identifying recombination and recombinant genotypes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04473-1

HaplotypeTools is a new toolset to phase variant sites using VCF and BAM files and to analyse phased VCFs. Phasing is achieved via the identification of reads overlapping ≥ 2 heterozygous positions and then extended by additional reads, a process that can be parallelized.

HaplotypeTools includes various utility scripts for downstream analysis including crossover detection and phylogenetic placement of haplotypes to other lineages or species. HaplotypeTools was assessed for accuracy against WhatsHap using simulated short and long reads.





□ trioPhaser: using Mendelian inheritance logic to improve genomic phasing of trios

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04470-4

trioPhaser uses gVCF files from an individual and their parents as initial input, and then outputs a phased VCF file. Input trio data are first phased using Mendelian inheritance logic.

Then, the positions that cannot be phased using inheritance information alone are phased by the SHAPEIT4 phasing algorithm.





□ SBGNview: Towards Data Analysis, Integration and Visualization on All Pathways

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab793/6433671

SBGNview adopts Systems Biology Graphical Notation (SBGN) and greatly extends the Pathview project by supporting multiple major pathway databases beyond KEGG.

SBGNview substantially extends or exceeds current tools (Pathview) in both design and function, high quality output graphics (SVG format) convenient for interpretation, and flexible and open-end workflow for iterative editing and interactive visualization (Highlighter module).





□ The systematic assessment of completeness of public metadata accompanying omics studies

>> https://www.biorxiv.org/content/10.1101/2021.11.22.469640v1.full.pdf

a comprehensive analysis of the completeness of public metadata accompanying omics data on both original publication and online repositories. The completeness of metadata from the original publication across the nine clinical phenotypes is 71.1%.

In contrast, the overall completeness of metadata information from the public repositories is 48.6%. the most complete reported phenotypes are disease condition and organism, and the least complete phenotypes is mortality.





□ iEnhancer-MFGBDT: Identifying enhancers and their strength by fusing multiple features and gradient boosting decision tree

>> http://www.aimspress.com/article/doi/10.3934/mbe.2021434

iEnhancer-MFGBDT is developed to identify enhancer and their strength by fusing multiple features and gradient boosting decision tree (GBDT).

Multiple features include k-mer and reverse complement k-mer nucleotide composition based on DNA sequence, and second-order moving average, normalized Moreau-Broto auto-cross correlation and Moran auto-cross correlation based on dinucleotide physical structural property matrix.





□ CNVpytor: a tool for copy number variation detection and analysis from read depth and allele imbalance in whole-genome sequencing

>> https://academic.oup.com/gigascience/article/10/11/giab074/6431715

CNVpytor uses B-allele frequency likelihood information from single-nucleotide polymorphisms and small indels data as additional evidence for CNVs/CNAs and as primary information for copy number-neutral losses of heterozygosity.

CNVpytor inherits the reimplemented core engine of its predecessor. CNVpytor is significantly faster than CNVnator-particularly for parsing alignment files (2-20 times faster)-and has (20-50 times) smaller intermediate files.




Heng Li

>> https://github.com/Illumina/DRAGMAP

Dragmap is a new mapper for Illumina reads. It is like a CPU-only implementation of the DRAGEN mapping algorithm. I met DRAGEN developers once. They are among the best I know in this field. Give it a try.





□ PIntMF: Penalized Integrative Matrix Factorization method for Multi-omics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab786/6443074

PIntMF (Penalized Integrative Matrix Factorization), an MF model with sparsity, positivity and equality constraints.To induce sparsity in the model, PIntMF uses a classical Lasso penalization on variable and individual matrices.

PIntMF uses an automatic tuning of the sparsity parameters using the glmnet. the sparsity on the variable block helps to the interpretation of patterns. Sparsity, non-negativity & equality constraints are added to the 2nd matrix to improve the interpretability of the clustering.




□ GPA-Tree: Statistical Approach for Functional-Annotation-Tree-Guided Prioritization of GWAS Results

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab802/6443109

GPA-Tree is a statistical approach to integrate GWAS summary statistics and functional annotation information within a unified framework.

Specifically, by combining a decision tree algorithm with a hierarchical modeling framework, GPA-Tree simultaneously implements association mapping and identifies key combinations of functional annotations related to disease risk-associated SNPs.




□ DeepUTR: Computational modeling of mRNA degradation dynamics using deep neural networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab800/6443108

DeepUTR, a deep neural network to predict mRNA degradation dynamics and interpreted the networks to identify regulatory elements in the 3’UTR and their positional effect. By using Integrated Gradients, These CNNs models identified known and novel cis-regulatory sequence elements of mRNA degradation.





Mitus Lumen.

2021-12-13 22:10:12 | Science News


- emit language syntax. -


□ Fluctuation theorems with retrodiction rather than reverse processes

>> https://avs.scitation.org/doi/10.1116/5.0060893

The everyday meaning of (ir)reversibility in nature is captured by the perceived “arrow of time”: if the video of the evolution played backward makes sense, the process is reversible; if it does not make sense, it is irreversible.

The reverse process is generically not the video played backward: to cite an extreme example, nobody conceives bombs that fly upward to their airplanes while cities are being built from rabble.

In the case of controlled protocols in the presence of an unchanging environment, the reverse process is implemented by reversing the protocol. If the environment were to change, the connection between the physical process and the associated reverse one becomes thinner.

The retrodiction channel of an erasure channel is the erasure channel that returns the reference prior—a result that can be easily extended to any alphabet dimension.

PROCESSES VERSUS INFERENCES: fluctuation relations are intimately related to statistical distances (“divergences”) and that Bayesian retrodiction arises from the requirement that the fluctuating variable can be computed locally.





□ The Metric Dimension of the Zero-Divisor Graph of a Matrix Semiring

>> https://arxiv.org/pdf/2111.07717v1.pdf

The metric dimensions of graphs corresponding to various algebraic structures. The metric dimension of a zero-divisor graph of a commutative ring, a total graph of a finite commutative ring, an annihilating-ideal grah of a finite ring, a commuting graph of a dihedral group.

Antinegative semirings are also called antirings. The simplest example of an antinegative semiring is the binary Boolean semiring B, the set {0,1} in which addition and multiplication are the same as in Z except that 1 + 1 = 1.

For infinite entire antirings S, the metric dimension of Γ(Mn(S)) is infinite. Therefore, it shall limit themselves to studying finite semirings. For every Λ ⊆ Nn × Nn at most one zero-divisor matrix with its pattern of zero and non-zero entries prescribed by Λ is not in W.





□ CONTEXT, JUDGEMENT, DEDUCTION

>> https://arxiv.org/pdf/2111.09438v1.pdf

an abstract definition of type constructor featuring the usual formation, introduction, elimination and computation rules. In proof theory they offer a deep analysis of structural rules, demystifying some of their properties, and putting them into context.

Discussing the internal logic of a topos, a predicative topos, an elementary 2-topos et similia, and show how these can be organized in judgemental theories.





□ Scasa: Isoform-level Quantification for Single-Cell RNA Sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab807/6448218

Scasa, an isoform-level quantification method for high-throughput single-cell RNA sequencing by exploiting the concepts of transcription clusters and isoform paralogs.

Scasa compares well in simulations against competing approaches including Alevin, Cellranger, Kallisto, Salmon, Terminus and STARsolo at both isoform- and gene-level expression.

Scasa takes advantage of the efficient preprocessing provided by existing pseudoaligners such as Kallisto-bustools or Alevin to produce a read-count equivalent-class matrix. Scasa splits the equivalence class output by cell and applies the AEM algorithm to multiple cells.





□ corral: Single-cell RNA-seq dimension reduction, batch integration, and visualization with correspondence analysis

>> https://www.biorxiv.org/content/10.1101/2021.11.24.469874v1.full.pdf

Correspondence Cnalysis (CA) for dimension reduction of scRNAseq data, which is a performant alternative to PCA. Designed for use with counts, CA is based on decomposition of a chi-squared residual matrix and does not require log-transformation of scRNAseq counts.

CA using the Freeman-Tukey chi-squared residual was most performant overall in scRNAseq data. Variance stabilizing transformations applied in conjunction with standard CA and the use of “power deflation” smoothing both improve performance in downstream clustering tasks.

corralm, a CA-based method for multi-table batch integration of scRNAseq data in shared latent space. The adaptation of correspondence analysis for to the integration of multiple tables is similar to the method for single tables with additional matrix concatenation operations.

corralm employs indexed residuals, by dividing the standardized residuals by the square root of expected proportion to reduce the influence of column with larger masses (library depth). And applies CA-style processing to continuous data with the Hellinger distance adaptation.





□ Fuzzy set intersection based paired-end short-read alignment

>> https://www.biorxiv.org/content/10.1101/2021.11.23.469039v1.full.pdf

a new algorithm for aligning both reads in a pair simultaneously by fuzzily intersecting the sets of candidate alignment locations for each read. SNAP with the fuzzy set intersection algorithm dominates BWA and Bowtie, having both better performance and better concordance.

Fuzzy set intersection avoids doing expensive evaluations of many candidate alignments that would eventually be dismissed because they are too far from any plausible alignments for the other end of the pair.





□ ScLRTC: imputation for single-cell RNA-seq data via low-rank tensor completion

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-021-08101-3

scLRTC imputes the dropout entries of a given scRNA-seq expression. It initially exploits the similarity of single cells to build a third-order low-rank tensor and employs the tensor decomposition to denoise the data.

ScLRTC reconstructs the cell expression by adopting the low-rank tensor completion algorithm, which can restore the gene-to-gene and cell-to-cell correlations. scLRTC is demonstrated to be also effective in cell visualization and in inferring cell lineage trajectories.





□ FDJD: RNA-Seq Based Fusion Transcript Detection Using Jaccard Distance

>> https://www.biorxiv.org/content/10.1101/2021.11.17.469019v1.full.pdf

Converting the RNA categorical space into a compact binary array called binary fingerprints, which enables us to reduce the memory usage and increase efficiency. The search and detection of fusion candidates are done using the Jaccard distance.

FDJD (Fusion Detection using the Jaccard Distance) exhibits superior accuracy compared to popular alternative fusion detection methods. FDJD generates fusion candidates using both split reads and discordantly aligned pairs which are produced by the STAR alignment step.





□ Inspector: Accurate long-read de novo assembly evaluation

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02527-4

Inspector, a reference-free long-read de novo assembly evaluator which faithfully reports types of errors and their precise locations. Notably, Inspector can correct the assembly errors based on consensus sequences derived from raw reads covering erroneous regions.

Inspector generates read-to-contig alignment and performs downstream assembly evaluation. Inspector can report the precise locations and sizes for structural and small-scale assembly errors and distinguish true assembly errors from genetic variants.





□ Characterizing Protein Conformational Spaces using Dimensionality Reduction and Algebraic Topology

>> https://www.biorxiv.org/content/10.1101/2021.11.16.468545v1.full.pdf

Linear dimensionality reduction like PCA and its variants may not capture the complex, non-linear nature of pro- tein conformational landscape. Dimensionality reduction techniques are broadly classified based on the solution space they generate, as convex and non-convex.

Even after the conformational space is sampled, it should be filtered and clustered to extract meaningful information.

The structures represented by these conformations are then analyzed by studying their high dimension topological properties to identify truly distinct conformations and holes in the conformational space that may represent high energy barriers.





□ scCODE: an R package for personalized differentially expressed gene detection on single-cell RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.11.18.469072v1.full.pdf

DE methods together with gene filtering have profound impact on DE gene identification, and different datasets will benefit from personalized DE gene detection strategies.

scCODE (single cell Consensus Optimization of Differentially Expressed gene detection) produces consensus DE gene results.

scCODE summarizes the top (default as all) DE genes from each of the strategy selected. The principle of consensus optimization is that the DE genes with higher frequency of observation by different analysis strategies are more reliable.





□ HDMC: a novel deep learning based framework for removing batch effects in single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab821/6449435

This framework employs an adversarial training strategy to match the global distribution of different batches. This provides an improved foundation to further match the local distributions with a Maximum Mean Discrepancy based loss.

HDMC divides cells in each batch into clusters and uses a contrastive learning method to simultaneously align similar cluster pairs / keep noisy pairs apart from each other. It allows to obtain clusters w/ all cells of the same type, and avoid clusters w/ cells of different type.





□ COBREXA.jl: constraint-based reconstruction and exascale analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab782/6429269

COBREXA.jl provides a ‘batteries-included’ solution for scaling analyses to make efficient use of high-performance computing (HPC) facilities, which allows to be realistically applied to pre-exascale-sized models.

COBREXA formulates optimization problems and is compatible w/ JuMP solvers. the building blocks are designed so that the constructed workflows that explores flux variability in many variants, its distributed execution, and collection of many results in a multi-dimensional array.





□ Built on sand: the shaky foundations of simulating single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468676v1.full.pdf

Most simulators are unable to accommodate complex designs without introducing artificial effects; they yield over-optimistic performance of inte- gration, and potentially unreliable ranking of clustering methods; and, it is generally unknown.

By definition, simulations generate synthetic data. On the one hand, conclusions drawn from simulation studies are frequently criticized, because simulations cannot completely mimic (real) experimental data.




□ DiagAF: A More Accurate and Efficient Pre-Alignment Filter for Sequence Alignment

>> https://ieeexplore.ieee.org/document/9614999/

DiagAF uses a new lower bound of edit distance based on shift hamming masks. The new lower bound makes use of fewer shift hamming masks comparing with state-of-art algorithms such as SHD and MAGNET.

DiagAF has the features: faster; lower false positive rate; zero false negative rate; can deal with alignments with un-equal lengths; can pre-align a string to multiple candidate in a single time run. DiagAF can align sequences with early termination for true alignments.




□ Explainability methods for differential gene analysis of single cell RNA-seq clustering models

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468416v1.full.pdf

The absence of “ground truth” information about the DE genes makes the evaluation on real-world datasets is a complex task, usually requiring additional biological experiments for validation.

a comprehensive study to compare the performance of dedicated DE methods, with that of explainability methods typically used in machine learning, both model agnostic: SHAP, permutation importance, and model-specific: NN gradient-based methods.

The gradient method achieved the highest accuracy on the scziDesk and scDeepCluster while on contrastive-sc the results are comparable to the other top performing methods.

contrastive-sc employs high levels of NN dropout as data augmentation and thus learns a sparse representation of the input data, penalizing by de- sign the capacity to learn all relevant features.




□ MAGUS+eHMMs: Improved Multiple Sequence Alignment Accuracy for Fragmentary Sequences

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab788/6430102

MAGUS is fairly robust to fragmentary sequences under many conditions, and that using a two-stage approach where MAGUS is used to align selected “backbone sequences” and the remaining sequences are added into the alignment using ensembles of Hidden Markov Models.

MAGUS+eHMMs, matches or improves on both MAGUS and UPP, particularly when aligning datasets that evolved under high rates of evolution and that have large fractions of fragmentary sequences.




□ FastQTLmapping: an ultra-fast package for mQTL-like analysis

>> https://www.biorxiv.org/content/10.1101/2021.11.16.468610v1.full.pdf

FastQTLmapping is a computationally efficient, exact, and generic solver for exhaustive multiple regression analysis involving extraordinarily large numbers of dependent and explanatory variables with covariates.

FastQTLmapping can afford omics data containing tens of thousands of individuals and billions of molecular loci.

FastQTLmapping accepts input files in text format and in Plink binary format. The output file is in text format and contains all test statistics for all regressions, with the ability to control the volume of the output at preset significance thresholds.





□ ZARP: An automated workflow for processing of RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.11.18.469017v1.full.pdf

ZARP (Zavolan-Lab Automated RNA-seq Pipeline) can run locally or in a cluster environment, generating extensive reports not only of the data but also of the options utilized.

ZARP requires two distinct input files: A tab-delimited file with sample-specific information, such as paths to the sequencing data (FASTQ), transcriptome annotation (GTF) and experiment protocol- and library-preparation specifications like adapter sequences or fragment size.

To provide a high-level topographical/functional annotation of which gene segments (e.g., CDS, 3’UTR, intergenic) and biotypes (e.g., protein coding genes, rRNA) are represented by the reads in a given sample, ZARP includes ALFA.





□ VIVID: a web application for variant interpretation and visualisation in multidimensional analyses

>> https://www.biorxiv.org/content/10.1101/2021.11.16.468904v1.full.pdf

VIVID, a novel interactive and user-friendly platform that automates mapping of genotypic information and population genetic analysis from VCF files in 2D and 3D protein structural space.

VIVID is a unique ensemble user interface that enables users to explore and interpret the impact of genotypic variation on the phenotypes of secondary and tertiary protein structures.





□ Spliceator: multi-species splice site prediction using convolutional neural networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04471-3

Spliceator is based on the Convolutional Neural Networks technology and more importantly, is trained on an original high quality dataset containing genomic sequences from organisms ranging from human to protists.

Spliceator achieves overall high accuracy compared to other state-of-the-art programs, including the neural network-based NNSplice, MaxEntScan that models SS using the maximum entropy distribution, and two CNN-based methods: DSSP and SpliceFinder.






□ GSA: an independent development algorithm for calling copy number and detecting homologous recombination deficiency (HRD) from target capture sequencing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04487-9

Genomic Scar Analysis (GSA) could effectively and accurately calculate the purity and ploidy of tumor samples through NGS data, and then reflect the degree of genomic instability and large-scale copy number variations of tumor samples.

Evaluating the rationality of segmentation and genotype identification by the GSA algorithm and compared with other two algorithms, PureCN and ASCAT, found that the segmentation result of GSA algorithm was more logical.




□ A computationally efficient clustering linear combination approach to jointly analyze multiple phenotypes for GWAS

>> https://www.biorxiv.org/content/10.1101/2021.11.22.469509v1.full.pdf

The Clustering Linear Combination (CLC) method works particularly well with phenotypes that have natural groupings, but due to the unknown number of clusters for a given data,

the final test statistic of CLC method is the minimum p-value among all p-values of the CLC test statistics obtained from each possible number of clusters.

Computationally Efficient CLC (ceCLC) to test the association between multiple phenotypes and a genetic variant. ceCLC uses the Cauchy combination test to combine all p-values of the CLC test statistics obtained from each possible number of clusters.





□ Figbird: A probabilistic method for filling gaps in genome assemblies

>> https://www.biorxiv.org/content/10.1101/2021.11.24.469861v1.full.pdf

Figbird, a probabilistic method for filling gaps in draft genome assemblies using second generation reads based on a generative model for sequencing that takes into account information on insert sizes of read pairs and sequencing errors.

Figbird uses an iterative approach based on the expectation-maximization (EM) algorithm. The method is based on a generative model for sequencing proposed in CGAL and subsequently used to develop a scaffolding tool SWALO.





□ TSEBRA: transcript selector for BRAKER

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04482-0

TSEBRA uses a set of arbitrarily many gene prediction files in GTF format together with a set of files of heterogeneous extrinsic evidence to produce a combined output.

TSEBRA uses extrinsic evidence in the form of intron regions or start/stop codon positions to evaluate and filter transcripts from gene predictions.





□ VG-Pedigree: A Complete Pedigree-Based Graph Workflow for Rare Candidate Variant Analysis

>> https://www.biorxiv.org/content/10.1101/2021.11.24.469912v1.full.pdf

VG-Pedigree, a pedigree-aware workflow based on the pangenome-mapping tool of Giraffe and the variant-calling tool DeepTrio using a specially-trained model for Giraffe-based alignments.

VG-Pedigree improves mapping and variant calling in both SNVs and INDEL variants over those produced by alignments created using BWA-MEM to a linear-reference and Giraffe mapping to a pangenome graph containing data from the 1000 Genomes Project.





□ Detecting fabrication in large-scale molecular omics data

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0260395

Just as has been previously shown in the financial sector, digit frequencies are a powerful data representation when used in combination with machine learning to predict the authenticity of data. Fraud detection methods must be updated for sophisticated computational fraud.

The Fabrication detection methods in biomedical research and show that machine learning can be used to detect fraud in large-scale omic experiments. the Benford-like digit frequency method can be generalized to any tabular numeric data.





□ monaLisa: an R/Bioconductor package for identifying regulatory motifs

>> https://www.biorxiv.org/content/10.1101/2021.11.30.470570v1.full.pdf

monaLisa (MOtif aNAlysis with Lisa), an R/Bioconductor package that implements approaches to identify relevant transcription factors from experimental data.

monaLisa uses randomized lasso stability selection. monaLisa further provides helpful functions for motif analyses, including functions to predict motif matches and calcu- late similarity between motifs.





□ BreakNet: detecting deletions using long reads and a deep learning approach

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04499-5

BreakNet first extracts feature matrices from long-read alignments. Second, it uses a time-distributed CNN to integrate and map the feature matrices to feature vectors.

BreakNet employs a BLSTM model to analyse the produced set of continuous feature vectors in both the forward and backward directions. a classification module determines whether a region refers to a deletion.





□ Variance in Variants: Propagating Genome Sequence Uncertainty into Phylogenetic Lineage Assignment

>> https://www.biorxiv.org/content/10.1101/2021.11.30.470642v1.full.pdf

a probabilistic matrix representation of individual sequences which incorporates base quality scores as a measure of uncertainty that naturally lead to resampling and replication as a framework for uncertainty propagation.

With the matrix representation, resampling possible base calls according to quality scores provides a bootstrap- or prior distribution-like first step towards genetic analysis.

This framework involves converting the uncertainty scores into a matrix of probabilities, and repeatedly sampling from this matrix and using the resultant samples in downstream analysis.





□ Macarons: Uncovering complementary sets of variants for predicting quantitative phenotypes

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab803/6448209

Macarons, a fast and simple algorithm, to select a small, complementary subset of variants by avoiding redundant pairs that are likely to be in linkage disequilibrium.

Macarons features two simple, interpretable parameters to control the time/performance trade-off: the number of SNPs to be selected (k), and maximum intra-chromosomal distance (D, in base pairs) to reduce the search space for redundant SNPs.





□ Detecting Spatially Co-expressed Gene Clusters with Functional Coherence by Graph-regularized Convolutional Neural Network

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab812/6448221

The graph-regularized CNN models the expressions of a gene over spatial locations as an image of a gene activity map, and naturally utilizes the spatial localization information by performing convolution operation to capture the nearby tissue textures.

The model further exploits prior knowledge of gene relationships encoded in PPI network as a regularization by graph Laplacian of the network to enhance biological interpretation of the detected gene clusters.





□ deepMNN: Deep Learning-Based Single-Cell RNA Sequencing Data Batch Correction Using Mutual Nearest Neighbors

>> https://www.frontiersin.org/articles/10.3389/fgene.2021.708981/full

deepMNN identifies mutual nearest neighbor (MNN) pairs across different batches in a PCA subspace. A residual-based batch correction network was then constructed and employed to remove batch effects based on these MNN pairs.

The overall loss of deepMNN was designed as the sum of a batch loss and a weighted regularization loss. The batch loss was used to compute the distance between cells in MNN pairs in the PCA subspace, while the regularization loss was to make the output of the network similar to the input.





Desolation.

2021-12-13 22:07:13 | Science News




□ Adjoining colimits

>> https://arxiv.org/abs/2111.12117v1

a theory of colimit sketches ‘with constructions’ in higher category theory, formalising the input to the ubiquitous procedure of adjoining specified ‘constructible’ colimits to a category such that specified ‘relation’ colimits are enforced.

Morel-Voevodsky’s category of motivic spaces, resp. Robalo’s category of non-commutative motives are universal among categories under Sch, resp. ncSch, admitting all colimits such that Nisnevich descent is preserved and A1-localisation is enforced.

This language makes explicit the rôle colimit diagrams play as presentations of objects of ∞-categories, expressing how they are put together from objects of a dense subcategory. It may be useful to theory builders embarking on a construction of their own ‘designer’ ∞-category.





□ SAT: Efficient iterative Hi-C scaffolder based on N-best neighbors

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04453-5

Hi-C based scaffolding tool, pin_hic, which takes advantage of contact information from Hi-C reads to construct a scaffolding graph iteratively based on N-best neighbors of contigs. It identifies potential misjoins and breaks them to keep the scaffolding accuracy.

SAT, a new format which is inspired by the GFA and extended to keep scaffolding information. In each iteration, if the SAT file is used as an input, the paths will be construct first and each original contig in the draft assembly will keep a record of its corresponding scaffold.





□ EnGRaiN: A Supervised Ensemble Learning Method for Recovery of Large-scale Gene Regulatory Networks

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab829/6458321

EnGRaiN , the first supervised ensemble learning method to construct gene networks. The supervision for training is provided by small training datasets of true edge connections (positives) and edges known to be absent (negatives) among gene pairs.

EnGRaiN integrates interaction/co-expression predictions from multiple gene network inference methods to generate a comprehensive ensemble network of gene interactions. EnGRaiN leverages the ground truth to learn optimal distribution over its various features.





□ SCRIP: an accurate simulator for single-cell RNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab824/6454945

SCRIP provides a flexible Gamma-Poisson mixture and a Beta-Gamma-Poisson mixture framework to simulate scRNA-seq data. SCRIP package was built based on the framework of splatter. Both Gamma-Poisson and Beta-Poisson distribution model the over dispersion of scRNA-seq data.

Specifically, Beta-Poisson model was used to model bursting effect. The dispersion was accurately simulated by fitting the mean-BCV dependency using Generalized Additive Model.

SCRIP modeles other key characteristics of scRNA-seq data incl. library size, zero inflation and outliers. SCIRP enables various application for different experimental designs and goals including DE analysis, clustering analysis, trajectory-based analysis and bursting analysis.





□ schist: Nested Stochastic Block Models applied to the analysis of single cell data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04489-7

schist is a convenient wrapper to the graph-tool python library, designed to be used with scanpy. The most prominent function is schist.inference.nested_model() which takes a AnnData object as input and fits a nested Stochastic Block Model on the kNN graph built with scanpy.

The Bayesian formulation of Stochastic Block Models provides the possibility to perform inference on a graph for any partition configuration, thus allowing reliable model selection using an interpretable measure, entropy.





□ scShaper: an ensemble method for fast and accurate linear trajectory inference from single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab831/6458323

scShaper, a new trajectory inference method that enables accurate linear trajectory inference. The ensemble approach of scShaper generates a continuous smooth pseudotime based on a set of discrete pseudotimes.

scShaper is a fast method with few hyperparameters, making it a promising alternative to the principal curves method for linear pseudotemporal ordering.

scShaper is based on graph theory and solves the shortest Hamiltonian path of a clustering, utilizing a greedy algorithm to permute clusterings computed using the k-means method to obtain a set of discrete pseudotimes.





□ GNNImpute: An efficient scRNA-seq dropout imputation method using graph attention network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04493-x

GNNImpute, an autoencoder structure network that uses graph attention convolution to aggregate multi-level similar cell information and implements convolution operations on non-Euclidean space.

GNNImpute compensates for the lack of low expression intensity of some genes by aggregating the features information of similar cells. It can recover the dropout events in the scRNA-seq data and remain the specificity between cells to avoid excessive smoothing of expression.

GNNImpute can accurately and effectively impute the dropout and reduce dropout noise. GNNImpute enables the expression of the cells in the same tissue area to be embedded in low-dimensional vectors.





□ scBERT: a Large-scale Pretrained Deep Langurage Model for Cell Type Annotation of Single-cell RNA-seq Data

>> https://www.biorxiv.org/content/10.1101/2021.12.05.471261v1.full.pdf

scBERT (single-cell Bidirectional Encoder Representations from Transformers) follows the state-of-the-art paradigm of pre-train and fine-tune in the deep learning field.

scBERT formulates the expression profile of each single cell into embeddings for genes. scBERT computes the probability for the provided cell to be any cell type labelled in the reference dataset.

scBERT keeps the full gene-level interpretation, abandons the use of HVGs and dimensionality reduction, and lets discriminative genes and useful interaction come to the surface by themselves.

scBERT allows for the discovery of gene expression patterns that account for cell type annotation in an unbiased data-driven manner. scBERT pioneered the application of Transformer architectures in scRNA-seq data analysis with innovatively designed embeddings for genes.





□ GINCCo: Unsupervised construction of computational graphs for gene expression data with explicit structural inductive biases

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab830/6458322

GINCCo (Gene Interaction Network Constrained Construction), an unsupervised method for automated construction of computational graph models for gene expression data that are structurally constrained by prior knowledge of gene interaction networks.

Each of the entities in the GINCCo computational graph represent biological entities such as genes, candidate protein complexes and phenotypes instead of arbitrary hidden nodes of a neural network.

GINCCo performs the model construction in a completely automated and deterministic; this can be seen as a preprocessing step allowing GINCCo to scale immensely and study factor graphs without the influence of task specific optimization dictating the shape of the models.





□ sciCAN: Single-cell chromatin accessibility and gene expression data integration via Cycle-consistent Adversarial Network

>> https://www.biorxiv.org/content/10.1101/2021.11.30.470677v1.full.pdf

sciCAN removes modality differences while keeping true biological variation. the model architecture of sciCAN, which contains two major components, representation learning and modality alignment.

sciCAN doesn’t require cell anchors and thus, it can be applied to most non-joint profiled single-cell data. sciCAN enabled us to co-embed and co- cluster RNA-seq and ATAC-seq data. sciCAN reduces each dataset into 128-dimension spaces.





□ propeller: testing for differences in cell type proportions in single cell data

>> https://www.biorxiv.org/content/10.1101/2021.11.28.470236v1.full.pdf

propeller, a robust and flexible method that leverages biological replication to find statistically significant differences in cell type proportions between groups.

Propeller leverages biological replication to estimate the high sample- to-sample variability in cell type counts often observed in real single cell data.

The minimal annotation information that propeller requires for each cell is cluster/cell type, sample and group/condition, which can be automatically extracted from Seurat and SingleCellExperiment class objects.

The propeller function calculates cell type proportions for each biological replicate, performs a variance stabilising transformation on the matrix of proportions and fits a linear model for each cell type or cluster using the limma framework.




□ AlphaFill: enriching the AlphaFold models with ligands and co-factors

>> https://www.biorxiv.org/content/10.1101/2021.11.26.470110v1.full.pdf

AlphaFill, an algorithm based on sequence and structure similarity, to “transplant” such “missing” small molecules and ions from experimentally determined structures. AlphaFill should be complemented by structure-based transfer algorithms.

The sequence of the AlphaFold model is BLASTed8 against the sequence file of the LAHMA webserver9 which contains all sequences present in the PDB-REDO databank. The hits are sorted by E-value and a maximum of 250 hits, as is the default for BLAST, is returned.

The selection of hits is then structurally aligned, based on the Cα-atoms of the residues matched in the BLAST8 alignment. The root-mean-square deviation (RMSD) of this global alignment is stored in the AlphaFill metadata.





□ HiCArch: A Deep Learning-based Hi-C Data Predictor

>> https://www.biorxiv.org/content/10.1101/2021.11.26.470146v1.full.pdf

HiCArch, a transformer-based model architecture for Hi-C contact matrices prediction based on the 11 types of K562 epigenomic features, consisting of chromatin binding factors and histone modifications.

HiCArch processes the sequential input and generates the 2D Hi-C matrix via two main modules: sequence-to-sequence (seqToSeq, or STS) module, sequence-to-matrix (seqToMat, or STM) module.




□ propeller: testing for differences in cell type proportions in single cell data

>> https://www.biorxiv.org/content/10.1101/2021.11.28.470236v1.full.pdf

propeller, a robust and flexible method that leverages biological replication to find statistically significant differences in cell type proportions between groups.

Propeller leverages biological replication to estimate the high sample- to-sample variability in cell type counts often observed in real single cell data. The minimal annotation information that propeller requires for each cell is cluster/cell type, sample and group/condition, which can be automatically extracted from Seurat and SingleCellExperiment class objects.

The propeller function calculates cell type proportions for each biological replicate, performs a variance stabilising transformation on the matrix of proportions and fits a linear model for each cell type or cluster using the limma framework.





□ Predicting environmentally responsive transgenerational differential DNA methylated regions (epimutations) in the genome using a hybrid deep-machine learning approach

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04491-z

a hybrid DL-ML approach that uses a deep neural network for extracting molecular features and a non-DL classifier to predict environmentally responsive transgenerational differential DNA methylated regions (DMRs), termed epimutations, based on the extracted DL-based features.

The process of generating features is supervised. A 1000 bp input DNA sequence is one-hot encoded using a 5 × 1000 binary matrix. After each convolutional layer is a batch-normalization layer following by a ReLU transformer layer.





□ Navigating the pitfalls of applying machine learning in genomics

>> https://www.nature.com/articles/s41576-021-00434-9

Jacob Schreiber:
Although this high-level explanation covers our main point, we describe five specific (related) pitfalls that one can encounter in this space through the lens of train/test/prediction sets to drive home how common it is to make a mistake in an evaluation setting.

Importantly: CROSS-FOLD VALIDATION IS NOT THE SOLUTION. In fact, blindly applying cross-fold validation to biological data without thinking about your anticipated use case (the prediction set) can give you a false sense of security in the face of complexity.




□ Codex DNA increases productivity & efficiency of mRNA synthesis, launching BioXP kits with CleanCap Reagent AG

Automated platform accelerates development of mRNA-based #vaccines & therapies

>> https://codexdna.com/products/bioxp-kits/mrna-synthesis/




□ KaKs_Calculator 3.0: calculating selective pressure on coding and non-coding sequences

>> https://www.biorxiv.org/content/10.1101/2021.11.25.469998v1.full.pdf

Similar to the nonsynonymous/synonymous substitution rate ratio for coding sequences, selection on non-coding sequences can be quantified as non-coding nucleotide substitution rate normalized by synonymous substitution rate of adjacent coding sequences.

KaKs_Calculator detects the mode of selection operated on molecular sequences, accordingly demonstrating its great potential to achieve genome-wide scan of natural selection on diverse sequences and identification of potentially functional elements at whole genome scale.





□ Systematic evaluation of cell-type deconvolution pipelines for sequencing-based bulk DNA methylomes

>> https://www.biorxiv.org/content/10.1101/2021.11.29.470374v1.full.pdf

All compared sequencing-based methods consist of two common steps, informative region selection and cell-type composition estimation.

In the informative region selection step, the sequencing-based cell-type deconvolution methods filter out CpGs where the methylation patterns do not clearly demonstrate cell-type heterogeneity.

Whereas selecting similar genomic regions to DMRs generally contributed to increasing the performance in bi-component mixtures, the uniformity of cell-type distribution showed a high correlation with the performance in five cell-type bulk analyses.





□ GraphPrompt: Biomedical Entity Normalization Using Graph-based Prompt Templates

>> https://www.biorxiv.org/content/10.1101/2021.11.29.470486v1.full.pdf

OBO-syn encompasses 70 biomedical entity types and 2 million entity- synonym pairs. OBO-syn has demonstrated small overlaps with existing datasets and more challenging entity-synonym predictions.

GraphPrompt, a prompt-based learning method for entity normalization with the consideration of graph structures. GraphPrompt solves a masked-language model task. GraphPrompt has obtained superior performance to the other approaches on both few-shot and zero-shot settings.





□ CLA: Automated identification of cell-type–specific genes and alternative promoters

>> https://www.biorxiv.org/content/10.1101/2021.12.01.470587v1.full.pdf

Cell Lineage Analysis (CLA), a computational method which identifies transcriptional features with expression patterns that discriminate cell types, incorporating Cell Ontology knowledge on the relationship between different cell types.

CLA uses random forest classification with a stratified bootstrap to increase the accuracy of binary classifiers when each cell type have a different number of samples.

CLA runs multiple instances of regularized random forest and reports the transcriptional features consistently selected. CLA not only discriminates individual cell types but can also discriminate lineages of cell types related in the developmental hierarchy.





□ CSmiR: Exploring cell-specific miRNA regulation with single-cell miRNA-mRNA co-sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04498-6

CSmiR (Cell-Specific miRNA regulation) to combine single-cell miRNA-mRNA co-sequencing data and putative miRNA-mRNA binding information to identify miRNA regulatory networks at the resolution of individual cells.

CSmiR is effective in predicting cell-specific miRNA targets. Finally, through exploring cell–cell similarity matrix characterized by cell-specific miRNA regulation, CSmiR provides a novel strategy for clustering single-cells and helps to understand cell–cell crosstalk.





□ CombSAFE: Identification, semantic annotation and comparison of combinations of functional elements in multiple biological conditions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab815/6448225

CombSAFE allows analyzing the whole genome, by clustering patterns of regions with similar functional elements and through enrichment analyses to discover ontological terms significantly associated with them.

CombSAFE allows comparing functional states of a specific genomic region to analyze their different behavior throughout the various semantic annotations.





□ KAGE: Fast alignment-free graph-based genotyping of SNPs and short indels

>> https://www.biorxiv.org/content/10.1101/2021.12.03.471074v1.full.pdf

Since traditional reference genomes do not include genetic variation, traditional genotypers suffer from reference bias and poor accuracy in variation-rich regions where reads cannot accurately be mapped.

These methods work by representing genetic variants by their surrounding kmers (sequences with length k covering each variant) and looking for support for these kmers in the sequenced reads.

KAGE, a genotyper for SNPs and short indels that is inspired by recent developments within graph-based genome representations and alignment-free genotyping.





□ FastMLST: A Multi-core Tool for Multilocus Sequence Typing of Draft Genome Assemblies

>> https://journals.sagepub.com/doi/10.1177/11779322211059238

FastMLST, a tool that is designed to perform PubMLST searches using BLASTn and a divide-and-conquer approach that processes each genome assembly in parallel.

The output offered by FastMLST includes a table with the ST, allelic profile, and clonal complex or clade (when available), detected for a query, as well as a multi-FASTA file or a series of FASTA files with the concatenated or single allele sequences detected.

FastMLST assigns STs to thousands of genomes in minutes with 100% concordance in genomes without suspected contamination in a wide variety of species with different genome lengths, %GC, and assembly fragmentation levels.





□ TRAWLING: a Transcriptome Reference Aware of spLIciNG events.

>> https://www.biorxiv.org/content/10.1101/2021.12.03.471115v1.full.pdf

TRAWLING simplifies the identification of splicing events from RNA-seq data in a simple and fast way, while leveraging the suite of tools developed for alignment-free methods. it allows the aggregation of read counts based on the donor and acceptor splice motifs.

TRAWLING using three different RNA sequencing datasets: whole transcriptome sequencing, single cell RNA sequencing and Digital RNA w/ pertUrbation of Genes. TRAWLING did not misalign or lose reads, it can be used by default w/o loss of generality for gene level quantification.





□ DARTS: an Algorithm for Domain-Associated RetroTransposon Search in Genome Assemblies

>> https://www.biorxiv.org/content/10.1101/2021.12.03.471067v1.full.pdf

DARTS has radically higher sensitivity of long terminal repeat retrotransposons (LTR-RTs) identification compared to a widely accepted LTRharvest tool.

DARTS returns a set of structurally annotated nucleotide and amino acid sequences which can be readily used in subsequent comparative and phylogenetic analyses.




□ pystablemotifs: Python library for attractor identification and control in Boolean networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab825/6454946

pystablemotifs is a Python 3 library for analyzing Boolean networks. Its non-heuristic and exhaustive attractor identification algorithm was previously presented in (Rozum et al. 2021).

Illustrating its performance improvements over similar methods and discuss how it uses outputs of the attractor identification process to drive a system to one of its attractors from any initial state.





□ CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices

>> https://www.biorxiv.org/content/10.1101/2021.12.06.471436v1.full.pdf

Combining a modified MinHash technique (ArgMinHash) and a data structure called a k-mer ternary search tree (KTST), which allows Jaccard and containment indices to be computed at multiple k-mer sizes efficiently and simultaneously.

This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient.

CMash estimate of the Jaccard and containment index does not deviate significantly from the ground truth, indicating that this approach can give fast and reliable results with minimal bias.





□ Genovo: A method to build extended sequence context models of point mutations and indels

>> https://www.biorxiv.org/content/10.1101/2021.12.06.471476v1.full.pdf

a new method that solves this problem by grouping similar k-mers using IUPAC patterns. It calculates a table with the number of times each possible k-mer is observed with the central base mutated and unmutated.

Genovo predicts the expected number of synonymous, missense, and other functional mutation types for each gene. the created mutation rate models increase the statistical power to detect genes containing disease-causing variants and to identify genes under strong constraint.





□ DALI (Diversity AnaLysis Interface): a novel tool for the integrated analysis of multimodal single cell RNAseq data and immune receptor profiling.

>> https://www.biorxiv.org/content/10.1101/2021.12.07.471549v1.full.pdf

Diversity AnaLysis Interface (DALI) interacts with the Seurat R package and is aimed to support the advanced bioinformatician with a set of novel methods and an easier integration of existing tools for BCR and TCR analysis in their single cell workflow.





□ LEXAS: a web application for life science experiment search and suggestion

>> https://www.biorxiv.org/content/10.1101/2021.12.05.471323v1.full.pdf

LEXAS (Life-science EXperiment seArch and Suggestion) curates the description of biomedical experiments and suggests the experiments on genes that could be performed next.

LEXAS allows users to choose between two machine learning models that are used for the suggestion. One is a “reliable” model that uses seven major biomedical databases such as the BioGRID and four knowledgebases such as the Gene Ontology.





□ MCKAT: a multi-dimensional copy number variant kernel association test

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04494-w

MCKAT utilizes both multi-dimensional features of the CNVs & their heterogeneity effect. The MCKAT is not only capable of indicating stronger evidence in detecting significant associations b/n CNVs & disease-related traits, but it is applicable to both rare & common CNV datasets.





Ritardando.

2021-12-13 22:03:07 | Science News




□ Fugue: Scalable batch-correction method for integrating large-scale single-cell transcriptomes

>> https://www.biorxiv.org/content/10.1101/2021.12.12.472307v1.full.pdf

Fugue extended the deep learning method at the heart of our recently published Miscell approach. Miscell learns representations of single-cell expression profiles through contrastive learning and achieves high performance on canonical single-cell analysis tasks.

Fugue encodes batch information of each cell as a trainable parameter and added to its expression profile; a contrastive learning approach is used to learn feature representation. Fugue can learn smooth embedding for time course trajectory and joint embedding space.





□ FIN: Bayesian Factor Analysis for Inference on Interactions

>> https://www.tandfonline.com/doi/full/10.1080/01621459.2020.1745813

Current methods for quadratic regression are not ideal in these applications due to the level of correlation in the predictors, the fact that strong sparsity assumptions are not appropriate, and the need for uncertainty quantification.

FIN exploits the correlation structure of the predictors, and estimates interaction effects in high dimensional settings. FIN uses a latent factor joint model, which incl. shared factors in both the predictor and response components while assuming conditional independence.





□ Pint: A Fast Lasso-Based Method for Inferring Higher-Order Interactions

>> https://www.biorxiv.org/content/10.1101/2021.12.13.471844v1.full.pdf

Pint performs square-root lasso regression on all pairwise interactions on a one thousand gene screen, using ten thousand siRNAs, in 15 seconds, and all three-way interactions on the same set in under ten minutes.

Pint is based on an existing fast algorithm, which adapts for use on binary matrices. The three components of the algorithm, pruning, active set calculation, and solving the sub-problem, can all be done in parallel.





□ TopHap: Rapid inference of key phylogenetic structures from common haplotypes in large genome collections with limited diversity

>> https://www.biorxiv.org/content/10.1101/2021.12.13.472454v1.full.pdf

TopHap determines spatiotemporally common haplotypes of common variants and builds their phylogeny at a fraction of the computational time of traditional methods.

In the TopHap approach, bootstrap branch support for the inferred phylogeny of common haplotypes is calculated by resampling genomes to build bootstrap replicate datasets.

This procedure assesses the robustness of the inferred phylogeny to the inclusion/exclusion of haplotypes likely created by sequencing errors and convergent changes that are expected to have relatively low frequencies spatiotemporally.





□ swCAM: estimation of subtype-specific expressions in individual samples with unsupervised sample-wise deconvolution

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab839/6460803

a sample-wise Convex Analysis of Mixtures (swCAM) can accurately estimate subtype-specific expressions of major subtypes in individual samples and successfully extract co-expression networks in particular subtypes that are otherwise unobtainable using bulk expression data.

Fundamental to the success of swCAM solution is the nuclear-norm and l2,1-norm regularized low-rank latent variable modeling.

Determining hyperparameter values using cross-validation with random entry exclusion and obtain a swCAM solution using an efficient alternating direction method of multipliers.





□ Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2

>> https://www.biorxiv.org/content/10.1101/2021.12.14.472718v1.full.pdf

The compacted de Bruijn graph forms a vertex-decomposition of the graph, while preserving the graph topology. However, for some applications, only the vertex-decomposition is sufficient, and preservation of the topology is redundant.

for applications such as performing presence-absence queries for k-mers or associating information to the con- stituent k-mers of the input, any set of strings that preserves the exact set of k-mers from the input sequences can be sufficient.

Relaxing the defining requirement of unit igs, that the paths be non-branching in the underlying graph, and seeking instead a set of maximal non-overlapping paths covering the de Bruijn graph, results in a more compact rep- resentation of the input data.

CUTTLE-FISH 2 can seamlessly extract such maximal path covers by simply constraining the algorithm to operate on some specific subgraph(s) of the original graph.





□ Matchtigs: minimum plain text of kmer sets

>> https://www.biorxiv.org/content/10.1101/2021.12.15.472871v1.full.pdf

Matchtigs, a polynomial algorithm computing a minimum representation (which was previously posed as a potentially NP-hard open problem), as well as an efficient near-minimum greedy heuristic.

Matchtigs finds an SPSS (spectrum preserving string set) of minimum size (CL). the SPSS problem allowing repeated kmers is polynomially solvable, based on a many-to-many min-cost path query and a min-cost perfect matching approach.





□ AliSim: A Fast and Versatile Phylogenetic Sequence Simulator For the Genomic Era

>> https://www.biorxiv.org/content/10.1101/2021.12.16.472905v1.full.pdf

AliSim integrates a wide range of evolutionary models, available in the IQ-TREE. AliSim can simulate MSAs that mimic the evolutionary processes underlying empirical alignments.

AliSim implements an adaptive approach that combines the commonly-used rate matrix and probability matrix approach. AliSim works by first generating a sequence at the root of the tree following the stationarity of the model.

AliSim then recursively traverses along the tree to generate sequences at each node of the tree based on the sequence of its ancestral node. AliSim completes this process once all the sequences at the tips are generated.





□ ortho2align: a sensitive approach for searching for orthologues of novel lncRNAs

>> https://www.biorxiv.org/content/10.1101/2021.12.16.472946v1.full.pdf

lncRNAs exhibit low sequence conservation, so specific methods for enhancing the signal-to-noise ratio were developed. Nevertheless, current methods such as transcriptomes comparison or searches for conserved secondary structures are not applicable to novel lncRNAs dy design.

ortho2align — a synteny-based approach for finding orthologues of novel lncRNAs with a statistical assessment of sequence conservation. ortho2align allows control of the specificity of the search process and optional annotation of found orthologues.





□ EmptyNN: A neural network based on positive and unlabeled learning to remove cell-free droplets and recover lost cells in scRNA-seq data

>> https://www.cell.com/patterns/fulltext/S2666-3899(21)00154-9

EmptyNN accurately removed cell-free droplets while recovering lost cell clusters, and achieved an area under the receiver operating characteristics of 94.73% and 96.30%, respectively.

EmptyNN takes the raw count matrix as input, where rows represent barcodes and columns represent genes. The output is a list, containing a Boolean vector indicating it is a cell-containing or cell-free droplet, as well as the probability of each droplet.





□ AMAW: automated gene annotation for non-model eukaryotic genomes

>> https://www.biorxiv.org/content/10.1101/2021.12.07.471566v1.full.pdf

Iterative runs of MAKER2 must also be coordinated to aim for accurate predictions, which includes intermediary specific training of different gene predictor models.

AMAW (Automated MAKER2 Annotation Wrapper), a program devised to annotate non-model unicellular eukaryotic genomes by automating the acquisition of evidence data.




□ Pak RT

Merge supply is decreasing.
Watch.

>> https://etherscan.io/token/0x27d270b7d58d15d455c85c02286413075f3c8a31





□ HolistIC: leveraging Hi-C and whole genome shotgun sequencing for double minute chromosome discovery

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab816/6458320

HolistIC can enhance double minute chromosome predictions by predicting DMs with overlapping amplicon coordinates. HolistIC can uncover double minutes, even in the presence of DM segments with overlapping coordinates.

HolistIC is ideal for confirming the true association of amplicons to circular extrachromosomal DNA. it is modular in that the double minute prediction input can be from any program. This lends additional flexibility for future eccDNA discovery algorithms.





□ geneBasis: an iterative approach for unsupervised selection of targeted gene panels from scRNA-seq

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02548-z

geneBasis, an iterative approach for selecting an optimal gene panel, where each newly added gene captures the maximum distance between the true manifold and the manifold constructed using the currently selected gene panel.

geneBasis allows recovery of local and global variability. geneBasis accounts for batch effect and handles unbalanced cell type composition.

geneBasis constructs k-NN graphs within each batch, thereby assigning nearest neighbors only from the same batch and mitigating technical effects. Minkowski distances per genes are calculated across all cells from every batch thus resulting in a single scalar value for each gene.





□ scMARK an 'MNIST' like benchmark to evaluate and optimize models for unifying scRNA data

>> https://www.biorxiv.org/content/10.1101/2021.12.08.471773v1.full.pdf

scMARK uses unsupervised models to reduce the complete set of single-cell gene expression matrices into a unified cell-type embedding space. And trains a collection of supervised models to predict author labels from all but one held-out dataset in this unified cell-type space.

scMARK show that scVI represents the only tested method that benefits from larger training datasets. Qualitative assessment of the unified cell-type space indicates that the scVI embedding is suitable for automatic cell-type labeling and discovery of new cell-types.





□ DISA tool: discriminative and informative subspace assessment with categorical and numerical outcomes

>> https://www.biorxiv.org/content/10.1101/2021.12.08.471785v1.full.pdf

DISA (Discriminative & Informative Subspace Assessment) is proposed to assess patterns in the presence of numerical outcomes using well-established measures together w/ a novel principle able to statistically assess the correlation gain of the subspace against the overall space.

If DISA receives a numerical outcome, a range of values in which samples are valid is determined. DISA accomplishes this by approximating two probability density functions (e.g. Gaussians), one for all the observed targets and the other with targets of the target subspace.





□ Improved Transcriptome Assembly Using a Hybrid of Long and Short Reads with StringTie

>> https://www.biorxiv.org/content/10.1101/2021.12.08.471868v1.full.pdf

a new release of StringTie which allows transcriptome assembly and quantification using a hybrid dataset containing both short and long reads.

Hybrid-read assembly with StringTie is more accurate than long-read only or short-read only assembly, and on some datasets it can more than double the number of correctly assembled transcripts, while obtaining substantially higher precision than the long-read data assembly alone.





□ scATAK: Efficient pre-processing of Single-cell ATAC-seq data

>> https://www.biorxiv.org/content/10.1101/2021.12.08.471788v1.full.pdf

The scATAK track module generated group ATAC signal tracks (normalized by the mapped group read counts) from cell barcode – cell group table and sample pseudo-bulk alignment file.

scATAK hic module utilizes a provided bulk HiC or HiChIP interactome map together with a single-cell accessible chromatin region matrix to infer potential chromatin looping events for individual cells and generate group HiC interaction tracks.





□ DeepPlnc: Discovering plant lncRNAs through multimodal deep learning on sequential data

>> https://www.biorxiv.org/content/10.1101/2021.12.10.472074v1.full.pdf

LncRNAs are supposed to act as a key modulator for various biological processes. Their involvement is reported in controlling transcription process through enhancers and providing regulatory binding sites is well reported

DeepPlnc can even accurately annotate the incomplete length transcripts also which are very common in de novo assembled transcriptomes. It has incorporated a bi-modal architecture of Convolution Neural Nets while extracting information from the sequences of nucleotides.




□ A mosaic bulk-solvent model improves density maps and the fit between model and data

>> https://www.biorxiv.org/content/10.1101/2021.12.09.471976v1

The mosaic bulk-solvent model considers solvent variation across the unit cell. The mosaic model is implemented in the computational crystallography toolbox and can be used in Phenix in most contexts where accounting for bulk-solvent is required.

Using the mosaic solvent model improves the overall fit of the model to the data and reduces artifacts in residual maps. The mosaic model algorithm was systematically exercised against a large subset of PDB entries to ensure its robustness and practical utility to improve maps.




□ Coalescent tree recording with selection for fast forward-in-time simulations

>> https://www.biorxiv.org/content/10.1101/2021.12.06.470918v1.full.pdf

The algorithm records the genetic history of a species, directly places the mutations on the tree and infers fitness of subsets of the genome from parental haplotypes. The algorithm explores the tree to reconstruct the genetic data at the recombining segment.

When reproducing, if a segment is transmitted without recombination, then the fitness contribution of this segment in the offspring individual is simply the fitness contribution of the parental segment multiplied by the effects of eventual new mutations.





□ snpQT: flexible, reproducible, and comprehensive quality control and imputation of genomic data

>> https://f1000research.com/articles/10-567

snpQT: a scalable, stand-alone software pipeline using nextflow and BioContainers, for comprehensive, reproducible and interactive quality control of human genomic data.

snpQT offers some 36 discrete quality filters or correction steps in a complete standardised pipeline, producing graphical reports to demonstrate the state of data before and after each quality control procedure.





□ High performance of a GPU-accelerated variant calling tool in genome data analysis

>> https://www.biorxiv.org/content/10.1101/2021.12.12.472266v1.full.pdf

Sequencing data were analyzed on the GPU server using BaseNumber, the variant calling outputs of which were compared to the reference VCF or the results generated by the Burrows-Wheeler Aligner (BWA) + Genome Analysis Toolkit (GATK) pipeline on a generic CPU server.

BaseNumber demonstrated high precision (99.32%) and recall (99.86%) rates in variant calls compared to the standard reference. The variant calling outputs of the BaseNumber and GATK pipelines were very similar, with a mean F1 of 99.69%.




□ treedata.table: a wrapper for data.table that enables fast manipulation of large phylogenetic trees matched to data

>> https://peerj.com/articles/12450/

treedata.table, the first R package extending the functionality and syntax of data.table to explicitly deal with phylogenetic comparative datasets.

treedata.table significantly increases speed and reproducibility during the data manipulation involved in the phylogenetic comparative workflow. After an initial tree/data matching step, treedata.table continuously preserves the tree/data matching across data.table operations.





□ tRForest: a novel random forest-based algorithm for tRNA-derived fragment target prediction

>> https://www.biorxiv.org/content/10.1101/2021.12.13.472430v1.full.pdf

A significant advantage of using random forests is that they avoid overfitting, a common limitation of machine learning algorithms in which they become tailored specifically to the dataset they were trained on and thus become less predictive in independent datasets.

tRForest, a tRF target prediction algorithm built using the random forest machine learning algorithm. This algorithm predicts targets for all tRFs, including tRF-1s and includes a broad range of features to fully capture tRF-mRNA interaction.





□ Flimma: a federated and privacy-aware tool for differential gene expression analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02553-2

Flimma - Fererated Limma Voom Tool preserves the privacy of the local data since the expression profiles never leave the local execution sites.

In contrast to meta-analysis approaches, Flimma is particularly robust against heterogeneous distributions of data across the different cohorts, which makes it a powerful alternative for multi-center studies where patient privacy matters.





□ GREPore-seq: A Robust Workflow to Detect Changes after Gene Editing through Long-range PCR and Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2021.12.13.472514v1.full.pdf

GREPore-seq captures the barcoded sequences by grepping reads of nanopore amplicon sequencing. GREPore-seq combines indel-correcting DNA barcodes with the sequencing of long amplicons on the ONT platforms.

GREPore-seq can detect NHEJ-mediated double-stranded oligodeoxynucleotide (dsODN) insertions with comparable accuracy to Illumina NGS. GREPore-seq also identifies HDR-mediated large gene knock-in, which excellently correlates with FACS analysis data.





□ CellOT: Learning Single-Cell Perturbation Responses using Neural Optimal Transport

>> https://www.biorxiv.org/content/10.1101/2021.12.15.472775v1.full.pdf

Leveraging the theory of optimal transport and the recent advents of convex neural architectures, they learn a coupling describing the response of cell populations upon perturbation, enabling us to predict state trajectories on a single-cell level.

CellOT, a novel approach to predict single-cell perturbation responses by uncovering couplings between control and perturbed cell states while accounting for heterogeneous subpopulation structures of molecular environments.





□ splatPop: simulating population scale single-cell RNA sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02546-1

splatPop, a model for flexible, reproducible, and well-documented simulation of population-scale scRNA-seq data with known expression quantitative trait loci. splatPop can also be instructed to assign pairs of eGenes the same eSNP.

The splatPop model utilizes the flexible framework of Splatter, and can simulate complex batch, cell group, and conditional effects between individuals from different cohorts as well as genetically-driven co-expression.





□ Nfeature: A platform for computing features of nucleotide sequences

>> https://www.biorxiv.org/content/10.1101/2021.12.14.472723v1.full.pdf

Nfeature comprises of three major modules namely Composition, Correlation, and Binary profiles. Composition module allow to compute different type of compositions that includes mono-/di-tri-nucleotide composition, reverse complement composition, pseudo composition.

Correlation module allow to compute various type of correlations that includes auto-correlation, cross-correlation, pseudo-correlation. Similarly, binary profile is developed for computing binary profile based on nucleotides, di-nucleotides, di-/tri-nucleotide properties.

Nfeature also allow to compute entropy of sequences, repeats in sequences and distribution of nucleotides in sequences. This tool computes a total of 29217 and 14385 features for DNA and RNA sequence, respectively.





□ GENPPI: standalone software for creating protein interaction networks from genomes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04501-0

GENPPI can help fill the gap concerning the considerable number of novel genomes assembled monthly and our ability to process interaction networks considering the noncore genes for all completed genome versions.

GENPPI transfers the question of topological annotation from the centralized databases to the final user, the researcher, at the initial point of research. the GENPPI topological annotation information is directly proportional to the number of genomes used to create an annotation.





□ Sim-it: A benchmark of structural variation detection by long reads through a realistic simulated model

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02551-4

Sim-it, a straightforward tool for the simulation of both structural variation and long-read data. These simulations from Sim-it reveal the strengths and weaknesses for current available structural variation callers and long-read sequencing platforms.

combiSV is a new method that can combine the results from structural variation callers into a superior call set with increased recall and precision, which is also observed for the latest structural variation benchmark set.





□ seGMM: a new tool to infer sex from massively parallel sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.12.16.472877v1.full.pdf

seGMM, a new sex inference tool that determines the gender of a sample from called genotype data integrated aligned reads and jointly considers information on the X and Y chromosomes in diverse genomic data, including TGS panel data.

seGMM applies Gaussian Mixture Model (GMM) clustering to classify the samples into different clusters. seGMM provides a reproducible framework to infer sex from massively parallel sequencing data and has great promise in clinical genetics.





□ FourierDist: HarmonicNet: Fully Automatic Cell Segmentation with Fourier Descriptors

>> https://www.biorxiv.org/content/10.1101/2021.12.17.472408v1.full.pdf

FourierDist, a network, which is a modification of the popular StarDist and SplineDist architectures. FourierDist utilizes Fourier descriptors, predicting a coefficient vector for every pixel on the image, which implicitly define the resulting segmentation.

FourierDist is also capable of accurately segmenting objects that are not star-shaped, a case where StarDist performs suboptimally.





□ Analyzing transfer learning impact in biomedical cross-lingual named entity recognition and normalization

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04247-9

Firstly, for entity identification and classification, they implemented two bidirectional Long Short Memory (Bi-LSTM) layers with a CRF layer based on the NeuroNER model. The architecture of this model consists of a first Bi-LSTM layer for character embeddings.

In the second layer, they concatenate the output of the first layer with the word embeddings and sense-disambiguate embeddings for the second Bi-LSTM layer. Finally, the last layer uses a CRF to obtain the most suitable labels for each token.




Devinity.

2021-12-12 22:12:13 | Science News


単一システムの性質や振る舞いを記述する過程で、
既知であるメカニズムの蓋然性のみに準拠していることによって、
構造の一部あるいは全体が、
恒常的に見落とされている可能性を忘れてはならない。