lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Provenance.

2021-12-13 22:13:17 | Science News




□ STELLAR: Annotation of Spatially Resolved Single-cell Data

>> https://www.biorxiv.org/content/10.1101/2021.11.24.469947v1.full.pdf

STELLAR (SpaTial cELl LeARning), a geometric deep learning tool for cell-type discovery and identification in spatially resolved single-cell datasets. STELLAR uses a graph convolutional encoder to learn low-dimensional cell embeddings that capture cell topology.

STELLAR learns latent low-dimensional cell representations that jointly capture spatial and molecular similarities of cells that are transferable across different biological contexts.

STELLAR automatically assigns cells to cell types included in the reference set and also identifies cells with unique properties as belonging to a novel type that is not part of the reference set.

The encoder network in STELLAR consists of one fully-connected layer with ReLU activation and a graph convolutional layer with a hidden dimension of 128 in all layers. It uses the Adam optimizer with an initial learning rate of 10−3 and weight decay 0.





□ Sparse: Rapid, Reference-Free Human Genotype Imputation with Denoising Autoencoders

>> https://www.biorxiv.org/content/10.1101/2021.12.01.470739v1.full.pdf

Sparse, de-noising autoencoders spanning all bi-allelic SNPs observed in the Haplotype Reference Consortium were developed and optimized.

a generalized approach to unphased human genotype imputation using sparse, denoising autoencoders capable of highly accurate genotype imputation at genotype masking levels (98+%) appropriate for array-based genotyping and low-pass sequencing-based population genetics initiatives.

After merging the results from all genomic segments, the whole chromosome accuracy of autoencoder-based imputation remained superior to all HMM-based imputation tools, across all independent test datasets, and all genotyping array marker sets.

Inference time scales only with the number of variants to be imputed, whereas HMM-based inference time depends on both reference panel and the number of variants to be imputed.





□ Parity and time reversal elucidate both decision-making in empirical models and attractor scaling in critical Boolean networks

>> https://www.science.org/doi/10.1126/sciadv.abf8124

New applications of parity inversion and time reversal to the emergence of complex behavior from simple dynamical rules in stochastic discrete models. These applications underpin a novel attractor identification algorithm implemented for Boolean networks under stochastic dynamics.

Its speed enables resolving a long-standing open question of how attractor count in critical random Boolean networks scales with network size and whether the scaling matches biological observations.

The parity-based encoding of causal relationships and time-reversal construction efficiently reveal discrete analogs of stable and unstable manifolds.

The time reversal of stochastically asynchronous Boolean systems identify subsets of the state space that cannot be reached from outside. Using parity and time-reversal transformations in tandem, This algorithm efficiently identifies all attractors of large-scale Boolean systems.





□ EXMA: A Genomics Accelerator for Exact-Matching

>> https://arxiv.org/pdf/2101.05314.pdf

EXMA enhances FM-Index search throughput. EXMA first creates a novel table with a multi-task-learning (MTL)-based index to process multiple DNA symbols with each DRAM row activation.

The EXMA accelerator connects to four DRAM channel, and improves search throughput by 4.9×, and enhances search throughput per Watt by 4.8×. EXMA adopts the state-of-the-art Tangram neural network accelerator as the inference engine.





□ MIRA: Joint regulatory modeling of multimodal expression and chromatin accessibility in single cells

>> https://www.biorxiv.org/content/10.1101/2021.12.06.471401v1.full.pdf

MIRA: Probabilistic Multimodal Models for Integrated Regulatory Analysis, a comprehensive methodology that systematically contrasts transcription and accessibility to determine the regulatory circuitry driving cells along developmental continuums.

MIRA leverages joint topic modeling of cell states and regulatory potential modeling of individual gene loci.

MIRA represents cell states in an interpretable latent space, infers high fidelity lineage trees, determines key regulators of fate decisions at branch points, and exposes the variable influence of local accessibility on transcription at distinct loci.





□ scGTM: Single-cell generalized trend model: a flexible and interpretable model of gene expression trend along cell pseudotime

>> https://www.biorxiv.org/content/10.1101/2021.11.25.470059v1.full.pdf

scGTM can provide more informative and interpretable gene expression trends than the GAM and GLM when the count outcome comes from the Poisson, ZIP, NB or ZINB distributions.

scGTM robustly captures the hill-shaped trends for the four distributions and consistently estimates the change time around 0.75, which is where the MAOA gene reaches its expected maximum expression.

The scGTM parameters are estimated by the constrained maximum likelihood estimation via particle swarm optimization (PSO) metaheuristic algorithms.

scGTM is only applicable to a single pseudotime trajectory. A natural extension is to split a multiple-lineage cell trajectory into single lineages and fit the scGTM to each lineage separately. There is need to develop a variant algorithm of PSO or other metaheuristics algorithms.





□ ECLIPSER: identifying causal cell types and genes for complex traits through single cell enrichment of e/sQTL-mapped genes in GWAS loci

>> https://www.biorxiv.org/content/10.1101/2021.11.24.469720v1.full.pdf

ECLIPSER (Enrichment of Causal Loci and Identification of Pathogenic cells in Single Cell Expression and Regulation data) maps genes to GWAS loci for a given trait using s/eQTL data and other functional information.

ECLIPSER prioritizes causal genes in GWAS loci driving the enrichment signal in the specific cell types for experimental follow-up.

ECLIPSER is a computational framework that can be applied to single cell or single nucleus (sc/sn)RNA-seq data from multiple tissues and to multiple complex diseases and traits with discovered GWAS associations, and does not require genotype data from the e/sQTL.





□ Heron: Dynamic Pooling Improves Nanopore Base Calling Accuracy

>> https://ieeexplore.ieee.org/document/9616376/

Heron - high accuracy GPU nanopore basecaller. Heron is a dynamic pooling approach that continuous and differentiable almost everywhere.

Heron time-warps the signal using fractional distances in the pooling space.

• feature vector: fi = f(xi)∈(0,1)C
• point importance: wi = w(xi), wi∈(0, 1)
• length factor: mi = m(xi ), mi∈(0, 1)

Another intriguing goal is to extend dynamic pooling to multiple dimensions.





□ scCODA: a Bayesian model for compositional single-cell data analysis

>> https://www.nature.com/articles/s41467-021-27150-6

scCODA allows for identification of compositional changes in high-throughput sequencing count data, especially cell compositions from scRNA-seq. It also provides a framework for integration of cell-type annotated data directly from scanpy and other sources.

scCODA framework models cell-type counts with a hierarchical Dirichlet-Multinomial distribution that accounts for the uncertainty in cell-type proportions and the negative correlative bias via joint modeling of all measured cell-type proportions instead of individual ones.





□ Hubness reduction improves clustering and trajectory inference in single-cell transcriptomic data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab795/6433673

Considering a collection of datasets from the ARCHS4 repository, constructed the k-NN graphs with or without hubness reduction, then ran Louvain algorithm and calculated the modularity of the resulting clustering.

Reverse-Coverage approach, a method based on the size of the respective in-coming neighborhoods to retrieve hubs in a more robust way. Hubness reduction can be used instead of dimensionality reduction, in order to compensate for certain manifestations of the dimensionality curse.





□ DeepSNEM: Deep Signaling Network Embeddings for compound mechanism of action identification

>> https://www.biorxiv.org/content/10.1101/2021.11.29.470365v1.full.pdf

deepSNEM, a novel unsupervised graph deep learning pipeline to encode the information in the compound-induced signaling networks in fixed-length high-dimensional representations.

The core of deepSNEM is a graph transformer network, trained to maximize the mutual information between whole- graph and sub-graph representations that belong to similar perturbations. the 256-dimensional deepSNEM-GT-MI embeddings were clustered using the k-means algorithm.





□ IReNA: integrated regulatory network analysis of single-cell transcriptomes

>> https://www.biorxiv.org/content/10.1101/2021.11.22.469628v1.full.pdf

IReNA integrates both bulk and single-cell RNA-seq data with bulk ATAC-seq data to reconstruct modular regulatory networks which provide key transcription factors and intermodular regulations.

IReNA uses Monocle to construct the trajectory and calculate the pseudotime of single cells. IReNA calculates the smoothed expression profiles based on pseudotime and divide DEGs into different modules using the K-means clustering of the smoothed expression profiles.

IReNA calculates expression correlation (Pearson’s correlation) for each pair of DEGs and select highly correlated gene pairs which contain at least one transcription factor from the TRANSFAC database as potential regulatory relationships.






□ UNIFAN: Unsupervised cell functional annotation for single-cell RNA-Seq

>> https://www.biorxiv.org/content/10.1101/2021.11.20.469410v1.full.pdf

UNIFAN (Unsupervised Single-cell Functional Annotation) to simultaneously cluster and annotate cells with known biological processes including pathways.

UNIFAN uses an autoencoder that outputs a low-dimensional representation learned from the expression of all genes. UNIFAN combines both, the low dimension representation and the gene set activity scores to determine the cluster for each cell.





□ Meta-NanoSim: Characterization and simulation of metagenomic nanopore sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.11.19.469328v1.full.pdf

Meta-NanoSim characterizes read length distributions, error profiles, and alignment ratio models. It also detects chimeric read artifacts and quantifies an abundance ptofile. Meta-NanoSim calculates the deviation between expected and estimated abundance levels.

Meta-NanoSim significantly reduced the length of the unaligned regions. Meta-NanoSim uses kernel density estimation learnt from empirical reads.

Meta-NanoSim records the aligned bases for each sub-alignment towards their source genome, and then uses EM algorithm to assign multi-aligned segments proportionally to their putative source genomes iteratively.





□ KCOSS: an ultra-fast k-mer counter for assembled genome analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab797/6443080

KCOSS fulfills k-mer counting mainly for assembled genomes with segmented Bloom filter, lock-free queue, lock-free thread pool, and cuckoo hash table.

KCOSS optimizes running time and memory consumption by recycling memory blocks, merging multiple consecutive first-occurrence k-mers into C-read, and writing a set of C-reads to disk asynchronously.





□ On Hilbert evolution algebras of a graph

>> https://arxiv.org/pdf/2111.07399v1.pdf

Hilbert evolution algebras generalize the concept through a framework of Hilbert spaces. This allows to deal with a wide class of infinite-dimensional spaces.

Hilbert evolution algebra associated to a given graph and the Hilbert evolution algebra associated to the symmetric random walk on a graph. These definitions with infinitely many vertices a similar theory developed for evolution algebras associated to finite graphs.





□ Higher rank graphs from cube complexes and their spectral theory

>> https://arxiv.org/pdf/2111.09120v1.pdf

There is a strong connection between geometry of CW-complexes, groups and semigroup actions, higher rank graphs and the theory of C∗-algebras.

The difficulty is that there are many ways to associate C∗-algebras to groups, semigroups and CW-complexes, and this can lead to both isomorphic and non-isomorphic C∗-algebras.

a generalisation of the Cuntz-Krieger algebras from topological Markov shifts. a combinatorial definition of a finite k-graph Λ which is decoupled from geometrical realisations.

The existence of an infinite family of combinatorial k-graphs constructed from k-cube complexes. Aperiodicity of a higher rank graph is an important property, because together with cofinality it implies pure infiniteness if every vertex can be reached from a loop with an entrance.





□ Theory of local k-mer selection with applications to long-read alignment

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab790/6432031

os-minimap2: minimap2 with open syncmer capabilities. Investigating how different parameterizations lead to runtime and alignment quality trade-offs for ONT cDNA mapping.

the k-mers selected by more conserved methods are also more repetitive, leading to a runtime increase during alignment.

Deriving an exact expression for calculating the conservation of a k-mer selection method. This turns out to be tractable enough to prove closed-form expressions for a variety of methods, including (open and closed) syncmers, (a, b, n)-words, and an upper bound for minimizers.





□ CellVGAE: an unsupervised scRNA-seq analysis workflow with graph attention networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab804/6448212

CellVGAE leverages the connectivity between cells as an inductive bias to perform convolutions on a non-Euclidean structure, thus subscribing to the geometric deep learning paradigm.

CellVGAE can intrinsically capture information such as pseudotime and NF-B activation dynamics, the latter being a property that is not generally shared by existing neural alternatives. CellVGAE learns to reconstruct the original graph from the lower-dimensional latent space.





□ Portal: Adversarial domain translation networks enable fast and accurate large-scale atlas-level single-cell data integration

>> https://www.biorxiv.org/content/10.1101/2021.11.16.468892v1.full.pdf

Portal, a unified framework of adversarial domain translation to learn harmonized representations of datasets. Portal preserves biological variation during integration, while having significantly reduced running time and memory, achieving integration of millions of cells.

Portal can accurately align cells from complex tissues profiled by scRNA-seq and single-nucleus RNA sequencing (snRNA-seq), and also perform cross-species alignment of the gradient of cells.

Portal can focus only on merging cells of high probability to be of domain-shared cell types, while it remains inactive on cells of domain-unique cell types.

Portal leverages three regularizers to help it find correct and consistent correspondence across domains, including the autoencoder regularizer, the latent alignment regularizer and the cosine similarity regularizer.





□ Polarbear: Semi-supervised single-cell cross-modality translation

>> https://www.biorxiv.org/content/10.1101/2021.11.18.467517v1.full.pdf

Polarbear uses single-assay and co-assay data to train an autoencoder for each modality and then uses just the co-assay data to train a translator between the embedded representations learned by the autoencoders.

Polarbear is able to translate between modalities with improved accuracy relative to BABEL. Polarbear trains one VAE for each type of data, while taking into consideration sequencing depth and batch factors.





□ sc-SynO: Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04469-x

sc-SynO, which is based on LoRAS (Localized Random Affine Shadowsampling) algorithm applied to single-cell data. The algorithm corrects for the overall imbalance ratio of the minority and majority class.

The LoRAS algorithm generates synthetic samples from convex combinations of multiple shadowsamples generated from the rare cell types. The shadowsamples are obtained by adding Gaussian noise to features representing the rare cells.





□ Graph-sc: GNN-based embedding for clustering scRNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab787/6432030

Graph-sc, a method modeling scRNA-seq data as a graph, processed with a graph autoencoder network to create representations (embeddings) for each cell. The resulting embeddings are clustered with a general clustering algorithm to produce cell class assignments.

Graph-sc is stable across consecutive runs, robust to input down-sampling, generally insensitive to changes in the network architecture or training parameters and more computationally efficient than other competing methods based on neural networks.





□ Asc-Seurat: analytical single-cell Seurat-based web application

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04472-2

Asc-Seurat provides: quality control, by the exclusion of low-quality cells & potential doublets; data normalization, incl. log normalization and the SCTransform, dimension reduction, clustering of the cell populations, incl. selection or exclusion of clusters and re-clustering.

Asc-Seurat is built on three analytical cores. Using Seurat, users explore scRNA-seq data to identify cell types, markers, and DEGs. Dynverse allows the evaluation and visualization of developmental trajectories and identifies DEGs on these trajectories.





□ sc-CGconv: A copula based topology preserving graph convolution network for clustering of single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468695v1.full.pdf

sc-CGconv, a new robust-equitable copula correlation (Ccor) measure for constructing cell-cell graph leveraging the scale-invariant property of Copula while reducing the computational cost of processing large datasets due to the use of structure-aware using LSH.

sc-CGconv preserves the cell-to-cell variability within the selected gene set by constructing a cell-cell graph through copula correlation measure. And provides a topology-preserving embedding of cells in low dimensional space.





□ PHONI: Streamed Matching Statistics with Multi-Genome References

>> https://ieeexplore.ieee.org/document/9418770/

PHONI, Practical Heuristic ON Incremental matching statistics computation uses longest-common-extension (LCE) queries to compute the len values at the same time that computes the pos values.

The matching statistics MS of a pattern P [0..m − 1] with respect to a text T [0..n − 1] are an array of (position, length)-pairs MS[0..m − 1] such that

•P[i..i+MS[i].len−1]=T[MS[i].pos..MS[i].pos+MS[i].len−1],
• P [i..i + MS[i].len] does not occur in T.

Two-pass algorithm for quickly computing MS using only an O(r)-space data structure during the first pass, from right to left in O(m log log n) time.

• φ−1(p) = SA[ISA[p] + 1] (or NULL if ISA[p] = n − 1), • PLCP[p] = LCP[ISA[p]] (or 0 if ISA[p] = 0),

and SA, ISA, LCP and PLCP are the suffix array, inverse suffix array, longest-common-prefix array and permuted longest-common-prefix array.

PHONI uses Rossi et al.’s construction algorithm for MONI to build the RLBWT and the SLP. PHONI’s query times become faster as the number of reducible positions increases, making the time-expensive LCE queries less frequent.





□ UNBOUNDED ALGEBRAIC DERIVATORS

>> https://arxiv.org/pdf/2111.05918v1.pdf

Proving the derived category of a Grothendieck category with enough projective objects is the base category of a derivator. Therefore all such categories possess all co/limits and can be organized in a representable derivator.

This derivator is the base for constructing the derivator associated to the derived category by deriving the relevant functors. the framework provides a more general - arbitrary base ring, complexes as coefficients and simpler approach to some basic theorems of group cohomology.





□ Duesselpore: a full-stack local web server for rapid and simple analysis of Oxford Nanopore Sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468670v1.full.pdf

Duesselpore, a deep sequencing workflow that runs as a local webserver and allows the analysis of ONT data everywhere without requiring additional bioinformatic tools or internet connection.

Duesselpore performs differential gene expression (DGE) analysis. DuesselporeTM will also conduct gene set enrichment analyses (GSEA), enrichment analysis based on the DisGeNET and pathway-based data integration and visualization focusing on KEGG.





□ discover: Optimization algorithm for omic data subspace clustering

>> https://www.biorxiv.org/content/10.1101/2021.11.12.468415v1.full.pdf

the ground truth subspace is rarely the most compact one, and other subspaces may provide biologically relevant information.

discover, an optimization algorithm performing bottom-up subspace clustering on tabular high dimensional data. And identifies the corresponding sample clusters, such that the partitioning of the subspace has maximal internal clustering score of feature subspaces.





□ REMD-LSTM:A novel general-purpose hybrid model for time series forecasting

>> https://link.springer.com/article/10.1007/s10489-021-02442-y

Empirical Mode Decomposition (EMD) is a typical algorithm for decomposing data according to its time scale characteristics. The core of the EMD algorithm is empirical mode decomposition, which can decompose complex signals into a finite number of Intrinsic Mode Functions.

The REMD-LSTM algorithm can solve the problem of marginal effect and mode confusion in EMD. Decomposing time series data into multiple components through REMD can reveal the specific influence of hidden variables in time series data to a certain extent.





□ smBEVO: A computer vision approach to rapid baseline correction of single-molecule time series

>> https://www.biorxiv.org/content/10.1101/2021.11.12.468397v1.full.pdf

Current approaches for drift correction primarily involve either tedious manual assignment of the baseline or unsupervised frameworks such as infinite HMMs coupled with baseline nodes that are computationally expensive and unreliable.

smBEVO estimates the time-varying baseline drift that can in practice be difficult to eliminate in single-molecule experimental modalities. smBEVO provides visually and quantitatively compelling baseline estimation for simulated data w/ multiple types of mild to aggressive drift.




□ FMAlign: A novel fast multiple nucleotide sequence alignment method based on FM-index

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbab519/6458932

FMAlign, a novel algorithm to improve the performance of multiple nucleotide sequence alignment. FM-index uses FM-index to extract long common segments at a low cost rather than using a space-consuming hash table.





Rectangle.

2021-12-13 22:12:13 | Science News


"No problem is too small or too trivial if we can really do something about it."



□ BamToCov: an efficient toolkit for sequence coverage calculations

>> https://www.biorxiv.org/content/10.1101/2021.11.12.466787v1.full.pdf

BamToCov, a suite of tools for rapid coverage calculations relying on a memory efficient algorithm and designed for flexible integration in bespoke pipelines. BamToCov processes sorted BAM or CRAM, allowing to extract coverage information using different filtering approaches.

BamToCov uses a streaming approach that requires sorted alignments as input, computing coverage is computed starting from zero at the leftmost base in each contig and updated on-the-fly while reading alignments. In terms of Speed, BamToCov is second only to MegaDepth.





□ Long-read sequencing settings for efficient structural variation detection based on comprehensive evaluation

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04422-y

Generating a full range of simulated error-prone long-read datasets containing various sequencing settings and comprehensively evaluated the performance of SV calling with state-of-the-art long-read SV detection methods.

the overall F1 score and Matthews correlation coefficient (MCC) rate increase along with the coverage, read length, and accuracy rate.

Notably, it is sufficient for sensitive and accurate SV calling in practice when the long-read data comes to 20× coverage, 20 kbp average read length, and approximately 10–7.5% or below 1% error rates (or approximately 90–92.5% or over 99% accuracy rate).





□ CStone: A de novo transcriptome assembler for short-read data that identifies non-chimeric contigs based on underlying graph structure

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009631

CStones, a de Bruijn based de novo assembler for RNA-Seq data that utilizes a classification system to describe the Graph complexity. Each contig is labelled with one of three levels, indicating whether or not ambiguous paths exist.

The contigs that CStone produced were comparable in quality to those of Trinity and rnaSPAdes in terms of length, sequence identity of aligned regions and the range of cDNA transcripts represented, whilst providing additional information on chimerism.





□ HAllA: High-sensitivity pattern discovery in large, paired multi-omic datasets

>> https://www.biorxiv.org/content/10.1101/2021.11.11.468183v1.full.pdf

HAllA (Hierarchical All-against-All association testing) efficiently integrates hierarchical hypothesis testing with false discovery rate correction to reveal significant linear and non-linear block-wise relationships among continuous and/or categorical data.

HAllA is an end-to-end statistical method for Hierarchical All-against-All discovery of significant relationships among data features with high power. HAllA preserves statistical power in the presence of collinearity by testing coherent clusters of variables.





□ Meta-Transcriptome Detector (MTD): a novel pipeline for metatranscriptome analysis of bulk and single-cell RNAseq data

>> https://www.biorxiv.org/content/10.1101/2021.11.16.468881v1.full.pdf

Meta-Transcriptome Detector (MTD), supports automatic generation of the count matrix of the microbiome by using
raw data in the FASTQ format and count matrix of host genes from two commonly used single- cell RNA-seq platforms, 10x Genomics and Drop-seq.

MTD has a decontamination step that blacklists the common contaminant microbes in the laboratory environment. Users can easily install and run MTD using only one command and without requiring root privileges.





□ NSB: the improvements are most pronounced for larger distances and for higher levels of deviations Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model

>> https://www.biorxiv.org/content/10.1101/2021.11.10.468111v1.full.pdf

NSB (No Strand Bias) distance estimator, an algorithm and a tool for computing phylogenetic distances on alignment-free data based on a time-reversible, no strain-bias, 4-parameter evolutionary model called TK4.

a general model like TK4 can offer more accurate distances than Jukes-Cantor model, which is the simplest yet most dominantly used model in alignment-free phylogenetics. the improvements are most pronounced for larger distances and for higher levels of deviations.





□ Deep-BGCpred: A unified deep learning genome-mining framework for biosynthetic gene cluster prediction

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468547v1.full.pdf

Deep-BGCpred, a deep-learning method for Biosynthetic Gene Clusters (BGCs) identification within genomes. Deep-BGCpred effectively addresses the aforementioned customization challenges that arise in natural product genome mining.

Deep-BGCpred employs a stacked Bidirectional Long Short-Term Memory model to boost accuracy for BGC identifications. It integrates Sliding window strategy and dual-model serial screening, to reduce the number of false positive in BGC predictions.





□ sdcorGCN: Generating weighted and thresholded gene coexpression networks using signed distance correlation

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468627v1.full.pdf

a principled method to construct weighted gene coexpression networks using signed distance correlation. These networks contain weighted edges only between those pairs of genes whose correlation value is higher than a given threshold.

COGENT aids the selection of a robust network construction method without the need for any external validation data.

COGENT assists the selection of the optimal threshold value so that only pairs of genes for which the correlation value of their expression exceeds the threshold are connected in the network.




□ GEDI: an R package for integration of transcriptomic data from multiple high-throughput platforms

>> https://www.biorxiv.org/content/10.1101/2021.11.11.468093v1.full.pdf

Gene Expression Data Integration (GEDI) solves all the above mentioned challenges by implementing already existing R packages to read, re-annotate and merge the transcriptomic datasets after which the batch effect is removed, and the integration is verified.

This results in one transcriptomic dataset annotated with Ensembl or Entrez gene IDs. the batch effect is removed by the BatchCorrection function, and it is verified with a PCA plot and an RLE plot. VerifyGEDI verifies the data integration using a logistic regression model.




□ Modeling chromatin state from sequence across angiosperms using recurrent convolutional neural networks

>> https://www.biorxiv.org/content/10.1101/2021.11.11.468292v1.full.pdf

DanQ is a recurrent CNN that has already been shown to be able to more accurately predict a number of genomic labels, including chromatin accessibility and DNA methylation, in the human genome than standard CNNs like DeepSEA.

By incorporating sequence data from multiple species, they not only increase the size of the training data set, a critical factor for deep learning models, but also reduce the amount of confounding neutral variation around functional motifs.

Model architectures that can effectively incorporate trans factors, such as chromatin-remodeling TFs on neighboring regulatory elements or small RNA silencing, will likely surpass current methods but their cross-species applicability remains an open question.





□ CLMB: deep contrastive learning for robust metagenomic binning

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468566v1.full.pdf

CLMB improves the performance of bin refinement, reconstructing 8-22 more high-quality genomes and 15-32 more middle-quality genomes than the second-best result.

Vamb is a metagenomic binner which feeds sequence composition information from a contig catalogue and co-abundance information from BAM files into a variational autoencoder and clusters the latent representation.

Impressively, in addition to being compatible with the binning refiner, single CLMB even recovers on average 15 more HQ genomes than the refiner of VAMB and Maxbin on the benchmarking datasets.





□ PheneBank: a literature-based database of phenotypes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab740/6426070

PheneBank is the first to perform concept identification of phenotypic abnormalities directly to 13K Human Phenotype Ontology terms. PheneBank brings API access to a NN model trained on complex sentences from full text articles for identifying concepts.

The PheneBank model exploits latent semantic embeddings to infer text-to-concept mappings in 8 ontologies that would often not be apparent to conventional string matching approaches.





□ SCYN: single cell CNV profiling method using dynamic programming

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-021-07941-3

SCYN adopts a dynamic programming approach to find optimal single-cell CNV profiles. SCYN manifested more precise copy number inference on scDNA data, with array comparative genomic hybridization results of purified bulk samples as ground truth validation.

SCYN integrates SCOPE, which partitions chromosomes into consecutive bins and computes the cell-by-bin read depth matrix, to process the input BAM files and get the raw and normalized read depth matrices.





□ Idéfix: identifying accidental sample mix-ups in biobanks using polygenic scores

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab783/6430970

Idéfix relies on the comparison of actual phenotypes to PGSs. Idéfix works by modelling the relationships between phenotypes and polygenic scores, and calculating the residuals of the provided samples and their permutations.

Idéfix estimates mix-up rates to select a subset of samples that adhere to a specified maximum mix-up rate.





□ Approximate distance correlation for selecting highly interrelated genes across datasets

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009548

Approximate Distance Correlation (ADC) first obtains the k most correlated genes for each target gene as its approximate observations, and then calculates the distance correlation (DC) for the target gene across two datasets.

ADC repeats this process for all genes and then performs the Benjamini-Hochberg adjustment to control the false discovery rate. ADC can be applied to datasets ranging from thousands to millions of cells.




□ UVC: Calling small variants using universality with Bayes-factor-adjusted odds ratios

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab458/6427501

Empirical laws to improve variant calling: allele fraction at high sequencing depth is inversely proportional to the cubic root of variant-calling error rate, and odds ratios adjusted with Bayes factors can model various sequencing biases.

UVC outperformed other unique molecular identifier (UMI)-aware variant callers on the datasets used for publishing these variant callers. The executable uvc1 in the bin direcotry takes one BAM file as input and generates one block-gzipped VCF file as output.





□ ProSolo: Accurate and scalable variant calling from single cell DNA sequencing data

>> https://www.nature.com/articles/s41467-021-26938-w

ProSolo is a variant caller for multiple displacement amplified DNA sequencing data from diploid single cells. It relies on a pair of samples, where one is from an MDA single cell and the other from a bulk sample of the same cell population.

ProSolo uses an extension of the novel latent variable model of Varlociraptor, that already integrates various levels of uncertainty. It adds a layer that accounts for amplification biases and errors of MDA, and allows to properly asses the probability of having a variant.





□ PMD Uncovers Widespread Cell-State Erasure by scRNAseq Batch Correction Methods

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468733v1.full.pdf

Percent Maximum Difference (PMD), a new statistical metric that linearly quantifies batch similarity, and simulations generating cells from mixtures of distinct gene expression programs.

PMD is provably invariant to the number of clusters found when relative overlap in cluster composition is preserved, operates linearly across the spectrum of batch similarity, is unaffected by batch size differences or overall number of cells.

PMD does not require that batches be similar, filling a crucial gap in the field for benchmarking scRNAseq batch correction assessment.





□ CRAFT: a bioinformatics software for custom prediction of circular RNA functions

>> https://www.biorxiv.org/content/10.1101/2021.11.17.468947v1.full.pdf

circRNAs can be translated into CEP, incl. circRNA-specific ones generated by translation of ORF encompassing the backsplice junction, which are not present in linear transcripts, and circRNAs with a rolling ORF, lacking a stop codon a continuing along the ‘Mobius strip’.

CRAFT (CircRNA Function prediction Tool), allows investigating complex regulatory networks involving circRNAs acting in a concerted way, such as by decoying the same miRNAs or RBP, or miRNAs sharing target genes along with their coding potential.





□ Nonmetric ANOVA: a generic framework for analysis of variance on dissimilarity measures

>> https://www.biorxiv.org/content/10.1101/2021.11.19.469283v1.full.pdf

Based on the central limit theorem (CLT), Nonmetric ANOVA (nmA) as an extension of the cA and npA models where metric properties (identity, symmetry, and subad-ditivity) are relaxed.

nmA allows any dissimilarity measures to be defined between objects where a distinctiveness of a specific partitioning This derivation accommodates an ANOVA-like framework of judgment, indicative of significant dispersion of the partitioned outputs in nonmetric space.





□ STRling: a k-mer counting approach that detects short tandem repeat expansions at known and novel loci

>> https://www.biorxiv.org/content/10.1101/2021.11.18.469113v1.full.pdf

STRling is a method to detect large STR expansions from short-read sequencing data. It is capable of detecting novel STR expansions, that is expansions where there is no STR in the reference genome at that position.

STRling creates all possible rotations of each k-mer sequence and stores the minimum rotation. It then calculates the proportion of the read accounted for by each k-mer. STRling chooses the representative k-mer as the one that accounts for the greatest proportion of the read.

If multiple k-mers cover equal proportions, it chooses the smallest k-mer. If the representative k-mer exceeds a minimum threshold, STRling considers the read to have sufficient STR content to be informative for detecting STR expansions.





□ Hapl-o-MatGUI: Graphical user interface for the haplotype frequency estimation software

>> https://www.sciencedirect.com/science/article/pii/S019888592100255X

Hapl-o-Mat, a versatile and effective tool for haplotype frequency estimation based on an EM algorithm. Hapl-o-Mat is able to process large sets of unphased genotype data in various typing resolution.

Hapl-o-MatGUI acts as optional additional module to the Hapl-o-Mat software without directly intervening in the program. It supports processing and resolving various forms of HLA genotype data.





□ pISA-tree - a data management framework for life science research projects using a standardised directory tree

>> https://www.biorxiv.org/content/10.1101/2021.11.18.468977v1.full.pdf

pISA-tree, a straightforward and flexible data management solution for organisation of life science project-associated research data and metadata.

pISA-tree enables on-the-fly creation of enriched directory tree structure (project/Investigation/Study/Assay) via a series of sequential batch files in a standardised manner based on the ISA metadata framework.





□ reComBat: Batch effect removal in large-scale, multi-source omics data integration

>> https://www.biorxiv.org/content/10.1101/2021.11.22.469488v1.full.pdf

reComBat, a simple, yet effective, means of mitigating highly correlated experimental conditions through regularisation and compared various elastic net regularisation strengths.

The sources of biological variation are manifold and these can often only be encoded as categorical variables. Encoding these as one-hot categorical variables creates a sparse, high-dimensional feature vector and, when many such categorical features are considered, then m ≈ n.





□ Theoretical Guarantees for Phylogeny Inference from Single-Cell Lineage Tracing

>> https://www.biorxiv.org/content/10.1101/2021.11.21.469464v1.full.pdf

Theoretical guarantees for exact reconstruction of the underlying phylogenetic tree of a group of cells, showing that exact reconstruction can indeed be achieved with high probability given sufficient information capacity in the experimental parameters.

The lower bound assumption translates to a reasonable assumption over the minimal time until cell division. And extend this algorithm and bound to account for missing data, showing that the same bounds still hold assuming a constant probability of missing data.

The upper bound corresponds to an assumption on the maximum time until cell division, which can be evaluated in lineage-traced populations, as they by definition should not be post-mitotic.





□ HaplotypeTools: a toolkit for accurately identifying recombination and recombinant genotypes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04473-1

HaplotypeTools is a new toolset to phase variant sites using VCF and BAM files and to analyse phased VCFs. Phasing is achieved via the identification of reads overlapping ≥ 2 heterozygous positions and then extended by additional reads, a process that can be parallelized.

HaplotypeTools includes various utility scripts for downstream analysis including crossover detection and phylogenetic placement of haplotypes to other lineages or species. HaplotypeTools was assessed for accuracy against WhatsHap using simulated short and long reads.





□ trioPhaser: using Mendelian inheritance logic to improve genomic phasing of trios

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04470-4

trioPhaser uses gVCF files from an individual and their parents as initial input, and then outputs a phased VCF file. Input trio data are first phased using Mendelian inheritance logic.

Then, the positions that cannot be phased using inheritance information alone are phased by the SHAPEIT4 phasing algorithm.





□ SBGNview: Towards Data Analysis, Integration and Visualization on All Pathways

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab793/6433671

SBGNview adopts Systems Biology Graphical Notation (SBGN) and greatly extends the Pathview project by supporting multiple major pathway databases beyond KEGG.

SBGNview substantially extends or exceeds current tools (Pathview) in both design and function, high quality output graphics (SVG format) convenient for interpretation, and flexible and open-end workflow for iterative editing and interactive visualization (Highlighter module).





□ The systematic assessment of completeness of public metadata accompanying omics studies

>> https://www.biorxiv.org/content/10.1101/2021.11.22.469640v1.full.pdf

a comprehensive analysis of the completeness of public metadata accompanying omics data on both original publication and online repositories. The completeness of metadata from the original publication across the nine clinical phenotypes is 71.1%.

In contrast, the overall completeness of metadata information from the public repositories is 48.6%. the most complete reported phenotypes are disease condition and organism, and the least complete phenotypes is mortality.





□ iEnhancer-MFGBDT: Identifying enhancers and their strength by fusing multiple features and gradient boosting decision tree

>> http://www.aimspress.com/article/doi/10.3934/mbe.2021434

iEnhancer-MFGBDT is developed to identify enhancer and their strength by fusing multiple features and gradient boosting decision tree (GBDT).

Multiple features include k-mer and reverse complement k-mer nucleotide composition based on DNA sequence, and second-order moving average, normalized Moreau-Broto auto-cross correlation and Moran auto-cross correlation based on dinucleotide physical structural property matrix.





□ CNVpytor: a tool for copy number variation detection and analysis from read depth and allele imbalance in whole-genome sequencing

>> https://academic.oup.com/gigascience/article/10/11/giab074/6431715

CNVpytor uses B-allele frequency likelihood information from single-nucleotide polymorphisms and small indels data as additional evidence for CNVs/CNAs and as primary information for copy number-neutral losses of heterozygosity.

CNVpytor inherits the reimplemented core engine of its predecessor. CNVpytor is significantly faster than CNVnator-particularly for parsing alignment files (2-20 times faster)-and has (20-50 times) smaller intermediate files.




Heng Li

>> https://github.com/Illumina/DRAGMAP

Dragmap is a new mapper for Illumina reads. It is like a CPU-only implementation of the DRAGEN mapping algorithm. I met DRAGEN developers once. They are among the best I know in this field. Give it a try.





□ PIntMF: Penalized Integrative Matrix Factorization method for Multi-omics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab786/6443074

PIntMF (Penalized Integrative Matrix Factorization), an MF model with sparsity, positivity and equality constraints.To induce sparsity in the model, PIntMF uses a classical Lasso penalization on variable and individual matrices.

PIntMF uses an automatic tuning of the sparsity parameters using the glmnet. the sparsity on the variable block helps to the interpretation of patterns. Sparsity, non-negativity & equality constraints are added to the 2nd matrix to improve the interpretability of the clustering.




□ GPA-Tree: Statistical Approach for Functional-Annotation-Tree-Guided Prioritization of GWAS Results

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab802/6443109

GPA-Tree is a statistical approach to integrate GWAS summary statistics and functional annotation information within a unified framework.

Specifically, by combining a decision tree algorithm with a hierarchical modeling framework, GPA-Tree simultaneously implements association mapping and identifies key combinations of functional annotations related to disease risk-associated SNPs.




□ DeepUTR: Computational modeling of mRNA degradation dynamics using deep neural networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab800/6443108

DeepUTR, a deep neural network to predict mRNA degradation dynamics and interpreted the networks to identify regulatory elements in the 3’UTR and their positional effect. By using Integrated Gradients, These CNNs models identified known and novel cis-regulatory sequence elements of mRNA degradation.





Mitus Lumen.

2021-12-13 22:10:12 | Science News


- emit language syntax. -


□ Fluctuation theorems with retrodiction rather than reverse processes

>> https://avs.scitation.org/doi/10.1116/5.0060893

The everyday meaning of (ir)reversibility in nature is captured by the perceived “arrow of time”: if the video of the evolution played backward makes sense, the process is reversible; if it does not make sense, it is irreversible.

The reverse process is generically not the video played backward: to cite an extreme example, nobody conceives bombs that fly upward to their airplanes while cities are being built from rabble.

In the case of controlled protocols in the presence of an unchanging environment, the reverse process is implemented by reversing the protocol. If the environment were to change, the connection between the physical process and the associated reverse one becomes thinner.

The retrodiction channel of an erasure channel is the erasure channel that returns the reference prior—a result that can be easily extended to any alphabet dimension.

PROCESSES VERSUS INFERENCES: fluctuation relations are intimately related to statistical distances (“divergences”) and that Bayesian retrodiction arises from the requirement that the fluctuating variable can be computed locally.





□ The Metric Dimension of the Zero-Divisor Graph of a Matrix Semiring

>> https://arxiv.org/pdf/2111.07717v1.pdf

The metric dimensions of graphs corresponding to various algebraic structures. The metric dimension of a zero-divisor graph of a commutative ring, a total graph of a finite commutative ring, an annihilating-ideal grah of a finite ring, a commuting graph of a dihedral group.

Antinegative semirings are also called antirings. The simplest example of an antinegative semiring is the binary Boolean semiring B, the set {0,1} in which addition and multiplication are the same as in Z except that 1 + 1 = 1.

For infinite entire antirings S, the metric dimension of Γ(Mn(S)) is infinite. Therefore, it shall limit themselves to studying finite semirings. For every Λ ⊆ Nn × Nn at most one zero-divisor matrix with its pattern of zero and non-zero entries prescribed by Λ is not in W.





□ CONTEXT, JUDGEMENT, DEDUCTION

>> https://arxiv.org/pdf/2111.09438v1.pdf

an abstract definition of type constructor featuring the usual formation, introduction, elimination and computation rules. In proof theory they offer a deep analysis of structural rules, demystifying some of their properties, and putting them into context.

Discussing the internal logic of a topos, a predicative topos, an elementary 2-topos et similia, and show how these can be organized in judgemental theories.





□ Scasa: Isoform-level Quantification for Single-Cell RNA Sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab807/6448218

Scasa, an isoform-level quantification method for high-throughput single-cell RNA sequencing by exploiting the concepts of transcription clusters and isoform paralogs.

Scasa compares well in simulations against competing approaches including Alevin, Cellranger, Kallisto, Salmon, Terminus and STARsolo at both isoform- and gene-level expression.

Scasa takes advantage of the efficient preprocessing provided by existing pseudoaligners such as Kallisto-bustools or Alevin to produce a read-count equivalent-class matrix. Scasa splits the equivalence class output by cell and applies the AEM algorithm to multiple cells.





□ corral: Single-cell RNA-seq dimension reduction, batch integration, and visualization with correspondence analysis

>> https://www.biorxiv.org/content/10.1101/2021.11.24.469874v1.full.pdf

Correspondence Cnalysis (CA) for dimension reduction of scRNAseq data, which is a performant alternative to PCA. Designed for use with counts, CA is based on decomposition of a chi-squared residual matrix and does not require log-transformation of scRNAseq counts.

CA using the Freeman-Tukey chi-squared residual was most performant overall in scRNAseq data. Variance stabilizing transformations applied in conjunction with standard CA and the use of “power deflation” smoothing both improve performance in downstream clustering tasks.

corralm, a CA-based method for multi-table batch integration of scRNAseq data in shared latent space. The adaptation of correspondence analysis for to the integration of multiple tables is similar to the method for single tables with additional matrix concatenation operations.

corralm employs indexed residuals, by dividing the standardized residuals by the square root of expected proportion to reduce the influence of column with larger masses (library depth). And applies CA-style processing to continuous data with the Hellinger distance adaptation.





□ Fuzzy set intersection based paired-end short-read alignment

>> https://www.biorxiv.org/content/10.1101/2021.11.23.469039v1.full.pdf

a new algorithm for aligning both reads in a pair simultaneously by fuzzily intersecting the sets of candidate alignment locations for each read. SNAP with the fuzzy set intersection algorithm dominates BWA and Bowtie, having both better performance and better concordance.

Fuzzy set intersection avoids doing expensive evaluations of many candidate alignments that would eventually be dismissed because they are too far from any plausible alignments for the other end of the pair.





□ ScLRTC: imputation for single-cell RNA-seq data via low-rank tensor completion

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-021-08101-3

scLRTC imputes the dropout entries of a given scRNA-seq expression. It initially exploits the similarity of single cells to build a third-order low-rank tensor and employs the tensor decomposition to denoise the data.

ScLRTC reconstructs the cell expression by adopting the low-rank tensor completion algorithm, which can restore the gene-to-gene and cell-to-cell correlations. scLRTC is demonstrated to be also effective in cell visualization and in inferring cell lineage trajectories.





□ FDJD: RNA-Seq Based Fusion Transcript Detection Using Jaccard Distance

>> https://www.biorxiv.org/content/10.1101/2021.11.17.469019v1.full.pdf

Converting the RNA categorical space into a compact binary array called binary fingerprints, which enables us to reduce the memory usage and increase efficiency. The search and detection of fusion candidates are done using the Jaccard distance.

FDJD (Fusion Detection using the Jaccard Distance) exhibits superior accuracy compared to popular alternative fusion detection methods. FDJD generates fusion candidates using both split reads and discordantly aligned pairs which are produced by the STAR alignment step.





□ Inspector: Accurate long-read de novo assembly evaluation

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02527-4

Inspector, a reference-free long-read de novo assembly evaluator which faithfully reports types of errors and their precise locations. Notably, Inspector can correct the assembly errors based on consensus sequences derived from raw reads covering erroneous regions.

Inspector generates read-to-contig alignment and performs downstream assembly evaluation. Inspector can report the precise locations and sizes for structural and small-scale assembly errors and distinguish true assembly errors from genetic variants.





□ Characterizing Protein Conformational Spaces using Dimensionality Reduction and Algebraic Topology

>> https://www.biorxiv.org/content/10.1101/2021.11.16.468545v1.full.pdf

Linear dimensionality reduction like PCA and its variants may not capture the complex, non-linear nature of pro- tein conformational landscape. Dimensionality reduction techniques are broadly classified based on the solution space they generate, as convex and non-convex.

Even after the conformational space is sampled, it should be filtered and clustered to extract meaningful information.

The structures represented by these conformations are then analyzed by studying their high dimension topological properties to identify truly distinct conformations and holes in the conformational space that may represent high energy barriers.





□ scCODE: an R package for personalized differentially expressed gene detection on single-cell RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.11.18.469072v1.full.pdf

DE methods together with gene filtering have profound impact on DE gene identification, and different datasets will benefit from personalized DE gene detection strategies.

scCODE (single cell Consensus Optimization of Differentially Expressed gene detection) produces consensus DE gene results.

scCODE summarizes the top (default as all) DE genes from each of the strategy selected. The principle of consensus optimization is that the DE genes with higher frequency of observation by different analysis strategies are more reliable.





□ HDMC: a novel deep learning based framework for removing batch effects in single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab821/6449435

This framework employs an adversarial training strategy to match the global distribution of different batches. This provides an improved foundation to further match the local distributions with a Maximum Mean Discrepancy based loss.

HDMC divides cells in each batch into clusters and uses a contrastive learning method to simultaneously align similar cluster pairs / keep noisy pairs apart from each other. It allows to obtain clusters w/ all cells of the same type, and avoid clusters w/ cells of different type.





□ COBREXA.jl: constraint-based reconstruction and exascale analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab782/6429269

COBREXA.jl provides a ‘batteries-included’ solution for scaling analyses to make efficient use of high-performance computing (HPC) facilities, which allows to be realistically applied to pre-exascale-sized models.

COBREXA formulates optimization problems and is compatible w/ JuMP solvers. the building blocks are designed so that the constructed workflows that explores flux variability in many variants, its distributed execution, and collection of many results in a multi-dimensional array.





□ Built on sand: the shaky foundations of simulating single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468676v1.full.pdf

Most simulators are unable to accommodate complex designs without introducing artificial effects; they yield over-optimistic performance of inte- gration, and potentially unreliable ranking of clustering methods; and, it is generally unknown.

By definition, simulations generate synthetic data. On the one hand, conclusions drawn from simulation studies are frequently criticized, because simulations cannot completely mimic (real) experimental data.




□ DiagAF: A More Accurate and Efficient Pre-Alignment Filter for Sequence Alignment

>> https://ieeexplore.ieee.org/document/9614999/

DiagAF uses a new lower bound of edit distance based on shift hamming masks. The new lower bound makes use of fewer shift hamming masks comparing with state-of-art algorithms such as SHD and MAGNET.

DiagAF has the features: faster; lower false positive rate; zero false negative rate; can deal with alignments with un-equal lengths; can pre-align a string to multiple candidate in a single time run. DiagAF can align sequences with early termination for true alignments.




□ Explainability methods for differential gene analysis of single cell RNA-seq clustering models

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468416v1.full.pdf

The absence of “ground truth” information about the DE genes makes the evaluation on real-world datasets is a complex task, usually requiring additional biological experiments for validation.

a comprehensive study to compare the performance of dedicated DE methods, with that of explainability methods typically used in machine learning, both model agnostic: SHAP, permutation importance, and model-specific: NN gradient-based methods.

The gradient method achieved the highest accuracy on the scziDesk and scDeepCluster while on contrastive-sc the results are comparable to the other top performing methods.

contrastive-sc employs high levels of NN dropout as data augmentation and thus learns a sparse representation of the input data, penalizing by de- sign the capacity to learn all relevant features.




□ MAGUS+eHMMs: Improved Multiple Sequence Alignment Accuracy for Fragmentary Sequences

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab788/6430102

MAGUS is fairly robust to fragmentary sequences under many conditions, and that using a two-stage approach where MAGUS is used to align selected “backbone sequences” and the remaining sequences are added into the alignment using ensembles of Hidden Markov Models.

MAGUS+eHMMs, matches or improves on both MAGUS and UPP, particularly when aligning datasets that evolved under high rates of evolution and that have large fractions of fragmentary sequences.




□ FastQTLmapping: an ultra-fast package for mQTL-like analysis

>> https://www.biorxiv.org/content/10.1101/2021.11.16.468610v1.full.pdf

FastQTLmapping is a computationally efficient, exact, and generic solver for exhaustive multiple regression analysis involving extraordinarily large numbers of dependent and explanatory variables with covariates.

FastQTLmapping can afford omics data containing tens of thousands of individuals and billions of molecular loci.

FastQTLmapping accepts input files in text format and in Plink binary format. The output file is in text format and contains all test statistics for all regressions, with the ability to control the volume of the output at preset significance thresholds.





□ ZARP: An automated workflow for processing of RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.11.18.469017v1.full.pdf

ZARP (Zavolan-Lab Automated RNA-seq Pipeline) can run locally or in a cluster environment, generating extensive reports not only of the data but also of the options utilized.

ZARP requires two distinct input files: A tab-delimited file with sample-specific information, such as paths to the sequencing data (FASTQ), transcriptome annotation (GTF) and experiment protocol- and library-preparation specifications like adapter sequences or fragment size.

To provide a high-level topographical/functional annotation of which gene segments (e.g., CDS, 3’UTR, intergenic) and biotypes (e.g., protein coding genes, rRNA) are represented by the reads in a given sample, ZARP includes ALFA.





□ VIVID: a web application for variant interpretation and visualisation in multidimensional analyses

>> https://www.biorxiv.org/content/10.1101/2021.11.16.468904v1.full.pdf

VIVID, a novel interactive and user-friendly platform that automates mapping of genotypic information and population genetic analysis from VCF files in 2D and 3D protein structural space.

VIVID is a unique ensemble user interface that enables users to explore and interpret the impact of genotypic variation on the phenotypes of secondary and tertiary protein structures.





□ Spliceator: multi-species splice site prediction using convolutional neural networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04471-3

Spliceator is based on the Convolutional Neural Networks technology and more importantly, is trained on an original high quality dataset containing genomic sequences from organisms ranging from human to protists.

Spliceator achieves overall high accuracy compared to other state-of-the-art programs, including the neural network-based NNSplice, MaxEntScan that models SS using the maximum entropy distribution, and two CNN-based methods: DSSP and SpliceFinder.






□ GSA: an independent development algorithm for calling copy number and detecting homologous recombination deficiency (HRD) from target capture sequencing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04487-9

Genomic Scar Analysis (GSA) could effectively and accurately calculate the purity and ploidy of tumor samples through NGS data, and then reflect the degree of genomic instability and large-scale copy number variations of tumor samples.

Evaluating the rationality of segmentation and genotype identification by the GSA algorithm and compared with other two algorithms, PureCN and ASCAT, found that the segmentation result of GSA algorithm was more logical.




□ A computationally efficient clustering linear combination approach to jointly analyze multiple phenotypes for GWAS

>> https://www.biorxiv.org/content/10.1101/2021.11.22.469509v1.full.pdf

The Clustering Linear Combination (CLC) method works particularly well with phenotypes that have natural groupings, but due to the unknown number of clusters for a given data,

the final test statistic of CLC method is the minimum p-value among all p-values of the CLC test statistics obtained from each possible number of clusters.

Computationally Efficient CLC (ceCLC) to test the association between multiple phenotypes and a genetic variant. ceCLC uses the Cauchy combination test to combine all p-values of the CLC test statistics obtained from each possible number of clusters.





□ Figbird: A probabilistic method for filling gaps in genome assemblies

>> https://www.biorxiv.org/content/10.1101/2021.11.24.469861v1.full.pdf

Figbird, a probabilistic method for filling gaps in draft genome assemblies using second generation reads based on a generative model for sequencing that takes into account information on insert sizes of read pairs and sequencing errors.

Figbird uses an iterative approach based on the expectation-maximization (EM) algorithm. The method is based on a generative model for sequencing proposed in CGAL and subsequently used to develop a scaffolding tool SWALO.





□ TSEBRA: transcript selector for BRAKER

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04482-0

TSEBRA uses a set of arbitrarily many gene prediction files in GTF format together with a set of files of heterogeneous extrinsic evidence to produce a combined output.

TSEBRA uses extrinsic evidence in the form of intron regions or start/stop codon positions to evaluate and filter transcripts from gene predictions.





□ VG-Pedigree: A Complete Pedigree-Based Graph Workflow for Rare Candidate Variant Analysis

>> https://www.biorxiv.org/content/10.1101/2021.11.24.469912v1.full.pdf

VG-Pedigree, a pedigree-aware workflow based on the pangenome-mapping tool of Giraffe and the variant-calling tool DeepTrio using a specially-trained model for Giraffe-based alignments.

VG-Pedigree improves mapping and variant calling in both SNVs and INDEL variants over those produced by alignments created using BWA-MEM to a linear-reference and Giraffe mapping to a pangenome graph containing data from the 1000 Genomes Project.





□ Detecting fabrication in large-scale molecular omics data

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0260395

Just as has been previously shown in the financial sector, digit frequencies are a powerful data representation when used in combination with machine learning to predict the authenticity of data. Fraud detection methods must be updated for sophisticated computational fraud.

The Fabrication detection methods in biomedical research and show that machine learning can be used to detect fraud in large-scale omic experiments. the Benford-like digit frequency method can be generalized to any tabular numeric data.





□ monaLisa: an R/Bioconductor package for identifying regulatory motifs

>> https://www.biorxiv.org/content/10.1101/2021.11.30.470570v1.full.pdf

monaLisa (MOtif aNAlysis with Lisa), an R/Bioconductor package that implements approaches to identify relevant transcription factors from experimental data.

monaLisa uses randomized lasso stability selection. monaLisa further provides helpful functions for motif analyses, including functions to predict motif matches and calcu- late similarity between motifs.





□ BreakNet: detecting deletions using long reads and a deep learning approach

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04499-5

BreakNet first extracts feature matrices from long-read alignments. Second, it uses a time-distributed CNN to integrate and map the feature matrices to feature vectors.

BreakNet employs a BLSTM model to analyse the produced set of continuous feature vectors in both the forward and backward directions. a classification module determines whether a region refers to a deletion.





□ Variance in Variants: Propagating Genome Sequence Uncertainty into Phylogenetic Lineage Assignment

>> https://www.biorxiv.org/content/10.1101/2021.11.30.470642v1.full.pdf

a probabilistic matrix representation of individual sequences which incorporates base quality scores as a measure of uncertainty that naturally lead to resampling and replication as a framework for uncertainty propagation.

With the matrix representation, resampling possible base calls according to quality scores provides a bootstrap- or prior distribution-like first step towards genetic analysis.

This framework involves converting the uncertainty scores into a matrix of probabilities, and repeatedly sampling from this matrix and using the resultant samples in downstream analysis.





□ Macarons: Uncovering complementary sets of variants for predicting quantitative phenotypes

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab803/6448209

Macarons, a fast and simple algorithm, to select a small, complementary subset of variants by avoiding redundant pairs that are likely to be in linkage disequilibrium.

Macarons features two simple, interpretable parameters to control the time/performance trade-off: the number of SNPs to be selected (k), and maximum intra-chromosomal distance (D, in base pairs) to reduce the search space for redundant SNPs.





□ Detecting Spatially Co-expressed Gene Clusters with Functional Coherence by Graph-regularized Convolutional Neural Network

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab812/6448221

The graph-regularized CNN models the expressions of a gene over spatial locations as an image of a gene activity map, and naturally utilizes the spatial localization information by performing convolution operation to capture the nearby tissue textures.

The model further exploits prior knowledge of gene relationships encoded in PPI network as a regularization by graph Laplacian of the network to enhance biological interpretation of the detected gene clusters.





□ deepMNN: Deep Learning-Based Single-Cell RNA Sequencing Data Batch Correction Using Mutual Nearest Neighbors

>> https://www.frontiersin.org/articles/10.3389/fgene.2021.708981/full

deepMNN identifies mutual nearest neighbor (MNN) pairs across different batches in a PCA subspace. A residual-based batch correction network was then constructed and employed to remove batch effects based on these MNN pairs.

The overall loss of deepMNN was designed as the sum of a batch loss and a weighted regularization loss. The batch loss was used to compute the distance between cells in MNN pairs in the PCA subspace, while the regularization loss was to make the output of the network similar to the input.





Desolation.

2021-12-13 22:07:13 | Science News




□ Adjoining colimits

>> https://arxiv.org/abs/2111.12117v1

a theory of colimit sketches ‘with constructions’ in higher category theory, formalising the input to the ubiquitous procedure of adjoining specified ‘constructible’ colimits to a category such that specified ‘relation’ colimits are enforced.

Morel-Voevodsky’s category of motivic spaces, resp. Robalo’s category of non-commutative motives are universal among categories under Sch, resp. ncSch, admitting all colimits such that Nisnevich descent is preserved and A1-localisation is enforced.

This language makes explicit the rôle colimit diagrams play as presentations of objects of ∞-categories, expressing how they are put together from objects of a dense subcategory. It may be useful to theory builders embarking on a construction of their own ‘designer’ ∞-category.





□ SAT: Efficient iterative Hi-C scaffolder based on N-best neighbors

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04453-5

Hi-C based scaffolding tool, pin_hic, which takes advantage of contact information from Hi-C reads to construct a scaffolding graph iteratively based on N-best neighbors of contigs. It identifies potential misjoins and breaks them to keep the scaffolding accuracy.

SAT, a new format which is inspired by the GFA and extended to keep scaffolding information. In each iteration, if the SAT file is used as an input, the paths will be construct first and each original contig in the draft assembly will keep a record of its corresponding scaffold.





□ EnGRaiN: A Supervised Ensemble Learning Method for Recovery of Large-scale Gene Regulatory Networks

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab829/6458321

EnGRaiN , the first supervised ensemble learning method to construct gene networks. The supervision for training is provided by small training datasets of true edge connections (positives) and edges known to be absent (negatives) among gene pairs.

EnGRaiN integrates interaction/co-expression predictions from multiple gene network inference methods to generate a comprehensive ensemble network of gene interactions. EnGRaiN leverages the ground truth to learn optimal distribution over its various features.





□ SCRIP: an accurate simulator for single-cell RNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab824/6454945

SCRIP provides a flexible Gamma-Poisson mixture and a Beta-Gamma-Poisson mixture framework to simulate scRNA-seq data. SCRIP package was built based on the framework of splatter. Both Gamma-Poisson and Beta-Poisson distribution model the over dispersion of scRNA-seq data.

Specifically, Beta-Poisson model was used to model bursting effect. The dispersion was accurately simulated by fitting the mean-BCV dependency using Generalized Additive Model.

SCRIP modeles other key characteristics of scRNA-seq data incl. library size, zero inflation and outliers. SCIRP enables various application for different experimental designs and goals including DE analysis, clustering analysis, trajectory-based analysis and bursting analysis.





□ schist: Nested Stochastic Block Models applied to the analysis of single cell data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04489-7

schist is a convenient wrapper to the graph-tool python library, designed to be used with scanpy. The most prominent function is schist.inference.nested_model() which takes a AnnData object as input and fits a nested Stochastic Block Model on the kNN graph built with scanpy.

The Bayesian formulation of Stochastic Block Models provides the possibility to perform inference on a graph for any partition configuration, thus allowing reliable model selection using an interpretable measure, entropy.





□ scShaper: an ensemble method for fast and accurate linear trajectory inference from single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab831/6458323

scShaper, a new trajectory inference method that enables accurate linear trajectory inference. The ensemble approach of scShaper generates a continuous smooth pseudotime based on a set of discrete pseudotimes.

scShaper is a fast method with few hyperparameters, making it a promising alternative to the principal curves method for linear pseudotemporal ordering.

scShaper is based on graph theory and solves the shortest Hamiltonian path of a clustering, utilizing a greedy algorithm to permute clusterings computed using the k-means method to obtain a set of discrete pseudotimes.





□ GNNImpute: An efficient scRNA-seq dropout imputation method using graph attention network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04493-x

GNNImpute, an autoencoder structure network that uses graph attention convolution to aggregate multi-level similar cell information and implements convolution operations on non-Euclidean space.

GNNImpute compensates for the lack of low expression intensity of some genes by aggregating the features information of similar cells. It can recover the dropout events in the scRNA-seq data and remain the specificity between cells to avoid excessive smoothing of expression.

GNNImpute can accurately and effectively impute the dropout and reduce dropout noise. GNNImpute enables the expression of the cells in the same tissue area to be embedded in low-dimensional vectors.





□ scBERT: a Large-scale Pretrained Deep Langurage Model for Cell Type Annotation of Single-cell RNA-seq Data

>> https://www.biorxiv.org/content/10.1101/2021.12.05.471261v1.full.pdf

scBERT (single-cell Bidirectional Encoder Representations from Transformers) follows the state-of-the-art paradigm of pre-train and fine-tune in the deep learning field.

scBERT formulates the expression profile of each single cell into embeddings for genes. scBERT computes the probability for the provided cell to be any cell type labelled in the reference dataset.

scBERT keeps the full gene-level interpretation, abandons the use of HVGs and dimensionality reduction, and lets discriminative genes and useful interaction come to the surface by themselves.

scBERT allows for the discovery of gene expression patterns that account for cell type annotation in an unbiased data-driven manner. scBERT pioneered the application of Transformer architectures in scRNA-seq data analysis with innovatively designed embeddings for genes.





□ GINCCo: Unsupervised construction of computational graphs for gene expression data with explicit structural inductive biases

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab830/6458322

GINCCo (Gene Interaction Network Constrained Construction), an unsupervised method for automated construction of computational graph models for gene expression data that are structurally constrained by prior knowledge of gene interaction networks.

Each of the entities in the GINCCo computational graph represent biological entities such as genes, candidate protein complexes and phenotypes instead of arbitrary hidden nodes of a neural network.

GINCCo performs the model construction in a completely automated and deterministic; this can be seen as a preprocessing step allowing GINCCo to scale immensely and study factor graphs without the influence of task specific optimization dictating the shape of the models.





□ sciCAN: Single-cell chromatin accessibility and gene expression data integration via Cycle-consistent Adversarial Network

>> https://www.biorxiv.org/content/10.1101/2021.11.30.470677v1.full.pdf

sciCAN removes modality differences while keeping true biological variation. the model architecture of sciCAN, which contains two major components, representation learning and modality alignment.

sciCAN doesn’t require cell anchors and thus, it can be applied to most non-joint profiled single-cell data. sciCAN enabled us to co-embed and co- cluster RNA-seq and ATAC-seq data. sciCAN reduces each dataset into 128-dimension spaces.





□ propeller: testing for differences in cell type proportions in single cell data

>> https://www.biorxiv.org/content/10.1101/2021.11.28.470236v1.full.pdf

propeller, a robust and flexible method that leverages biological replication to find statistically significant differences in cell type proportions between groups.

Propeller leverages biological replication to estimate the high sample- to-sample variability in cell type counts often observed in real single cell data.

The minimal annotation information that propeller requires for each cell is cluster/cell type, sample and group/condition, which can be automatically extracted from Seurat and SingleCellExperiment class objects.

The propeller function calculates cell type proportions for each biological replicate, performs a variance stabilising transformation on the matrix of proportions and fits a linear model for each cell type or cluster using the limma framework.




□ AlphaFill: enriching the AlphaFold models with ligands and co-factors

>> https://www.biorxiv.org/content/10.1101/2021.11.26.470110v1.full.pdf

AlphaFill, an algorithm based on sequence and structure similarity, to “transplant” such “missing” small molecules and ions from experimentally determined structures. AlphaFill should be complemented by structure-based transfer algorithms.

The sequence of the AlphaFold model is BLASTed8 against the sequence file of the LAHMA webserver9 which contains all sequences present in the PDB-REDO databank. The hits are sorted by E-value and a maximum of 250 hits, as is the default for BLAST, is returned.

The selection of hits is then structurally aligned, based on the Cα-atoms of the residues matched in the BLAST8 alignment. The root-mean-square deviation (RMSD) of this global alignment is stored in the AlphaFill metadata.





□ HiCArch: A Deep Learning-based Hi-C Data Predictor

>> https://www.biorxiv.org/content/10.1101/2021.11.26.470146v1.full.pdf

HiCArch, a transformer-based model architecture for Hi-C contact matrices prediction based on the 11 types of K562 epigenomic features, consisting of chromatin binding factors and histone modifications.

HiCArch processes the sequential input and generates the 2D Hi-C matrix via two main modules: sequence-to-sequence (seqToSeq, or STS) module, sequence-to-matrix (seqToMat, or STM) module.




□ propeller: testing for differences in cell type proportions in single cell data

>> https://www.biorxiv.org/content/10.1101/2021.11.28.470236v1.full.pdf

propeller, a robust and flexible method that leverages biological replication to find statistically significant differences in cell type proportions between groups.

Propeller leverages biological replication to estimate the high sample- to-sample variability in cell type counts often observed in real single cell data. The minimal annotation information that propeller requires for each cell is cluster/cell type, sample and group/condition, which can be automatically extracted from Seurat and SingleCellExperiment class objects.

The propeller function calculates cell type proportions for each biological replicate, performs a variance stabilising transformation on the matrix of proportions and fits a linear model for each cell type or cluster using the limma framework.





□ Predicting environmentally responsive transgenerational differential DNA methylated regions (epimutations) in the genome using a hybrid deep-machine learning approach

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04491-z

a hybrid DL-ML approach that uses a deep neural network for extracting molecular features and a non-DL classifier to predict environmentally responsive transgenerational differential DNA methylated regions (DMRs), termed epimutations, based on the extracted DL-based features.

The process of generating features is supervised. A 1000 bp input DNA sequence is one-hot encoded using a 5 × 1000 binary matrix. After each convolutional layer is a batch-normalization layer following by a ReLU transformer layer.





□ Navigating the pitfalls of applying machine learning in genomics

>> https://www.nature.com/articles/s41576-021-00434-9

Jacob Schreiber:
Although this high-level explanation covers our main point, we describe five specific (related) pitfalls that one can encounter in this space through the lens of train/test/prediction sets to drive home how common it is to make a mistake in an evaluation setting.

Importantly: CROSS-FOLD VALIDATION IS NOT THE SOLUTION. In fact, blindly applying cross-fold validation to biological data without thinking about your anticipated use case (the prediction set) can give you a false sense of security in the face of complexity.




□ Codex DNA increases productivity & efficiency of mRNA synthesis, launching BioXP kits with CleanCap Reagent AG

Automated platform accelerates development of mRNA-based #vaccines & therapies

>> https://codexdna.com/products/bioxp-kits/mrna-synthesis/




□ KaKs_Calculator 3.0: calculating selective pressure on coding and non-coding sequences

>> https://www.biorxiv.org/content/10.1101/2021.11.25.469998v1.full.pdf

Similar to the nonsynonymous/synonymous substitution rate ratio for coding sequences, selection on non-coding sequences can be quantified as non-coding nucleotide substitution rate normalized by synonymous substitution rate of adjacent coding sequences.

KaKs_Calculator detects the mode of selection operated on molecular sequences, accordingly demonstrating its great potential to achieve genome-wide scan of natural selection on diverse sequences and identification of potentially functional elements at whole genome scale.





□ Systematic evaluation of cell-type deconvolution pipelines for sequencing-based bulk DNA methylomes

>> https://www.biorxiv.org/content/10.1101/2021.11.29.470374v1.full.pdf

All compared sequencing-based methods consist of two common steps, informative region selection and cell-type composition estimation.

In the informative region selection step, the sequencing-based cell-type deconvolution methods filter out CpGs where the methylation patterns do not clearly demonstrate cell-type heterogeneity.

Whereas selecting similar genomic regions to DMRs generally contributed to increasing the performance in bi-component mixtures, the uniformity of cell-type distribution showed a high correlation with the performance in five cell-type bulk analyses.





□ GraphPrompt: Biomedical Entity Normalization Using Graph-based Prompt Templates

>> https://www.biorxiv.org/content/10.1101/2021.11.29.470486v1.full.pdf

OBO-syn encompasses 70 biomedical entity types and 2 million entity- synonym pairs. OBO-syn has demonstrated small overlaps with existing datasets and more challenging entity-synonym predictions.

GraphPrompt, a prompt-based learning method for entity normalization with the consideration of graph structures. GraphPrompt solves a masked-language model task. GraphPrompt has obtained superior performance to the other approaches on both few-shot and zero-shot settings.





□ CLA: Automated identification of cell-type–specific genes and alternative promoters

>> https://www.biorxiv.org/content/10.1101/2021.12.01.470587v1.full.pdf

Cell Lineage Analysis (CLA), a computational method which identifies transcriptional features with expression patterns that discriminate cell types, incorporating Cell Ontology knowledge on the relationship between different cell types.

CLA uses random forest classification with a stratified bootstrap to increase the accuracy of binary classifiers when each cell type have a different number of samples.

CLA runs multiple instances of regularized random forest and reports the transcriptional features consistently selected. CLA not only discriminates individual cell types but can also discriminate lineages of cell types related in the developmental hierarchy.





□ CSmiR: Exploring cell-specific miRNA regulation with single-cell miRNA-mRNA co-sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04498-6

CSmiR (Cell-Specific miRNA regulation) to combine single-cell miRNA-mRNA co-sequencing data and putative miRNA-mRNA binding information to identify miRNA regulatory networks at the resolution of individual cells.

CSmiR is effective in predicting cell-specific miRNA targets. Finally, through exploring cell–cell similarity matrix characterized by cell-specific miRNA regulation, CSmiR provides a novel strategy for clustering single-cells and helps to understand cell–cell crosstalk.





□ CombSAFE: Identification, semantic annotation and comparison of combinations of functional elements in multiple biological conditions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab815/6448225

CombSAFE allows analyzing the whole genome, by clustering patterns of regions with similar functional elements and through enrichment analyses to discover ontological terms significantly associated with them.

CombSAFE allows comparing functional states of a specific genomic region to analyze their different behavior throughout the various semantic annotations.





□ KAGE: Fast alignment-free graph-based genotyping of SNPs and short indels

>> https://www.biorxiv.org/content/10.1101/2021.12.03.471074v1.full.pdf

Since traditional reference genomes do not include genetic variation, traditional genotypers suffer from reference bias and poor accuracy in variation-rich regions where reads cannot accurately be mapped.

These methods work by representing genetic variants by their surrounding kmers (sequences with length k covering each variant) and looking for support for these kmers in the sequenced reads.

KAGE, a genotyper for SNPs and short indels that is inspired by recent developments within graph-based genome representations and alignment-free genotyping.





□ FastMLST: A Multi-core Tool for Multilocus Sequence Typing of Draft Genome Assemblies

>> https://journals.sagepub.com/doi/10.1177/11779322211059238

FastMLST, a tool that is designed to perform PubMLST searches using BLASTn and a divide-and-conquer approach that processes each genome assembly in parallel.

The output offered by FastMLST includes a table with the ST, allelic profile, and clonal complex or clade (when available), detected for a query, as well as a multi-FASTA file or a series of FASTA files with the concatenated or single allele sequences detected.

FastMLST assigns STs to thousands of genomes in minutes with 100% concordance in genomes without suspected contamination in a wide variety of species with different genome lengths, %GC, and assembly fragmentation levels.





□ TRAWLING: a Transcriptome Reference Aware of spLIciNG events.

>> https://www.biorxiv.org/content/10.1101/2021.12.03.471115v1.full.pdf

TRAWLING simplifies the identification of splicing events from RNA-seq data in a simple and fast way, while leveraging the suite of tools developed for alignment-free methods. it allows the aggregation of read counts based on the donor and acceptor splice motifs.

TRAWLING using three different RNA sequencing datasets: whole transcriptome sequencing, single cell RNA sequencing and Digital RNA w/ pertUrbation of Genes. TRAWLING did not misalign or lose reads, it can be used by default w/o loss of generality for gene level quantification.





□ DARTS: an Algorithm for Domain-Associated RetroTransposon Search in Genome Assemblies

>> https://www.biorxiv.org/content/10.1101/2021.12.03.471067v1.full.pdf

DARTS has radically higher sensitivity of long terminal repeat retrotransposons (LTR-RTs) identification compared to a widely accepted LTRharvest tool.

DARTS returns a set of structurally annotated nucleotide and amino acid sequences which can be readily used in subsequent comparative and phylogenetic analyses.




□ pystablemotifs: Python library for attractor identification and control in Boolean networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab825/6454946

pystablemotifs is a Python 3 library for analyzing Boolean networks. Its non-heuristic and exhaustive attractor identification algorithm was previously presented in (Rozum et al. 2021).

Illustrating its performance improvements over similar methods and discuss how it uses outputs of the attractor identification process to drive a system to one of its attractors from any initial state.





□ CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices

>> https://www.biorxiv.org/content/10.1101/2021.12.06.471436v1.full.pdf

Combining a modified MinHash technique (ArgMinHash) and a data structure called a k-mer ternary search tree (KTST), which allows Jaccard and containment indices to be computed at multiple k-mer sizes efficiently and simultaneously.

This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient.

CMash estimate of the Jaccard and containment index does not deviate significantly from the ground truth, indicating that this approach can give fast and reliable results with minimal bias.





□ Genovo: A method to build extended sequence context models of point mutations and indels

>> https://www.biorxiv.org/content/10.1101/2021.12.06.471476v1.full.pdf

a new method that solves this problem by grouping similar k-mers using IUPAC patterns. It calculates a table with the number of times each possible k-mer is observed with the central base mutated and unmutated.

Genovo predicts the expected number of synonymous, missense, and other functional mutation types for each gene. the created mutation rate models increase the statistical power to detect genes containing disease-causing variants and to identify genes under strong constraint.





□ DALI (Diversity AnaLysis Interface): a novel tool for the integrated analysis of multimodal single cell RNAseq data and immune receptor profiling.

>> https://www.biorxiv.org/content/10.1101/2021.12.07.471549v1.full.pdf

Diversity AnaLysis Interface (DALI) interacts with the Seurat R package and is aimed to support the advanced bioinformatician with a set of novel methods and an easier integration of existing tools for BCR and TCR analysis in their single cell workflow.





□ LEXAS: a web application for life science experiment search and suggestion

>> https://www.biorxiv.org/content/10.1101/2021.12.05.471323v1.full.pdf

LEXAS (Life-science EXperiment seArch and Suggestion) curates the description of biomedical experiments and suggests the experiments on genes that could be performed next.

LEXAS allows users to choose between two machine learning models that are used for the suggestion. One is a “reliable” model that uses seven major biomedical databases such as the BioGRID and four knowledgebases such as the Gene Ontology.





□ MCKAT: a multi-dimensional copy number variant kernel association test

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04494-w

MCKAT utilizes both multi-dimensional features of the CNVs & their heterogeneity effect. The MCKAT is not only capable of indicating stronger evidence in detecting significant associations b/n CNVs & disease-related traits, but it is applicable to both rare & common CNV datasets.





Ritardando.

2021-12-13 22:03:07 | Science News




□ Fugue: Scalable batch-correction method for integrating large-scale single-cell transcriptomes

>> https://www.biorxiv.org/content/10.1101/2021.12.12.472307v1.full.pdf

Fugue extended the deep learning method at the heart of our recently published Miscell approach. Miscell learns representations of single-cell expression profiles through contrastive learning and achieves high performance on canonical single-cell analysis tasks.

Fugue encodes batch information of each cell as a trainable parameter and added to its expression profile; a contrastive learning approach is used to learn feature representation. Fugue can learn smooth embedding for time course trajectory and joint embedding space.





□ FIN: Bayesian Factor Analysis for Inference on Interactions

>> https://www.tandfonline.com/doi/full/10.1080/01621459.2020.1745813

Current methods for quadratic regression are not ideal in these applications due to the level of correlation in the predictors, the fact that strong sparsity assumptions are not appropriate, and the need for uncertainty quantification.

FIN exploits the correlation structure of the predictors, and estimates interaction effects in high dimensional settings. FIN uses a latent factor joint model, which incl. shared factors in both the predictor and response components while assuming conditional independence.





□ Pint: A Fast Lasso-Based Method for Inferring Higher-Order Interactions

>> https://www.biorxiv.org/content/10.1101/2021.12.13.471844v1.full.pdf

Pint performs square-root lasso regression on all pairwise interactions on a one thousand gene screen, using ten thousand siRNAs, in 15 seconds, and all three-way interactions on the same set in under ten minutes.

Pint is based on an existing fast algorithm, which adapts for use on binary matrices. The three components of the algorithm, pruning, active set calculation, and solving the sub-problem, can all be done in parallel.





□ TopHap: Rapid inference of key phylogenetic structures from common haplotypes in large genome collections with limited diversity

>> https://www.biorxiv.org/content/10.1101/2021.12.13.472454v1.full.pdf

TopHap determines spatiotemporally common haplotypes of common variants and builds their phylogeny at a fraction of the computational time of traditional methods.

In the TopHap approach, bootstrap branch support for the inferred phylogeny of common haplotypes is calculated by resampling genomes to build bootstrap replicate datasets.

This procedure assesses the robustness of the inferred phylogeny to the inclusion/exclusion of haplotypes likely created by sequencing errors and convergent changes that are expected to have relatively low frequencies spatiotemporally.





□ swCAM: estimation of subtype-specific expressions in individual samples with unsupervised sample-wise deconvolution

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab839/6460803

a sample-wise Convex Analysis of Mixtures (swCAM) can accurately estimate subtype-specific expressions of major subtypes in individual samples and successfully extract co-expression networks in particular subtypes that are otherwise unobtainable using bulk expression data.

Fundamental to the success of swCAM solution is the nuclear-norm and l2,1-norm regularized low-rank latent variable modeling.

Determining hyperparameter values using cross-validation with random entry exclusion and obtain a swCAM solution using an efficient alternating direction method of multipliers.





□ Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2

>> https://www.biorxiv.org/content/10.1101/2021.12.14.472718v1.full.pdf

The compacted de Bruijn graph forms a vertex-decomposition of the graph, while preserving the graph topology. However, for some applications, only the vertex-decomposition is sufficient, and preservation of the topology is redundant.

for applications such as performing presence-absence queries for k-mers or associating information to the con- stituent k-mers of the input, any set of strings that preserves the exact set of k-mers from the input sequences can be sufficient.

Relaxing the defining requirement of unit igs, that the paths be non-branching in the underlying graph, and seeking instead a set of maximal non-overlapping paths covering the de Bruijn graph, results in a more compact rep- resentation of the input data.

CUTTLE-FISH 2 can seamlessly extract such maximal path covers by simply constraining the algorithm to operate on some specific subgraph(s) of the original graph.





□ Matchtigs: minimum plain text of kmer sets

>> https://www.biorxiv.org/content/10.1101/2021.12.15.472871v1.full.pdf

Matchtigs, a polynomial algorithm computing a minimum representation (which was previously posed as a potentially NP-hard open problem), as well as an efficient near-minimum greedy heuristic.

Matchtigs finds an SPSS (spectrum preserving string set) of minimum size (CL). the SPSS problem allowing repeated kmers is polynomially solvable, based on a many-to-many min-cost path query and a min-cost perfect matching approach.





□ AliSim: A Fast and Versatile Phylogenetic Sequence Simulator For the Genomic Era

>> https://www.biorxiv.org/content/10.1101/2021.12.16.472905v1.full.pdf

AliSim integrates a wide range of evolutionary models, available in the IQ-TREE. AliSim can simulate MSAs that mimic the evolutionary processes underlying empirical alignments.

AliSim implements an adaptive approach that combines the commonly-used rate matrix and probability matrix approach. AliSim works by first generating a sequence at the root of the tree following the stationarity of the model.

AliSim then recursively traverses along the tree to generate sequences at each node of the tree based on the sequence of its ancestral node. AliSim completes this process once all the sequences at the tips are generated.





□ ortho2align: a sensitive approach for searching for orthologues of novel lncRNAs

>> https://www.biorxiv.org/content/10.1101/2021.12.16.472946v1.full.pdf

lncRNAs exhibit low sequence conservation, so specific methods for enhancing the signal-to-noise ratio were developed. Nevertheless, current methods such as transcriptomes comparison or searches for conserved secondary structures are not applicable to novel lncRNAs dy design.

ortho2align — a synteny-based approach for finding orthologues of novel lncRNAs with a statistical assessment of sequence conservation. ortho2align allows control of the specificity of the search process and optional annotation of found orthologues.





□ EmptyNN: A neural network based on positive and unlabeled learning to remove cell-free droplets and recover lost cells in scRNA-seq data

>> https://www.cell.com/patterns/fulltext/S2666-3899(21)00154-9

EmptyNN accurately removed cell-free droplets while recovering lost cell clusters, and achieved an area under the receiver operating characteristics of 94.73% and 96.30%, respectively.

EmptyNN takes the raw count matrix as input, where rows represent barcodes and columns represent genes. The output is a list, containing a Boolean vector indicating it is a cell-containing or cell-free droplet, as well as the probability of each droplet.





□ AMAW: automated gene annotation for non-model eukaryotic genomes

>> https://www.biorxiv.org/content/10.1101/2021.12.07.471566v1.full.pdf

Iterative runs of MAKER2 must also be coordinated to aim for accurate predictions, which includes intermediary specific training of different gene predictor models.

AMAW (Automated MAKER2 Annotation Wrapper), a program devised to annotate non-model unicellular eukaryotic genomes by automating the acquisition of evidence data.




□ Pak RT

Merge supply is decreasing.
Watch.

>> https://etherscan.io/token/0x27d270b7d58d15d455c85c02286413075f3c8a31





□ HolistIC: leveraging Hi-C and whole genome shotgun sequencing for double minute chromosome discovery

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab816/6458320

HolistIC can enhance double minute chromosome predictions by predicting DMs with overlapping amplicon coordinates. HolistIC can uncover double minutes, even in the presence of DM segments with overlapping coordinates.

HolistIC is ideal for confirming the true association of amplicons to circular extrachromosomal DNA. it is modular in that the double minute prediction input can be from any program. This lends additional flexibility for future eccDNA discovery algorithms.





□ geneBasis: an iterative approach for unsupervised selection of targeted gene panels from scRNA-seq

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02548-z

geneBasis, an iterative approach for selecting an optimal gene panel, where each newly added gene captures the maximum distance between the true manifold and the manifold constructed using the currently selected gene panel.

geneBasis allows recovery of local and global variability. geneBasis accounts for batch effect and handles unbalanced cell type composition.

geneBasis constructs k-NN graphs within each batch, thereby assigning nearest neighbors only from the same batch and mitigating technical effects. Minkowski distances per genes are calculated across all cells from every batch thus resulting in a single scalar value for each gene.





□ scMARK an 'MNIST' like benchmark to evaluate and optimize models for unifying scRNA data

>> https://www.biorxiv.org/content/10.1101/2021.12.08.471773v1.full.pdf

scMARK uses unsupervised models to reduce the complete set of single-cell gene expression matrices into a unified cell-type embedding space. And trains a collection of supervised models to predict author labels from all but one held-out dataset in this unified cell-type space.

scMARK show that scVI represents the only tested method that benefits from larger training datasets. Qualitative assessment of the unified cell-type space indicates that the scVI embedding is suitable for automatic cell-type labeling and discovery of new cell-types.





□ DISA tool: discriminative and informative subspace assessment with categorical and numerical outcomes

>> https://www.biorxiv.org/content/10.1101/2021.12.08.471785v1.full.pdf

DISA (Discriminative & Informative Subspace Assessment) is proposed to assess patterns in the presence of numerical outcomes using well-established measures together w/ a novel principle able to statistically assess the correlation gain of the subspace against the overall space.

If DISA receives a numerical outcome, a range of values in which samples are valid is determined. DISA accomplishes this by approximating two probability density functions (e.g. Gaussians), one for all the observed targets and the other with targets of the target subspace.





□ Improved Transcriptome Assembly Using a Hybrid of Long and Short Reads with StringTie

>> https://www.biorxiv.org/content/10.1101/2021.12.08.471868v1.full.pdf

a new release of StringTie which allows transcriptome assembly and quantification using a hybrid dataset containing both short and long reads.

Hybrid-read assembly with StringTie is more accurate than long-read only or short-read only assembly, and on some datasets it can more than double the number of correctly assembled transcripts, while obtaining substantially higher precision than the long-read data assembly alone.





□ scATAK: Efficient pre-processing of Single-cell ATAC-seq data

>> https://www.biorxiv.org/content/10.1101/2021.12.08.471788v1.full.pdf

The scATAK track module generated group ATAC signal tracks (normalized by the mapped group read counts) from cell barcode – cell group table and sample pseudo-bulk alignment file.

scATAK hic module utilizes a provided bulk HiC or HiChIP interactome map together with a single-cell accessible chromatin region matrix to infer potential chromatin looping events for individual cells and generate group HiC interaction tracks.





□ DeepPlnc: Discovering plant lncRNAs through multimodal deep learning on sequential data

>> https://www.biorxiv.org/content/10.1101/2021.12.10.472074v1.full.pdf

LncRNAs are supposed to act as a key modulator for various biological processes. Their involvement is reported in controlling transcription process through enhancers and providing regulatory binding sites is well reported

DeepPlnc can even accurately annotate the incomplete length transcripts also which are very common in de novo assembled transcriptomes. It has incorporated a bi-modal architecture of Convolution Neural Nets while extracting information from the sequences of nucleotides.




□ A mosaic bulk-solvent model improves density maps and the fit between model and data

>> https://www.biorxiv.org/content/10.1101/2021.12.09.471976v1

The mosaic bulk-solvent model considers solvent variation across the unit cell. The mosaic model is implemented in the computational crystallography toolbox and can be used in Phenix in most contexts where accounting for bulk-solvent is required.

Using the mosaic solvent model improves the overall fit of the model to the data and reduces artifacts in residual maps. The mosaic model algorithm was systematically exercised against a large subset of PDB entries to ensure its robustness and practical utility to improve maps.




□ Coalescent tree recording with selection for fast forward-in-time simulations

>> https://www.biorxiv.org/content/10.1101/2021.12.06.470918v1.full.pdf

The algorithm records the genetic history of a species, directly places the mutations on the tree and infers fitness of subsets of the genome from parental haplotypes. The algorithm explores the tree to reconstruct the genetic data at the recombining segment.

When reproducing, if a segment is transmitted without recombination, then the fitness contribution of this segment in the offspring individual is simply the fitness contribution of the parental segment multiplied by the effects of eventual new mutations.





□ snpQT: flexible, reproducible, and comprehensive quality control and imputation of genomic data

>> https://f1000research.com/articles/10-567

snpQT: a scalable, stand-alone software pipeline using nextflow and BioContainers, for comprehensive, reproducible and interactive quality control of human genomic data.

snpQT offers some 36 discrete quality filters or correction steps in a complete standardised pipeline, producing graphical reports to demonstrate the state of data before and after each quality control procedure.





□ High performance of a GPU-accelerated variant calling tool in genome data analysis

>> https://www.biorxiv.org/content/10.1101/2021.12.12.472266v1.full.pdf

Sequencing data were analyzed on the GPU server using BaseNumber, the variant calling outputs of which were compared to the reference VCF or the results generated by the Burrows-Wheeler Aligner (BWA) + Genome Analysis Toolkit (GATK) pipeline on a generic CPU server.

BaseNumber demonstrated high precision (99.32%) and recall (99.86%) rates in variant calls compared to the standard reference. The variant calling outputs of the BaseNumber and GATK pipelines were very similar, with a mean F1 of 99.69%.




□ treedata.table: a wrapper for data.table that enables fast manipulation of large phylogenetic trees matched to data

>> https://peerj.com/articles/12450/

treedata.table, the first R package extending the functionality and syntax of data.table to explicitly deal with phylogenetic comparative datasets.

treedata.table significantly increases speed and reproducibility during the data manipulation involved in the phylogenetic comparative workflow. After an initial tree/data matching step, treedata.table continuously preserves the tree/data matching across data.table operations.





□ tRForest: a novel random forest-based algorithm for tRNA-derived fragment target prediction

>> https://www.biorxiv.org/content/10.1101/2021.12.13.472430v1.full.pdf

A significant advantage of using random forests is that they avoid overfitting, a common limitation of machine learning algorithms in which they become tailored specifically to the dataset they were trained on and thus become less predictive in independent datasets.

tRForest, a tRF target prediction algorithm built using the random forest machine learning algorithm. This algorithm predicts targets for all tRFs, including tRF-1s and includes a broad range of features to fully capture tRF-mRNA interaction.





□ Flimma: a federated and privacy-aware tool for differential gene expression analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02553-2

Flimma - Fererated Limma Voom Tool preserves the privacy of the local data since the expression profiles never leave the local execution sites.

In contrast to meta-analysis approaches, Flimma is particularly robust against heterogeneous distributions of data across the different cohorts, which makes it a powerful alternative for multi-center studies where patient privacy matters.





□ GREPore-seq: A Robust Workflow to Detect Changes after Gene Editing through Long-range PCR and Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2021.12.13.472514v1.full.pdf

GREPore-seq captures the barcoded sequences by grepping reads of nanopore amplicon sequencing. GREPore-seq combines indel-correcting DNA barcodes with the sequencing of long amplicons on the ONT platforms.

GREPore-seq can detect NHEJ-mediated double-stranded oligodeoxynucleotide (dsODN) insertions with comparable accuracy to Illumina NGS. GREPore-seq also identifies HDR-mediated large gene knock-in, which excellently correlates with FACS analysis data.





□ CellOT: Learning Single-Cell Perturbation Responses using Neural Optimal Transport

>> https://www.biorxiv.org/content/10.1101/2021.12.15.472775v1.full.pdf

Leveraging the theory of optimal transport and the recent advents of convex neural architectures, they learn a coupling describing the response of cell populations upon perturbation, enabling us to predict state trajectories on a single-cell level.

CellOT, a novel approach to predict single-cell perturbation responses by uncovering couplings between control and perturbed cell states while accounting for heterogeneous subpopulation structures of molecular environments.





□ splatPop: simulating population scale single-cell RNA sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02546-1

splatPop, a model for flexible, reproducible, and well-documented simulation of population-scale scRNA-seq data with known expression quantitative trait loci. splatPop can also be instructed to assign pairs of eGenes the same eSNP.

The splatPop model utilizes the flexible framework of Splatter, and can simulate complex batch, cell group, and conditional effects between individuals from different cohorts as well as genetically-driven co-expression.





□ Nfeature: A platform for computing features of nucleotide sequences

>> https://www.biorxiv.org/content/10.1101/2021.12.14.472723v1.full.pdf

Nfeature comprises of three major modules namely Composition, Correlation, and Binary profiles. Composition module allow to compute different type of compositions that includes mono-/di-tri-nucleotide composition, reverse complement composition, pseudo composition.

Correlation module allow to compute various type of correlations that includes auto-correlation, cross-correlation, pseudo-correlation. Similarly, binary profile is developed for computing binary profile based on nucleotides, di-nucleotides, di-/tri-nucleotide properties.

Nfeature also allow to compute entropy of sequences, repeats in sequences and distribution of nucleotides in sequences. This tool computes a total of 29217 and 14385 features for DNA and RNA sequence, respectively.





□ GENPPI: standalone software for creating protein interaction networks from genomes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04501-0

GENPPI can help fill the gap concerning the considerable number of novel genomes assembled monthly and our ability to process interaction networks considering the noncore genes for all completed genome versions.

GENPPI transfers the question of topological annotation from the centralized databases to the final user, the researcher, at the initial point of research. the GENPPI topological annotation information is directly proportional to the number of genomes used to create an annotation.





□ Sim-it: A benchmark of structural variation detection by long reads through a realistic simulated model

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02551-4

Sim-it, a straightforward tool for the simulation of both structural variation and long-read data. These simulations from Sim-it reveal the strengths and weaknesses for current available structural variation callers and long-read sequencing platforms.

combiSV is a new method that can combine the results from structural variation callers into a superior call set with increased recall and precision, which is also observed for the latest structural variation benchmark set.





□ seGMM: a new tool to infer sex from massively parallel sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.12.16.472877v1.full.pdf

seGMM, a new sex inference tool that determines the gender of a sample from called genotype data integrated aligned reads and jointly considers information on the X and Y chromosomes in diverse genomic data, including TGS panel data.

seGMM applies Gaussian Mixture Model (GMM) clustering to classify the samples into different clusters. seGMM provides a reproducible framework to infer sex from massively parallel sequencing data and has great promise in clinical genetics.





□ FourierDist: HarmonicNet: Fully Automatic Cell Segmentation with Fourier Descriptors

>> https://www.biorxiv.org/content/10.1101/2021.12.17.472408v1.full.pdf

FourierDist, a network, which is a modification of the popular StarDist and SplineDist architectures. FourierDist utilizes Fourier descriptors, predicting a coefficient vector for every pixel on the image, which implicitly define the resulting segmentation.

FourierDist is also capable of accurately segmenting objects that are not star-shaped, a case where StarDist performs suboptimally.





□ Analyzing transfer learning impact in biomedical cross-lingual named entity recognition and normalization

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04247-9

Firstly, for entity identification and classification, they implemented two bidirectional Long Short Memory (Bi-LSTM) layers with a CRF layer based on the NeuroNER model. The architecture of this model consists of a first Bi-LSTM layer for character embeddings.

In the second layer, they concatenate the output of the first layer with the word embeddings and sense-disambiguate embeddings for the second Bi-LSTM layer. Finally, the last layer uses a CRF to obtain the most suitable labels for each token.