□ DeepSVP: Integration of genotype and phenotype for structural variant prioritization using deep learning
DeepSVP significantly improves the success rate of finding causative variants over StrVCTVRE and CADD-SV. DeepSVP uses as input an annotated VCF file of an individual and clinical phenotypes encoded using the Human Phenotype Ontology.
DeepSVP overcomes the limitation of missing phenotypes by incorporating information related to genes through ontologies, mainly the functions of gene products, gene expression in individual celltypes, and anatomical sites of expression and systematically relating them to their phenotypic consequences through ontologies.
□ MultiMAP: dimensionality reduction and integration of multimodal data
MultiMAP is based on a framework of Riemannian geometry and algebraic topology and generalizes the UMAP framework to the setting of multiple datasets each with different dimensionality.
MultiMAP takes as input any number of datasets of potentially differing dimensions and recovers geodesic distances on a single latent manifold on which all of the data is uniformly distributed.
□ MSRCall: A Multi-scale Deep Neural Network to Basecall Oxford Nanopore Sequences
MSRCall first uses convolutional layers to manipulate multi-scale downsampling. These back-to-back convolutional layers aim to capture features with receptive fields at different levels of complexity.
MSRCall simultaneously utilizes multi-scale convolutional and bidirectional LSTM layers to capture semantic information. MSRCall disentangles the relationship between raw signal data and nucleotide labels.
□ cLoops2: a full-stack comprehensive analytical tool for chromatin interactions
cLoops2 consists of core modules for peak-calling, loop-calling, differentially enriched loops calling and loops annotation. cLoops2 addresses the practical analysis requirements, especially for loop-centric analysis with preferential design for Hi-TrAC/TrAC-looping data.
cLoops2 directly analyzes the paired-end tags to find candidate peaks and loops. It estimates the statistical significance for the peak/loop features with a permuted local background, eliminating the bias introduced from third part peak-calling parameters tuning for calling loops.
□ CMIA: Gene regulation network inference using k-nearest neighbor-based mutual information estimation- Revisiting an old DREAM
the MI-based kNN Kraskov-Stoögbauer-Grassberger (KSG) algorithm leads to a significant improvement in GRN reconstruction for popular inference algorithms, such as Context Likelihood of Relatedness (CLR).
CMIA (Conditional Mutual Information Augmentation), a novel inference algorithm inspired by Synergy-Augmented CLR. Looking forward, the goal of complete reconstruction of GRNs may require new inference algorithms and probably Mutual information MI in more than three dimensions.
□ CoRE-ATAC: A deep learning model for the functional classification of regulatory elements from single cell and bulk ATAC-seq data
CoRE-ATAC can infer regulatory functions in diverse cell types, capture activity differences modulated by genetic mutations, and can be applied to single cell ATAC-seq data to study rare cell populations.
CoRE-ATAC integrates DNA sequence data with chromatin accessibility data using a novel ATAC-seq data encoder that is designed to be able to integrate an individual’s genotype with the chromatin accessibility maps by inferring the genotype from ATAC-seq read alignments.
□ CosNeti: ComplexOme-Structural Network Interpreter used to study spatial enrichment in metazoan ribosomes
CosNeti translates experimentally determined structures into graphs, with nodes representing proteins and edges the spatial proximity between them. CosNeti considers rProteins and ignores rRNA and other objects.
Spatial regions are defined using a random walk with restart methodology, followed by a procedure to obtain a minimum set of regions that cover all proteins in the complex.
Structural coherence is achieved by applying weights to the edges reflecting the physical proximity between purportedly contacting proteins. The weighting probabilistically guides the random-walk path trajectory.
□ 2FAST2Q: A general-purpose sequence search and counting program for FASTQ files
2FAST2Q, a versatile and intuitive standalone program capable of extracting and counting feature occurrences in FASTQ files.
2FAST2Q can be used in any experimental setup that requires feature extraction from raw reads, being able to quickly handle mismatch alignments, nucleotide wise Phred score filtering, custom read trimming, and sequence searching within a single program.
□ Integration of public DNA methylation and expression networks via eQTMs improves prediction of functional gene-gene associations
MethylationNetwork can identify experimentally validated interacting pairs of genes that could not be identified in the RNA-seq datasets.
an integration pipeline based on kernel cross-correlation matrix decomposition. Using this pipeline, they integrated GeneNetwork and MethylationNetwork and used the integrated results to predict functional gene–gene correlations that are collected in the STRING database.
□ FineMAV: Prioritising positively selected variants in whole-genome sequencing data
Fine-Mapping of Adaptation Variation (FineMAV) is a statistical method that prioritizes functional SNP candidates under selection and depends upon population differentiation.
A stand-alone application that can perform FineMAV calculations on whole-genome sequencing data and can output bigWig files which can be used to graphically visualise the scores on genome browsers.
□ GraphOmics: an interactive platform to explore and integrate multi-omics data
GraphOmics provides an interactive platform that integrates data to Reactome pathways emphasising interactivity and biological contexts. This avoids the presentation of the integrated omics data as a large network graph or as numerous static tables.
GraphOmics offers a way to perform pathway analysis separately on each omics, and integrate the results at the end. The separate pathway analysis results run on different omics datasets can be combined with an AND operator in the Query Builder.
□ anndata: Annotated data
AnnData makes a particular choice for data organization that has been left unaddressed by packages like scikit-learn or PyTorch, which model input and output of model transformations as unstructured sets of tensors.
The AnnData object is a collection of arrays aligned to the common dimensions of observations (obs) and variables (var).
Storing low-dimensional manifold structure within a desired reduced representation is achieved through a k-nearest neighbor graph in form of a sparse adjacency matrix: a matrix of pairwise relationships of observations.
□ Class similarity network for coding and long non-coding RNA classification
Class Similarity Network considers more relationships among input samples in a direct way. It focuses on exploring the potential relationships between input samples and samples from both the same class and the different classes.
Class Similarity Network trains the parameters specific to each class to obtain the high-level features. The Fully Connected module learns parameters from diff dense branches to integrate similarity information. The Decision module concatenates the nodes to make the prediction.
□ FCLQC: fast and concurrent lossless quality scores compressor
FCLQC achieves a comparable compression rate while having much faster than the baseline algorithms. FCLQC uses concurrent programming to achieve fast compression and decompression.
Concurrent programming executes a program independently, not necessarily simultaneously, which is different from error-prone parallel computing. FCLQC shows at least 31x compression speed improvement, where a performance degradation in compression ratio is up to 13.58%.
□ ADClust: A Parameter-free Deep Embedded Clustering Method for Single-cell RNA-seq Data
ADClust first obtains low-dimensional representation through pre-trained autoencoder, and uses the representa- tions to cluster cells into initial micro-clusters.
The micro-clusters are then compared in between through a statistical test for unimodality called Dip-test to detect similar micro- clusters, and similar micro-clusters are merged through jointly optimizing the carefully designed clustering and autoencoder loss functions.
□ fastMSA: Accelerating Multiple Sequence Alignment with Dense Retrieval on Protein Language
The fastMSA framework, consisting of query sequence encoder and context sequences encoder, can improve the scalability and speed of multiple sequence alignment significantly.
fastMSA utilizes the query sequences to search from UniRef90 using JackHMMER v3.3 and build the resulted MSAs as ground truth. By filtering out the unrelated sequences on the low-dimensional space before performing MSA, fastMSA can accelerate the process by 35 folds.
□ XAE4Exp: Explainable autoencoder-based representation learning for gene expression data
XAE4Exp (eX-plainable AutoEncoder for Expression data), which integrates AE and SHapley Additive exPlana-tions (SHAP), a flagship technique in the field of eXplainable AI (XAI).
XAE4Exp quantitatively evaluates the contributions of each gene to the hidden structure learned by an AE, substantially improving the expandability of AE outcomes.
□ DeepLOF: A deep learning framework for predicting human essential genes from population and functional genomic data
DeepLOF, an evolution- based deep learning model for predicting human genes intolerant to LOF mutations. DeepLOF can integrate genomic features and population genomic data to predict LOF-intolerant genes without human-labeled training data.
DeepLOF combines the neural network-based beta prior distribution with the population genetics-based likelihood function to obtain a posterior distribution of η, which represents their belief about LOF intolerance after integrating genomic features and population genomic data.
□ CSNet: Estimating cell-type-specific gene co-expression networks from bulk gene expression data
For finite sample cases, it may be desirable to ensure the positive definiteness of the final estimator. One strategy is to solve a constrained optimization problem to find the nearest correlation matrix in Frobenius norm.
CSNet, a sparse estimator w/ SCAD penalty. And deriving the non-asymptotic convergence rate in spectral norm of CSNet and establish variable selection consistency, ensuring that the edges in the cell-type specific networks can be correctly identified w/ probability tending to 1.
□ NanoGeneNet: Using Deep Learning for Gene Detection and Classification in Raw Nanopore Signals
NanoGeneNet, a neural network-based method capable of detecting and classifying specific genomic regions already in raw nanopore signals – squiggles.
Therefore, the basecalling process can be omitted entirely as the raw signals of significant genes, or intergenic regions can be directly analysed, or if the nucleotide sequences are required, the identified squiggles can be basecalled, preferably to others.
□ binny: an automated binning algorithm to recover high-quality genomes from complex metagenomic datasets
binny, a binning tool that produces high-quality metagenome-assembled genomes from both contiguous and highly fragmented genomes.
binny uses k-mer-composition and coverage by metagenomic reads for iterative, non-linear dimension reduction of genomic signatures as well as subsequent automated contig clustering with cluster assessment using lineage-specific marker gene sets.
□ Baltica: integrated splice junction usage analysis
Baltica, a framework that provides workflows for quality control, de novo transcriptome assembly with StringTie2, and currently 4 DJU methods: rMATS, JunctionSeq, Majiq, and LeafCutter.
Baltica uses 2 datasets, the first uses Spike-in RNA Variant Control Mixes (SIRVs) and the second dataset of paired Illumina and Oxford Nanopore Technologies. Baltica integration allows us to compare the performance of different DJU and test the usability of a meta-classifier.
□ bulkAnalyseR: An accessible, interactive pipeline for analysing and sharing bulk sequencing results
Critically, neither VIPER, nor BioJupies offer support for more complex differential expression (DE) tasks, beyond simple pair-wise comparisons. This limits the biological interpretations from more complex experimental designs.
bulkAnalyseR provides an accessible, yet flexible framework for the analysis of bulk sequencing data without relying on prior programming expertise. The users can create a shareable shiny app in two lines of code, from an expression matrix and a metadata table.
□ ePat: extended PROVEAN annotation tool
The 'ePat' extends the conventional PROVEAN to enable the following two things, which the conventional PROVEAN could not calculate the pathogenicity of these variants.
ePat is able to calculate the pathogenicity of variants near the splice junction, frameshift, stop gain, and start lost. In addition, batch processing is used to calculate the pathogenicity of all variants in a VCF file in a single step.
□ A guide to trajectory inference and RNA velocity
Whereas traditional trajectory inference methods reconstruct cellular dynamics given a population of cells of varying maturity, RNA velocity relies on a dynamical model describing splicing dynamics.
However, pseudotime is based solely on transcriptional information, so it cannot be interpreted as an estimator of the true time since initial differentiation.
Rather, it is a high-resolution estimate of cell state, which is likely to be monotonically related to the true chronological time, but there is no guarantee that equivalent changes in transcriptional profiles follow a similar chronological time.
□ GeneTonic: an R/Bioconductor package for streamlining the interpretation of RNA-seq data
GeneTonic serves as a comprehensive toolkit for streamlining the interpretation of functional enrichment analyses, by fully leveraging the information of expression values in a differential expression context.
GeneTonic is not structured as an end-to-end workflow including quantification, preprocessing, exploratory data analysis, and DE modeling—all operations that are also time consuming, but in many scenarios need to be carried out only once.
□ The impact of low input DNA on the reliability of DNA methylation as measured by the Illumina Infinium MethylationEPIC BeadChip
This study demonstrates that although as little as 40ng is sufficient to produce Illumina Infinium MethylationEPIC Beadchip DNAm data that passes standard QC checks, data quality and reliability diminish as DNA input decreases.
They recommend caution and use of sensitivity analyses when working with less than 200ng DNA on the Illumina Infinium MethylationEPIC Beadchip.
□ AMC: accurate mutation clustering from single-cell DNA sequencing data
AMC first employs principal component analysis followed by K-means clustering to find mutation clusters, then infers the maximum likelihood estimates of the genotypes of each cluster.
The inferred genotypes can subsequently be used to reconstruct the phylogenetic tree with high efficiency. AMC uses BIC to jointly determine the best number of mutation clusters and the corresponding genotypes.
□ LotuS2: An ultrafast and highly accurate tool for amplicon sequencing analysis
LotuS2 uses only truncated, high-quality reads for sequence clustering (except ITS amplicons), while the read backmapping and seed extension steps restore some of the discarded sequence data.
LotuS2 often reported the fewest ASVs/OTUs, while including more sequence reads in abundance tables. This indicates that LotuS2 has a more efficient usage of input data while covering a larger sequence space per ASV/OTU.
□ EdClust: A heuristic sequence clustering method with higher sensitivity
Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from overestimation of inferred clusters and low clustering sensitivity.
The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH.
□ cDNA-detector: detection and removal of cDNA contamination in DNA sequencing libraries
cDNA-detector provides the option to remove contaminant reads from the alignment to reduce the risk of spurious coverage peak and variant calls in downstream analysis.
When using cDNA-detector on genomic sequence data, they recommend suppressing the “retrocopy” output, such that only potential vector cDNA candidates are reported. With this strategy, contaminants can be removed from alignments, revealing true signal previously obscured.
□ Artificial intelligence “sees” split electrons
Chemical bonds between atoms are stabilized by the exchange-correlation (xc) energy, a quantum-mechanical effect in which “social distancing” by electrons lowers their electrostatic repulsion energy.
Kohn-Sham density functional theory (DFT) states that the electron density determines this xc energy, but the density functional must be approximated.
Two exact constraints—the ensemble-based piecewise linear variation of the total energy with respect to fractional electron number and fractional electron z-component of spin — require hard-to-control nonlocality.
□ RAxML Grove: An empirical Phylogenetic Tree Database
When generating synthetic data it is often unclear how to set simulation parameters for the models and generate trees that appropriately reflect empirical model parameter distributions and tree shapes.
RAxML Grove currently comprising more than 60,000 inferred trees and respective model parameter estimates from fully anonymized empirical data sets that were analyzed using RAxML and RAxML-NG on two web servers.
□ ifCNV: a novel isolation-forest-based package to detect copy number variations from NGS datasets
About 1500 CNV regions have already been discovered in the human population, accounting for ~12–16% of the entire human genome,1 making it one of most common types of genetic variation. Although the biological impact of the majority of these CNVs remains uncertain.
ifCNV is a CNV detection tool based on read-depth distribution. ifCNV combines artificial intelligence using two isolation forests and a comprehensive scoring method to faithfully detect CNVs among various samples.
□ DICAST: Alternative splicing analysis benchmark
DICAST offers a modular and extensible framework for the analysis of AS integrating 11 splice-aware mapping and eight event detection tools. DICAST allows researchers to employ a consensus approach to consider the most successful tools jointly for robust event detection.
While DICAST introduces a unifying standard for AS event reporting, AS event detection tools utilize inherently different approaches and lead to inconsistent results.
□ scNAME: Neighborhood contrastive clustering with ancillary mask estimation for scRNA-seq data
scNAME incorporates a mask estimation task for gene pertinence mining and a neighborhood contrastive learning framework for cell intrinsic structure exploitation.
A neighborhood contrastive paradigm with an offline memory bank, global in scope, which can inspire discriminative feature representation and achieve intra-cluster compactness, yet inter-cluster separation.