□ LANTERN: Interpretable modeling of genotype-phenotype landscapes with state-of-the-art predictive power
LANTERN learns interpretable models of GPLs by finding a latent, low-dimensional space where mutational effects combine additively. LANTERN then captures the non-linear effects of epistasis through a multi-dimensional, non-parametric Gaussian-process model.
□ OptICA: Optimal dimensionality selection for independent component analysis of transcriptomic data
OptICA, a novel method for effectively finding the optimal dimensionality that consistently maximizes the number of biologically relevant components revealed while minimizing the potential for over- decomposition.
Validating OptICA against known transcriptional regulatory networks and found that it outperformed previously published algorithms for identifying the optimal dimensionality. OptICA is organism-invariant.
□ Theory of local k-mer selection with applications to long-read alignment
This turns out to be tractable enough for us to prove closed-form expressions for a variety of methods, including (open and closed) syncmers, (a, b, n)-words, and an upper bound for minimizers.
Colinear sets of k-mer matches are collected into chains, and then dynamic programming based alignment is performed to fill gaps between chains. This modification was to swap out the k-mer selection method, originally random minimizers, to an open syncmer.
□ GENIES: A new method to study genome mutations using the information entropy
GENIES (GENetic Entropy Information Spectrum) is a fully functional code, that has an easy to use graphical interface and allows maximum versatility in choosing the computational parameters such as SS, WS and m-block size.
□ Super-cells untangle large and complex single-cell transcriptome networks
a network-based coarse-graining framework where highly similar cells are merged into super-cells. super-cells not only preserve but often improve the results of downstream analyses including clustering, DE, cell type annotation, gene correlation, RNA velocity and data integration.
a super-cell gene expression matrix is computed by averaging gene expression within super-cells. Using walktrap algorithm, it enables users to explore different graining levels without having to recompute the super-cells for each choice of 𝛾.
□ Heng Li
Minimap2 v2.19 released with better and more contiguous alignment over long INDELs and in highly repetitive regions, improvements backported from unimap. These represent the most significant algorithmic change since v2.1. Use with caution.
□ Adam Phillipy RT
"Segmental duplications and their variation in a complete human genome" led by @mrvollger identifies double the number of previously known near-identical SD alignments, revealing massive evolutionary differences in SD organization between humans and apes.
□ Vcflib and tools for processing the VCF variant call format
The vcflib toolkit contains both a library and collection of executable programs for transforming VCF files consisting of over 30,000 lines of
source code written in the C++. vcflib also comes with a toolkit for population genetics: the Genotype Phenotype Association Toolkit (GPAT).
□ Tracking cell lineages to improve research reproducibility go.nature.com/3oDxZ2k
□ Sophie Zaaijer
Cell lineage tracking is important, and is actually pretty easy given the right tools.
Academics please check out our (FREE!) tool called "FIND Cell": you can digitize, organize, and verify your cell line info.
□ HCMB: A stable and efficient algorithm for processing the normalization of highly sparse Hi-C contact data
Hi-C Matrix Balancing (HCMB) is architected on an iterative solution of equations combining with a linear search and projection strategy to normalize the Hi-C original interaction data.
HCMB can be seen as a variant of the Levenberg-Marquardt-type method, of which one salient characteristic is that the coefficient matrix of linear equations will be dense during the iterative process. HCMB algorithm a more robust practical behavior on highly sparse matrices.
□ G2S3: A gene graph-based imputation method for single-cell RNA sequencing data
G2S3 imputes dropouts by borrowing information from adjacent genes in a sparse gene graph learned from gene expression profiles across cells.
G2S3 has superior overall performance in recovering gene expression, identifying cell subtypes, reconstructing cell trajectories, identifying differentially expressed genes, and recovering gene regulatory and correlation relationships.
G2S3 optimizes the gene graph structure using graph signal processing that captures nonlinear correlations among genes.
The computational complexity of the G2S3 algorithm is a polynomial of the total number of genes in the graph, so it is computationally efficient, especially for large scRNA-seq datasets with hundreds of thousands of cells.
□ MultiTrans: an algorithm for path extraction through mixed integer linear programming for transcriptome assembly
the transcriptome assembly problem as path extraction on splicing graphs (or assembly graphs), and propose a novel algorithm MultiTrans for path extraction using mixed integer linear programming.
MultiTrans is able to take into consideration coverage constraints on vertices and edges, the number of paths and the paired-end information simultaneously. MultiTrans generates more accurate transcripts compared to TransLiG and rnaSPAdes.
□ Automated Generation of Novel Fragments Using Screening Data, a Dual SMILES Autoencoder, Transfer Learning and Syntax Correction
The dual model produced valid SMILES with improved features, considering a range of properties including aromatic ring counts, heavy atom count, synthetic accessibility, and a new fragment complexity score we term Feature Complexity.
□ SRC: Accelerating RepeatClassifier Based on Spark and Greedy Algorithm with Dynamic Upper Boundary
Spark-based RepeatClassifier (SRC) which uses Greedy Algorithm with Dynamic Upper Boundary (GDUB) for data division and load balancing, and Spark to improve the parallelism of RepeatClassifier.
SRC can not only ensure the same level of accuracy as that of RepeatClassifier, but also achieve 42-88 times of acceleration compared to RepeatClassifier. At the same time, a modular interface is provided to facilitate the subsequent upgrade and optimization.
□ BaySiCle: A Bayesian Inference joint kNN method for imputation of single-cell RNA-sequencing data making use of local effect
BaySiCle allows robust imputation of missing values generating realistic transcript distributions that match single molecule fluorescence in situ hybridization measurements.
By using priors as obtained by the dataset structures in the not just the experimental set-up batch, but also the same group of cells, BaySiCle improves accuracy of imputation to be that much closer to its similar alternatives.
□ nf-LO: A scalable, containerised workflow for genome-to-genome lift over
nf-LO (nextflow-LiftOver), a containerised and scalable Nextflow pipeline that enables liftovers within and between any species for which assemblies are available. nf-LO is a workflow to facilitate the generation of genome alignment chain files compatible with the LiftOver utility.
Nf-LO can directly pull genomes from public repositories, supports parallelised alignment using a range of alignment tools and can be finely tuned to achieve the desired sensitivity, speed of process and repeatability of analyses.
□ Pseudo-supervised Deep Subspace Clustering
Self-reconstruction loss of an AE ignores rich useful relation information and might lead to indiscriminative representation, which inevitably degrades the clustering performance. It is also challenging to learn high-level similarity without feeding semantic labels.
Using pairwise similarity to weigh the reconstruction loss to capture local structure information, while a similarity is learned by the self-expression layer.
Pseudo-graphs and pseudo-labels, which allow benefiting from uncertain knowledge acquired during network training, are further employed to supervise similarity learning. Joint learning and iterative training facilitate to obtain an overall optimal solution.
□ Samplot: a platform for structural variant visual validation and automated filtering
Samplot provides a quick platform for rapidly identifying false positives and enhancing the analysis of true-positive SV calls. Samplot images are a concise SV visualization that highlights the most relevant evidence in the variable region and hides less informative reads.
Samplot-ML is a resnet-like model that takes Samplot images of putative deletion SVs as input and predicts a genotype. This model will remove false positives from the output set of an SV caller or genotyper.
□ RMAPPER: Fast and efficient Rmap assembly using the Bi-labelled de Bruijn graph
There the term bi-label refers to two k-mers separated by a specified genomic distance. The redefinition of the de Bruijn graph with this extra information was shown to de-tangle the resulting graph, making traversal more efficient and accurate.
An equivalent paradigm can be effective for Rmap assembly. MAPPER was more than 130 times faster and used less than five times less memory than Solve, and was more than 2,000 times faster than Valouev et al.
RMAPPER successfully assembled the 3.1 million Rmaps of the climbing perch genome into contigs that covered over 95% of the draft genome with zero mis-assemblies.
□ diffBUM-HMM: a robust statistical modeling approach for detecting RNA flexibility changes in high-throughput structure probing data
diffBUM-HMM is widely compatible, accounting for sampling variation and sequence coverage biases, and displays higher sensitivity than existing methods while robust against false positives.
diffBUM-HMM detects more differentially reactive nucleotides (DRNs) in the Xist lncRNA that are preferentially single-stranded A’s and U’s. diffBUM-HMM outperforms deltaSHAPE and dStruct in both sensitivity and/or specificity.
□ contrastive-sc: Contrastive self-supervised clustering of scRNA-seq data
contrastive-sc maintains good performance when only a fraction of input cells is provided and is robust to changes in hyperparameters or network architecture.
contrastive-sc computes by default a cell partitioning with KMeans or Leiden. This phenomenon can be explained by the documented tendency KMeans has to identify equal-sized, combined with the significant class imbalance associated with the datasets having more than 8 clusters.
□ baredSC: Bayesian Approach to Retrieve Expression Distribution of Single-Cell
baredSC, a Bayesian approach to disentangle the intrinsic variability in gene expressions from the sampling noise. Bared SC approximates the expression distribution of a gene by a Gaussian mixture model.
They also use real biological data sets to illustrate the power of baredSC to assess the correlation between genes or to reveal the multi-modality of a lowly expressed gene. baredSC reveals the trimodal distribution.
□ GenomicSuperSignature: interpretation of RNA-seq experiments through robust, efficient comparison to public databases
GenomicSuperSignature matches PCA axes in a new dataset to an annotated index of replicable axes of variation (RAV) that are represented in previously published independent datasets.
GenomicSuperSignature also can be used as a tool for transfer learning, utilizing RAVs as well-defined and replicable latent variables defined by multiple previous studies in place of de novo latent variables.
□ Nature Genetics
Long-read sequencing at the population scale presents specific challenges but is becoming increasingly accessible. The authors discuss the major platforms and analytical tools, considerations in project design and challenges in scaling long-read sequencing to populations.
□ Dysgu: efficient structural variant calling using short or long reads
Dysgu detects signals from alignment gaps, discordant and supplementary mappings, and generates consensus contigs, before classifying events using machine learning.
Dysgu employs a fast consensus sequence algorithm, inspired by the positional de Brujin graph, followed by remapping of anomalous sequences to discover additional small SVs.
□ GeneGrouper: Density-based binning of gene clusters to infer function or evolutionary history
GeneGrouper identified a novel, frequently occurring pduN pseudogene. When replicated in vivo, disruption of pduN with a frameshift mutation negatively impacted microcompartment formation.
Sequences are clustered using mmseqs2 linclust to generate a set of proximate orthology relationships, producing a set of representative amino acid sequences in FASTA format. The E-values from the filtered hits table is used as an input for Markov Graph Clustering with MCL.
□ A phylogenetic approach for weighting genetic sequences
Formalising the principle by rigorously defining the evolutionary ‘novelty’ of a sequence within an alignment. This results in new sequence weights that called ‘phylogenetic novelty scores’.
This phylogenetic novelty scores can be useful when an evolutionarily meaningful system for adjusting for uneven taxon sampling is desired. They have numerous possible applications, including estimation of evolutionary conservation scores and sequence logos.
□ PRESCIENT: Generative modeling of single-cell time series with PRESCIENT enables prediction of cell trajectories with interventions
PRESCIENT (Potential eneRgy undErlying Single Cell gradIENTs) builds upon a diffusion-based model by enabling the model to operate on large numbers of cells over many timepoints with high-dimensional features, and by incorporating cellular growth estimates.
PRESCIENT’s ability to generate held-out timepoints and to predict cell fate bias, i.e. the probability a cell enters a particular fate given its initial state. PRESCIENT’s objective can be modified to maximize the likelihood of observing individual trajectories given lineage tracing data.
□ MetaVelvet-DL: a MetaVelvet deep learning extension for de novo metagenome assembly
MetaVelvet-DL builds an end-to-end architecture using Convolutional Neural Network and Long Short-Term Memory units. MetaVelvet-DL can more accurately predict how to partition a de Bruijn graph than the Support Vector Machine-based model in MetaVelvet-SL.
□ CaFew: Boosting scRNA-seq data clustering by cluster-aware feature weighting
By resolving the optimization problem of clustering, a weight matrix indicating the importance of features in different clusters is derived. CaFew filters out genes with small weight in all clusters or a small weight variation across all clusters.
With CaFew, the clustering performance of distance-based methods like k-means and SC3 can be considerably improved, but its effectiveness is not so obvious on the other types of methods like Seurat.
□ MiMiC: a bioinformatic approach for generation of synthetic communities from metagenomes
MiMiC, a computational approach for data-driven design of simplified communities from shotgun metagenomes.
MiMiC predicts the composition of minimal consortia using an iterative scoring system based on maximal match-to-mismatch ratios between this database and the Pfam binary vector of any input metagenome.
□ TIGA: Target illumination GWAS analytics
Rational ranking, filtering and interpretation of inferred gene–trait associations and data aggregation across studies by leveraging existing curation and harmonization efforts.
TIGA, a method for assessing confidence in gene–trait associations from evidence aggregated across studies, including a bibliometric assessment of scientific consensus based on the iCite Relative Citation Ratio, and meanRank scores, to aggregate multivariate evidence.
□ Overcoming uncollapsed haplotypes in long-read assemblies of non-model organisms
The haploidy score is based on the identification of two peaks in the per-base coverage depth distribution: a high-coverage peak that corresponds to bases in collapsed haplotypes, and a peak at about half-coverage of the latter that corresponds to bases in uncollapsed haplotypes.
The haploidy score represents the fraction of collapsed bases in the assembly, and is equal to C/(C+U/2), i.e. the ratio of the area of the collapsed peak (C) divided by the sum of the area of the collapsed peak (C) and half of the area of the uncollapsed peak (U/2).
This metric reaches its maximum of 1.0 when there is no uncollapsed peak, in a perfectly collapsed assembly, whereas it returns 0.0 when the assembly is not collapsed at all.
□ BUTTERFLY: addressing the pooled amplification paradox with unique molecular identifiers in single-cell RNA-seq
the naïve removal of duplicates can lead to a bias due to a “pooled amplification paradox,” BUTTERFLY utilizes estimation of unseen species for addressing the bias caused by incomplete sampling of differentially amplified molecules.
BUTTERFLY uses a zero truncated negative binomial estimator implemented in the kallisto bustools workflow.
BUTTERFLY correction can be used to scale the gene expression of each gene to resemble the gene expression that more reads would yield, they do not necessarily imply that the corrected expression values are closer to ground truth.
□ NanoSpring: reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach
NanoSpring uses an approximate assembly approach partly inspired by existing assembly algorithms but adapted for significantly better performance, especially for the recent higher quality datasets. NanoSpring achieves close to 3x improvement in compression as compared to ENANO.
NanoSpring uses MinHash to index the reads and find overlapping reads during contig generation. NanoSpring uses the minimap2 aligner to align candidate reads to the consensus sequence and add them to the graph during contig generation.
□ EPIC: Inferring relevant tissues and cell types for complex traits in genome-wide association studies
EPIC (cEll tyPe enrIChment), a statistical framework that relates large-scale GWAS summary statistics to cell-type-specific omics measurements from single-cell sequencing.
EPIC is the first method that prioritizes tissues and/or cell types for both common and rare variants with a rigorous statistical framework to account for both within- and between-gene correlations.
□ ASURAT: Functional annotation-driven unsupervised clustering of single-cell transcriptomes
ASURAT simultaneously performs unsupervised cell clustering and biological interpretation in semi-automatic manner, in terms of cell type and various biological functions.
ASURAT creates a functional spectrum matrix, termed a sign-by-sample matrix (SSM). By analyzing SSMs, users can cluster samples to aid their interpretation.
□ eQTLsingle: Discovering single-cell eQTLs from scRNA-seq data only
eQTLsingle discovers eQTLs only with scRNA-seq data, without genomic data. It detects mutations from scRNA-seq data and models gene expression of different genotypes with the ZINB model to find associations between genotypes and phenotypes at single-cell level.
□ EIR: Deep integrative models for large-scale human genomics
EIR, a deep learning framework for PRS prediction which includes a model, genome-local-net (GLN), is specifically designed for large scale genomics data. The framework supports multi-task (MT) learning, automatic integration of clinical and biochemical data and model explainability.
□ Puffaligner : A Fast, Efficient, and Accurate Aligner Based on the Pufferfish Index
PuffAligner begins read alignment by collecting unique maximal exact matches, querying k-mers from the read in the Pufferfish index.
The aligner then chains together the collected uni-MEMs using a dynamic programming approach, choosing the chains with the highest coverage as potential alignment positions for the reads.