lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

METANOIA.

2022-12-13 23:13:31 | Science News





□ BioByGANS: biomedical named entity recognition by fusing contextual and syntactic features through graph attention network in node classification framework

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05051-9

BioByGANS (BioBERT/SpaCy-Graph Attention Network-Softmax) models the dependencies / topology of a sentence and formulate the BioNER task. This formulation can introduce topological features of language and no longer be only concerned about the distance b/n words in the sequence.

First, BioByGANS uses periods to segment sentences and spaces and symbols to segment words. Second, contextual features are encoded by BioBERT, and syntactic features such as part of speeches, dependencies and topology are preprocessed by SpaCy respectively.

A graph attention network is then used to generate a fusing representation considering both the contextual features and syntactic features. Last, a softmax function is used to calculate the probabilities.





□ CARNAGE: Investigating graph neural network for RNA structural embedding

>> https://www.biorxiv.org/content/10.1101/2022.12.02.515916v1

CARNAGE (Clustering/Alignment of RNA with Graph-network Em- bedding), which leverages a graph neural network encoder to imprint structural information into a sequence-like embedding; therefore, downstream sequence analyses now account implicitly for structural constraints.

CARNAGE creates a graphG = (V,E,U), where nodes V are unit-vectors encoding the nucleotide identity. For each node/nucleotide, two rounds of message passing network aggregate information. All the node vectors are concatenated to form the Si-seq.





□ bmVAE: a variational autoencoder method for clustering single-cell mutation data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac790/6881080

bmVAE infers the low-dimensional representation of each cell by minimizing the Kullback-Leibler divergence loss and reconstruction loss (measured using cross-entropy). bmVAE takes single-cell binary mutation data as inputs, and outputs inferred cell subpopulations as well as their genotypes.

bmVAE employs a VAE model to learn latent representation of each cell in a low-dimensional space, then uses a Gaussian mixture model (GMM) to find clusters of cells, finally uses a Gibbs sampling based approach to estimate genotypes of each subpopulation in the latent space.





□ rcCAE: a convolutional autoencoder based method for detecting tumor clones and copy number alterations from single-cell DNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.12.04.519013v1

rcCAE uses a convolutional encoder network to project thelog2 transformed read counts (LRC) into a low-dimensional latent space where the cells are clustered into distinct subpopulations through a Gaussian mixture model.

rcCAE leverages a convolutional decoder network to recover the read counts from learned latent representations. rcCAE employs a novel hidden Markov model to jointly segment the genome and infer absolute copy number for each segment.

rcCAE directly deciphers ITH from original read counts, which avoids potential error propagation from copy number analysis to ITH inference. After the algorithm converges, the copy number of each bin is deduced from the state that has the maximum posterior probability.





□ gtexture: Haralick texture analysis for graphs and its application to biological networks

>> https://www.biorxiv.org/content/10.1101/2022.11.21.517417v1

The method for calculating GLCM-equivalents and Haralick texture features and apply it to several network types. They developed the translation of co-occurrence matrix analysis to generic networks for the first time.

The number of distinct node weights is w, the dimension of the co-occurrence matrix, C, is w × w. Co-occurrence matrices summarize a network when the number of distinct node weights is less than the number of nodes.

gtexture reduces the number of unique node weights, incl. node weight binning options for continuous node weights. Continuous data can be transformed via several discretisation methods.

The Haralick features calculated on different landscapes and networks of the same size but with different topologies vary. Although highly specific methods designed for detecting landscape ruggedness exist, this discretization and co-occurrence matrix method is more generalizable.





□ CRMnet: a deep learning model for predicting gene expression from large regulatory sequence datasets

>> https://www.biorxiv.org/content/10.1101/2022.12.02.518786v1

CRMnet, a Transformer encoded U-Net from the image semantic segmentation task and applied it to genomic sequences as a feature extractor. CRMnet utilizes transformer encoders, which leverage self-attention mechanisms to extract additional useful information from genomic sequences.

CRMnet consists of Squeeze and Excitation (SE) Encoder Blocks, Transformer Encoder Blocks, SE Decoder Blocks, SE Block and Multi-Layer Perceptron (MLP). CRMnet has an initial encoding stage that extracts feature maps at progressively lower dimensions.

A decoder stage that upscales these feature maps back to the original sequence dimension, whilst concatenating with the higher resolution feature maps of the encoder at each level to retain prior information despite the sparse upscaling.





□ SRGS: sparse partial least squares-based recursive gene selection for gene regulatory network inference

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-09020-7

SRGS, SPLS (sparse partial least squares)-based recursive gene selection, to infer GRNs from bulk or single-cell expression data. SRGS recursively selects and scores the genes which may have regulations on the considered target gene based on SPLS.

SRGS recursively selects and scores the genes which may have regulations on the considered target gene. They randomly scramble samples, set some values in the expression matrix to zeroes, and generate multiple copies of data through multiple iterations.





□ WINC: M-Band Wavelet-Based Imputation of scRNA-seq Matrix and Multi-view Clustering of Cell

>> https://www.biorxiv.org/content/10.1101/2022.12.05.519090v1

WINC integrates M-band wavelet analysis and UMAP to a panel of single cell sequencing datasets via breaking up the data matrix into a trend (low frequency or low resolution) component and (M − 1) fluctuation (high frequency or high resolution) components.

This strategy resolves the notorious chaotic sparsity of droplet RNA-Seq matrix and uncovers missed / rare cell types, identities, states. A non-parametric wavelet-based imputation algorithm of sparse data that integrates M-band orthogonal wavelet for recovering dropout events.





□ DeepPHiC: Predicting promoter-centered chromatin interactions using a novel deep learning approach

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac801/6887158

DeepPHiC adopts a “shared knowledge transfer” strategy for training the multi-task learning model. When tissue A/B is of interest, and aggregates all chromatin interactions from other tissues except tissue A/B to pretrain the shared feature extractor.

DeepPHiC consists of three types of input features, which include genomic sequence and epigenetic signal in the anchors as well as anchor distance. DeepPHiC uses one-hot encoding for the genomic sequence. As a result, the genomic sequence is converted into a 2000 × 4 matrix.

The network architecture of DeepPHiC is developed based on the DenseNet. DeepPHiC uses a ResNet-style structure with skip connections. During back propagation, each layer has a direct access to the output gradients, resulting in faster network convergence.





□ DPMUnc: Bayesian clustering with uncertain data

>> https://www.biorxiv.org/content/10.1101/2022.12.07.519476v1

Dirichlet Process Mixtures with Uncertainty (DPMUnc), an extension of a Bayesian nonparametric clustering algorithm which makes use of the uncertainty associated with data points.

DPMUnc outperformed its comparators kmeans and mclust by a small margin when observation noise and cluster variance were small, which increased with increasing cluster variance or observation noise.

DPMZeroUnc is the adjusted version of the datasets where the uncertainty estimates were shrunk to 0. The latent variables are essentially fixed to be equal to the observed data points throughout.





□ LAST: Latent Space-Assisted Adaptive Sampling for Protein Trajectories

>> https://pubs.acs.org/doi/10.1021/acs.jcim.2c01213

LAST accelerates the exploration of protein conformational space. This method comprises cycles of (i) variational autoencoder training, (ii) seed structure selection on the latent space, and (iii) conformational sampling through additional Molecular dynamics simulations.

In metastable ADK simulations, LAST explored two transition paths toward two stable states, while SDS explored only one and cMD neither. In VVD light state simulations, LAST was three times faster than cMD simulation with a similar conformational space.





□ FiniMOM: Genetic fine-mapping from summary data using a non-local prior improves detection of multiple causal variants

>> https://www.biorxiv.org/content/10.1101/2022.12.02.518898v1

FiniMOM (fine-mapping using a product inverse-moment priors), a novel Bayesian fine-mapping method for summarized genetic associations. The method uses a non-local inverse-moment prior, which is a natural prior distribution to model non-null effects in finite samples.

FiniMOM allows a non-zero probability for all variables, instead of considering only the variables that correlate highly with the residuals of the current model.

FiniMOM’s sampling scheme is related to reversible jump MCMC algorithm, however this formulation and use of Laplace’s method avoids complicated sampling from varying-dimensional model space.





□ DeepCellEss: Cell line-specific essential protein prediction with attention-based interpretable deep learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac779/6865030

DeepCellEss utilizes convolutional neural network and bidirectional long short-term memory to learn short- and long-range latent information from protein sequences. Further, a multi-head self-attention mechanism is used to provide residue-level model interpretability.

DeepCellEss converts a protein sequence into a numerical matrix using one-hot encoding. The multi-head self-attention is used to produce residue-level attention scores. After this, a bi-LSTM module is applied to model sequential data by learning long-range dependencies.





□ DiffDomain enables identification of structurally reorganized topologically associating domains

>> https://www.biorxiv.org/content/10.1101/2022.12.05.519135v1

DiffDomain, an algorithm leveraging high-dimensional random matrix theory to identify structurally reorganized TADs using chromatin contact maps. DiffDomain outperforms alternative methods for FPRs, TPRs, and identifying a new subtype of reorganized TADs.

DiffDomain directly computes a difference matrix then normalize it properly, skipping the challenging normalization steps for individual Hi-C contact matrices. DiffDomain then borrows well-established theorectical results in ramdom matrix theory to compute a theorectical P value.

DiffDomain identifies reorganized TADs b/n cell types w/ reasonable reproducibility using pseudo-bulk Hi-C data from as few as 100 cells per condition. DiffDomain reveals that TADs have clear differential cell-to-population variability and heterogeneous cell-to-cell variability.





□ Efficient inference and identifiability analysis for differential equation models with random parameters

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010734

A new likelihood-based framework, based on moment matching, for inference and identifiability analysis of differential equation models that capture biological heterogeneity through parameters that vary according to probability distributions.

The availability of a surrogate likelihood allows us to perform inference and identifiability analysis of random parameter models using the standard suite of tools, including profile likelihood, Fisher information, and Markov-chain Monte-Carlo.





□ EDIR: Exome Database of Interspersed Repeats

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac771/6858440

The Exome Database of Interspersed Repeats (EDIR) was developed to provide an overview of the positions of repetitive structures within the human genome composed of interspersed repeats encompassing a coding sequence.

EDIR can be queried for interspersed repeat sequence IRS in a gene of interest. Additional parameters which can be entered are the length of the repeat (7-20 bp), the minimum (0 bp) and maximum distance (1000 bp) of the spacer sequence, and whether to allow a 1-bp mismatch.

As output, a table is given where for each repeat length, the number of interspersed repeat structures, together with the average distance separating two repeats, as well as the number of interspersed repeat structures per megabase and whether a 1 bp mismatch has occurred.





□ T3E: a tool for characterising the epigenetic profile of transposable elements using ChIP-seq data

>> https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-022-00285-z

The Transposable Element Enrichment Estimator (T3E) weights the number of read mappings assigned to the individual TE copies of a family/subfamily by the overall number of genomic loci to which the corresponding reads map, and this is done at the single nucleotide level.

T3E maps ChIP-seq reads to the entire genome of interest w/o subsequently remapping the reads to particular consensus or pseudogenome sequences. In its calculations T3E considers the number of both repetitive / non-repetitive genomic loci to which each multimapper mapped.





□ Hi-LASSO: High-performance python and apache spark packages for feature selection with high-dimensional data

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0278570

Random LASSO does not take advantage of global oracle property. Although Random LASSO uses bootstrapping with weights being proportional to importance scores of predictors in the second procedure, the final coefficients are estimated without the weights.

Hi-LASSO computes importance scores of variables by averaging absolute coefficients. Hi-LASSO alleviates bias from bootstrapping, improves the performance taking advantage of global oracle property, provides a statistical strategy to determine the number of bootstrapping.





□ Scaling Neighbor-Joining to One Million Taxa with Dynamic and Heuristic Neighbor-Joining

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac774/6858462

Dynamic and Heuristic Neighbor-Joining, are presented, which optimize the canonical Neighbor-Joining method to scale to millions of taxa without increasing the memory requirements.

Both Dynamic and Heuristic Neighbor-Joining outperform the current gold standard methods to construct Neighbor-Joining trees, while Dynamic Neighbor-Joining is guaranteed to produce exact Neighbor-Joining trees.

Asymptotically, DNJ reaches a runtime of O(n3) when updates to D causes frequent updates. This worst-case time complexity can be reduced to O(n2) with an approximating search heuristic. The time complexity of HNJ to O(n2), while the space complexity remains at O(n2) as for DNJ.





□ GLCM-WSRC: Robust and accurate prediction of self-interacting proteins from protein sequence information by exploiting weighted sparse representation based classifier

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04880-y

GLCM-WSRC (gray level co-occurrence matrix-weighted sparse representation based classification), for predicting SIPs automatically based on protein evolutionary information from protein primary sequences.

The GLCM algorithm is employed to capture the valuable information from the PSSMs and form feature vectors, after which the ADASYN is applied to balance the training data set to form new feature vectors used as the input of classifier from the GLCM feature vectors.





□ Treenome Browser: co-visualization of enormous phylogenies and millions of genomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac772/6858450

Treenome Browser displays mutations as vertical lines spanning the mutation’s presence among samples in the phylogeny, drawn at their horizontal position in an associated reference genome.

The core algorithm used by Treenome Browser decodes a mutation-annotated tree to compute the on-screen position of each mutation in the tree. To compute vertical positions, the vertical span of each subclade of the tree is first stored using dynamic programming.





□ Accurate quantification of single-nucleus and single-cell RNA-seq transcripts

>> https://www.biorxiv.org/content/10.1101/2022.12.02.518832v1

The presence of both nascent and mature mRNA molecules in single-cell RNA-seq data leads to ambiguity in the notion of a “count matrix”. Underlying this ambiguity, is the challenging problem of separately quantifying nascent and mature mRNAs.

By utilizing k-mers, this approach has the benefit of being efficient as it is compatible with pseudoalignment. An approach to quantification of single-nucleus RNA-seq that focuses on the nascent transcripts, thereby mirroring the approach that focuses on mature transcripts.





□ Variational inference accelerates accurate DNA mixture deconvolution

>> https://www.biorxiv.org/content/10.1101/2022.12.01.518640v1

Considering Stein Variational Gradient Descent (SVGD) and Variational Inference (VI) with an evidence lower-bound objective. Both provide alternatives to the commonly used Markov-Chain Monte-Carlo methods for estimating the model posterior in Bayesian probabilistic genotyping.

The model defines the unnormalised posterior, and the estimator defines the way how an approximation of this distribution is obtained. These two parts are largely independent of each other, meaning that, for example, an estimator can be replaced with another one.

The singularities are not a problem for HMC estimators, who will avoid them
because of the high curvature of the posterior in the vicinity of the singularities. The trajectory of the simulated Hamiltonian differs too much from the expected Hamiltonian.





□ HTRX: an R package for learning non-contiguous haplotypes associated with a phenotype

>> https://www.biorxiv.org/content/10.1101/2022.11.29.518395v1

HTRX defines a template for each haplotype using the combination of ‘0’, ‘1’ and ‘X’ which represent the reference allele, alternative allele and either of the alleles, at each SNP. A four-SNP haplotype ‘1XX0’ only refers to the interaction between the first and the fourth SNP.

HTRX considers lasso penalisation. AIC and BIC penalise the number of features through forward regression, and the features whose parameters do not shrink to 0 are retained. The objective function of HTRX is the out-of-sample variance explained by haplotypes within a region.





□ GSSNNG: Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq (scRNA-seq)

>> https://www.biorxiv.org/content/10.1101/2022.11.29.518384v1

GSSNNG produces a gene set score for each individual cell, addressing problems of low read counts and the many zeros and retains gradations that remain visible in UMAP plots.

The method works by using a nearest neighbor graph in gene expression space to smooth the count matrix. The smoothed expression profiles are then used in single sample gene set scoring calculations.

Using gssnng, large collections of cells can be scored quickly even on a modest desktop. The method uses the nearest neighbor graph (kNN) of cells to smooth the gene expression count matrix which decreases sparsity and improves geneset scoring.





□ Annotation-agnostic discovery of associations between novel gene isoforms and phenotypes

>> https://www.biorxiv.org/content/10.1101/2022.12.02.518787v1

A bi-directed de Bruijn Graph (dBG) is constructed, using Bifrost, from these reads with k-mer size 𝑘 = 31 and then compacted such that consecutive k-mers with out-degree 1 and in-degree 1 respectively are folded into a single, maximal unitig, which is a high-confidence contig.





□ MCProj: Metacell projection for interpretable and quantitative use of transcriptional atlases

>> https://www.biorxiv.org/content/10.1101/2022.12.01.518678v1

MCProj, an algorithm for quantitative analysis of query scRNA-seq given a reference atlas. The algorithm is transforming single cells to quantitative states using a metacell representation of the atlas and the query.

MCProj infers each query state as a mixture of atlas states, and tags cases in which such inference is imprecise, suggestive of novel or noisy states in the query. MCProj tags novel query states and compares them to atlas states.





□ Finemap-MiXeR: A variational Bayesian approach for genetic finemapping

>> https://www.biorxiv.org/content/10.1101/2022.11.30.518509v1

The Finemap-MiXeR is based on a variational Bayesian approach for finemapping genomic data, i.e., determining the causal SNPs associated with a trait at a given locus after controlling for correlation among genetic variants due to linkage disequilibrium.

Finemap-MiXeR on the optimization of Evidence Lower Bound of the likelihood function obtained from the MiXeR model. The optimization is done using Adaptive Moment Estimation Algorithm, allowing to obtain posterior probability of each SNP to be a causal variant.





□ Visual Omics: A web-based platform for omics data analysis and visualization with rich graph-tuning capabilities

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac777/6865031

Visual Omics integrates multiple omics analyses which include differential expression analysis, enrichment analysis, protein domain prediction and protein-protein interaction analysis with extensive graph presentations.

The extensive use of the powerful downstream ggplot2 and its family packages enables almost all analysis results to be visualized by Visual Omics and can be adapted to the online tuning system almost without modification.





□ associationSubgraphs: Interactive network-based clustering and investigation of multimorbidity association matrices

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac780/6874541

associationSubgraphs, a new interactive visualization method to quickly and intuitively explore high-dimensional association datasets using network percolation and clustering.

The algorithm for computing associationSubgraphs at all given cutoffs is closely related to single-linkage clustering but differs philosophically by viewing nodes that are yet to be merged with other nodes as unclustered rather than residing within their own cluster of size one.

It investigates association subgraphs efficiently, each containing a subset of variables with more frequent associations than the remaining variables outside the subset, by showing the entire clustering dynamics and provide subgraphs under all possible cutoff values at once.




Starbright.

2022-12-13 23:12:13 | Science News




□ MoDLE: high-performance stochastic modeling of DNA loop extrusion interactions

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02815-7

MoDLE uses fast stochastic simulation to sample DNA-DNA contacts generated by loop extrusion. Binding and release of LEFs and barriers and the extrusion process is modeled as an iterative process.

MoDLE goes through a burn-in phase where LEFs are progressively bound to DNA, w/o sampling molecular contacts. The burn-in phase runs until the average loop size has stabilized. LEFs are extruded through randomly sampled strides along the DNA in reverse / forward directions.

Extrusion barriers (e.g., CTCF binding sites) are modeled using a two-state (bound and unbound) Markov process. Each extrusion barrier consists of a position, a blocking direction and the Markov process transition probabilities.





□ Reconstructing gene regulatory networks of biological function using differential equations of multilayer perceptrons

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05055-5

A multi-layer perceptron-based differential equation method, which specifically transforms the gene regulation network(GRN) system into an input-output regression problem, where the input is gene expression data and the output is the derivative estimated from the expression data.

The method utilizes time-series gene expression data to train a regulatory function that simulates the transcription rate of a gene, which is a fully connected neural network(NN) with a four-layer structure.





□ BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches in Genome Analysis

>> https://www.biorxiv.org/content/10.1101/2022.11.23.517691v1

BLEND utilizes a technique called SimHash, that can generate the same hash value for similar sets, and provides the proper mechanisms for using seeds as sets with the SimHash technique to find fuzzy seed matches efficiently.

BLEND is faster by 2.4×-83.9× (average 19.3×), has a lower memory foot- print by 0.9×-14.1× (average 3.8×), and finds higher quaity overlaps leading to accurate de novo assemblies than the minimap2. For read mapping, BLEND is faster by 0.8×-4.1× (average 1.7×) than minimap2.





□ SIEVE: joint inference of single-nucleotide variants and cell phylogeny from single-cell DNA sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02813-9

SIEVE, a statistical method for the joint inference of somatic variants and cell phylogeny under the finite-sites assumption from single-cell DNA sequencing. SIEVE leverages raw read counts for all nucleotides and corrects the acquisition bias of branch lengths.

SIEVE takes as input raw read count data, accounting for the read counts for nucleotides and the total depth at each site and combines a phylogenetic model with a probabilistic graphical model, incorporating a Dirichlet Multinomial distribution of the nucleotide counts.





□ scEvoNet: a gradient boosting-based method for prediction of cell state evolution

>> https://www.biorxiv.org/content/10.1101/2022.12.07.519467v1

ScEvoNet builds the confusion matrix of cell states and a bipartite network connecting genes and cell states. It allows a user to obtain a set of genes shared by the characteristic signature of two cell states even between distantly-related datasets.

scEvoNet implements a shortest path search in order to generate a subnetwork of interest. scEvoNet builds a cell type-to-gene network using the Light Gradient Boosting Machine (LGBM) algorithm overcoming different domain effects and dropouts that are inherent.





□ seqwish: Unbiased pangenome graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac743/6854971

The seqwish algorithm builds a variation graph from a set of sequences and alignments between them. seqwish implements a lossless conversion from pairwise alignments between sequences to a variation graph encoding the sequences and their alignments.

seqwish transforms the alignment set into an implicit interval tree. seqwish queries this representation to reduce transitive matches into single DNA segments in a sequence graph. seqwish traces the original paths through this graph, yielding a pangenome variation graph.





□ RawMap: Rapid Real-time Squiggle Classification for Read Until

>> https://www.biorxiv.org/content/10.1101/2022.11.22.517599v1

RawMap is a direct squiggle-space metagenomic classifier which complements Minimap2 for filtering non-targeted reads. RawMap uses a SVM with an RBF kernel, which is trained to capture the non-linear and non-stationary characteristics of the nanopore squiggles.

Each normalized squiggle segment y corresponding to 450 basepairs of a read is mapped to a 3-D feature space. Features are derived from a modified ver. of Hjorth parameters, where the mean and standard deviation are replaced w/ median and median absolute deviation respectively.





□ scSHARP: Consensus Label Propagation with Graph Convolutional Networks for Single-Cell RNA Sequencing Cell Type Annotation

>> https://www.biorxiv.org/content/10.1101/2022.11.23.517739v1

scSHARP provides evidence for the accuracy of the GCN approach through comparison to state-of-the-art methods ScType, ScSorter, SCINA, SingleR, and ScPred on a variety of data sets,

They implemented a non-parametric neighbor ma jority approach as an additional baseline to test our GCN model. This method operates on the 500D vectors produced as the principal components of the gene expression matrices for each data set.





□ Matrix prior for data transfer between single cell data types in latent Dirichlet allocation

>> https://www.biorxiv.org/content/10.1101/2022.11.23.517534v1

When applied to scATAC-seq data, the outputs of latent Dirichlet allocation (LDA) are a cell-topic matrix, describing the topics assigned to each cell, and a topic-peak matrix, describing how strongly a peak contributes to the definition of each topic.

LDA is also well-suited to model single cell genomics data because it expects a matrix of integers as input, and thus can naturally operate on the raw count matrices generated by scATAC-seq or scRNA-seq.

The hyper parameters for the LDA model are the concentration parameters for the document/topic Dirichlet distribution. These distributions are assumed to be symmetric Dirichlet distributions. In that case the Dirichlet distribution can be parameterized with a single scalar value.





□ Interactive explainable AI platform for graph neural networks

>> https://www.biorxiv.org/content/10.1101/2022.11.21.517358v1

An interactive XAI platform that allows the domain expert to ask counterfactual ("what-if") questions. This platform allows a domain expert to observe how changes based on their questions affect the AI decision and the XAI explanation.

This human-in-the-loop approach to GNN classification will pave the way for implementation of GNNs in the clinical setting. This interactive XAI platform will pave the way for informed medical decision-making and the application of AI models as CDSS.

Generating 1000 Barabasi networks comprising 30 nodes and 29 edges. The networks had the same topology, but with varying node feature values. The features of the nodes were randomly sampled from a normal distribution N (0, 0.1). It should uncover these patterns in an algorithmic way.





□ ANNA16: Deep Learning for Predicting 16S rRNA Copy Number

>> https://www.biorxiv.org/content/10.1101/2022.11.26.518038v1

The proposed approach, i.e., Artificial Neural Network Approximator for 16S rRNA Gene Copy Number (ANNA16), essentially links 16S sequence string directly to GCN, without the construction of taxonomy or phylogeny.

ANNA16 is capable of detecting informative positions and weighing K-mers unequally according to their informativeness to more effectively utilize the information contained in 16S sequence.





□ IBDphase: Accurate genome-wide phasing from IBD data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05066-2

Identity by descent (IBD) occurs when one of a person’s two haplotypes is identical to one of another person’s in a segment of the genome because the two share a common ancestor. IBD data can be used to phase and determine the parent from which haplotypes are inherited.

IBDphase is able to separate the DNA inherited from each parent in our test set with an average accuracy over 95%. IBDphase also labels each IBD segment as being on one side of the family or the other.

IBDphase performs better when the DB is large, when many IBD segments are discovered, when a large proportion of sites overlap at least a few IBD segments, and when there are close genetic relationships to provide long IBD segments and help phase across multiple chromosomes.





□ Transposable element finder (TEF): finding active transposable elements from next generation sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05011-3

The new algorithm Transposable Element Finder (TEF) enables the detection of TE transpositions, even for TEs with an unknown sequence. TEF is a finding tool of transposed TEs, in contrast to TIF as a detection tool of transposed sites for TEs with a known sequence.

TEF detects transposed TEs with TSDs as a result of TE transposition, sequences of both ends and their inserted positions of transposed TEs. Genotypes of transpositions are verified by counting of junctions of head and tail, and non-insertion sequences in NGS reads.





□ scCDC: a computational method for gene-specific contamination detection and correction in single-cell and single-nucleus RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2022.11.24.517598v1

scCDC (single-cell Contamination Detection and Correction), which first detects the “contamination-causing genes,” which encode the most abundant ambient RNAs, and then only corrects these genes’ measured expression levels.

scCDC locates the cell cluster in which the GCG has the lowest mean expression. scCDC groups the cell cluster w/ similar clusters in terms of the Wasserstein distance. Genes w/ significant entropy divergence were selected in each cluster and the common genes were defined as GCGs.





□ MAGE: Strain Level Profiling of Metagenome Samples

>> https://www.biorxiv.org/content/10.1101/2022.11.24.517382v1

MAGE builds a k-mer lookup index for the sequence collection. It comprises strain level genome sequences from across a set of species. MAGE performs a novel local search based optimization which computes maximum likelihood estimates subject to constraints on read coverage.

The MAGE index is made of two level indices. In the index at level 2 index, the T sub-collections are indexed separately using FM index based full text indexing that supports k-mer lookup. MAGE performs read mapping purely based on k-mer hits and without any gapped alignment.





□ SCALA: A web application for multimodal analysis of single cell next generation sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.11.24.517826v1

SCALA, a holistic pipeline which integrates all the aforementioned procedures and enables biomedical researchers to get actively involved in the downstream analysis and exploration of both scRNA-seq and scATAC-seq datasets.

SCALA supports additional analysis modes such as automatic cluster annotation, functional enrichment analysis, ligand-receptor analysis, trajectory inference and reconstruction of GRNs.





□ RNAlysis: analyze your RNA sequencing data without writing a single line of code

>> https://www.biorxiv.org/content/10.1101/2022.11.25.517851v1

RNAlysis allows users to build customized analysis pipelines suiting their specific research questions, going all the way from raw FASTQ files, through exploratory data analysis and data visualization, clustering analysis, and gene-set enrichment analysis.

RNAlysis uses a modular approach, and provides an intuitive and flexible GUI, allowing users to answer a wide variety of biological questions, whether they are general or highly specific, and explore their data interactively without writing a single line of code.





□ PRESGENE: A web server for PRediction of ESsential GENE using integrative machine learning strategies

>> https://www.biorxiv.org/content/10.1101/2022.11.25.517801v1

PRESGENE, a ML-based web server for prediction of essential genes in unexplored eukaryotic and prokaryotic organisms.

PRESGENE algorithms mitigate the problems of training dataset imbalance and limited availability of experimentally labeled data for essential genes.





□ WGDTree: a phylogenetic software tool to examine conditional probabilities of retention following whole genome duplication events

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05042-w

Using gene tree-species tree reconciliation to label gene duplicate nodes and differentiate b/n WGD and SSD duplicates, WGDTree calculates a statistic based upon the conditional probability of a gene duplicate being retained after a second WGD dependent upon the retention status.

The inference tool performed well for a range of tree topologies and SSD rates particularly when loss and small-scale duplication rates were small and when event pairs were placed further apart. Therefore, WGDTree can be used to reliably calculate Pratio values in other lineages.





□ Monopogen: single nucleotide variant calling from single cell sequencing

>> https://www.biorxiv.org/content/10.1101/2022.12.04.519058v1

Monopogen, a computational framework that enables researchers to detect single nucleotide variants (SNVs) from a variety of single cell transcriptomic and epigenomic sequencing data. Monopogen starts from individual bam files produced by single cell sequencing technologies

Monopogen leverages linkage disequilibrium (LD) data from an external reference panel to increase SNV detection sensitivity and genotyping accuracy. Monopogen uses Monovar, a probabilistic SNV caller that effectively accounts for allelic dropout and false-positive errors.





□ SysBiolPGWAS: Simplifying Post GWAS analysis through the use of computational technologies and integration of diverse Omics datasets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac791/6883906

SysBiolPGWAS, a post-GWAS web application that provides a comprehensive functionality for biologists and non-bioinformaticians to conduct several post-GWAS analyses. It targets researchers in the area of the human genome and performs its analysis mainly in the autosomal chromosomes.

SysbiolPGWAS can select causal variants based on the linkage disequilibrium information in 1000 genomes using the clumping method of PLINK software. The process of variant clumping reports iteratively the most significant variant in the defined LD regions across the genome.





□ Atlas-scale single-cell multi-sample multi-condition data integration using scMerge2

>> https://www.biorxiv.org/content/10.1101/2022.12.08.519588v1

scMerge2 algorithm is able to integrate many millions of cells from single-cell studies generated from various single-cell technologies, incl. scRNA-seq, CyTOF. scMerge2 is generalizable to other single cell modalities including spatially resolved modality and multi-modalities.

The robustness of scMerge2 is achieved by varying the key tuning parameters of the algorithm, including the number of unwanted variation factors, the number of pseudo-bulk, the ways of pseudo-bulk construction and the number of nearest neighbours.





□ Dysfunctional analysis of the pre-training model on nucleotide sequences and the evaluation of different k-mer embeddings

>> https://www.biorxiv.org/content/10.1101/2022.12.05.518770v1

Decomposing a pre-training model of Bidirectional Encoder Representations from Transformers (BERT) into embedding and encoding modules to illustrate what a pre-trained model learns from pre-training data.

The context-consistent k-mer representation is the primary product that a typical BERT model learns in the embedding layer. Surprisingly, single usage of the k-mer embedding on the random data can achieve comparable performance to that of the k-mer embedding on actual sequences.





□ Freddie: annotation-independent detection and discovery of transcriptomic alternative splicing isoforms using long-read sequencing

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac1112/6882131

Freddie is an annotation-free isoform detection and discovery tool that uses as input transcriptomic long-reads aligned to the reference genome using a splice aligner. Freddie partitions the input reads into sets that can be processed independently and in parallel.

Freddie segments the genomic alignment of the reads into canonical exon segments. Freddie reconstructs the isoforms by jointly clustering and error-correcting the reads using the canonical segmentation as a succinct representation.





□ Optimising a coordinate ascent algorithm for the meta-analysis of test accuracy studies

>> https://www.biorxiv.org/content/10.1101/2022.12.05.519131v1

Considering six closed form methods for estimating the initial values of the parameters for a co-ordinate ascent algorithm used to fit the bivariate model and compare them with numerically derived robust initial values.

All the closed form methods lead to a reduction in computation time of around 80% and rank higher overall across the metrics when compared with the robust initial values method.

Although no initial values estimator dominated the others across all parameters and metrics, the two-step Hedges-Olkin estimator ranked highest overall across the different scenarios.





□ Megan Server: facilitating interactive access to metagenomic data on a server

>> https://www.biorxiv.org/content/10.1101/2022.12.05.518498v1

Megan Server, a stand-alone program that serves MEGAN files to the web, using a RESTful API, facilitating in- teractive analysis without downloading the complete data.

A root directory is specified and then all appropriate files found in or below the root directory are served. The API provides endpoints for obtaining file-related information, classification-related information, for accessing reads and matches and for administrating the server.





□ VASCA: Variable-selection ANOVA Simultaneous Component Analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac795/6887137

Variable-selection ASCA (VASCA), a method that generalizes ASCA through variable selection, augmenting its statistical power without inflating the Type-I error risk. The method is evaluated with simulations and with a real data set from a multi-omic clinical experiment.

VASCA is assessed w/ simulations and w/ a real data set from a multi-omics, and compared to ASCA and the BH (FDR) method in terms of statistical power, and to Partial Least Squares Discriminant Analysis (PLS-DA) and its sparse counterpart (sPLS-DA) in terms of exploratory power.





□ GeneticsMakie.jl: A versatile and scalable toolkit for visualizing locus-level genetic and genomic data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac786/6887175

GeneticsMakie.jl allows scalable and flexible visual display of high-dimensional genetic and genomic data within the Julia ecosystem. It produces high-quality, publication-ready figures by default.

GeneticsMakie.jl harmonizes column names of GWAS or QTL summary statistics, their SNP IDs, and calculates Z-scores if they are missing. GeneticsMakie.jl mitigates this issue by clamping P values of such SNPs to the smallest floating-point number, when munging summary statistics.





□ AutoGater: A Weakly Supervised Neural Network Model to Gate Cells in Flow Cytometric Analyses

>> https://www.biorxiv.org/content/10.1101/2022.12.07.519491v1

Autogater, using a neural network model, can utilize information across multiple channels to distinguish between live and dead cell populations. While the precise definition of dead cells utilized by Autogater is unknown, the model was trained on information only from Forward Scatter and Side Scatter channels.

Autogater has a couple of significant advantages over nucleic acid stains or CFU analyses. When trained on both SYTOX and CFU analyses, Autogater appears to account for features of dead cells identified by both approaches while allowing real-time determination of which cells are dead or alive.





□ TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering

>> https://www.biorxiv.org/content/10.1101/2022.12.09.519749v1

Target- Call performs light-weight basecalling to compute noisy reads using LightCall, and labels these noisy reads as on-target/off- target using Similarity Check. TargetCall eliminates the wasted computation in basecalling by performing basecalling only on the on-target reads.

TargetCall improves the performance of entire genome sequence analysis pipeline by 2.03×-3.00×. TargetCall uses a highly-accurate neural network based variant caller, the execution time of variant calling dominated read mapping.





□ DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05093-z

DiviK: a scalable stepwise algorithm with local data-driven feature space adaptation for segmenting high-dimensional datasets. The algorithm is compared to the optional solutions combined with different feature engineering techniques (None, PCA, EXIMS, UMAP, Neural Ions).

DiviK is an original stepwise deglomerative algorithm. It uses a locally optimised K-means algorithm iteratively. They implemented local feature engineering as filtering based on GMM decomposition of the feature variance across the subregion.





□ Codetta: predicting the genetic code from nucleotide sequence

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac802/6895099

Codetta can analyze an arbitrary nucleotide sequence and needs no sequence annotation or taxonomic placement. The most likely amino acid decoding for each of the 64 codons is inferred from alignments of profile hidden Markov models of conserved proteins to the input sequence.

Codetta takes nucleotide sequences from a single organism as input and predicts the genetic code from coding regions with recognizable homology. For each codon, the best amino acid meaning is selected; Codetta can detect canonical stop and sense codons w/ new amino acid meanings.





□ PYPE: A Python pipeline for phenome-wide association (PheWAS) and mendelian randomization in investigator-driven phenotypes and genotypes of biobank data

>> https://www.biorxiv.org/content/10.1101/2022.12.10.519906v1

PYPE provides the user with the ability to run Mendelian Randomization under a variety of causal effect modeling scenarios (e.g., Inverse Variance Weighted Regression, Egger Regression, and Weighted Median Estimation) to identify possible causal relationships between phenotypes












Maroon.

2022-12-13 23:11:11 | Science News




□ HELIOS: High-speed sequence alignment in optics

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010665

HELIOS, an all-optical high-throughput method for aligning DNA, RNA, and protein sequences. HELIOS locates matches, mutations, and single/multiple indels; while the coding procedure presents distinct coding patterns for input sequences and reduces the noises at the output vector.

The HELIOS optical architecture exploits high-speed processing and operational parallelism, by adopting wavelength and polarization of optical beams. HELIOS and HELIOS optical architecture, each one is manipulated to enhance the other one, and both form a single coherent system.





□ SimMCMC: Inferring delays in partially observed gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2022.11.27.518074v1

SimMCMC, a simulation-based Bayesian method for the inference of kinetic / delay parameters of a GRN when only the products of the genes in the network are observed. SimMCMC is applicable even if only the most downstream genes, i.e. the final outputs, of the network are observed.

SimMCMC uses a a continuous-time Markov Chain, which efficiently explains a biochemical reaction network, one can also use a stochastic differential equation which is accurate when the copy numbers are higher, an agent-based model, or a delay differential equation.





□ Syllable-PBWT for space-efficient haplotype long-match query

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac734/6849513

Syllable- PBWT, a space-efficient variation of the positional Burrows-Wheeler transform (PBWT) which divides every haplotype into syllables, builds the PBWT positional prefix arrays on the compressed syllabic panel, and leverages the polynomial rolling hash function.

Syllable-Query, an algorithm that solves the L-long match query problem. Syllable-Query searchs for ongoing long matches, as opposed to past solutions’ focus on terminated matches, due to the chaotic behavior upon match termination of general sequences in reverse prefix order.





□ IRM / ns-HAL: The Inherited Rate Matrix algorithm for phylogenetic model selection for non-stationary Markov processes

>> https://www.biorxiv.org/content/10.1101/2022.12.06.519392v1

The Inherited Rate Matrix algorithm (IRM) reduces the complexity of identifying a sufficient solution to the problem of time-heterogeneous substitution processes across lineages. fast-IRM makes the parameters from the parent model constant to reduce numerical optimisation time.

The non-stationary heterogeneous across lineages model (ns-HAL) extends the HAL algorithm to the general nucleotide Markov process. This is a discrete-time, the model complexity reducing approach employs a top-down algorithm to identify optimal time-heterogeneous models.





□ Progres: Fast protein structure searching using structure graph embedding

>> https://www.biorxiv.org/content/10.1101/2022.11.28.518224v1

Progres (PROtein GRaph Embedding Search), a simple GNN to embed a protein structure independent of its sequence. Progres uses distance features based on coordinates the embedding is E(3)-invariant. It doesn’t change w/ translation, rotation or reflection of the input structure.

A decoder generates structures from the embedding space. Properties of proteins such as evolution, topological classification , the completeness of fold space, the continuity of fold space, function and dynamics could be explored in the context of the low-dimensional fold space.





□ dnadna: a deep learning framework for population genetics inference

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac765/6851140

dnadna, a flexible python-based software for deep learning inference in population genetics. It is task-agnostic and aims at facilitating the development, reproducibility, dissemination, and reusability of neural networks designed for population genetic data.

dnadna defines multiple workflows. First, users can implement new architectures and tasks, while benefiting from dnadna utility functions, training procedure and test environment. Second, the implemented networks can be re-optimized based on user-specified training sets / tasks.





□ Active Learning for Efficient Analysis of High-throughput Nanopore Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac764/6851141

This work applies several advanced active learning technologies to the nanopore data, including the RNA classification dataset (RNA-CD) and the Oxford Nanopore Technologies barcode dataset (ONT-BD).

Due to the complexity of the nanopore data (with noise sequence), the bias constraint is introduced to improve the sample selection strategy in active learning. Active learning technology can assist experts in labeling samples, and significantly reduce the labeling cost.





□ NanoTrans: an integrated computational framework for comprehensive transcriptome analyses with Nanopore direct-RNA sequencing

>> https://www.biorxiv.org/content/10.1101/2022.11.29.518309v1

Nanopore direct-RNA sequencing (DRS) provides the direct access to native RNA strands with full-length information, shedding light on rich qualitative and quantitative properties of gene expression profiles.

NanoTrans, an integrated computational framework that comprehensively covers all major DRS-based application scopes, including isoform clustering and quanti- fication, poly(A) tail length estimation, RNA modification profiling, and fusion gene detection.





□ NanoPack2: Population scale evaluation of long-read sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.11.28.518232v1

NanoPack now offers tools ready for the evaluation of large populations with implementations in a more performant programming language, with a focus on features relevant to long-read sequencing.

In this manuscript, NanoPack presents newly developed tools that fulfill this need and efficiently assess characteristics specifically relevant to long-read genome sequencing, including alignments spanning structural variants and phasing read alignments.

Phasing, i.e. assigning each sequenced fragment to a parental haplotype by identifying co-occurring variants is important in identifying potential functional variants in association studies and for the pathogenicity of putative compound heterozygous variation.





□ NOMAD+: Unsupervised reference-free inference reveals unrecognized regulated transcriptomic complexity in human single cells

>> https://www.biorxiv.org/content/10.1101/2022.12.06.519414v1

NOMAD+, a new analytic method that performs unified, reference-free statistical inference directly on raw sequencing reads, extending the core NOMAD algorithm to include a micro-assembly and interpretation framework.

NOMAD+ discovers broad and new examples of transcript diversification in single cells, bypassing genome alignment and without requiring cell type metadata and impossible with current algorithms. NOMAD+ simultaneously discovers diversification in centromeric RNA expression.





□ SCExecute: custom cell barcode-stratified analyses of scRNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac768/6854977

SCExecute can be restricted to specific genomic regions and can limit the number of generated scBAMs. SCExecute can be configured to use cleaned up cell barcodes, raw cell barcodes, to use a list of acceptable cell barcodes, or all cell-barcodes found in the BAM file.

Demonstrating SCExecute w/ variant callers designed for bulk (DNA-)sequencing data to identify sceSNVs. SceSNVs from 10xGenomics are vastly understudied, as traditional variant callers estimate quality metrics, incl. allele frequency / genotype confidence, based on all reads.





□ Mathematical model of the cell signaling pathway based on the extended Boolean network model with a stochastic process

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05077-z

A new mathematical model of cell signaling pathways based on the extended Boolean method with the Waller–Kraft operator and a stochastic process. The model was employed to simulate the mitogen-activated protein kinase (MAPK) signaling pathway.

In the model, the activity of proteins in the pathway is regulated by a Boolean function, which is determined by the weights of protein–protein interactions. The model also considers the effect of stochastic factors of protein self-activity on signaling transduction.





□ Transfer learning for genotype–phenotype prediction using deep learning models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05036-8

Any algorithm, TCA, CORAL, 1DCNN, and SVC can also be used for transfer learning, and there is a possibility that these algorithms yield more accuracy when transferring knowledge. So, in the model section, any number of algorithms can be employed without affecting the methodology.

Transfer learning with deep transfer learning. The time to train the model on a large population's genotype is O(E * (Th + T2+. TN)). When transferring knowledge from a large population, one must decide the number of
trainable and non-trainable layers.

If the number of trainable layers is = o, the final computation time would be O(E * (T1 + T2+. .TN)). If some layers are trainable t, the actual computation time would be O(E * (T1 + T2+. .TN)) + O(E * (TN + TN-1+. .Tt)), where is t is the number of trainable layers from bottom to top.





□ Scalable transcriptomics analysis with Dask: applications in data science and machine learning

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05065-3

The simplicity of Dask greatly reduces the barrier to entry for analysts that are new to distributed and parallel computing. The Dask framework combines blocked algorithms with task scheduling to achieve parallel and out-of-core computation.

Dask minimizes the changes required to port pre-existing code. Dask can scale several tasks commonly performed in the preprocessing of scRNA-seq data. Dask can improve the performance of transcriptomics data analysis and scale computation beyond the usual limits.





□ Persistent memory as an effective alternative to random access memory in metagenome assembly

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05052-8

Exploring the possibility of using Persistent Memory (PMem) as a less expensive substitute for dynamic random access memory (DRAM) to reduce OOM and increase the scalability of metagenome assemblers.

PMem can enable metagenome assemblers on terabyte-sized datasets by partially or fully substituting DRAM. Depending on the configured DRAM/PMEM ratio, running assemblies with PMem can achieve a similar speed as DRAM, while in the worst case it showed a roughly two-fold slowdown.





□ Secuer: Ultrafast, scalable and accurate clustering of single-cell RNA-seq data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010753

Secuer, a Scalable and Efficient speCtral clUstERing algorithm for scRNA-seq data. By employing an anchor-based bipartite graph representation algorithm, Secuer enjoys reduced runtime and memory usage over one order of magnitude for datasets with more than 1 million cells.

Secuer pivots p anchors and constructs a weighted bipartite graph by a modified approximate k-nearest neighbor algorithm. Secuer determines the weights of the bipartite graph by a scaled Gaussian kernel function to capture the geometry of the cell-to-anchor similarity network.





□ Mean Dimension of Generative Models for Protein Sequences

>> https://www.biorxiv.org/content/10.1101/2022.12.12.520028v1

The log probability log p(s) of a sequence s in a model can be expanded into terms of different orders. Under some assumptions on the expansion, the corresponding variance under the uniform distribution can be decomposed into contributions of different orders as well.

The mean dimension is then defined as the average of orders under weights that correspond to contributions of orders to the total variance. The contribution of an order to the variance is proportional to the sum of squared interaction coefficients of that order.





□ Nanophase: Nanopore long-read-only metagenomics enables complete and high-quality genome reconstruction from mock and complex metagenomes

>> https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-022-01415-8

Although Nanopore sequencing has difficulty fully characterizing long homopolymer regions, introducing insertion/deletion errors, the continuous improvement of sequencing accuracy, throughput and theoretically unlimited read length empower efficient genome reconstruction.

NanoPhase uses metaFlye to assemble filtered Nanopore long reads to generate assemblies. Then MetaBAT2 and MaxBin2 integrated w/ the coverage information were adopted to reconstruct two candidate genome sets, followed by the bin refinement step of MetaWRAP to generate draft bins.





□ STRling: a k-mer counting approach that detects short tandem repeat expansions at known and novel loci

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02826-4

STRling, software capable of detecting both novel and reference STR expansions, including pathogenic STR expansions. It calls alleles both within the read length and greater than the read length. It is capable of accurately detecting the genomic position of expansions.

STRling can detect STR expansions that are annotated in the reference genome. STRling uses kmer counting to recover mis-mapped STR reads. It then uses soft-clipped reads to precisely discover the position of the STR expansion in the reference genome.





□ Pseudoalignment tools as an efficient alternative to detect repeated transposable elements in scRNAseq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac737/6909008

Kallisto pseudoaligns reads to a reference, producing a list of transcripts that are compatible with each read while avoiding alignment of individual bases and, therefore, bypassing the multiple-mapping issues related to TE detection by conventional alignment tools.

It does so by creating an index through a transcriptome de Brujin Graph (t-DBG) where nodes are k-mers. Reads are hashed and pseudoaligned to a transcript based on their intersection of the k-compatibility classes.





□ Strobealign: flexible seed size enables ultra-fast and accurate read alignment

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02831-7

strobealign is a fast short-read aligner. It achieves the speedup by using a dynamic seed size obtained from syncmer-thinned strobemers. strobealign is multithreaded, aligns single-end and paired-end reads, and outputs mapped reads either in SAM format or PAF format.

The main idea of the seeding approach is to create fuzzy seeds by first computing open syncmers from the reference sequences, then linking the syncmers together using the randstrobe method with two syncmers.





□ CS-CORE: Cell-type-specific co-expression inference from single cell RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.12.13.520181v1

CS-CORE estimates cell-type-specific co-expressions, built on a general expression-measurement model that explicitly accounts for sequencing depth variations and measurement errors in the observed single cell data.

CS-CORE models the unobserved true gene expression levels as latent variables, linked to the observed UMI counts through a measurement model that accounts for both sequencing depth varia- tions and measurement errors.





□ multiGroupVI: Disentangling shared and group-specific variations in single-cell transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2022.12.13.520349v1

multi-Group Variational Inference (multiGroupVI), a DGM that explicitly decomposes the gene expression patterns in scRNA-seq data into shared and group-specific factors of variation.

multiGroupVI models the variations underlying the data using gamma + 1 sets of latent variables: Group-specific encoders embed cells into group-specific latent spaces. For a cell from a given group γ, the latent variables for other groups γ′ ̸= γ are fixed to be zero vectors.





□ TASSEL: Merging short and stranded long reads improves transcript assembly

>> https://www.biorxiv.org/content/10.1101/2022.12.13.520317v1

TASSEL (Transcript Assembly using Short and Strand Emended Long reads), that merges qualitative features of stranded long reads w/ the quantitative depth of short-read sequencing. TASSEL outperforms other assembly in terms of sensitivity / complete assembly on the correct strand.

TASSEL resulted in substantially improved capture of key transcriptomic features such as transcription start and termination sites as well as better enrichment of active histone marks and RNA Pol II. TASSEL TSS are better indicator of active TSS than StringTie Mix TSS.





□ NanopoReaTA: a user-friendly tool for nanopore-seq real-time transcriptional analysis

>> https://www.biorxiv.org/content/10.1101/2022.12.13.520220v1

NanopoReaTA provides biologically relevant snapshots of the sequencing run, which in turn can enable interactive fine-tuning of the sequencing run itself, facilitate decisions to abort the run, when sufficient accuracy is achieved, or accelerate the resolution of clinical cases.

NanopoReaTA focuses on the analysis of cDNA and direct RNA-sequencing reads and achieves the different steps up to final visualizations of results from i.e. differential expression or gene body coverage. NanopoReaTa can be run in real-time right after starting a run via MinKNOW.





□ Insane in the vembrane: filtering and transforming VCF/BCF files

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac810/6909012

vembrane, a new filtering tool for all versions of the VCF and BCF formats. vembrane consolidates and extends the functionality of previously available tools and uses standard Python syntax, while achieving very good processing speed.

vembrane is the first tool to comprehensively handle breakend variants (BNDs): BNDs are a way of encoding structural variants by grouping two or more genomic breakpoints into a joint structural variant event. vembrane thus needs to ensure that each event is removed or kept as a whole.





□ EquiPPIS: E(3) equivariant graph neural networks for robust and accurate protein-protein interaction site prediction

>> https://www.biorxiv.org/content/10.1101/2022.12.14.520476v1

EquiPPIS converts the input protein monomer into an undirected graph 𝒢 = (𝒱,E), with 𝒱 denoting the residues (nodes) and E denoting the interaction between nonsequential residue pairs according to their pairwise spatial proximity.

EquiPPIS uses a deep E(3) equivariant graph neural network that conducts a series of transformations of its input through a stack of equivariant graph convolution layer (EGCL).

A sigmoidal function is applied to the last EGCL node embedding to predict the probability of every residue in the input monomer to be a PPI site, thereby converting the PPI site prediction into a graph node classification task.





□ Mirage2's high-quality spliced protein-to-genome mappings produce accurate multiple-sequence alignments of isoforms

>> https://www.biorxiv.org/content/10.1101/2022.12.14.520492v1

Mirage2 retains the fundamental algorithms of the original Mirage implementation while benefiting from a substantial overhaul of several core components, resulting in a software that improves the results of translated mapping, records informative intermediate outputs.

Isoforms are first mapped back to their coding exons. Once all isoforms within a gene family have been mapped, those genome mapping coordinates serve as the basis for intra-species alignment, resulting in an MSA with explicit splice site awareness and exon delineation.





□ Unsupervized identification of prognostic copy-number alterations using segmentation and lasso regularization

>> https://www.biorxiv.org/content/10.1101/2022.12.14.520497v1

Using Fischer’s non-centered hypergeometric distribution to model survival w/ a segmentation model avoids the high dependency issue of univariate testing, identifies almost systematically all regions, but suffers from the difficulty of selecting the correct number of segments.

Combining this approach with a Lasso-penalization selection improves significantly the ability to recover true regions of interest. Surprisingly, downscaling the data to wider bins seemed to affect only the performances of methods using lasso regularization.

Combining a segmentation approach to create initial meta-regions of similar prognosis impact and a lasso-regularization scheme to select the significant ones provided the best results, especially in the smallest scale situation.





□ PyDESeq2: a python package for bulk RNA-seq differential expression analysis

>> https://www.biorxiv.org/content/10.1101/2022.12.14.520412v1

PyDESeq2 implements the DEA, which consists in modeling raw counts using a negative binomial distribution. Dispersion parameters are estimated independently for each gene by fitting a negative binomial generalized linear model (GLM), and shrunk towards a global trend curve.

PyDESeq2 returns very similar sets of significant genes and pathways, while achieving better likelihood for dispersion and to log-fold changes (LFC) parameters on a vast majority of genes and comparable speeds

PyDESeq2 is structured around two classes of objects: a DeseqDataSet class, handling data-modeling steps from normalization to LFC fitting, and a DeseqStats class for statistical tests and optional LFC shrinkage.





□ LegNet: resetting the bar in deep learning for accurate prediction of promoter activity and variant effects from massive parallel reporter assays

>> https://www.biorxiv.org/content/10.1101/2022.12.22.521582v1

LegNet is an EfficientNetV2-based fully convolutional neural network employing several domain-specific ideas and improvements to reach accurate expression modeling and prediction from a DNA sequence.

LegNet was trained to predict not the single expression value but a vector of expression bin probabilities. At the model evaluation stage, the predicted probabilities are multiplied by bin numbers to convert the vector into a single predicted expression value.





□ Best: A Tool for Characterizing Sequencing Errors

>> https://www.biorxiv.org/content/10.1101/2022.12.22.521488v1

Best, a tool for characterizing sequencing errors using a reference assembly called best: Bam Error Stats Tool. best builds upon the work of a python script published in Wenger et al6 called bamConcordance.

best is written in Rust that quantifies sequencing errors based on alignments to a reference assembly. At its core, best iterates through reads aligned to a high
quality reference assembly, counts the number and types of errors, and aggregates these values into multiple output.