lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

We were Once Kings.

2023-03-31 03:33:33 | Science News

(Photo by Joanne Hollings)




□ TXGNN: Zero-shot prediction of therapeutic use with geometric deep learning and clinician centered design

>> https://www.medrxiv.org/content/10.1101/2023.03.19.23287458v1

TXGNN is a graph neural network pre-trained on a comprehensive knowledge graph of 17,080 clinically-recognized diseases and 7,957 therapeutic candidates. The model can process various therapeutic tasks, such as indication and contraindication prediction, in a unified formulation.

TXGNN can perform zero-shot inference on new diseases without additional parameters or fine-tuning on ground truth labels. TXGNN uses a metric learning module that operates on the latent representation space.

TXGNN transforms points in the latent space representing the candidate and disease into predictions about their relationship. In TXGNN, we obtain a disease signature vector for each disease based on the set of neighboring proteins, exposures, and other biomedical entities.





□ NeuLay: Accelerating network layouts using graph neural networks

>> https://www.nature.com/articles/s41467-023-37189-2

The NeuLay algorithm, a Graph Neural Network (GNN) developed to parameterize node features, significantly improves both the speed and the quality of graph layouts, opening up the possibility to quickly and reliably visualize large networks.

NeuLay allows for the use of different GNN architecture other than GCN, such as Graph Attention. NeuLay encodes the graph structure by graph neural networks that maps the adjacency matrix to the node positions. NeuLay-2 w/ two GCN layers has the fastest convergence of the energy.





□ Con-AAE: Contrastive Cycle Adversarial Autoencoders for Single-cell Multi-omics Alignment and Integration

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad162/7091469

Con-AAE (Contrastive cycle adversarial Autoencoders), aiming at integrating and aligning the multi-omics data at the single-cell level. The contrastive loss minimizes the distance between positive pairs and maximizes the distance between negative pairs.

Con-AAE uses two autoencoders to map two modality data into two low-dimensional manifolds under the constrain of adversarial loss, trying to develop representations for each modality that are separated but cannot be identified by an adversarial network in a coordinated subspace.





□ Phenonaut; multiomics data integration for phenotypic space exploration

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad143/7082955

Phenonaut is a framework for applying workflows to multi-omics data. Originally targeting high-content imaging and the exploration of phenotypic space, with different visualisations and metrics.

Phenonaut runs are accompanied by cryptographic hashes proving reported inputs. Phenonaut allows now operates in a data agnostic manner, allowing users to describe their data (multi-view/multi-omics) and apply a series of generic or specialised data-centric transforms.





□ Accurate Flow Decomposition via Robust Integer Linear Programming

>> https://www.biorxiv.org/content/10.1101/2023.03.20.533019v1

A new ILP formulation for the flow decomposition problem for dealing with edge weights not forming a flow. It enables a macroscopic management of errors by attaching an error to each solution path instead of each edge.

This formulation defines the minimum path-error flow decomposition problem as the problem of finding a set of weighted paths with associated error variables, such that the superposition difference of each edge is within the sum of the error variables of the paths using the edge.





□ multiWGCNA: an R package for deep mining gene co-expression networks in multi-trait expression data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05233-z

multiWGCNA, a WGCNA-based procedure that can leverage the multidimensionality of experimental designs to study co-expression networks across variable conditions, such as space or time.

multiWGCNA generates a network for each condition separately, and subsequently maps these modules across designs, and performs relevant downstream analyses, incl. module-trait correlation and module preservation.





□ GVC: efficient random access compression for gene sequence variations

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05240-0

The Genomic Variant Codec(GVC), a novel approach for compressing gene sequence variations with random access capability. The genotypes are extracted from a VCF file and divided into blocks. Each block represents genotypes of all samples in a certain range of loci in a chromosome.

GVC uses two alternative binarization approaches to decompose the allele matrix into a binary representation: bit plane binarization and row binarization. GVC uses the Hamming distance to measure the similarity b/n adjacent rows/columns. Each binary matrix is entropy-encoded.





□ SoCube: an innovative end-to-end doublet detection algorithm for analyzing scRNA-seq data

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbad104/7081128

Several doublet detection algorithms are currently available, but their generalization performance could be further improved due to the lack of effective feature-embedding strategies with suitable model architectures.

SoCube proposed a novel 3D composite feature-embedding strategy that embedded latent gene information and constructed a multikernel, multichannel CNN-ensembled architecture in conjunction with the feature-embedding strategy.





□ OASIS: An interpretable, finite sample valid alternative to Pearson's X2 for scientific discovery

>> https://www.biorxiv.org/content/10.1101/2023.03.16.533008v1

OASIS (Optimized Adaptive Statistic for Inferring Structure) constructs a test-statistic which is linear in the normalized data matrix, providing closed form p-value bounds through classical concentration inequalities.

OASIS computes a bilinear form of residuals. OASIS provides a decomposition of the table, lending interpretability to its rejection of the null. The finite-sample bounds correctly characterize the p-value bound derived up to a variance term.





□ AIM: A Framework for High-throughput Sequence Alignment using Real Processing-in-Memory Systems

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad155/7087101

Alignment-in-Memory (AIM), a framework for PIM-based sequence alignment that targets the UPMEM system. AIM dispatches a large number of sequence pairs across different memory modules and aligns each pair using compute cores within the memory module where the pair resides.

AIM supports multiple alignment algorithms including NW, SWG, GenASM, WFA, and WFA-adaptive. Each algorithm has alternate implementations that manage the UPMEM memory hierarchy differently and are suitable for different read lengths.





□ scQA: Clustering scRNA-seq data via qualitative and quantitative analysis

>> https://www.biorxiv.org/content/10.1101/2023.03.25.534232v1

scQA (an architecture for clustering Single-Cell RNA-seq data based on Qualitative and Quantitative Analysis), which can efficiently cluster cells at various scale based on so called landmarks and each indicates the consensus of genes with similar expression patterns.

scQA constructs the consensus vector of genes whose qualitative expressions under certain cells are of similar trend: quasi-trend-preserved genes. scQA identifies distinct cell types, it proceeds to analyze the characteristics of the ID landmarks both internally / externally.





□ SpaceWalker: Interactive Gradient Exploration for Spatial Transcriptomics Data

>> https://www.biorxiv.org/content/10.1101/2023.03.20.532934v1

The intrinsic dimensionality can serve to guide the user to anatomically distinct regions, that changes in local intrinsic dimensionality in many cases mirror transitions between cell subclasses.

SpaceWalker consists of two key innovations: an interactive, real-time flood-fill and spatial projection of the local topology of the High-Dimensional space, and a gradient gene detector.





□ exFINDER: identify external communication signals using single-cell transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2023.03.24.533888v1

exFINDER analyzes the exSigNet by predicting signaling strength, calculating the maximal signal flow, clustering different ligand-target signaling paths, quantifying the signaling activities using the activation index, and evaluating the GO analysis outputs of exSigNet.





□ NOMAD2 provides ultra-efficient, scalable, and unsupervised discovery on raw sequencing reads

>> https://www.biorxiv.org/content/10.1101/2023.03.17.533189v1

NOMAD2 rapidly identifies candidate RNA editing de novo, including detecting potentially hyperedited events, filling a gap in existing bioinformatic tools. classified anchors as “mismatch” defined as cases where the two most abundant targets differ by single-base mismatches.

NOMAD2 enumerates all (a+g+t)-mers, these sequences are sorted lexicographically with KMC-tools. All occurrences of unique anchors are adjacent, which enables efficient gap removal and unique targets collapsing in the third step via a linear traversal over the (a+g+t)-mers.





□ PWN: enhanced random walk on a warped network for disease target prioritization

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05227-x

PWN (Prioritization with a Warped Network) uses the Forman–Ricci curvature instead of the Ollivier–Ricci curvature. PWN can be used for identifying the targets with properly given prior knowledge and gene scores.

PWN is designed to be an efficient variant of random walk with restart (RWR). PWN uses a weighted asymmetric network that is generated from an unweighted and undirected network. The weights come from two distinct features.

PWN is designed to manage the proportion of information circulating in and flowing out of certain regions by controlling the internal feature. PWN warps the network by assigning higher weights to prior knowledge-related edges.





□ Multi-Omics Integration For Disease Prediction Via Multi-Level Graph Attention Network And Adaptive Fusion

>> https://www.biorxiv.org/content/10.1101/2023.03.19.533326v1

This framework involves constructing co-expression and co-methylation networks for each subject, followed by applying multi-level graph attention to incorporate biomolecule interaction information.

The true-class-probability strategy is employed to evaluate omics-level confidence for classification, and the loss is designed using an adaptive mechanism to leverage both within- and across-omics information.

The initial feature is generated by the multi-level Graph Attention Network for each type of omics data respectively. The dicision feature of each type of omics data is generated by the TCP module. The decision features of each omics are concatenated into one fusion feature.





□ QADD: De Novo Drug Design by Iterative Multi-Objective Deep Reinforcement Learning with Graph-based Molecular Quality Assessment

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad157/7085596

QADD designs a multi-objective deep reinforcement learning pipeline to generate molecules w/ multiple desired properties iteratively, where a graph neural network-based model for accurate molecular quality assessment on drug potentials is introduced to guide molecule generation.

QADD uses the Deep Q-Network, a value-based reinforcement learning method, to estimate the action-value function under different action selection strategies. Since it does not require a fixed-dimensional action space, it is particularly suitable for discontinuous space search.





□ Distances and their visualization in studies of spatial-temporal genetic variation using single nucleotide polymorphisms (SNPs)

>> https://www.biorxiv.org/content/10.1101/2023.03.22.533737v1

They recommend selection of a distance measure for SNP genotype data that does not give differing outcomes depending on the arbitrary choice, and consideration of which state should be considered as zero when applying binary distance measures to fragment presence-absence data.





□ BSP: Dimension-agnostic and granularity-based spatially variable gene identification

>> https://www.biorxiv.org/content/10.1101/2023.03.21.533713v1

BSP (big-small patch), a spatial granularity-guided and non-parametric model to identify spatially variable genes SVGs from two or three- dimensional spatial transcriptomics data in a fast and robust manner.

BSP selects a set of neighboring spots within a certain distance to capture the regional means with different granularities. The variances of the expression mean across all spots are then calculated under different scales, and genes with high ratios are identified as the SVGs.





□ Capturing Spatiotemporal Signaling Patterns in Cellular Data with Geometric Scattering Trajectory Homology

>> https://www.biorxiv.org/content/10.1101/2023.03.22.533807v1

GSTH, a general framework that encapsulates time-lapse signals on a cell adjacency graph in a low-dimensional trajectory. GSTH integrates geometric scattering and topological data analysis (TDA) to provide a comprehensive understanding of complex cellular interactions.

Geometric scattering employs wavelet-based transformations to extract multiscale representations of the signaling data, capturing the intricate hierarchical structures present in the spatial organization of cells and the temporal evolution of signaling events.





□ Ensemble-GNN: federated ensemble learning with graph neural networks for disease module discovery and classification

>> https://www.biorxiv.org/content/10.1101/2023.03.22.533772v1

Ensemble-GNN allows to quickly build predictive models utilizing PPI networks consisting of various node features such as gene expression and/or DNA methylation.

Ensemble-GNNs were combined into a global federated model. In the federated case, each client has its dedicated data based on which a GNN classifier is trained. The trained models of the ensembles are shared among all clients, and predictions are again made via Majority Vote.





□ Scrooge: A Fast and Memory-Frugal Genomic Sequence Aligner for CPUs, GPUs, and ASICs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad151/7085594

Scrooge, a fast and memory-frugal genomic sequence aligner. Scrooge includes three novel algorithmic improvements which reduce the data movement, memory footprint, and the number of operations in the GenASM algorithm.

GenASM-DC uses only cheap bitwise operations to calculate the edit distance between two strings text and pattern. It builds an (n+1)×(k+1) dynamic programming (DP) table R, where n=length(text) and k is the maximum number of edits considered.





□ Estimation of a treatment effect based on a modified covariates method with L0 norm

>> https://www.biorxiv.org/content/10.1101/2023.03.22.533735v1

A new treatment effect estimation approaches based on the modified covariate method, one using lasso regression and the other ridge regression, using the L0 norm.

A modified covariate method based on the L0 norm and Lq norm (q = 1, 2). The first method estimates treatment effects using lasso regression with the L0 norm. The second method uses ridge regression with the L0 norm.





□ PENCIL: Supervised learning of high-confidence phenotypic subpopulations from single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.03.23.533712v1

PENCIL can perform gene selection during the training process, which allows learning proper gene spaces that facilitate accurate subpopulation identifications from single-cell data.

PENCIL has the flexibility to address various phenotypes such as binary, multi-category and continuous phenotypes. PENCIL can order cells to reveal the subpopulations undergoing continuous transitions between conditions.





□ xTrimoGene: An Efficient and Scalable Representation Learner for Single-Cell RNA-Seq Data

>> https://www.biorxiv.org/content/10.1101/2023.03.24.534055v1

xTrimoGene reduces FLOPs by one to two orders of magnitude compared to classical transformers while maintaining high accuracy, enabling us to train the largest transformer models over the largest scRNA-seq dataset today.

xTrimoGene proposes an asymmetric encoder-decoder framework that takes advantage of the sparse gene expression matrix, and establishes the projection strategy of continuous values with a higher resolution.





□ EnsInfer: a simple ensemble approach to network inference outperforms any single method

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05231-1

EnsInfer, an ensemble approach to the network inference problem: each individual network inference method will work as a first level learning algorithm that gives a set of predictions from the gene expression input.

EnsInfer uses a combination of state-of-the-art inference approaches and combines them using a simple Naive Bayes ensemble model. EnsInfer essentially turns all the predictions from different inference algorithms into priors about each edge in the network.





□ Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02899-9

Enformer were not trained on GTEx / Cardoso-Moreira et al. data. specifically and do not directly give predictions for many human tissues. To match CAGE tracks to tissues and stages of development in a simple, yet data-driven, way, they fitted a ridge regression.

Enformer can predict endogenous RNA abundance very well and consistently outperforms previous models. Enformer substantially outperformed Basenji2 even when it is restricted to the latter model‘s input window and even on tasks where the receptive field size is irrelevant.





□ ElasticBLAST: accelerating sequence search via cloud computing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05245-9

ElasticBLAST can handle anywhere from a few to many thousands of queries and run the searches on thousands of virtual CPUs.

ElasticBLAST leverages the cloud to provide multiple worker nodes to parallelize the computation by breaking the queries into query batches. ElasticBLAST relies on BLAST DB metadata that is automatically generated to determine the amount of main memory needed for that database.





□ SiPSiC: A novel method to accurately estimate pathway activity in single cells for clustering and differential analysis

>> https://www.biorxiv.org/content/10.1101/2023.03.27.534310v1

SiPSiC, a novel method for inferring pathway scores from scRNA-seq data. It has a high sensitivity, accuracy, and consistency with existing knowledge across different data types, including findings often missed by the original conventional analyses.

SiPSiC scores can be used to cluster the cells and compute their UMAP projections in a manner that better captures the biological underpinnings of tissue heterogeneity.





□ cnnLSV: detecting structural variants by encoding long-read alignment information and convolutional neural network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05243-x

cnnLSV can automatically adjust the images from different variants to a uniform size according to the length of each variant and the coverage of the dataset for training the filtering model.

cnnLSV converts the images in training set into one-dimensional arrays, and executes the principal component analysis and k-means clustering to eliminate the incorrectly labeled images to improve the filtering performance of the model.





□ KGETCDA: an efficient representation learning framework based on knowledge graph encoder from transformer for predicting circRNA-disease associations

>> https://www.biorxiv.org/content/10.1101/2023.03.28.534642v1

Knowledge Graph Encoder from Transformer for predicting CDA (KGETCDA) integrates more than 10 databases to construct a large heterogeneous non-coding RNA dataset, which contains multiple relationships between circRNA, miRNA, lncRNA and disease.

A biological knowledge graph is created based on this dataset and Transformer-based knowledge representation learning and attentive propagation layers are applied to obtain high-quality embeddings with accurately captured high-order interaction information.





□ C-DEPP: Scaling deep phylogenetic embedding to ultra-large reference trees: a tree-aware ensemble approach

>> https://www.biorxiv.org/content/10.1101/2023.03.27.534201v1

Clustered-DEPP (C-DEPP) uses carefully crafted techniques to enable quasi-linear scaling while maintaining accuracy. C-DEPP enables placing twenty million 16S fragments on the GG2 reference tree in 41 hours of computation.

C-DEPP trains a separate model for each of several overlapping subtrees; for each query, C-DEPP uses a 2-level classifier to select one or more subtrees, computes distances using those subtrees, and uses these distances as input to APPLES-II, leaving the other distances blank.





□ simpleaf: A simple, flexible, and scalable framework for single-cell transcriptomics data processing using alevin-fry

>> https://www.biorxiv.org/content/10.1101/2023.03.28.534653v1

simpleaf, a program that simplifies the processing of single-cell data using tools from the alevin-fry ecosystem, and adds new functionality and capabilities, while retaining the flexibility and performance of the underlying tools.

simpleaf quant, simpleaf quant will automatically recruit and parameterize the correct mapper, and will automatically locate and provide the file containing the transcript-to-gene mapping information to later quantification stages where appropriate.





□ Sequencing accuracy and systematic errors in nanopore direct RNA sequencing

>> https://www.biorxiv.org/content/10.1101/2023.03.29.534691v1

The presence of the same systematic error patterns in RODAN points to more fundamental causes of errors in the raw signal data, necessitating further development of better pore chemistry to produce higher quality dRNA-seq data.

Clearly, further development of dRNA-seq protocols, pore chemistry and basecalling algorithms are desirable. Appropriate quality control and error correction methods are needed to mitigate the effects of high error rates and systematic biases in downstream analyses.



Dihedral.

2023-03-31 02:22:22 | Science News

(Art by ekaitza)

GPT models are affected by various factors such as the size of the training dataset and architecture, which may influence the Kolmogorov complexity. Simpler algorithms can compress complex data. The performance of the GPT model is expected to improve as its complexity increases.






□ Split-Transformer Impute (STI): Genotype Imputation Using a Transformer-Based Model

>> https://www.biorxiv.org/content/10.1101/2023.03.05.531190v3

The model utilizes attention to capture correlations among the SNPs/SNVs in the data. It achieves high imputation accuracy at a modest memory consumption cost by dividing the data into chunks, enabling efficient application to long sequences.

STI uses Cat-Embedding layer in order to capture allele information per SNV. In conjunction with multi-headed attention layers, enables STI to model correlations among SNVs to impute missing values based on known and missing values per position.






□ Dagger Linear Logic and Categorical Quantum Mechanics

>> https://arxiv.org/abs/2303.14231

The existing frameworks of Categorical Quantum Mechanics (CQM) are categorical proof theories of compact dagger linear logic, and are motivated by the interpretation of quantum systems in the category of finite dimensional Hilbert spaces.

Mixed Unitary Categories is a novel non-compact framework. MUC is built upon linearly distributive categories and ∗-autonomous categories, which serve as categorical proof theories of non-compact multiplicative linear logic and can be applied to infinite dimensional systems.





□ AIBMD: Artificial Intelligence Boosted Molecular Dynamics

>> https://www.biorxiv.org/content/10.1101/2023.03.25.534210v1

In AIBMD, probabilistic Bayesian neural network models were used to construct boost potentials that exhibit Gaussian distribution with minimized anharmonicity for accurate energetic reweighting and enhanced sampling.

AIBMD has been demonstrated on model systems of the alanine dipeptide in explicit and implicit solvent, the chignolin fast-folding protein, and three hairpin RNAs with the GCAA, GAAA, and UUCG tetraloops.





□ Boolean Network Sketches: A Unifying Framework for Logical Model Inference

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad158/7099622

Boolean network sketch starts with an initial sketch that corresponds to the prior literature-based knowledge only. Subsequently, it is extended by adding restrictions representing experimental data resulting in the data-informed sketch.

BNs integrates partial knowledge about the network’s topology and the update logic, as well as dynamical restrictions representing knowledge or assumptions about the properties of the network’s transitions (e.g., attractor landscape), and restrictions on the model dynamics.





(Art by jaanus03)

人は人から人に似て産まれ、偶然乗り合わせた船の上で犇めき合っている。星を読むように類似した記号に意味を与え、風の追いやる方だけが確からしいと覚える。己が何を見つけて、何を想い、何を遂げようとしても、押し流していく風からは留めて置けないことを知る。誰もが、誰の名も忘れて解けていく。



□ StackOverflowのトップエンジニアからの提言。GPT-4に依存し続けると「枯れた川床から水を飲む危険がある。」という指摘。知識の再生産フェーズが可能になるかどうか。事実、Googleのトラフィックが下がっているという指摘も。因みに引用されている画像は、AIをテーマにした映画『Ex Machina』の撮影に使われたノルウェーのHotel Juvetですね。

Peter Nixey

I'm in the top 2% of users on StackOverflow. My content there has been viewed by over 1.7M people. And it's unlikely I'll ever write anything there again.

Which may be a much bigger problem than it seems. Because it may be the canary in the mine of our collective knowledge.

A canary that signals a change in the airflow of knowledge: from human-human via machine, to human-machine only. Don’t pass human, don’t collect 200 virtual internet points along the way.

StackOverflow is *the* repository for programming Q&A. It has 100M users & saves man-years of time & wig-factories-worth of grey hair every single day.

It is driven by people like me who ask questions that other developers answer. Or vice-versa. Over 10 years I've asked 217 questions & answered 77. Those questions have been read by millions of developers & had tens of millions of views.

But since GPT4 it looks less & less likely any of that will happen; at least for me. Which will be bad for StackOverflow. But if I'm representative of other knowledge-workers then it presents a larger & more alarming problem for us as humans.

What happens when we stop pooling our knowledge with each other & instead pour it straight into The Machine? Where will our libraries be? How can we avoid total dependency on The Machine? What content do we even feed the next version of The Machine to train on?

When it comes time to train GPTx it risks drinking from a dry riverbed. Because programmers won't be asking many questions on StackOverflow. GPT4 will have answered them in private. So while GPT4 was trained on all of the questions asked before 2021 what will GPT6 train on?

This raises a more profound question. If this pattern replicates elsewhere & the direction of our collective knowledge alters from outward to humanity to inward into the machine then we are dependent on it in a way that supercedes all of our prior machine-dependencies.

Whether or not it "wants" to take over, the change in the nature of where information goes will mean that it takes over by default.

Like a fast-growing Covid variant, AI will become the dominant source of knowledge simply by virtue of growth. If we take the example of StackOverflow, that pool of human knowledge that used to belong to us - may be reduced down to a mere weighting inside the transformer.

Or, perhaps even more alarmingly, if we trust that the current GPT doesn't learn from its inputs, it may be lost altogether. Because if it doesn't remember what we talk about & we don't share it then where does the knowledge even go?

We already have an irreversible dependency on machines to store our knowledge. But at least we control it. We can extract it, duplicate it, go & store it in a vault in the Arctic (as Github has done).

So what happens next? I don't know, I only have questions.

None of which you'll find on StackOverflow.





□ CONGAS+: A Bayesian method to infer copy number clones from single-cell RNA and ATAC sequencing

>> https://www.biorxiv.org/content/10.1101/2023.04.01.535197v1

CONGAS+, a Bayesian model to map single-cell RNA and ATAC profiles generated from independent or multimodal assays on the latent space of copy numbers clones. CONGAS+ is equipped with a shrinkage hyperparameter that can be used to weigh the evidence differently across RNA/ATAC.

CONGAS+ did retrieve complex subclonal architectures while providing a coherent mapping among ATAC and RNA, facilitating the study of genotype-phenotype mapping.






□ Reconstruction of Gene Regulatory Networks using sparse graph recovery models

>> https://www.biorxiv.org/content/10.1101/2023.04.02.535294v1

Categorizing graph recovery methods into four main types based on the underlying formulations: Regression-based, Graphical Lasso, Markov Networks and Directed Acyclic Graphs. And incorporate transcription factor information as a prior to ensure successful reconstruction of GRNs.

They modified the uGLAD algorithm to take into account TF information, called uGLAD-GRN, by using a post-hoc masking operation that only retains the edges having at least one node as a TF. It can be applied to most of the algorithms that recover Conditional Independence graphs.





□ STGRNS: An interpretable Transformer-based method for inferring gene regulatory networks from single-cell transcriptomic data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad165/7099621

STGRNS, a Transformer-based model, provides a fast and accurate tool to infer gene regulatory networks from a single-cell RNA-seq profile. By leveraging the newly designed neural network structure, STGRNS especially obtains an outperformance on GRN inference.

STGRNS has certain transferability on the TF-gene prediction task. STGRNS can accurately infer GRNs based on known relationships between genes, irrespective of whether the data is static, pseudo-time, or time-series.





□ SEQUENCE VS. STRUCTURE: DELVING DEEP INTO DATA DRIVEN PROTEIN FUNCTION PREDICTION

>> https://www.biorxiv.org/content/10.1101/2023.04.02.534383v1

The difference between the RGC TN and RG AT methods is that the former employs a transformer network and incorporates direction, orientation, and distance distribution information in the edge features, while the latter only includes distance and dihedral angle information.

The first fusion method directly splices the output of the ESM-1b model and the GAT model and feeds it to the classifier for the final prediction. The second fusion method involves taking the output of the ESM-1b model as the initialization characteristics of nodes in the graph.





□ Single-cell RNA-seq differential expression tests within a sample should use pseudo-bulk data of pseudo-replicates

>> https://www.biorxiv.org/content/10.1101/2023.03.28.534443v1

The results of the simulation experiments showed that bulk methods that use pseudo-bulk raw count data from pseudo-replicates ranked highest and were most effective in controlling the false discovery rate (FDR) for highly expressed genes.

For real scRNA-seq data, the top- performing pipelines were also dominated by the same kind of pipelines, but the differences between single-cell and pseudo-replicate methods were less clear.





□ sciPENN: A multi-use deep learning method for CITE-seq and single-cell RNA-seq data integration with cell surface protein prediction and imputation

>> https://www.nature.com/articles/s42256-022-00545-w

sciPENN is a flexible method that supports completion of multiple CITE-seq references (by imputing missing proteins for each reference) as well as protein expression prediction in an scRNA-seq test set, all in one framework.

sciPENN can transfer cell type labels from a training set to a test set, and can also integrate cells from the multiple datasets into a common latent space.

sciPENN’s model architecture comprises an input block, followed by a sequence of feed-forward (FF) blocks interleaved with updates to an internally maintained hidden state updated via an RNN cell.

The final hidden state is passed through three dense layers to compute protein predictions, protein prediction bounds and cell type class probability vectors.





□ Bayesian Multi-Study Non-Negative Matrix Factorization for Mutational Signatures

>> https://www.biorxiv.org/content/10.1101/2023.03.28.534619v1

A Bayesian multi-study NMF method that jointly decomposes multiple studies or conditions to identify signatures that are common, specific, or partially shared by any subset.

A “discovery-only" model that estimates de novo signatures in a completely unsupervised manner, and a “recovery-discovery" model that builds informative priors from previously known signatures to both update the estimates of these signatures and identify any novel signatures.





□ The impact of FASTQ and alignment read order on structural variation calling from long-read sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.03.27.534439v1

Comparisons of variant call format (VCF) files generated from the original and permutated FASTQ files demonstrated that the order of input data had a large impact on SV prediction, particularly for pbsv. The type of variant most affected by read order varied by caller.

For pbsv, most differences occurred for deletions and duplications, while for Sniffles, permutating the read order had a stronger impact on insertions. For SVIM, inversions and deletions accounted for most differences.





□ Spatial Transcriptomics Analysis of Gene Expression Prediction using Exemplar Guided Graph Neural Network

>> https://www.biorxiv.org/content/10.1101/2023.03.30.534914v1

Proposing a graph exemplar bridging (GEB) block to update window features by the exemplars and the gene expression of exemplars. Allowing dynamic information propagation, the exemplar feature also receives and is updated with the status of the window features.

Semantically, the former update corresponds w/ ‘the known gene expression’, and the latter corresponds w/ ‘the GE the model wants to be known’. Finally, It has an attention-based prediction block to aggregate exemplars of each window and the exemplar-revised window features.





□ CellTrackVis: interactive browser-based visualization for analyzing cell trajectories and lineages

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05218-y

CellTrackVis visualizes tracking results, e.g., cell trajectories, segmentation, raw or processed image sequence, cell lineages, or quantified information, on interconnected views. Those generally include the number of cell division or appearance/disappearance at each time step.

Distinct time-series data are plotted using line graphs and exact values appear with a vertical bar, moved by a mouse pointer. The statistic data set is not the mandatory input, and thus our tool supports its visual analysis while retaining the flexibility of input data.





□ A self-propagating, barcoded transposon system for the dynamic rewiring of genomic networks

>> https://www.embopress.org/doi/full/10.15252/msb.202211398

A modular, combinatorial assembly pipeline for the functionalization of transposons with synthetic or endogenous gene regulatory elements as well as DNA barcodes.

The continuous mobilization of transposons throughout the host genome yields multi-site adaptive mutations and growth phenotypes in both static and dynamic selective environments.

It first mimics a natural transposon, with the transposase acting in cis from within the region flanked by the inverted repeat sequences, while the second uses a medium copy helper plasmid (pHelper) to provide transposase acting in trans.





□ Sparse clusterability: testing for cluster structure in high dimensions

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05210-6

Clusterlab generates clusters of a user-provided dimension by a linear projection of two-dimensional Gaussian principal components into the desired higher-dimensional space. The clusterlab manual highlights 12 example two-dimensional structures to project into higher dimension.

Methods with the dip test and either sparse PCA or traditional PCA detected known cluster structure in high dimensional-omics based data and had high power in simulations. Type I error was controlled at or below the nominal level across all dimensions.





□ MBE: Model-based differential sequencing analysis

>> https://www.biorxiv.org/content/10.1101/2023.03.29.534803v1

Model-based enrichment (MBE) is based on sound theoretical principles, is easy to implement, and can trivially make use of advances in modern-day machine learning classification architectures or related innovations.

Increasingly, log-enrichment estimates are also being used as supervised labels for training machine learning models so that one may predict enrichment for unobserved sequences, or probe the model to gain further insights.





□ PanKmer: k-mer based and reference-free pangenome analysis

>> https://www.biorxiv.org/content/10.1101/2023.03.31.535143v1

PanKmer decomposes a set of input genomes into a table of observed k-mers and their presence-absence values in each genome. These are stored in an efficient k-mer index data format that encodes all forms of variation within the pangenome, including SNPs, INDELs, and SVs.

PanKmer includes functions for downstream analysis, such as calculating sequence similarity statistics b/n individuals at whole-genome or local scales. k-mers can be “anchored” in any individual genome to quantify sequence variability or conservation at a specific locus.





□ MOGAT: An Improved Multi-Omics Integration Framework Using Graph Attention Networks

>> https://www.biorxiv.org/content/10.1101/2023.04.01.535195v1

MOGAT, a novel multi-omics integration-based cancer subtype prediction leveraging a graph attention network (GAT) model that incorporates graph-based learning with an attention mechanism for analyzing multi-omics data.

MOGAT utilizes a multi-head attention mechanism that can efficiently extract information for a specific patient by assigning unique attention coefficients to its neighboring patients, i.e., getting the relative influence of neighboring patients in the patient similarity graph.





□ mlf-core: a framework for deterministic machine learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad164/7099608

mlf-core, a machine learning framework that enables building fully deterministic and therefore also reproducible machine learning projects. mlf-core is based on MLflow for machine learning experiment tracking, visualization and model deployment.

mlf-core provides project templates and static code analysis (linting) functionality that ensures the sole usage of deterministic algorithms for GPU computing as well as setting all necessary random seeds for deterministic results.





□ Discovering motifs and genomic patterns with SMT: a high-performance data structure for counting kmers

>> https://www.biorxiv.org/content/10.1101/2023.04.01.535163v1

The Sparse Motif Tree (SMT), an innovative tool specifically designed to store and count kmers efficiently. The SMT optimizes memory usage and computation.

The SMT provides advanced features, such as exact search in constant time, retrieval of the most abundant kmers, and approximate search in linear time to find fragments with up to d mutations uniformly distributed across their bases.





□ PanGraphViewer: A Versatile Tool to Visualize Pangenome Graphs

>> https://www.biorxiv.org/content/10.1101/2023.03.30.534931v1

PanGraphViewer targets pangenome graphs and allows the viewing of pangenome graphs built from multiple genomes in either the graphical fragment assembly format or the VCF. PanGraphViewer also integrates genome annotations with graph nodes to analyze insertions / deletions.

The graph node shapes in PanGraphViewer can represent different types of genomic variations when a VCF file is used. Notably, PanGraphViewer displays subgraphs from a chromosome or sequence segment based on any given coordinates.





□ ScRAT: Clinical Phenotype Prediction From Single-cell RNA-seq Data using Attention-Based Neural Networks

>> https://www.biorxiv.org/content/10.1101/2023.03.31.532253v1

ScRAT, a clinical phenotype prediction framework that can learn from limited numbers of scRNA-seq samples with minimal dependence on cell- type annotations.

ScRAT establishes the connection between the input (cells) and the output (phenotypes) of the Transformer model simply using the attention weights.





□ NEREL-BIO: A Dataset of Biomedical Abstracts Annotated with Nested Named Entities

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad161/7099619

NEREL-BIO contains annotations for 700+ Russian and 100+ English abstracts. NEREL-BIO comprises the following specific features: annotation of nested named entities, it can be used as a benchmark for cross-domain and cross-language transfer.

Transferability of trained models across two datasets with completely different contexts can be limited due to domain shift, while sequential training can cause complete retraining of model weights.





□ Dipwmsearch: a python package for searching di-PWM motifs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad141/7100340

dipwmsearch provides an easy and efficient procedure to find occurrences of di-PWMs in nucleotidic sequences, and well documented snippets. It offers practical advantages compared to an existing solution (like processing IUPAC codes, or an adaptable output).

dipwmsearch uses an original enumeration based search algorithm that handles di-PWMs. Coping with non selective positions was necessary to make search effective for some di-PWMs, which questions their information content, and in turn their construction process.





□ FRASER 2.0: Improved detection of aberrant splicing using the Intron Jaccard Index

>> https://www.medrxiv.org/content/10.1101/2023.03.31.23287997v1

As FRASER’s autoencoder works with values in the logit space, which is defined for values greater than 0 and less than 1, a pseudocount needs to be added to both the numerator and denominator when calculating each metric on raw read counts.

FRASER 2.0, a method to detect aberrant splicing using a novel intron-centric metric, the Intron Jaccard Index. In a single metric, the Intron Jaccard Index captures former metrics of splicing efficiency as well as alternative donor and acceptor site choice.

FRASER 2.0 decreases the number of reported splicing outliers by one order of magnitude, recovers splicing outliers associated with candidate splice-disrupting rare variants more accurately than competitor methods, and is more robust to variations in sequencing depth.





□ catchSalmon / catchKallisto: Dividing out quantification uncertainty allows efficient assessment of differential transcript expression

>> https://www.biorxiv.org/content/10.1101/2023.04.02.535231v1

Bootstrap samples generated by lightweight aligners can be used to accurately estimate the mapping ambiguity overdispersion which, in turn, can be used to scale down estimated transcript counts so that the resulting effective library sizes reflect their true precision.

As a result, standard methods designed for the differential expression analyses at the gene-level can be applied to transformed transcript counts for DTE analyses.

Functions catchSalmon and catchKallisto from edgeR import transcript-specific estimated counts (including bootstrap resamples) from Salmon and kallisto, respectively, and estimate the associated mapping ambiguity overdispersion.





□ HTOreader: A hybrid single-cell demultiplexing strategy that increases both cell recovery rate and calling accuracy

>> https://www.biorxiv.org/content/10.1101/2023.04.02.535299v1

HTOreader, an improved algorithm for cell hashing that distinguishes true positive from background for each individual hashtag at higher accuracy. This hybrid strategy increases cell recovery and calling accuracy while lowering experimental cost.

HTOreader uses a hybrid demultiplexing strategy for single-cell sample pooling and super loading. By integrating results of both cell hashing and SNP profiling, they successfully complement the two approaches with each other and hugely improve their weaknesses.