lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Elevation.

2024-01-17 23:33:55 | Science News




□ PCA-Plus: Enhanced principal component analysis with illustrative applications to batch effects and their quantitation

>> https://www.biorxiv.org/content/10.1101/2024.01.02.573793v1

DSC (the dispersion separability criterion), a novel variant metric for quantifying the global dissimilarity of sets of pre-defined groups, with application to PCA plots.

The DSC can be used, for instance, to assess the magnitude of batch effects or the differences among classes or subtypes of biological samples.

PCA-Plus features group centroids; trend arrows (when pertinent); separate coloring of centroids, rays, and data points; and quantitation in terms of the new DSC metric with corresponding permutation test p-values.





□ Reformer: Deep learning model for characterizing protein-RNA interactions from sequence at single-base resolution

>> https://www.biorxiv.org/content/10.1101/2024.01.14.575540v1

Reformer is based on transformer aiming to improve prediction resolution and facilitate greater information flow between peaks and their surrounding contexts.

Reformer provides a unified framework for characterizing RBP binding and prioritizing mutations that affect RNA regulation at base resolution. For each base, the transformer layer computed a weighted sum across the representations of all other bases of the sequence.

Reformer refines predictions by incorporating information from relevant regions across the entire sequence. Employing a regression layer for coverage prediction, Reformer outputs binding affinities for all bases.





□ DeepCycle: Unraveling the oscillatory dynamics of mRNA metabolism and chromatin accessibility during the cell cycle through integration of single-cell multiomic data

>> https://www.biorxiv.org/content/10.1101/2024.01.11.575159v1

DeepCycle, a deep learning tool that uses single-cell RNA sequencing, to map the gene expression profiles of every cell to a continuous latent variable, 0, representing the cell cycle phase.

DeepCycle predicts the cell cycle dependence of transcription, nuclear export, and degradation rates for every gene, revealing waves of transcriptional and post-transcriptional regulation during the cell cycle.





□ PathFinder: a novel graph transformer model to infer multi-cell intra- and inter-cellular signaling pathways and communications

>> https://www.biorxiv.org/content/10.1101/2024.01.13.575534v1

PathFinder is based on the divide-and-conquer strategy, which divides the complex signaling networks into signaling paths, and then score and rank them using a novel graph transformer architecture to infer the intra- and inter-cell signaling network inference.

PathFinder can effectively separate cells from different conditions by selecting differentially expressed signaling paths. The trainable path weight will be learned to assign each path an importance score, which can be used to generate intra-cell communication networks.





□ scKWARN: Kernel-weighted-average robust normalization for single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae008/7574580

scKWARN, a Kernel Weighted Average Robust Normalization designed to correct known or hidden technical cofounders w/o assuming specific data distributions or count-depth relationships. scKWARN inherently consider any technical factors contributing to unwanted expression variation.

scKWARN generates a pseudo expression profile for EA cell using information from its fuzzy technical neighbors through a kernel smoother. It then compares this profile against the reference derived from cells w/ the same bimodality patterns to determine the normalization factor.





□ BSAlign: a library for nucleotide sequence alignment

>> https://www.biorxiv.org/content/10.1101/2024.01.15.575791v1

BSalign is a library/tool for adaptive banding striped 8/2-bit-scoring global/extend/overlap DNA sequence pairwise/multiple alignment

BSAlign delivers alignment results at an ultra-fast speed by knitting a series of novel methods together to take advantage of all of the aforementioned three perspectives w/ highlights such as active F-loop in striped vectorization and striped move in banded dynamic programming.





□ SI: Quantifying the distribution of feature values over data represented in arbitrary dimensional spaces

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011768

Structure Index (SI), a new metric aimed at quantifying how a given feature is structured along an arbitrary point cloud. The SI aims at quantifying the amount of structure present in the distribution of a given feature over a point cloud in an arbitrary D-dimensional space.

By definition, the SI is agnostic to the type of structure (e.g., gradient, patchy, etc.) since bin groups do not need to follow any specific arrangement. SI permits examination of the local and global distribution of features, whether categorical/continuous or scalar/vectorial.





□ SPE: On the Stability of Expressive Positional Encodings for Graph Neural Networks

>> https://arxiv.org/abs/2310.02579

Stable and Expressive Positional Encodings (SPE), an architecture for processing eigenvectors that uses eigenvalues to "softly partition" eigenspaces.

SPE is the first architecture that is provably stable, and universally expressive for basis invariant functions whilst respecting all symmetries of eigenvectors.





□ MetaNorm: Incorporating meta-analytic priors into normalization of NanoString nCounter data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae024/7574576

MetaNorm, a Bayesian algorithm for normalizing NanoString nCounter gene expression data. performance. MetaNorm employs priors carefully constructed from a rigorous meta- analysis to leverage information.

MetaNorm is based on RCRnorm, a powerful method designed under an integrated series of hierarchical models that allow various sources of error to be explained by different types of probes in the nCounter system.





□ scMAE: a masked autoencoder for single-cell RNA-seq clustering

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae020/7564641

scMAE perturbs gene expression and employs a masked autoencoder to reconstruct the original data, learning robust and informative cell representations. scMAE effectively captures latent structures and dependencies in the data, enhancing clustering performance.

scMAE employs partial corruption to the gene expression data and incorporates a masking predictor to capture the correlations between genes. scMAE takes the corrupted data as input to the encoder, obtains a low-dimensional embedding, and then passes it to the masking predictor.





□ FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae014/7515251

FMAlign2 utilizes Maximal Exact Matches (MEMs) instead of k-mers to identify partial chains in sequences. FMAlign2 constructs suffix array and longest common prefix (LCP) array, identifies MEMs, and generates a colinear set of MEMs for alignment.

FMAlign2 employs the striped Smith-Waterman (SSW) algorithm to identify similar substrings for each MEMs in sequences where MEMs are absent. The identified substrings, combined with MEMs, form the partial chains used for subsequent sequence segmentation to generate segments.





□ SC-VAE: A Supervised Contrastive Framework for Learning Disentangled Representations of Cell Perturbation Data

>> https://www.biorxiv.org/content/10.1101/2024.01.05.574421v1

SC-VAE (Supervised Contrastive Variational Autoencoder), a novel framework for learning disentangled representations from Perturb-Seq data. SC-VAE learns two latent spaces with the same semantic, but also jointly models guide RA identity alongside gene expression measurements.

SC-VAE employs the Hilbert-Schmidt Independence Criterion as a regularization technique. SC-VAE extends the CA framework by adding a supervision component to the generative model.

SC-VAE incorporates two distinct encoders: a background encoder, capturing biological attributes like cell cycle processes, and a salient encoder, specifically targeting perturbation effects.

The salient space induces a much higher energy distance compared to the background space, suggesting that the two spaces are disentangled. The energy distances for SC-VAE's salient space were consistently higher than those for ContrastiveVI's salient space or for the PCA space.





□ TEMINET: A Co-Informative and Trustworthy Multi-Omics Integration Network for Diagnostic Prediction

>> https://www.biorxiv.org/content/10.1101/2024.01.03.574118v1

TEMINET utilizes intra-omics features to construct disease-specific networks, then applies graph attention networks and a multi-level framework to capture more collective informativeness than pairwise relations.

TEMINET operates on a sample-wise basis with multi-omics information for each individual sample being imported into the model. The first intra-omics network is built using the WGCNA. The intra-omic information at each omics-level is augmented using the multi-level GAT.

The evidence is evaluated by the subject logic module to obtain uncertainty. During the integration phase, the trustworthy informativeness and uncertainty from each omics are amalgamated into a composite embedding encompassing inter-omics information.





□ scDirect: key transcription factor identification for directing cell state transitions based on single-cell multi-omics data

>> https://www.biorxiv.org/content/10.1101/2024.01.08.574757v1

scDirect models cell state transition as a linear process. scDirect constructs a primary GRN with scRNA-seq data and scATAC-seq data, and then enhances the GRN with graph attention network (GAT) to obtain more putative TF-target pairs with high confidence.

scDirect uses CellOracle to calculate a primary GRN, and then GAT was applied to enhance the GRN. scDirect models the TF identification task as a linear inverse problem and solves the expected alteration of each TF with Tikhonov regularization.





□ Biolord: Disentanglement of single-cell data

>> https://www.nature.com/articles/s41587-023-02079-x

Biolord is a deep generative method for disentangling single-cell multi-omic data to known and unknown attributes, including spatial, temporal and disease states, used to reveal the decoupled biological signatures over diverse single-cell modalities and biological systems.

Decomposed latent space - for each known attribute, a dedicated subnetwork is constructed. The architecture of each subnetwork is chosen based on the attributes' type (categorical or ordered),

The decomposed latent space and the generative prediction, is done jointly, such that the embeddings in the decomposed latent space are optimized with respect to the reconstruction error of the generator.






□ PDGrapher: Combinatorial prediction of therapeutic perturbations using causally-inspired neural networks

>> https://www.biorxiv.org/content/10.1101/2024.01.03.573985v2

PDGRAPHER efficiently predicts perturbagens to shift cell line gene expression from a diseased to a treated state across two evaluation settings and eight datasets of genetic and chemical interventions.

Training PDGRAPHER models is up to 30 times faster than response prediction methods that use indirect prediction to nominate candidate perturbagens.

PDGRAPHER can illuminate the mode of action of predicted perturbagens given that it predicts gene targets based on network proximity which governs similarity between genes.

PDGRAPHER posits that leveraging representation learning can overcome incomplete causal graph approximations. A valuable research direction is to theoretically examine the impact of using the approximations, focusing on how they influence the reliability of predicted likelihoods.






□ Transformers are Multi-State RNNs

>> https://arxiv.org/abs/2401.06104

Transformers can be thought of as infinite multi-state RNNs, with the key/value vectors corresponding to a multi-state that dynamically grows infinitely. Transformers behave as finite MSRNNs, which keep a fixed-size multi-state by dropping one state at each decoding step.

TOVA is a powerful MSRNN compression policy. TOVA selects which tokens to keep in the multi-state based solely on their attention scores. TOVA performs comparably to the infinite MSRNN model. Although transformers are not trained as such, they often function as finite MSRNNs.





□ SuperCell: Coarse-graining of large single-cell RNA-seq data into metacells

>> https://github.com/GfellerLab/SuperCell

SuperCell is an R package for coarse-graining large single-cell RNA-seq data into metacells and performing downstream analysis at the metacell level.

Unlike clustering, the aim of metacells is not to identify large groups of cells that comprehensively capture biological concepts, like cell types, but to merge cells that share highly similar profiles, and may carry repetitive information.

Therefore metacells represent a compromise structure that optimally remove redundant information in scRNA-seq data while preserving the biologically relevant heterogeneity.





□ Cellograph: a semi-supervised approach to analyzing multi-condition single-cell RNA-sequencing data using graph neural networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05641-9

Cellograph uses Graph Convolutional Networks (GCNs) to perform node classification on cells from multiple samples to quantify how representative cells are of each sample.

Cellograph not only measures how prototypical cells are of each condition but also learns a latent space that is amenable to interpretable data visualization and clustering. The learned gene weight matrix from training reveals pertinent genes driving the differences between conditions.





□ ABC: Batch correction of single cell sequencing data via an autoencoder architecture

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbad186/7502962

Autoencoder-based Batch Correction (ABC), a semi-supervised deep learning architecture for integrating single cell sequencing. ABC removes batch effects through a guided process of data compression using supervised cell type classifier branches for biological signal retention.

ABC is based on an autoencoder architecture trained in an adversarial manner alongside a batch label discriminator, similar to GANs.

The architecture takes as input molecular measurements from a given cell, containing the normalized counts of each locus/gene in the cell, and outputs a corrected vector of values that can be used for downstream analysis.

In ABC approach, cell type classifiers are utilized to guide both encoding and decoding processes, ensuring the retention of cell type-specific variations. This is particularly relevant for cell types that are unique to a specific batch and represented by a small number of cells.





□ HyperPCM: Robust Task-Conditioned Modeling of Drug–Target Interactions

>> https://pubs.acs.org/doi/10.1021/acs.jcim.3c01417

HyperPCM, a novel neural network architecture that achieves state-of-the-art performance in various settings including during zero-shot inference, where predictions are made for previously unseen protein targets.

HyperPCM leverages the power of a HyperNetwork that learn to predict parameters for other neural networks. The specialized weight initialization strategy of the HyperNetwork stabilizes the signal propagation through the QSAR model.





□ Dagger categories and the complex numbers: Axioms for the category of finite-dimensional Hilbert spaces and linear contractions

>> https://arxiv.org/abs/2401.06584

Characterising the category of finite-dimensional Hilbert spaces and linear contractions using simple category-theoretic axioms that do not refer to norms, continuity, dimension, or real numbers.

The scalar localisation of a category satisfying this axioms is equivalent to the category of finite-dimensional Hilbert spaces and all linear maps, then identify the original category with the full subcategory of linear contractions.






□ BaseMEMOIR: Reconstructing cell histories in space with image-readable base editor recording

>> https://www.biorxiv.org/content/10.1101/2024.01.03.573434v1

baseMEMOIR combines base editing, sequential hybridization imaging, and Bayesian inference to allow reconstruction of high-resolution cell lineage trees and cell state dynamics while preserving spatial organization.

BaseMEMOIR stochastically and irreversibly edits engineered dinucleotides to one of three alternative image-readable states. baseMEMOIR achieves high density recording, while maintaining compatibility with FISH-based readout of endogenous genes.





□ MoCoLo: a testing framework for motif co-localization

>> https://www.biorxiv.org/content/10.1101/2024.01.04.574249v1

MoCoLo employs a unique approach to co-localization testing that directly probes for genomic co-localization with duo-hypotheses testing. This means that MoCoLo can deliver more detailed and nuanced insights into the interplay between different genomic features.

MoCoLo features a novel method for informed genomic simulation, taking into account intrinsic sequence properties such as length and guanine-content.

MoCoLo enables us to identify genome-wide co-localization of 8-oxo-dG sites and non-B DNA forming region, providing a deeper understanding of the interactions between these genomic elements.





□ PathIntegrate: Multivariate modelling approaches for pathway-based multi-omics data integration

>> https://www.biorxiv.org/content/10.1101/2024.01.09.574780v1

PathIntegrate employs single-sample pathway analysis (ssPA) to transform multi-omics datasets from the molecular to the pathway-level, and applies a predictive single-view or multi-view model to integrate the data.

PathIntegrate Single-View produces a multi-omics pathway-transformed dataset and applies a classification or regression model. PathIntegrate Multi-View uses a multi-block partial least squares (MB-PLS) latent variable model to integrate ssPA-transformed multi-omics data.





□ GatekeepR: an R shiny application for the identification of nodes with high dynamic impact in boolean networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae007/7513690

GatekeepR provides a ranked list of network components whose perturbation (i.e. knockout or overexpression) is likely to have a high impact on dynamics, resulting in a large change in the system's attractor landscape.

Such a change is defined by the loss of previously existing attractors along with the appearance of new attractors which possess a high Hamming distance with respect to all attractors of the unperturbed system.

The recommended nodes have been found to be sparsely connected and to preferentially exchange mutual information with highly connected hub nodes and have thus been named "gatekeepers".

GatekeepR does not perform any analyses on the state transition graph of a network, which scales exponentially with network size, but relies only on measures defined by the network's logical rules and their resulting interaction graph.





□ Hierarchical Causal Models

>> https://arxiv.org/abs/2401.05330

Hierarchical causal models (HCM), which extend structural causal models and causal graphical models by adding inner plates. It uses a general graphical identification technique for hierarchical causal models that extends do-calculus.

In the HCM identification problem, Infinite data from both units and subunits is considered. We find many situations in which hierarchical data can enable causal identification even when it would be impossible with non-hierarchical data.





□ Generative artificial intelligence performs rudimentary structural biology modelling

>> https://www.biorxiv.org/content/10.1101/2024.01.10.575113v1

Using ChatGPT to model 3D structures for the 20 standard amino acids as well as an a-helical polypeptide chain, with the latter involving incorporation of the Wolfram plugin for advanced mathematical computation.

For amino acid modelling, distances and angles between atoms of the generated structures in most cases approximated to around experimentally-determined values.

For a-helix modelling, the generated structures were comparable to that of an experimentally-determined a-helical structure. However, both amino acid and a-helix modelling were sporadically error-prone and increased molecular complexity was not well tolerated.





□ Genopyc: a python library for investigating the genomic basis of complex diseases

>> https://www.biorxiv.org/content/10.1101/2024.01.11.575316v1

Genopyc performs various tasks such as retrieve the functional elements neighbouring genomic coordinates, annotate variants, retrieving genes affected by non coding variants and perform and visualize functional enrichment analysis.

Genopyc can also retrieve a linkage-disequilibrium (LD) matrix for a set of SNPs by using LDlink, converting genome coordinates between genome versions and retrieving genes coordinates in the genome.

Genopyc queries the variant effect predictor (VEP) to predict the consequences of the SNPs on the transcript and its effect on neighboring genes and functional elements.





□ CEL: A Continual Learning Model for Disease Outbreak Prediction by Leveraging Domain Adaptation via Elastic Weight Consolidation

>> https://www.biorxiv.org/content/10.1101/2024.01.13.575497v1

CEL (Continual Learning by EWC and LSTM), a model for disease outbreak prediction designed to combat catastrophic forgetting in domain-incremental learning setting where the Fisher Information Matrix in Elastic Weight Consolidation is used to construct a regularization term.

CEL starts w/ data segmentation for contextual learning, followed by domain adaptation where a neural network incorporates with EWC and retains earlier knowledge while integrating new contexts. Finally, performance evaluation measures knowledge retention versus new learning.





□ SupirFactor: Structure-primed embedding on the transcription factor manifold enables transparent model architectures for gene regulatory network and latent activity inference

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03134-1

SupirFactor (StrUcture Primed Inference of Regulation using latent Factor ACTivity), a novel autoencoder-based framework for modeling, and a metric, explained relative variance (ERV), for interpretation of GRNs.

SupirFactor incorporates knowledge priming by using prior, known regulatory evidence to constrain connectivity between an input gene expression layer and the first latent layer, which is explicitly defined to be TF-specific.




Year of the Dragon.

2024-01-17 23:22:33 | Science News





□ Scalable network reconstruction in subquadratic time

>> https://arxiv.org/abs/2401.01404

A general algorithm applicable to a broad range of reconstruction problems that achieves its result in subquadratic time, with a data-dependent complexity loosely upper bounded by O(N3/2 log N), but with a more typical log-linear complexity of O(N log2 N).

This algorithm relies on a stochastic second neighbor search that produces the best edge candidates with high probability, thus bypassing an exhaustive quadratic search.

This algorithm achieves a performance that is many orders of magnitude faster than the quadratic baseline, allows for easy parallelization. The strategy is applicable for algorithms that can be used w/ non-convex objectives, e.g. stochastic gradient descent / simulated annealing.





□ OmniNA: A foundation model for nucleotide sequences

>> https://www.biorxiv.org/content/10.1101/2024.01.14.575543v1

OmniNA represents an endeavor in leveraging foundation models for comprehensive nucleotide learning across diverse species and genome contexts. OmniNA can be fine-tuned to align multiple nucleotide learning tasks with natural language paradigms.

OmniNA employs a transformer-based decoder, undergoes pre-training through an auto-regressive approach. OmniNA was pre-trained on a scale of 91.7 million nucleotide sequences encompassing 1076.2 billion bases range across a global species and biological context.





□ STIGMA: Single-cell tissue-specific gene prioritization using machine learning

>> https://www.sciencedirect.com/science/article/pii/S0002929723004433

STIGMA predicts the disease-causing probability of genes based on their expression profiles across cell types, while considering the temporal dynamics during the embryogenesis of a healthy (wild-type) organism, as well as several intrinsic gene properties.

In STIGMA, supervised machine learning is applied to the single-cell gene expression data as well as intrinsic gene properties on positive and negative classes.

The STIGMA score that each gene receives is based on the cell type-specific temporal dynamics in gene expression and, to a smaller extent, is based on the gene-intrinsic metrics, including the population level constraint metrics.





□ RfamGen: Deep generative design of RNA family sequences

>> https://www.nature.com/articles/s41592-023-02148-8

RfamGen (RNA family sequence generator), a deep generative model that designs RNA family sequences in a data-efficient manner by explicitly incorporating alignment and consensus secondary structure information.

RfamGen can generate novel and functional RNA family sequences by sampling points from a semantically rich and continuous representation. RfamGen successfully generates artificial sequences with higher activity than natural sequences.





□ SYNTERUPTOR: mining genomic islands for non-classical specialised metabolite gene clusters

>> https://www.biorxiv.org/content/10.1101/2024.01.03.573040v1

SYNTERUPTR identifies genomic islands in a given genome by comparing its genomic sequence with those of closely related species. SYNTERUPTOR was designed and is focused on identifying SMBGC-containing genomic islands.

SYNTERUPTOR pipeline requires a dataset consisting of genome files selected by the user from species that are related enough to possess synteny blocks.

SYNTERUPTOR proceeds by performing pairwise comparisons between all Coding DNA Sequences (CDSs) amino acid sequences to identify orthologs. Subsequently, it constructs synteny blocks and detects any instances of synteny breaks.





□ ALG-DDI: A multi-scale feature fusion model based on biological knowledge graph and transformer-encoder for drug-drug interaction prediction

>> https://www.biorxiv.org/content/10.1101/2024.01.12.575305v1

ALG-DDI can comprehensively incorporate attribute information, local biological information, and global semantic information. ALG-DDI first employs the Attribute Masking method to obtain the embedding vector of the molecular graph.

ALG-DDI leverages heterogeneous graphs to capture the local biological information between drugs and several highly related biological entities. The global semantic information is also learned from the medicine-oriented large knowledge graphs.

ALG-DDI employs a transformer encoder to fuse the multi-scale drug representations and feed the resulting drug pair vector into a fully connected neural network for prediction.





□ FAVA: High-quality functional association networks inferred from scRNA-seq and proteomics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae010/7513163

FAVA (Functional Associations using Variational Autoencoders) compresses high-dimensional data into a low-dimensional space. FAVA infers networks from high-dimensional omics data with much higher accuracy, across a diverse collection of real as well as simulated datasets.

In latent space, FAVA calculates the Pearson correlation coefficient (PCC) each pair of proteins, resulting in a functional association network. FAVA can process large datasets w/ over 0.5 million conditions and has predicted 4,210 interactions b/n 1,039 understudied proteins.





□ FFS: Fractal feature selection model for enhancing high-dimensional biological problems

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05619-z

In fractals, a central tenet posits that patterns recur at differing scales. This principle suggests that when one examines a minuscule segment of a fractal and juxtaposes it with a more significant portion of the same fractal, the patterns observed will bear striking resemblance.

FFS (Fractal Feature Selection) is proof of harmonic convergence of a low-complexity system with remarkable performance. FFS partitions features into blocks, measures similarity using the Root Mean Square Error (RMSE), and determines feature importance based on low RMSE values.

By conceptualizing these attributes as blocks, where each block corresponds to a particular data category, the proposed model finds that blocks with common similarities are often associated with specific data categories.





□ CytoCommunity: Unsupervised and supervised discovery of tissue cellular neighborhoods from cell phenotypes

>> https://www.nature.com/articles/s41592-023-02124-2

CytoCommunity learns a mapping directly from the cell phenotype space to the TCN space using a graph neural network model without intermediate clustering of cell embeddings.

By leveraging graph pooling, CytoCommunity enables de novo identification of condition-specific and predictive TCNs under the supervision of sample labels.

CytoCommunity formulates TCN identification as a community detection problem on graphs and use a graph minimum cut (MinCut)-based GNN model to identify TCNs.

CytoCommunity directly uses cell phenotypes as features to learn TCN partitions and thus facilitates the interpretation of TCN functions.

CytoCommunity can also identify condition-specific TCNs from a cohort of labeled tissue samples by leveraging differentiable graph pooling and sample labels, which is an effective strategy to address the difficulty of graph alignment.





□ scSNV-seq: high-throughput phenotyping of single nucleotide variants by coupled single-cell genotyping and transcriptomics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03169-y

scSNV-seq uses transcribed genetic barcodes to couple targeted single-cell genotyping with transcriptomics to identify the edited genotype and transcriptome of each individual cell rather than predicting genotype from gRNA identity.

scSNV-seq allows us to identify benign variants or variants with an intermediate phenotype which would otherwise not be possible.

The methodology is applicable to any other methods for introducing variation such as HDR, prime editing, or saturation genome editing since it does not rely on gRNA identity to infer genotype.





□ Fragmentstein: Facilitating data reuse for cell-free DNA fragment analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae017/7550024

Fragmentstein, a command-line tool for converting non-sensitive cDNA-fragmentation data into alignment mapping (BAM) files. Fragmentstein complements fragment coordinates with sequence information from a reference genome to reconstruct BAM files.

Fragmentstein creates alignment files for each sample using only non-sensitive information. The original alignment files and the alignment files generated by Fragmentstein were subjected to fragment length, copy number and nucleosome occupancy analysis.





□ DLemb / BioKG2Vec: PREDICTING GENE DISEASE ASSOCIATIONS WITH KNOWLEDGE GRAPH EMBEDDINGS FOR DISEASES WITH CURTAILED INFORMATION

>> https://www.biorxiv.org/content/10.1101/2024.01.11.575314v1

BioKg2Vec relies on a biased random-walk approach in which the user can prioritize specific connections by assigning a weight to edges. In the KG defined in this work we used 4 different node-types: drug, protein, function and disease.

DLemb is a shallow neural network. The input layer takes as input KG entities as numbers and outputs them to the embedding layer. Subsequently, embeddings are normalized, and a dot product is calculated between them resulting in the output layer.

DLemb is trained by providing a batch of correct links and wrong links in the KG to provide with positive and negative examples in what can be conceived as a link-prediction task. Embeddings are then optimized for every epoch by minimizing RMSE and using Adam optimization.





□ POP-GWAS: Valid inference for machine learning-assisted GWAS

>> https://www.medrxiv.org/content/10.1101/2024.01.03.24300779v1

POP-GWAS (Post-prediction GWAS) provides unbiased estimates and well-calibrated type-l error, is universally more powerful than conventional GWAS on the observed phenotype, and has minimal assumption on the variables used for imputation and choice of prediction algorithm.

POP-GWAS imputes the phenotype in both labeled and unlabeled samples, and performs three GWAS: GWAS of the observed and imputed phenotype in labeled samples, and GWAS on the imputed phenotype in unlabeled samples.





□ GLDADec: marker-gene guided LDA modelling for bulk gene expression deconvolution

>> https://www.biorxiv.org/content/10.1101/2024.01.08.574749v1

GLADADec (Guided Latent Dirichlet Allocation Deconvolution) utilizes marker gene names as partial prior information to estimate cell type proportions, thereby overcoming the challenges of conventional reference-based and reference-free methods simultaneously.

GLADADec employs a semi-supervised learning algorithm that combines cell-type marker genes with additional factors that may influence gene expression profiles to achieve a robust estimation of cell type proportions. An ensemble strategy is used to aggregate the output.





□ scGOclust: leveraging gene ontology to compare cell types across distant species using scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2024.01.09.574675v1

scGOclust constructs a functional profile of individual cells by multiplication of a gene expression count matrix of cells and a binary matrix with GO BP annotations of genes.

This GO BP feature matrix is treated similarly to a count matrix in classic single-cell RNA sequencing (scRNA-seq) analysis and is subjected to dimensionality reduction and clustering analyses.

scGOclust recapitulates the function spectrum of different cell types, characterises functional similarities between homologous cell types, and reveals functional convergence between unrelated cell types.





□ MATES: A Deep Learning-Based Model for Locus-specific Quantification of Transposable Elements in Single Cell

>> https://www.biorxiv.org/content/10.1101/2024.01.09.574909v1

MATES (Multi-mapping Alignment for TE loci quantification in Single-cell), a novel deep neural network-based method tailored for accurate locus-specific TE quantification in single-cell sequencing data across modalities.

MATES harnesses the distribution of uniquely mapped reads occurrence flanking TE loci and assigns multiple mapping TE reads for locus-specific TE quantification.

MATES captures complex relationships b/n the context distribution of unique-mapping reads flanking TE loci and the probability of multi-mapping reads assigned to those loci, handles the multi-mapping read assignments probabilistically based on the local context of the TE loci.





□ COFFEE: CONSENSUS SINGLE CELL-TYPE SPECIFIC INFERENCE FOR GENE REGULATORY NETWORKS

>> https://www.biorxiv.org/content/10.1101/2024.01.05.574445v1

COFFEE (COnsensus single cell-type speciFic inFerence for gEnE regulatory networks), a Borda voting based consensus algorithm that integrates information from 10 established GRN inference methods.

COFFEE has improved performance across synthetic, curated and experimental datasets when compared to baseline methods.

COFFEE's stability across differing datasets; even with Curated data, the consensus approach is able to capture high confidence edges when compared to the ground truth data.





□ HAT: de novo variant calling for highly accurate short-read and long-read sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad775/7510834

Hare And Tortoise (HAT) as an automated DNV detection workflow for highly accurate short-read and long-read sequencing data.

HAT is a computational workflow that begins with aligned read data (i.e., CRAM or BAM) from a parent-child sequenced trio and outputs DNVs. The HAT workflow consists of three main steps: GVCF generation, fam-ily-level genotyping, and filtering of variants to get final DNVs.

HAT detects high-quality DNVs from Illumina short-read whole-exome sequencing, Illumina short-read whole-genome sequencing, and highly accurate PacBio HiFi long-read whole-genome sequencing data.





□ SVCR: The Scalable Variant Call Representation: Enabling Genetic Analysis Beyond One Million Genomes

>> https://www.biorxiv.org/content/10.1101/2024.01.09.574205v1

SVCR achieves this by adopting reference blocks from the Genomic Variant Call Format (GVCF) and employing local allele indices. SVCR is also lossless and mergeable, allowing for N+1 and N+K incremental joint-calling.

SVCR-VCF encodes SVCR in VCF format, and VDS, which uses Hail's native format. Their experiments confirm the linear scalability of SVCR-VCF and VDS, in contrast to the super-linear growth seen with standard VCF files.

VDS Combiner, a scalable, open-source tool for producing a VDS from GVCFs and unique features of VDS which enable rapid data analysis.

PVCF defines the semantics of fields such as GT, AD, GP, PL, and, for list fields, the relationship between their length and the number of alternate alleles. VCF, as a format, describes, for example, how a number or a list is rendered in plaintext.

PVCF represents a collection of sequences as a dense matrix, with one column per sequenced sample and one row for every variant site. PVCF permits both a multiallelic representation (wherein each locus appears in at most one row) and a biallelic representation.





□ Poincaré and SimBio: a versatile and extensible Python ecosystem for modeling systems.

>> https://www.biorxiv.org/content/10.1101/2024.01.10.574883v1

Poincaré allows defining differential equation sys-tems, while SimBio builds on it for defining reaction networks. They are focused on providing an ergonomic experience to end-users by integrating well with IDEs and static analysis tools through the use of standard modern Python syntax.

The models built using these packages can be introspected to create other representations, such as graphs connecting species and/or reactions, or tables with parameters or equations.





□ Secreted Particle Information Transfer (SPIT) - A Cellular Platform For In Vivo Genetic Engineering

>> https://www.biorxiv.org/content/10.1101/2024.01.11.575257v1

Compared to the limited packaging capacities of contemporary in vivo gene therapy delivery platforms, a human cell's nucleus contains approximately 6 billion base pairs of information. They hypothesized that human cells could be applied as vectors for in vivo gene therapy.

SPIT is modified to secrete a genetic engineering enzyme within a particle that transfers this enzyme into a recipient cell, where it manipulates genetic information.





□ Decoder-seq enhances mRNA capture efficiency in spatial RNA sequencing

>> https://www.nature.com/articles/s41587-023-02086-y

Decoder-seq (Dendrimeric DNA coordinate barcoding design for spatial RNA sequencing) combines dendrimeric nanosubstrates with microfluidic coordinate barcoding to generate spatial arrays with a DNA density approximately ten times higher than previously reported methods.

Decoder-seq improves the detection of lowly expressed olfactory receptor (Olfr) genes in mouse olfactory bulbs and contributed to the discovery of a unique layer enrichment pattern for two Olfr genes.





□ GVRP: Genome Variant Refinement Pipeline for variant analysis in non-human species using machine learning

>> https://www.biorxiv.org/content/10.1101/2024.01.14.575595v1

GVRP employs a machine learning-based approach to refine variant calls in non-human species. Rather than training separate variant callers for each species, we employ a machine learning model to accurately identify variations and filter out false positives from DeepVariant.

In GVRP, they omit certain DeepVariant preprocessing steps and leverage the ground-truth Genome In A Bottle (GIAB) variant calls to train the machine learning model for non-human species genome variant refinement.





□ BAMBI: Integrative biostatistical and artificial-intelligence method discover coding and non-coding RNA genes as biomarkers

>> https://www.biorxiv.org/content/10.1101/2024.01.12.575460v1

BAMBI (Biostatistics and Artificial-Intelligence integrated Method for Biomarker /dentification), a robust pipeline that identifies both coding and non-coding RNA biomarkers for disease diagnosis and prognosis.

BAMBI can process RNA-seq data and microarray data to pinpoint a minimal yet highly predictive set of RNA biomarkers, thus facilitating their clinical application.

BAMBI offers visualization of biomarker expression and interpretation their functions using co-expression networks and literature mining, enhancing the interpretability of the results.





□ PoMoCNV: Inferring the selective history of CNVs using a maximum likelihood model

>> https://www.biorxiv.org/content/10.1101/2024.01.15.575676v1

PoMoCNV (POlymorphism-aware phylogenetic MOdel for CNV datasets) infers the fitness parameters and transition rates associated with different copy numbers along branches in the phylogenetic tree, tracing back in time.

Utilizing the phylogenetic tree of populations and estimated copy numbers, PoMoCNV was utilized to infer the evolutionary parameters governing CNV evolution along branches.

In PoMoCNV, the likelihood of this birth-death process is modeled per genomic segment, taking into account the copy number (allele) fitness and frequencies.





□ O-LGT: Online Hybrid Neural Network for Stock Price Prediction: A Case Study of High-Frequency Stock Trading in the Chinese Market

>> https://www.mdpi.com/2225-1146/11/2/13

O-LGT, an online hybrid recurrent neural network model tailored for analyzing LOB data and predicting stock price fluctuations in a high-frequency trading (HFT) environment.

O-LGT combines LSTM, GRU, and transformer layers, and features efficient storage management. When computing the stock forecast for the immediate future, O-LGT only use the output calculated from the previous trading data together with the current trading data.





□ GYOSA: A Distributed Computing Solution for Privacy-Preserving Genome-Wide Association Studies

>> https://www.biorxiv.org/content/10.1101/2024.01.15.575678v1

GYOSA, a secure and privacy-preserving distributed genomic analysis solution. Unlike in previous work, GYOSA follows a distributed processing design that enables handling larger amounts of genomic data in a scalable and efficient fashion.

GYOSA provides transparent authenticated encryption, which protects sensitive data from being disclosed to unwanted parties and ensures anti-tampering properties for clients' data stored in untrusted infrastructures.





□ KaMRaT: a C++ toolkit for k-mer count matrix dimension reduction

>> https://www.biorxiv.org/content/10.1101/2024.01.15.575511v1

KaMRaT (k-mer Matrix Reduction Toolkit) is a program for processing large k-mer count tables extracted from high throughput sequencing data.

Major functions include scoring k-mers based on count statistics, merging overlapping k-mers into longer contigs and selecting k-mers based on their presence in certain samples.

KaMRaT merge builds on the concept of local k-mer extension ("unitigs") to improve extension precision by leveraging count data. KaMRaT enables the identification of condition-specific or differential sequences, irrespective of any gene or transcript annotation.





□ EvoAug-TF: Extending evolution-inspired data augmentations for genomic deep learning to TensorFlow

>> https://www.biorxiv.org/content/10.1101/2024.01.17.575961v1

EvoAug was introduced to train a genomic DNN with evolution-inspired augmentations. EvoAug-trained DNNs have demonstrated improved generalization and interpretability with attribution analysis.

EvoAug-TF is a TensorFlow implementation of EvoAug (a PyTorch package) that provides the ability to train genomic DNNs with evolution-inspired data augmentations. EvoAug-TF improves generalization and model interpretability with attribution methods.





□ SLEDGe: Inference of ancient whole genome duplications using machine learning

>> https://www.biorxiv.org/content/10.1101/2024.01.17.574559v1

SLEDGe (Supervised Learning Estimation of Duplicated Genomes) provides a novel means to repeatably and rapidly infer ancient WGD events
from Ks plots derived from genomic or transcriptomic data.

SLEDGe can simulate ancient WGDs of multiple ages and across a range of gene birth and death rates. It provides the first model-based approach to infer WGDs in Ks plots and makes WGD interpretation more repeatable and consistent.




Peter Kochinsky

>> https://rapport.bio/all-stories/semper-maior-spirits-rising-january-2024

Do you think of biotech as wasteful? How much of the biotech Universe's cash is locked away in companies that have lingered all year with a negative enterprise value? We looked.

Interested in the relevance of M&A to sector returns? How much of the returns from M&A accrue to companies held by at least one specialist? At least three? We looked.

What's it all mean for private companies looking to get public?

And overshadowing it all is a question: what can we do to protect the @biotech sector and biomedical innovation from the wrong stroke of a pen?