goo blog サービス終了のお知らせ 

lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Nona.

2023-06-06 18:03:06 | Science News




□ scANNA: Boosting Single-Cell RNA Sequencing Analysis with Simple Neural Attention

>> https://www.biorxiv.org/content/10.1101/2023.05.29.542760v1

scANNA (single-cell Analysis using Neural-Attention) learns salient genes for each cluster enabling accurate / scalable unsupervised annotations. After training scANNA's DL core, the gene attention weights from the Additive Attention Module are used as input for downstream tasks.

scANNA uses the Deep Projection Blocks, which are an ensemble of operators learning a nonlinear mapping between gene scores. This mapping is designed
to increase model capacity and connect the gene associations to the auxiliary objective.





□ COMSE: Analysis of Single-Cell RNA-seq Data Using Community Detection Based Feature Selection

>> https://www.biorxiv.org/content/10.1101/2023.06.03.543526v1

COMSE partitions all genes into different communities in latent space using the Louvain algorithm. A denoising procedure removes noise introduced during sequencing or other procedures. It then selects highly informative genes from each community based on the Laplacian score.

COMSE calculates the Laplacian score with multi-subsample randomization and choose genes with the smallest scores, assuming that data from the same class are often close to each other. COMSE then rank the genes based on gene-gene correlation to remove redundancy.





□ scATAnno: Automated Cell Type Annotation for single-cell ATAC-seq Data

>> https://www.biorxiv.org/content/10.1101/2023.06.01.543296v1

scATAnno, a workflow that directly and automatically annotates scATAC-seq data based on scATAC-seq reference atlases. scATAnno directly uses peaks or CRE genomic regions as input features, eliminating the need to convert the epigenomic features into gene activity scores.

scATAnno uses chromatin state profile of large-scale reference atlas to generate peak signals and reference peaks. scATAnno tackles the high dimensionality of SCATAC-seq data by leveraging spectral embedding to efficiently transform the data into a low dimensional space.

Each query cell is assigned a cell type along with two uncertainty scores: the first uncertainty score is based on the KNN, and the second uncertainty score is derived from a novel computation of the weighted distance between the query cell and reference cell type centroids.






□ Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

>> https://arxiv.org/abs/2305.14342

Sophia, Second-order Clipped Stochastic Optimization, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner. Sophia only estimates the diagonal Hessian every handful of iterations.

The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping. It controls the worst-case update size and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory.

Sophia has a more aggressive pre-conditioner than Adam Sophia applies a stronger penalization to updates in sharp dimensions (where the Hessian is large) than the flat dimensions (where the Hessian is small), ensuring a uniform loss decrease across all parameter dimensions.





□ AlphaDev: Faster sorting algorithms discovered using deep reinforcement learning

>> https://www.nature.com/articles/s41586-023-06004-9

Formulating the problem of discovering new, efficient sorting algorithms as a single-player game that they refer to as AssemblyGame. The AlphaDev learning algorithm can incorporate both DRL as well as stochastic search optimization algorithms to play AssemblyGame.

The primary AlphaDev representation is based on Transformers. AlphaDev discovered small sorting algorithms from scratch that outperformed previously known human benchmarks. These algorithms have been integrated into the LLVM standard C++ sort library.





□ NOS: diffusioN Optimized Sampling: Protein Design with Guided Discrete Diffusion

>> https://arxiv.org/abs/2305.20009

NOS, a guidance method for discrete diffusion models that follows gradients in the hidden states of the denoising network. NOS can perform design directly in sequence space, circumventing significant limitations of structure-based methods, incl. scarce data and inverse design.

NOS generalizes LaMBO, a Bayesian optimization procedure for sequence design that facilitates multiple objectives and edit-based constraints. The resulting method, LaMBO-2, enables discrete diffusions and stronger performance through a novel application of saliency maps.





□ MISATO - Machine learning dataset for structure-based drug discovery

>> https://www.biorxiv.org/content/10.1101/2023.05.24.542082v1

MISATO, a curated dataset of 20000 experimental structures of protein-ligand complexes, associated molecular dynamics traces, and electronic properties. Semi-empirical quantum mechanics was used to systematically refine protonation states of proteins and small molecule ligands.

Molecular dynamics traces for protein-ligand complexes were obtained in explicit water. The dataset is made readily available to the scientific community via simple python data-loaders. AI baseline models are provided for dynamical and electronic properties.





□ SifiNet: A robust and accurate method to identify feature gene sets and annotate cells

>> https://www.biorxiv.org/content/10.1101/2023.05.24.541352v1

SifiNet (Single-cell feature identification w/ Network topology), a cell-clustering-independent method for directly identifying feature gene sets. SifiNet is based on the observation that co-differentially-expressed genes w/ a cell subpopulation exhibit co-expression patterns.

SifiNet constructs a gene co-expression network and explores its topology to identify feature gene sets. It also applies to scATAC-seq data, generating a gene co-open-chromatin network and exploring network topology to identify epigenomic feature gene sets.





□ scTIE: data integration and inference of gene regulation using single-cell temporal multimodal data

>> https://www.biorxiv.org/content/10.1101/2023.05.18.541381v1

scTIE, an autoencoder-based method for integrating multimodal profiling of scRNA-seq / scATAC-seq data over a time course. scTIE provides the first unified framework for the integration of temporal data and the inference of context-specific GRNs that predict cell fates.

scTIE uses iterative optimal transport (OT) fitting to align cells in similar states between different time points and estimate their transition probabilities. scTIE removes the need for selecting highly variable genes (HVGs) as input through a pair of coupled batchnorm layers.

scTIE provides the means to extract interpretable features from the embedding space by linking the developmental trajectories of cell representations. scTIE formulates a trajectory prediction using the estimated transition probabilities and uses gradient-based saliency mapping.





□ scME: A Dual-Modality Factor Model for Single-Cell Multi-Omics Embedding https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad337/7176368

scME can generate a better joint representation of multiple modalities than those generated by other single-cell multi-omics integration algorithms, which gives a clear elucidation of nuanced differences among cells.

scME relies on clustering to determine the shared and complementary information between modalities. Hence, the parameters of a clustering algorithm, such as resolution of the Leiden algorithm, could affect the efficacy of this algorithm.





□ scBalance – a scalable sparse neural network framework for rare cell type annotation of single-cell transcriptome data

>> https://www.nature.com/articles/s42003-023-04928-6

scBalance, a sparse neural network framework that can automatically label rare cell types in scRNA-seq datasets of all scales. scBalance will automatically choose the weight for each cell type in the reference dataset and construct the training batch.

scBalance leverages the combination of weight sampling and sparse neural network, whereby minor (rare) cell types are more informative without harming the annotation efficiency of the common (major) cell populations.

scBalance will iteratively learn mini batches from a three-layer neural network until the cross-entropy loss converges. In the training stage, scBalance randomly disables neurons in the network.





□ SIMBA: single-cell embedding along with features

>> https://www.nature.com/articles/s41592-023-01899-8

SIMBA is a single-cell embedding method that supports single- or multi-modality analyses. It leverages recent graph embedding techniques to embed cells and genomic features into a shared latent space.

SIMBA introduces several crucial procedures, including Softmax transformation, weight decay for controlling overfitting and entity-type constraints to generate comparable embeddings (co-embeddings) of cells and features and to address unique challenges in single-cell data.





□ gRNAde: Multi-State RNA Design with Geometric Multi-Graph Neural Networks

>> https://arxiv.org/abs/2305.14749

gRNAde, a geometric deep learning-based pipeline for RNA sequence design conditioned on multiple backbone conformations.

gRNAde explicitly accounts for RNA conformational flexibility via a novel multi-Graph Neural Network architecture which independently encodes a set of conformers via message passing, followed by conformer order-invariant pooling and sequence design.





□ Cellenium—a scalable and interactive visual analytics app for exploring multimodal single-cell data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad349/7188099

Cellenium, a full-stack scalable visual analytics web application which enables users to semantically integrate and organize all their single-cell RNA-, ATAC- , and CITE-sequencing studies.

Cellenium consists of a central Postgres database for hosting all expression- and meta-data, a Postgraphile based GraphQL API layer. Cellenium precalculates differential gene expressions between each annotated cell type and all other cells.





□ Lineage motifs: developmental modules for control of cell type proportions

>> https://www.biorxiv.org/content/10.1101/2023.06.06.543925v1

Lineage Motif Analysis (LMA), a method that recursively identifies statistically overrepresented patterns of cell fates on lineage trees as potential signatures of committed progenitor states.

LMA is based on motif detection, which has been used to identify the building blocks of complex regulatory networks, DNA sequences, and other biological features.

Motifs could be generated by progenitors intrinsically programmed to autonomously give rise to specific patterns of descendant cell fates. It reflects developmental programs invl. extrinsic cues and cell-cell signaling that generate correlated cell fate patterns on lineage trees.





□ MCPNet: A parallel maximum capacity-based genome-scale gene network construction framework

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad373/7192172

MCP (Maximum Capacity Path) Score, a novel maximum-capacity-path based metric to quantify the relative strengths of direct and indirect gene-gene interactions. MCPNet combines interactions from multiple path lengths using optimized weights identified with partial groundtruth.





□ Spider: a flexible and unified framework for simulating spatial transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2023.05.21.541605v1

Spider generates locations of cells on a plate randomly or in a uniform grid-like pattern. Spider supports various neighborhood metrics, such as k-nearest neighbors or neighbors identified by Delaunay triangulation.





□ SanntiS: Expansion of novel biosynthetic gene clusters from diverse environments

>> https://www.biorxiv.org/content/10.1101/2023.05.23.540769v1

At the core of SanntiS is the detection model, an Artificial Neural Network with a one-dimensional convolutional layer, plus a BiLSTM. The model was developed using linearized sequences of protein annotations based on a subset of InterPro as input.

SanntiS employes a duration robust loss function (RLF). RLF mitigates the issue of class imbalance, which can arise from the disparities in BGC counts by class and the variation in the duration of detection events - the disparities in length across different BGC classes.





□ Identification of Biochemical Pathways Responsible for Distinct Phenotypes Using Gene Ontology Causal Activity Models

>> https://www.biorxiv.org/content/10.1101/2023.05.22.541760v1

Phenotypic variability among affected individuals described as incomplete penetrance and variable expressivity can be the result of interactions between the mutated gene and other genes with which it normally interacts.

Integrating the information about human biology from Reactome with model-organism biology from MGI. It can be used not only to understand the similarities of the pathways but as a testing ground for manipulation of pathways in more experimentally tractable organisms than human.





□ MetaBayesDTA: codeless Bayesian meta-analysis of test accuracy, with or without a gold standard

>> https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-023-01910-y

MetaBayesDTA uses the bivariate model to conduct analysis assuming a perfect reference test, and users can also conduct univariate meta-regression and subgroup analysis. It uses latent class models (LCMs) to conduct analyses without assuming a perfect gold standard.

MetaBayesDTA allows the user to run models assuming conditional independence or dependence, options for whether to model the reference and index test sensitivities and specificities as fixed or random effects, and can model multiple reference tests using a meta-regression covariate.





□ WebAtlas pipeline for integrated single cell and spatial transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2023.05.19.541329v1

WebAtlas incorporates integrated scRNA-seq, imaging- and sequencing-based ST datasets for interactive web visualisation, enabling cross-query of cell types and gene expressions across modalities.

WebAtlas unifies commonly used atlassing technologies into the cloud-optimised Zarr format and builds on Vitessce to enable remote data navigation. On WebAtlas, single cell and spatial datasets are linked by biomolecular metadata.

Linkage is performed prior to WebAtlas ingestion using existing data integration methods like Cell2location and StabMap that map scRNA-seq cell type references onto ST datasets and impute unobserved gene expression in the latter.





□ ROCCO: A Robust Method for Detection of Open Chromatin via Convex Optimization

>> https://www.biorxiv.org/content/10.1101/2023.05.24.542132v1

ROCCO determines consensus open chromatin regions across multiple samples simultaneously. ROCCO uses robust summary statistics across samples by solving a constrained optimization problem formulated to account for both enrichment & spatial features of open chromatin signal data.

The model accounts for features common to the edges of accessible chromatin regions, which are often hard to determine based on independently determined sample peaks that can vary widely in their genomic locations.





□ FuzzyPPI: Human Proteome at Fuzzy Semantic Space

>> https://www.biorxiv.org/content/10.1101/2023.05.24.541959v1

FuzzyPPI, a fuzzy semantic scoring function using the Gene Ontology (GO) graphs to assess the binding affinity between any two proteins at an organism level.

FuzzyPPI also constructs a fuzzy semantic network at proteome level from the above designed binding affinity function and extraction of meaningful biological insights.





□ Classifying high-dimensional phenotypes with ensemble learning

>> https://www.biorxiv.org/content/10.1101/2023.05.29.542750v1

A meta-analysis of 33 algorithms across 20 datasets containing over 20,000 high-dimensional shape phenotypes using an ensemble learning framework. Both binary and multi-class (e.g., species, genotype, population) classification tasks were considered.

They employs phenotypic datasets containing a range of anatomical data from different organisms with unique class distributions. Blending ensemble approaches involve strategically stacking a set of individual classifiers using a holdout validation set to improve performance.





□ buttery-eel: Accelerated nanopore basecalling with SLOW5 data format

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad352/7186500

Buttery-eel, an open-source wrapper for Oxford Nanopore’s Guppy basecaller that enables SLOW5 data access, resulting in performance improvements that are essential for scalable, affordable basecalling.

Buttery-eel/BLOW5 demonstrates a ~3-fold performance improvement when using FAST basecalling, compared to ~20% improvement with HAC basecalling. This suggests that there is an underlying bottleneck in data access on the PromethION.





□ FAST: Flexible Analysis of Spatial Transcriptomics Data: A Deconvolution Approach

>> https://www.biorxiv.org/content/10.1101/2023.05.26.542550v1

A novel reference-free method based on regularized non-negative matrix factorization (NMF), named Flexible Analysis of Spatial Transcriptomics (FAST), that can effectively incorporate gene expression data, spatial coordinates, and histology information into a unified deconvolution framework.

FADT is adaptable to any graph Laplacian matrix, allowing for flexibility in its application. The second term imposes a constraint on cell proportions, encouraging their summation equals one.





□ autoStreamTree: Genomic variant data fitted to geospatial networks

>> https://www.biorxiv.org/content/10.1101/2023.05.27.542562v1

autoStreamTree provides a companion library of functions for calculating various measures of genetic distances among individuals or populations, including model-corrected p-distances as well as those based on allele frequencies.

autoStreamTree includes integrated functions for parsing an input vector shapefile of streams for calculation of pairwise stream distances b/n sites, as well as the ordinary or weighted least-squares fitting of reach-wise genetic distances according to the "stream tree" model.





□ Hierarchical Interleaved Bloom Filter: enabling ultrafast, approximate sequence queries

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02971-4

the Hierarchical Interleaved Bloom Filter (HIBF) that overcomes major limitations of the IBF data structure. The HIBF successfully decouples the user input from the internal representation, enabling it to handle unbalanced size distributions and millions of samples.

The HIBF structure has enormous potential. It can be used on its own, like in the tool Raptor, or can serve as a prefilter to distribute more advanced analyses such as read mapping. Querying ten million reads could be done by querying 11 HIBFs on different machines in parallel.





□ A survey of mapping algorithms in the long-reads era

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02972-3

Adapting and tailoring long-read aligners to such applications will significantly improve analysis over the limited possibilities existing with short reads. Moreover, using pangenomes represented as graphs made from a set of reference genomes is becoming more prevalent.

As a result, long-read mapping to these structures is a novel and active field for genomic reads but should soon expand to other applications such as transcriptomics.

Notably, pangenome graphs vary in definition and structure (overlap graphs, de Bruijn graphs, graphs of minimizers) and therefore expect a diversified algorithmic response to mapping sequences on these graphs.





□ SVcnn: an accurate deep learning-based method for detecting structural variation based on long-read data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05324-x

SVcnn accurately detects DELs, INSs, DUPs, and INVs. SVcnn is a convolutional neural network (CNN) based method. It uses hierarchical clustering to identify if a region contains multi-allelic SVs. Moreover, SVcnn utilizes the LetNet model to distinguish whether an SV is a true SV or not.

The input of SVcnn consists of (i) a sorted long read bam file and (ii) a reference file. SVcnn mainly consists of three main steps: (1) Detecting candidate SVs, (2) Converting to image and building model, (3) Filtering and outputting SVs.





□ Epiphany: predicting Hi-C contact maps from 1D epigenomic signals

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02934-9

Epiphany, a neural network to predict cell-type-specific Hi-C contact maps from widely available epigenomic tracks. Epiphany uses Bi-LSTM layers to capture long-range dependencies and optionally a generative adversarial network architecture to encourage contact map realism.

Epiphany can be trained with MSE alone or with a combination of MSE and GAN loss. In the latter case, the full model consists of two parts: a generator to extract information and make predictions, and a discriminator to introduce adversarial loss into the training process.





□ networkGWAS: A network-based approach to discover genetic associations

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad370/7191773

networkGWAS, a statistically sound approach to network-based GWAS using mixed models and neighborhood aggregation. It allows for population structure correction and for well-calibrated p-values, which are obtained through circular and degree-preserving network permutations.

networkGWAS successfully detects known associations on diverse synthetic phenotypes. It employs a FaST-LMM-Set like model to estimate the statistical associations with the phenotype of choice. networkGWAS presents higher recall in comparison to dmGWAS per each precision value.





□ NoVaTeST: Identifying Genes with Location Dependent Noise Variance in Spatial Transcriptomics Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad372/7191774

NoVaTeST pipeline that offers a more general spatial gene expression modeling in ST data using the heteroscedastic Gaussian process. The pipeline uses Wilcoxon signed rank test and FDR correction to identify genes with location-dependent noise variance.





□ TreeTerminus - Creating transcript trees using inferential replicate counts

>> https://www.sciencedirect.com/science/article/pii/S2589004223010386

TreeTerminus, a data-driven approach for grouping transcripts into a tree structure where leaves represent individual transcripts and internal nodes represent an aggregation of a transcript set.

TreeTerminus constructs trees such that, on average, the inferential uncertainty decreases as ascending the tree topology. It provides the flexibility to analyze data at nodes that are at different levels of resolution and can be tuned depending on the analysis of interest.





Morta.

2023-06-06 18:00:06 | Science News




□ hifiasm ultra-long (UL): Scalable telomere-to-telomere assembly for diploid and polyploid genomes with double graph

>> https://arxiv.org/abs/2306.03399

hifiasm (UL) provides an ultra-fast and robust solution for telomere-to-telomere genome assemblies in a population-scale. hifiasm (UL) will facilitate a more comprehensive understanding of complex genomic regions such as centromeres and highly repetitive segmental duplications.

Hifiasm (UL) constructs an integer graph by utilizing ultra-long integer sequences and their overlaps. hifiasm (UL) employs highly aggressive graph cleaning strategies to eliminate ambiguous edges associated with each node.

hifiasm (UL) produces its sequence by concatenating the subsequences of nodes. Each resulting contig is an integer sequence that is significantly longer than any individual ultra-long read. These integer contigs represent the paths that can untangle intricate structures.





□ hifieval: Evaluation of haplotype-aware long-read error correction

>> https://www.biorxiv.org/content/10.1101/2023.06.05.543788v1

hifieval evaluates phased assemblies and can distinguish under-corrections and over-corrections. It is perhaps the first user-facing EC evaluation tool that can be easily deployed to users' own datasets.

hifieval calculates three metrics: correct corrections (CC), errors that are in raw reads but not in corrected reads; under-corrections (UC), errors present in both raw and corrected reads; and over-corrections (OC), new errors found in corrected reads but not in raw reads.





□ Ewald-based Long-Range Message Passing for Molecular Graphs

>> https://arxiv.org/abs/2303.04791

Ewald message passing (MP) is a general framework that complements existing GNN layers in analogy to how the frequency-truncated long-range part complements the distance-truncated short-range part in Ewald summation.

Ewald message passing is architecture-agnostic and computationally efficient, which we demonstrate by implementing and testing it as a modification on top of existing GNN models. Ewald MP is more suitable for large or periodic structures containing a diverse set of atoms.





□ Pathformer: biological pathway informed Transformer model integrating multi-modal data of cancer

>> https://www.biorxiv.org/content/10.1101/2023.05.23.541554v1

Pathformer, a biological pathway informed deep learning model based on Transformer with bias to integrate multi-modal data. Pathformer leverages criss-cross attention mechanism to capture crosstalk between different biological pathways and between different modalities.

Pathformer utilizes a sparse neural network based on pathway knowledge to transform gene embeddings into pathway embeddings. Pathway crosstalk matrix is used to guide the direction of information flow, and updated according to encoded pathway embedding in each Transformer block.





□ DiffPack: A Torsional Diffusion Model for Autoregressive Protein Side-Chain Packing

>> https://arxiv.org/abs/2306.01794

DiffPack, an autoregressive torsional diffusion model that learns the joint distribution of side-chain torsional angles, the only degrees of freedom in side-chain packing, by diffusing and denoising on the torsional space.

DiffPack use s an SE(3)- invariant network to learn the gradient field for the joint distribution of torsional angles. This result in a much smaller conformation space of side-chain, thereby capturing the intricate energy landscape of protein side chains.





□ scGHOST: Identifying single-cell 3D genome subcompartments

>> https://www.biorxiv.org/content/10.1101/2023.05.24.542032v1

scGHOST is a single-cell compartmentalization framework and views scHi-C contact maps as graphs, where genomic loci are vertices in the graph and are connected through edge weights defined by Hi-C contact frequencies among loci.

scGHOST employs a unique random sampling procedure that filters noise in imputed scHi-C data, represents each genomic locus as a continuous-valued vector, and uses graph embedding neural networks to discretize single-cell genomes and identify 3D genome subcompartments.





□ Construction and representation of human pangenome graphs

>> https://www.biorxiv.org/content/10.1101/2023.06.02.542089v1

They collect all publicly available high-quality human haplotypes and constructed the largest human pangenome graphs to date, incorporating 52 individuals in addition to two synthetic references (CHM13 and GRCh38).

Building variation graphs and de Bruijn graphs of this collection using five of the state-of-the-art tools: Bifrost, mdbg, Minigraph, Minigraph-Cactus and pggb.

Counter-intuitively, a pangenome graph construction tool may in some cases generate different outputs when executed multiple times with the same haplotypes as input.

This unstability could be due to a permutation in the order of the sequences given as input, or non-determinism in the construction algorithm.


□ AGILE Platform: A Deep Learning-Powered Approach to Accelerate LNP Development for mRNA Delivery

>> https://www.biorxiv.org/content/10.1101/2023.06.01.543345v1

The AI-Guided Ionizable Lipid Engineering (AGILE) platform, a synergistic combination of deep learning and combinatorial chemistry. AGILE streamlines the iterative development of ionizable lipids, crucial components for LNP-mediated mRNA delivery.

AGILE utilizes vast amounts of unlabeled data, employing a self-supervised approach to learn differentiable lipid representations. AGILE can identify promising lipids for high mRNA transfection potency in specific cells from a significantly larger combinatorial library.





□ Dictionary learning for integrative, multimodal and scalable single-cell analysis

>> https://www.nature.com/articles/s41587-023-01767-y

‘bridge integration’, which integrates single-cell datasets measuring different modalities by leveraging a separate dataset where both modalities are simultaneously measured as a molecular ‘bridge’.

‘atomic sketch integration’, which combines dictionary learning and dataset sketching to improve the computational efficiency of large-scale single-cell analysis and enables rapid integration of dozens of datasets spanning millions of cells.

Motivated by a similar problem addressed by Laplacian Eigenmaps, they compute an eigen decomposition of the graph Laplacian for the multiomic dataset to reduce the dimensionality from the number of atoms to the number of selected eigenvectors.





□ minimap2-fpga: Integrating hardware-accelerated chaining for efficient end-to-end long-read sequence mapping

>> https://www.biorxiv.org/content/10.1101/2023.05.30.542681v1

minimap2-fpga, a Field Programmable Gate Array (FPGA) based hardware-accelerated version of minimap2 that is end-to-end integrated. minimap2-fpga speeds up the mapping process by integrating an FPGA kernel optimised for chaining.

minimap2-fpga is up to 79% and 53% faster than minimap2 for ∼ 30× ONT and ∼ 50× PacBio datasets, when mapping without base-level alignment. When mapping w/ base-level alignment, minimap2-fpga is up to 62% and 10% faster than minimap2 for ∼ 30× ONT and ∼ 50× PacBio datasets.

The accuracy is near-identical to that of original minimap2 for both ONT and PacBio data, when mapping both with and without base-level alignment. minimap2-fpga is supported on Intel FPGA-based systems and Xilinx FPGA-based systems.





□ StabMap: Stabilized mosaic single-cell data integration using unshared features

>> https://www.nature.com/articles/s41587-023-01766-z

StabMap, a mosaic data integration technique that stabilizes mapping of single-cell data by exploiting the non-overlapping features. StabMap accurately embeds single-cell data from multiple technology sources into the same low-dimensional coordinate space.

StabMap projects all cells onto supervised or unsupervised reference coordinates using all available features regardless of overlap with other datasets, instead relying on traversal along the mosaic data topology.






□ The Damage to Lunar Orbiting Spacecraft Caused by the Ejecta of Lunar Landers

>> https://arxiv.org/abs/2305.12234

The results for ~40 t landers show that the Lunar Orbital Gateway will be impacted by 1000s to 10,000s of particles per square meter but the particle sizes are very small and the impact velocity is low so the damage will be slight.

A spacecraft in Low Lunar Orbit that happens to pass through the ejecta sheet will sustain extensive damage w/ hundreds of millions of impacts per SQM: they are in the hypervelocity regime, and exposed glass on the spacecraft will sustain spallation over 4% of its surface.





□ Anansi: Knowledge-based Integration of Multi-Omic Datasets: Annotation-based Analysis of Specific Interactions

>> https://arxiv.org/abs/2305.10832

Anansi (Annotation-based Analysis of Specific Interactions) relies on the structure provided from knowledge databases. Typically, these are databases that contain knowledge on features and how they interact, for example in the form of a molecular interaction network.

Anansi takes a knowledge-based approach where external databases like KEGG are used to constrain the all-vs-all association hypothesis space, only considering pairwise associations that are a priori known to occur.





□ GNN-C2L: Spatio-relational inductive biases in spatial cell-type deconvolution

>> https://www.biorxiv.org/content/10.1101/2023.05.19.541474v1

GNN-C2L propagates learnable messages on the proximity graph of spot transcripts, effectively leveraging the spatial relationships between spots and exploiting the co-location of cell-types.

GNN-C2L achieves increased deconvolution performance over spatial-agnostic variants. GNN-C2L leverages proximal inductive biases to facilitate enhanced reconstruction of tissue architectures.





□ Tara Oceans + anvi’o: The story behind Mirusviruses

>> https://anvio.org/blog/mirus-discovery/

An unusual phylogenetic signal that guided the recovery of large eukaryotic virus genomes forming their very own phylum at the cross-road between two realms.





□ CADA-BioRE: A Co-adaptive Duality-aware Framework for Biomedical Relation Extraction

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad301/7176367

CADA-BioRE, is designed as a bidirectional extraction structure that fully takes interdependence into account in the duality- aware extraction process of subject-object entity pair and relation.

CADA leverages a duality module for inverse extracting triplets and a matching module to correct errors. CADA-BioRE achieves outstanding performance gains even in complex scenarios involving various overlapping patterns, multiple triplets, and cross-sentence triplets.





□ Genekitr: Empowering biologists to decode omics data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05342-9

Genekitr comprises four modules: gene information retrieval, ID (identifier) conversion, enrichment analysis and publication-ready plotting. The ID conversion module assists in ID-mapping of genes, probes, proteins, and aliases.

Genekitr integrates various functionalities into a single web server: GeneInfo module for batch query gene information, IDConvert and ProbeConvert for gene and probe identifier conversion, GeneEnrich for gene enrichment analysis and Plot module for publication-ready plotting.





□ BioModelsML: Building a FAIR and reproducible collection of machine learning models in life sciences and medicine for easy reuse

>> https://www.biorxiv.org/content/10.1101/2023.05.22.540599v1

The formalisation and pilot implementation of community protocol to enable FAIReR (Findable, Accessible, Interoperable, Reusable, and Reproducible) sharing of ML models.

The trained model should be made available in either native format or ONNX (Open Neural Network Exchange) format when possible. ONNX is a widely used open-source format designed to foster interoperability between different machine learning frameworks.





□ itol.toolkit accelerates working with iTOL (Interactive Tree Of Life) by an automated generation of annotation files

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad339/7177989

This R package also provides an all-in-one data structure to store data and themes, accelerating the step from metadata to annotation files of iTOL visualizations through automatic workflows.





□ kmindex and ORA: indexing and real-time user-friendly queries in terabytes-sized complex genomic datasets

>> https://www.biorxiv.org/content/10.1101/2023.05.31.543043v1

kmindex offers the possibility to index thousands of highly complex metagenomes into an index that answers sequences queries in the tenth of a second. Using kmindex, the resulting indexes can be registered into a single meta-index allowing users to easily query multiple indexes.

Ocean Read Atlas (ORA) allows query one or several sequences across all of the Tara Oceans metagenomic raw datasets. ORA enables the visualization of the results on a geographic map. ORA provides new perspectives on the deep exploitation of Tara oceans resources.





□ Mutate and Observe: Utilizing Deep Neural Networks to Investigate the Impact of Mutations on Translation Initiation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad338/7177993

Getting DNNs to describe the biological relevance of what was learned from the data and extracting novel biological knowledge from DNNs are two tasks that are not easy to accomplish. In this research effort.

The usefulness of in silico mutations, in combination with meticulous experimental routines, to achieve a certain degree of biological relevance and to obtain novel insights into translation.





□ Pygenomics: manipulating genomic intervals and data files in Python

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad346/7179791

Unlike general numeric intervals, the genomic intervals are associated with assembled genome sequences (chromosomes, scaffolds, or contigs) and the interval start and end positions are specified by non-negative integers bounded by sizes of the assembled genome sequences.

pygenomics, a Python package for working with genomic intervals and bioinfor- matic data files. The package implements interval operations, provides both API and CLI, and supports reading and writing data in widely used bioinformatic formats, including BAM, BED, GFF3 and VCF.





□ mutscan: a flexible R package for efficient end-to-end analysis of multiplexed assays of variant effect data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02967-0

mutscan, a novel R package that provides a unified, flexible interface to the analysis of MAVE experiments, covering the entire workflow from FASTQ files to count tables and statistical analysis and visualization.

mutscan is directly applicable also to other types of data aimed at identifying and tabulating substitution variants compared to a provided reference sequence, or tabulating unique sequences directly, potentially after collapsing variants within a certain distance.





□ IsoTools: a flexible workflow for long-read transcriptome sequencing analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad364/7189737

IsoTools integrates a graph-based method for identifying alternative splicing events and a statistical approach based on the beta-binomial distribution for detecting differential events. IsoTools uses a novel model based approach to estimating the required depth of sequencing.

The model is based on the Cumulative Distribution Function of a negative binomial distribution. To reconstruct the transcriptome from aligned reads, IsoTools groups reads w/ the same intron chain into transcripts/groups transcripts sharing at least one splice junction into genes.

IsoTools uses the reference positions. To determine the positions of transcription start sites (TSSs) and polyadenylation sites (PASs), IsoTools employs a gene-wise peak calling approach to identify the most prominent start and end positions of reads.





□ WAGS: User-friendly, rapid, containerized pipelines for processing, variant discovery, and annotation of short read whole genome sequencing data

>> https://academic.oup.com/g3journal/advance-article/doi/10.1093/g3journal/jkad117/7181376

WAGS is an open-source set of user-friendly, containerized pipelines designed to simplify the process of identifying germline short (SNP and indel) and structural variants geared toward the veterinary community but adaptable to any species with a suitable reference genome.

WAGS consists of three pipelines for (1) processing raw short-read FAST files into GVCFs: OneWAG, (2) joint genotyping and annotating variants: ManyWAGS, and (3) the identification of private variants in a single sample: OnlyWAG.





□ MR-Horse: A Bayesian approach to Mendelian randomization using summary statistics in the univariable and multivariable settings with correlated pleiotropy

>> https://www.biorxiv.org/content/10.1101/2023.05.30.542988v1

MR-Horse had comparable power and bias to CAUSE, with substantially lower type I error rates. It again had slightly higher coverage and lower type I error rates compared with MR-cML-DP, and outperformed all other methods across each metric.


MVMR-Horse outperformed IVW, MVMR-Median and GRAPPLE in terms of bias, precision and type I error rates in all scenarios. MVMR-Horse retained type I error rates below the nominal level in all scenarios, with the trade-off of lower power compared with MVMR-cML-DP.





□ Jupyter AI

>> https://jupyter-ai.readthedocs.io/en/latest/index.html

Jupyter AI provides a user-friendly and powerful way to explore generative AI models in notebooks and improve your productivity in JupyterLab and the Jupyter Notebook





□ MolXPT: Wrapping Molecules with Text for Generative Pre-training

>> https://arxiv.org/abs/2305.10688

MolXPT, a unified language model of text and molecules pre-trained on SMILES (a sequence representation of molecules) wrapped by text. MolXPT can be finetuned for various text and molecular downstream tasks, like molecular property prediction and molecule-text translation.

MolXPT outperforms strong baselines of molecular property prediction on MoleculeNet, performs comparably to the best model in text-molecule translation while using less than half of its parameters, and enables zero-shot molecular generation without finetuning.





□ MTM: a multi-task learning framework to predict individualized tissue gene expression profiles

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad363/7190366

Multi-tissue Transcriptome Mapping (MTM), a deep learning-based multi-task learning framework to predict individualized tissue gene expression profiles using any available tissue from a specific person.

By jointly leveraging individualized cross-tissue information from multi-tissue reference samples through multi-task learning, MTM achieves superior sample-level and gene-level accuracy, and larger proportions of predictable genes than existing methods on unseen individuals.





□ SMEAR: Soft Merging of Experts with Adaptive Routing

>> https://arxiv.org/abs/2306.03745

SMEAR avoids discrete routing by using a single "merged" expert constructed via a weighted average of all of the experts' parameters. SMEAR provides an effective alternative for modular models that use adaptive routing among expert subnetworks.

All components of SMEAR are fully differentiable enables standard gradient-based training. Empirically, SMEAR significantly attains a favorable performance/cost tradeoff to discrete routing solutions found via gradient estimation.





□ Orthogonal Statistical Learning

>> https://arxiv.org/abs/1901.09036

A meta-algorithm that takes as input arbitrary estimation algorithms for the target/nuisance parameter. If the population risk satisfies a Neyman orthogonality, the impact of the nuisance estimation error on the excess risk bound achieved by the meta-algorithm is of second order.

The theorem is agnostic to the particular algorithms used for the target/nuisance and only makes an assumption on their individual performance. It enables the use of a plethora of existing results from machine learning to give new guarantees for learning w/ a nuisance component.

This method can accommodate settings in which the target parameter belongs to a complex nonparametric class. It provide conditions on the metric entropy of the nuisance and target classes such that oracle rates of the same order as if we knew the nuisance parameter are achieved.





□ Biotite: new tools for a versatile Python bioinformatics library

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05345-6

flexibility can be harnessed to tackle a wide range of problems, without the need to write ‘glue’ code for communication between different programs. For most tasks the implementation in Biotite performs similar or is even faster than dedicated software.

Biotite is able to create sequence profiles from multiple sequence alignments consisting of nucleotide, protein or custom sequences. The usefulness of profiles lies in their better representation of information than a consensus sequence or a multiple sequence alignment.





□ DesiRNA: structure-based design of RNA sequences with a Monte Carlo approach

>> https://www.biorxiv.org/content/10.1101/2023.06.04.543636v1

DesiRNA, a versatile Python-based software tool for RNA sequence design. This program considers a comprehensive array of constraints, ranging from secondary structures (including pseudoknots) and GC content, to the distribution of dinucleotides emulating natural RNAs.

Additionally, it factors in the presence or absence of specific sequence motifs and prevents or promotes oligomerization, thereby ensuring a robust and flexible design process.

DesiRNA utilizes the Monte Carlo algorithm for the selection and acceptance of mutation sites. In tests on the EteRNA benchmark, DesiRNA displayed high accuracy and computational efficiency, outperforming most existing RNA design programs.





□ DiffSegR: An RNA-Seq data driven method for differential expression analysis using changepoint detection

>> https://www.biorxiv.org/content/10.1101/2023.06.05.543691v1

DiffSegR, an R package that uses a new strategy for delineating the boundaries of DERs. It segments the per-base log2 fold change using FPOP, a method designed to identify changepoints in the mean of a Gaussian signal.





□ The NanoFlow Repository

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad368/7191772

The NanoFlow Repository to provide the first implementation of the MIFlowCyt-EV framework. It enables sharing of EP-FC data and standards-compliant metadata about experimental design, samples, instrument configuration, and analysis parameters.



على متن مركبة "دراجون”

2023-05-21 20:08:08 | Science News








Orpheus.

2023-05-15 05:15:05 | Science News
(Art by ekaitsa)





□ ORFeus: A Computational Method to Detect Programmed Ribosomal Frameshifts and Other Non-Canonical Translation Events

>> https://www.biorxiv.org/content/10.1101/2023.04.24.538127v1

ORFeus uses a hidden Markov model to infer translation patterns from ribo-seq data that is inherently noisy and sparse. The model identifies changes in reading frame and additional upstream or downstream reading frames, making it suitable for detection of many alternative translation events.

ORFeus can identify novel or extended ORFs (including uORFs and dORFs) with either canonical or non-canonical start codons, as well as programmed ribosomal frameshifts and stop codon readthrough events. For each transcript, ORFeus returns the most probable state path.





□ scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI

>> https://www.biorxiv.org/content/10.1101/2023.04.30.538439v1

scGPT, a single-cell foundation model by GPT on over 10 million cells. scGPT uses an in-memory data structure to store hundreds of datasets that allow fast access. The learned gene embedding maps decode known pathways by grouping together genes that are functionally relevant.

With zero-shot learning, the pre-trained model is able to reveal meaningful cell clusters on unseen datasets. With finetuning in a few-shot learning setting, the model achieves state-of-the-art performance on a wide range of downstream tasks.

scGPT employes the generative self-supervised objective to iteratively predict GE values of unknown tokens from known tokens in an auto-regressive manner. scGPT's embedding architecture can easily extend to multiple sequencing modalities, batches, and perturbation states.





□ REVNANO: Reverse Engineering DNA Origami Nanostructure Designs from Raw Scaffold and Staple Sequence Lists

>> https://www.biorxiv.org/content/10.1101/2023.05.03.539261v1

REVNANO, a constraint programming solver that recovers the (approximate) staple-scaffold contact map from origami sequences. REVNANO uses graph layout techniques to convert the topological contact map into an approximate geometric origami schematic.

REVNANO leverages the unique physical features of origami nanostructures as heuristics. DNA, RNA or hybrid scaffolded origami are all supported. The quality of the REVNANO solution is quantified by taking the base hamming distance between the ground truth contact map.





□ UnitedNet: Explainable multi-task learning for multi-modality biological data analysis

>> https://www.nature.com/articles/s41467-023-37477-x

UnitedNet has an encoder-decoder-discriminator structure and is trained by joint group identification / cross-modal prediction. Its structure does not presume that the data distributions are known - instead implicitly approximates the statistical characteristics of each modality.

UnitedNet uses SHapley Additive exPlanations algorithm and indicates the relevance relationship between gene expression and DNA accessibility with cell-type specificity. UnitedNet fuses these codes into shared latent codes using an adaptive weighting scheme.





□ AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes

>> https://www.biorxiv.org/content/10.1101/2021.02.16.431517v2

AirLift, a methodology and tool for quickly, comprehensively, and accurately remapping a read data set that had previously been mapped to an older reference genome to a newer reference genome.

AirLift provides BAM-to-BAM remapping results on which downstream analysis can be immediately performed. AirLift Index exploits the similarity b/n two references to quickly identify candidate locations that a read should be remapped to based on its original mapping.





□ DELVE: Feature selection for preserving biological trajectories in single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.05.09.540043v1

DELVE (dynamic selection of locally covarying features), an unsupervised feature selection method for identifying a representative subset of dynamically-expressed molecular features that recapitulates cellular trajectories.

DELVE uses a bottom-up approach to mitigate the effect of unwanted sources of variation confounding inference, and instead models cell states from dynamic feature modules that constitute core regulatory complexes.





□ Designing molecular RNA switches with Restricted Boltzmann machines

>> https://www.biorxiv.org/content/10.1101/2023.05.10.540155v1

Restricted Boltzmann machines (RBM), a simple two-layer machine learning model, capture intricate sequence dependencies induced by secondary and tertiary structure, as well as the switching mechanism, resulting in a model that can be used for the design of allosteric RNA.

The hidden units of the RBM must extract features shared by the data sequences and thus likely to be important for their biological function. Conservation of probability mass implies that regions of sequence space not populated by data sequences must be penalized.

The RBM is able to model complex interactions. After marginalizing over the hidden units configurations, effective interactions arise between the visible units. RBM can represent schematically a three-body interaction, arising from the three connections of the summed hidden unit.





□ metapaths: similarity search in heterogeneous knowledge graphs via meta paths

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad297/7152274

Once informative meta paths for a given KG have been defined, these meta paths define the semantics of the relationships between nodes in the KG, thereby enabling heterogeneous graph convolutional and graph attention networks for downstream machine learning analyses.

The primitives of the metapaths package identify the neighbors of a specified node with a given type by querying either an edge t or, for efficiency, an adjacency list precomputed from the edge list.

The meta path traversal function accepts an origin node, a destination node, and a specified meta path; then, via the neighbor identification functions, it starts at the origin node and recursively expounds the sequence of node types until the destination node is reached.






□ EvoAug: improving generalization and interpretability of genomic deep neural networks with evolution-inspired data augmentations

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02941-w

Random transformation of DNA sequences can potentially alter their function in unknown ways. EvoAug pretrains sequence-based deep learning models for regulatory genomics data w/ evolution-inspired augmentations followed by a finetuning on the original, unperturbed sequence data.

EvoAug data augmentations introduce a modeling bias to learn invariances of the (un)natural symmetries generated by the augmentations.

Random insertions and deletions assume that the distance between motifs is not critical, whereas random inversions and translocations promote invariances to motif strand orientation and the order of motifs.





□ ProteinSGM: Score-based generative modeling for de novo protein design

>> https://www.nature.com/articles/s43588-023-00440-3

ProteinSGM, a continuous-time score-based generative model that generates high-quality de novo proteins. ProteinSGM learns to generate four matrices that fully describes a protein's backbone, which are used as smoothed harmonic constraints in the Rosetta minimization protocol.

ProteinSGM generates variable-length structures with a mean < -3.9 REU per residue, indicative of native-like structures. It provides an alternative approach that uses MinMover for backbone minimization, and ProteinMPNN and OmegaFold for sequence design and structure prediction.





□ CEBRA: Learnable latent embeddings for joint behavioural and neural analysis

>> https://www.nature.com/articles/s41586-023-06031-6

CEBRA is a nonlinear dimensionality reduction method newly developed to explicitly leverage auxiliary (behaviour) labels and/or time to discover latent features in time series data—in this case, latent neural embeddings.

CEBRA can be used for supervised and self-supervised analysis, thereby directly facilitating hypothesis- and discovery-driven science. It produces both consistent embeddings across subjects and can find the dimensionality of neural spaces that are topologically robust.





□ The categorical basis of dynamical entropy

>> https://arxiv.org/abs/2301.09205

The focus of topological Dynamical systems theory is to derive properties of the system. The objects that are usually in consideration are invariant behavior such as attractors, invariant sets and omega-limit sets, and asymptotic properties such as invariant measures and entropy.

A category-theoretic view of topological dynamical entropy, which reveals that the common limit is a consequence of the structural assumptions on these notions. One of the key tools developed is that of a qualifying pair of functors, which ensure a limit preserving property.

The diameter and Lebesgue number of open covers of a compact space, form a qualifying pair of functors. The various notions of complexity are expressed as functors, and natural transformations between these functors lead to their joint convergence to the common limit.





□ A draft human pangenome reference

>> https://www.nature.com/articles/s41586-023-05896-x

Flagger detects different types of misassemblies within a phased diploid assembly. The pipeline works by mapping the HiFi reads to the combined maternal and paternal assembly in a haplotype-aware manner.

Flagger identifies coverage inconsistencies within these read mappings. Coverage is calculated across the genome and a mixture model is fit to account for reliably assembled haploid sequence and various classes of unreliably assembled sequence.





□ Squigulator: simulation of nanopore sequencing signal data with tunable noise parameters

>> https://www.biorxiv.org/content/10.1101/2023.05.09.539953v1

Squigulator generates simulated nanopore signal data based on an input reference genome or transcriptome sequence, or directly from a set of basecalled reads.

Squigulator uses an idealised 'pore model' that specifies the predicted current signal reading associated with every possible DNA or RNA k-mer, as appropriate to the specific nanopore protocol being emulated.

Squigulator generates sequential signal values corresponding to sequential k-mers in the provided reference sequence. squigulator transforms the data using Gaussian noise functions in both the time and amplitude domains to produce realistic, rather than ideal, signal reads.





□ Ariadne: Synthetic Long Read Deconvolution Using Assembly Graphs

>> https://www.biorxiv.org/content/10.1101/2021.05.09.443255v3

Ariadne, a novel assembly graph-based SLR deconvolution algorithm, that can be used to extract single-species read-clouds from SLR datasets to improve the taxonomic classification and de novo assembly of complex populations, such as metagenomes.

Ariadne leverages the linkage information encoded in the full de Bruin-based assembly graph generated by a de novo assembly tool such as cloudSPAdes to generate up to 37.5-fold more read clouds containing only reads from a single fragment.





□ Merizo: a rapid and accurate domain segmentation method using invariant point attention

>> https://www.biorxiv.org/content/10.1101/2023.02.19.529114v2

Network inputs to the IPA encoder are the single and pairwise representations and backbone frames in the style of AlphaFold2. The IPA encoder comprises six weight-shared blocks, each containing a single IPA block with RoPE positional encoding, and a bi-GRU transition block.

In the Masked transformer decoder, learnable domain mask embeddings dare concatenated to the single representation and passed through a 10-layer MHA stack with ALiBi positional encoding.

The predicted domain mask tensor is split according to the predicted domain and is passed through a two-layer biGRU, followed by projection into one dimension to produce a single ploU value for each domain. ndom represents the number of predicted domains.





□ Evolutionary graph theory on rugged fitness landscapes

>> https://www.biorxiv.org/content/10.1101/2023.05.04.539435v1

A unifying theory of how heterogenous structure shapes evolutionary dynamics. Even a simple extension to a two-mutational landscape can exhibit evolutionary dynamics not observed in deme-based models and that cannot be predicted using single-mutation results.

This model can be applied to understand the evolutionary trajectory of cellular systems with complex architectures. Heterogenous structure can affect fitness landscape crossing by allowing intermediate mutants to persist for longer, until the final beneficial mutation occurs.





□ The Compositional Structure of Bayesian Inference

>> https://arxiv.org/abs/2305.06112

A compositional Bayesian inversion of Markov kernels in isolation, using a suitable axiomatisation of a category of Markov kernels. It builds categories whose morphisms are pairs of a Markov kernel and an associated 'Bayesian inverter', which is itself built compositionally.

Symmetric monoidal categories with compatible families of copy and delete morphisms have been identified as an expressive language for synthetically representing concepts from probability theory.

A categorical translation of Bayes allows for a general definition of a Bayesian inverse to a morphism in a Markov category. The category of Bayesian lenses is constructed as a fired category that is closely related to the families fibration, in the semantics of dependent types.





□ CoCoNat: a novel method based on deep-learning for coiled-coil prediction

>> https://www.biorxiv.org/content/10.1101/2023.05.08.539816v1

CoCoNat encodes sequences with the combination of two state-of-the- art protein language models and implements a three-step deep learning procedure concatenated with a Grammatical-Restrained Hidden Conditional Random Field (GRHCRF) for CCD identification and refinement.

CoCoNat makes use of residue embeddings obtained with large-scale protein Language Models (pLMs) to represent proteins in training and testing sets. CoCoNat adopts a 15 residue long sliding window, takes as input, where each residue is represented with a 2304-feature vector.





□ snATAK: Assessing the multimodal tradeoff

>> https://www.biorxiv.org/content/10.1101/2021.12.08.471788v2

snATAK incorporates kallisto and other tools in a workflow that facilitates the preprocessing of snATAC-seq data from numerous technologies in minimal computing environments. snATAK can be used for allele-specific analysis of multimodal data, even in the absence of genotype data.

snATACK consists of first mapping reads to a reference genome using Minimap2. snATAK identifies putative open chromatin regions with Genrich. A kallisto pseudoalignment index is made and reads are remapped using kalisto. The snATAK output is compatible with the Signac and ArchR.





□ GenPhys: From Physical Processes to Generative Models

>> https://arxiv.org/abs/2304.02637

GenPhys (Generative Models from Physical Processes), a frame-work that can convert physical Partial differential equations (PDEs) to generative models. Diffusion models and Poisson flow generative models leverage the diffusion equation and the Poisson equation.

There exists non s-generative model which can also provide useful generative modeling, such as the case in quantum machine learning with dynamics based on the Schrödinger equation and quantum circuits.





□ Learning Decision Trees with Gradient Descent

>> https://arxiv.org/abs/2305.03515

Gradient-based decision trees (GDTs), a novel approach for learning hard, axis-aligned Decision Trees (DTs) with gradient descent. The proposed method uses backpropagation with a straight-through operator on a dense DT representation to jointly optimize all tree parameters.

GDTs are less prone to overfitting. GDT optimizes the gradient descent algorithm by exploiting common stochastic gradient descent techniques, including mini-batch calculation and momentum using the Adam optimizer with weight averaging.





□ LatentDiff: A Latent Diffusion Model for Protein Structure Generation

>> https://arxiv.org/abs/2305.04120

Latent Diff generates a novel protein backbone structure. They first sample multivariate Gaussian noise and use the learned latent diffusion model to generate 3D positions and node embeddings in the latent space.

Latent Diff uses a pre-trained equivariant 3D autoencoder to transform protein backbones into a more compact latent space, and models the latent distribution with an equivariant latent diffusion model.





□ Sequence UNET: High-throughput deep learning variant effect prediction

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02948-3

Sequence UNET is trained to directly predict variant frequency or to classify low frequency variants, as a proxy for deleteriousness, and then fine-tuned for pathogenicity prediction.

Sequence UNET uses a fully convolutional architecture. Convolutional kernels also naturally integrate information from nearby amino acids. The model outputs a matrix of per position features and can therefore be trained to predict various positional properties.





□ aaHash: recursive amino acid sequence hashing

>> https://www.biorxiv.org/content/10.1101/2023.05.08.539909v1

aaHash, a recursive hashing algorithm tailored for amino acid sequences. This algorithm utilizes multiple hash levels to represent biochemical similarities between amino acids. aaHash performs ~10X faster than generic string hashing algorithms in hashing adjacent k-mers.

aaHlash builds on ntHash, a rolling hash algorithm for DNA/RNA sequences, and adapts it for amino acid sequences. aaHash also supports using different levels of hashes together to create a multi-level pattern, mimicking the functionality of spaced seeds.





□ BGWAS: Bayesian variable selection in linear mixed models with nonlocal priors for genome-wide association studies

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05316-x

BGWAS uses a novel nonlocal prior for linear mixed models (LMMs). The screening step fits as many LMMs as the number of SNPs using a mixture of a Dirac delta at zero and a nonlocal prior, and estimates the probability of the Dirac delta component.

BGWAS uses a pMOM nonlocal prior for LMMs that uses the full Fisher information matrix. BGWAS either uses complete enumeration or searches the model space with a genetic algorithm.





□ AIONER: All-in-one scheme-based biomedical named entity recognition using deep learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad310/7160912

AIONER, a new NER tagger that takes full advantage of various existing datasets for recognizing multiple entities simultaneously, despite their inherent differences in scope and quality, through a novel all-in-one (AIO) scheme.

The AIO scheme utilizes a small dataset recently annotated with multiple Entity types as a bridge to integrate multiple datasets annotated with a subset of entity types, thereby recognizing multiple entities at once, resulting in improved accuracy and robustness.





□ NanoPack2: Population scale evaluation of long-read sequencing data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad311/7160911

The cramino, chopper, kyber, and phasius tools are written in Rust and available as executable binaries without requiring installation or managing dependencies. Binaries build on musl are available for broad compatibility.

Phasius is developed to visualize the results of read phasing, which shows in a dynamic genome browser style the length and interruptions between contiguously phased blocks from a large number of individuals together with genome annotation, for example, segmental duplications.





□ copMEM2: Robust and scalable maximum exact match finding

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad313/7160910

copMEM2, a multi-threaded MEM finding tool, targeting the execution speed and reducing the memory, as well as incorporating an improvement to speed up its processing by orders of magnitude when the pair of genomes is highly similar.

copMEM2 allows to compute all MEMs of minimum length 50 between the human and mouse genomes in 59s, using 10.40 GB of RAM and 12 threads, being at least a few times faster than its main contenders. On a pair of human genomes, hg18 and hg19, the results are 324s and 16.57 GB.





□ Integration of a multi-omics stem cell differentiation dataset using a dynamical model

>> https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1010744

A hierarchical dynamical model that allowed us to integrate all data sets. This model was able to explain mRNA-protein discordance for most genes and identified instances of potential microRNA-mediated regulation.

Overexpression or depletion of microRNAs identified by the model, followed by RNA sequencing and protein quantification, were used to follow up on the predictions of the model.





□ Improving variant calling using population data and deep learning

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05294-0

A population-aware DeepVariant models with a new channel encoding allele frequencies. This model reduces variant calling errors, improving both precision and recall in single samples, and reduces rare homozygous and pathogenic clinvar calls cohort-wide.

The relative advantage of the population-aware models increase at lower coverage, suggesting that population information is most valuable in difficult examples, where read-level information alone may not be sufficient for confident calling.





□ DeSide: A unified deep learning approach for cellular decomposition of bulk tumors based on limited scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.05.11.540466v1

The DeSide architecture considers only non-cancerous cells during the training process, indirectly calculating the proportion of cancerous cells.

DeSide avoids directly handling the often more variable heterogeneity of cancerous cells, and instead leverages scRNA-seq data from three different cancer types to empower the DNN model with a robust generalization capability across diverse cancers.





□ A Superior Thumb Drive: Optimizing DNA Stability for DNA Data Storage

>> https://www.biorxiv.org/content/10.1101/2023.05.11.540302v1

While methods to achieve DNA stability for hundreds or even millennia are possible, they call for completely enclosing DNA inside a silica matrix.

For instance, for an Archival Storage system whose DNA is enclosed in silica, the probability of strand loss or breakage is much lower, thereby enabling the use of longer DNA strands and higher information densities.

Conversely, for Working or Short-Term Storage systems, shorter strand lengths and lower information density requirements would be more appropriate due to the higher likelihood of strand loss.




Tranquility.

2023-05-15 05:13:05 | Science News

(Art by ekaitza)




□ scSpace: Reconstruction of the cell pseudo-space from single-cell RNA sequencing data

>> https://www.nature.com/articles/s41467-023-38121-4

scSpace (single-cell spatial position associated co-embeddings), an integrative method that uses ST data as a spatial reference to reconstruct the pseudo-space. A space-informed clustering is conducted to identify spatially variable cell subpopulations within the scRNA-seq data.

scSpace uses a transfer component analysis (TCA), it enables eliminating the batch effect between single-cell and ST data and extracting the shared latent feature. TCA projects the scRNA-seq and spatial transcriptomics data into a Reproducing Kernel Hilbert Space.





□ DEGAP: Dynamic Elongation of a Genome Assembly Path

>> https://www.biorxiv.org/content/10.1101/2023.04.25.538224v1

DEGAP (Dynamic Elongation of a Genome Assembly Path), a novel gap-filling software that can resolve gap regions in genomes. DEGAP optimizes HiFi reads by identifying the differences b/n reads and provides ‘GapFiller’ or ‘CtgLinker’ modes to eliminate or shorten gaps in genomes.

DEGAP elongates all contigs with supplied HiFi data, assesses the potentially neighbored contigs. DEGAP adopts a cyclic elongation strategy that automatically and dynamically adjusts parameters according to the complexity of the sequences and selects the optimal extension path.





□ scDisInFact: disentangled learning for integration and prediction of multi-batch multi-condition single-cell RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.05.01.538975v1

scDisInFact (single cell disentangled Integration preserving condition-specific Factors) learns latent factors that disentangle condition effects from batch effects, enabling it to simultaneously perform: batch effect removal, CKG detection, and perturbation prediction.

The disentangled latent space allows scDisInFact to perform the CKG detection and perturbation prediction, and to overcome the limitation of existing methods for each task. scDisInFact can remove batch effect while keeping the condition effect in gene expression data.





□ scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics

>> https://www.nature.com/articles/s41587-023-01772-1

The scDesign3 model is flexible to incorporate cell covariates (such as cell type, pseudotime, and spatial coordinates) via the use of generalized additive models, making the scDesign3 model fit well to various single-cell and spatial omics data a property confirmed by scDesign3's realistic simulation.

scDesign3 has a model alteration functionality enabled by its transparent probabilistic modeling: given the scDesign3 model parameters estimated on real data, users can alter the model parameters to reflect a hypothesis and generate the corresponding synthetic data that bear real data characteristics.





□ CellTypist v2.0: Automatic cell type harmonization and integration across Human Cell Atlas datasets

>> https://www.biorxiv.org/content/10.1101/2023.05.01.538994v1

CellTypist v2.0 accurately guantifies cell-cell transcriptomic similarities and enables robust and efficient cross-dataset meta-analyses. Cell types are placed into a relationship graph that hierarchically defines shared and novel cell subtypes.

CellTypist uses PCT, a multi-target regression tree algorithm. CellTypist defines semantic relationships among cell types / captures their underlying hierarchies, which are further leveraged to guide the downstream data integration at different levels of annotation granularities.





□ GATE: Moving Fast With Broken Data

>> https://arxiv.org/pdf/2303.06094.pdf

GATE, the Partition Summarization (PS) approach to data validation. The method creates a vector of statistics for each time step and performs a k-nearest neighbor algorithm against historical vectors to label the current time step's vector as anomalous or acceptable.

GATE significantly outperforms other methods in terms of mitigating false positives when ML pipelines have many correlated features because of GATE's clustering component, which only triggers an alert when an entire group of correlated features is anomalous.





□ ATOMRefine: Atomic protein structure refinement using all-atom graph representations and SE(3)-equivariant graph transformer

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad298/7152976

ATOMRefine, a deep learning-based, end-to-end, all-atom protein structural model refinement method. It uses a SE(3)-equivariant graph transformer network to directly refine protein atomic coordinates in a predicted tertiary structure represented as a molecular graph.

ATOMRefine enables the network to leverage sequence-based and spatial information from the entire protein structures to update node and edge features and catch the global and local structural variation from the initial model to the native structure iteratively.





□ Restrander: rapid orientation and QC of long-read cDNA data

>> https://www.biorxiv.org/content/10.1101/2023.05.02.539165v1

Restrander was faster than Oxford Nanopore Technologies’ existing tool Pychopper, and correctly restranded more reads due to its strategy of searching for polyA/T tails in addition to primer sequences from the reverse transcription and template-switch steps.

Each read from the reverse strand is replaced with reverse-complement, ensuring all reads in the output have the same orientation as the original transcripts. Restrander classifies artefactual reads for QC and ensure only high-quality reads are taken for downstream processing.





□ ROptimus: a parallel general-purpose adaptive optimisation engine

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad292/7152277

ROptimus, a general-purpose optimisation engine in R that can be plugged to any, simple or complex, modelling initiative through a few lucid interfacing functions, to perform a seamless optimisation with rigorous parameter sampling.

ROptimus features simulated annealing and replica exchange implementations equipped with adaptive thermoregulation to drive Monte Carlo optimisation process in a flexible manner, through constrained acceptance frequency but unconstrained adaptive pseudo temperature regimens.





□ Unifilar Machines and the Adjoint Structure of Bayesian Models

>> https://arxiv.org/abs/2305.02826

There is an adjunction between ‘dynamical’ and ‘epistemic’ models of a hidden Markov process. Concepts such as Bayesian filtering and conjugate priors arise as natural consequences of this adjunction.

Strongly representable Markov categories include BorelStoch (whose objects are standard Borel spaces and whose morphisms are Markov kernels) and the Kleisli category of the (real-valued) distribution monad, which is called Dist.

Unifilar machines outputs are stochastic but whose state updates are deterministic. Its state space consists of probability distributions over the hidden states of the system, and its dynamics are given by Bayesian updating.




□ StarCoder: A State-of-the-Art LLM for Code

>> https://huggingface.co/blog/starcoder

15B LLM with 8k context
Trained on permissively-licensed code
Acts as tech assistant
80+ programming languages
Open source and data
Online demos
VSCode plugin
1 trillion tokens





□ A Bayesian Noisy Logic Model for Inference of Transcription Factor Activity from Single Cell and Bulk Transcriptomic Data

>> https://www.biorxiv.org/content/10.1101/2023.05.03.539308v1

NLBayes: A noisy Boolean logic Bayesian model for TF activity inference from differential gene expression data and causal graphs. This approach provides a flexible framework to incorporate biologically motivated TF-gene regulation logic models.

NLBayes incorporates the prior information on causal regulatory interactions and makes posterior adjustments to further account for noise and determine the context-specific posterior network structure and active regulators through a Gibbs sampling procedure.





□ Dawnn: single-cell differential abundance with neural networks

>> https://www.biorxiv.org/content/10.1101/2023.05.05.539427v1

Dawnn uses a deep neural network model that has been trained to estimate the relative abundance of cells from each sample or condition in a cell’s neighbourhood. Dawnn predicts the probability w/ which each cell was drawn from a given sample or condition using simulated datasets.

Dawn controls the false discovery rate (FDR), the proportion of cells incorrectly cssified as belonging to regions exhibiting DA, using the Benjamini-Yekutieli procedure, a variant of the Benjamini-Hochberg procedure that does not assume independence between hypotheses.





□ Ribotin: rDNA consensus sequence builder

>> https://github.com/maickrau/ribotin

Ribotin inputs hifi or duplex, and optionally ultralong ONT. Extracts rDNA-specific reads based on k-mer matches to a reference rDNA sequence or based on a verkko assembly

Ribotin builds a DBG out of them, extracts the most covered path as a consensus and bubbles as variants. Optionally assembles highly abundant rDNA morphs using the ultralong ONT reads.





□ Aggregating network inferences: towards useful networks

>> https://www.biorxiv.org/content/10.1101/2023.05.05.539529v1

They suggest to combine edge frequencies directly to reconstruct the network. This approach ensures that only robust and reproducible edges are included in the consensus network.

The first consensus step relies on selecting edges w/ high inclusion frequency in the networks reconstructed from resampled data. The 2nd aggregation step is the inference of a consensus network considering each method advantages and counter balancing each estimation's default.





□ Foldseek: Fast and accurate protein structure search

>> https://www.nature.com/articles/s41587-023-01773-0

Foldseek discretizes the query structures into sequences over the 3Di alphabet and uses a pre-trained 3Di substitution matrix to search through the 3Di sequences of the target structures using the double-diagonal k-mer-based prefilter and gapless alignment prefilter modules.

Foldseek uses vectorized Smith–Waterman local alignment combining 3Di and amino acid substitution scores. Alternatively, a global alignment is computed with a 1.7-times accelerated TM-align.





□ ProteinGenerator: Joint Generation of Protein Sequence and Structure with RoseTTAFold Sequence Space Diffusion

>> https://www.biorxiv.org/content/10.1101/2023.05.08.539766v1

Beginning from random amino acid sequences, ProteinGenerator generates sequence and structure pairs by iterative denoising, guided by any desired sequence and structural protein attributes.

ProteinGenerator readily generates sequence-structure pairs satisfying the input conditioning criteria, and experimental validation showed that the designs were monomeric by size exclusion chromatography, had the desired secondary structure content by circular dichroism.





□ Improving de novo protein binder design with deep learning

>> https://www.nature.com/articles/s41467-023-38328-5

The physically based Rosetta approach frames both the folding and binding problems in energetic terms; for the approach to succeed, the designed sequence must have as its lowest energy state in isolation the designed monomer structure.

ProteinMPNN, a novel deep learning-augmented de novo protein binder design protocol. It shows retrospectively and prospectively that this improved protocol has nearly 10-fold higher success rate than the original energy-based method.





□ HMMerge: an ensemble method for multiple sequence alignment

>> https://academic.oup.com/bioinformaticsadvances/article/3/1/vbad052/7126611

HMMerge builds on the technique from its predecessor alignment methods, UPP and WITCH, which build an ensemble of profile HMMs to represent the backbone alignment and add the remaining sequences into the backbone alignment using the ensemble.

HMMerge builds a new ‘merged’ HMM from the ensemble, and then using that merged HMM to align the query sequences. We show that HMMerge is competitive with WITCH, with an advantage over WITCH when adding very short sequences into backbone alignments.





□ Correcting gradient-based interpretations of deep neural networks for genomics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02956-3

Even though DNNs can learn a function everywhere in Euclidean space, one-hot encoded DNA is a categorical variable that lives on a lower-dimensional simplex.

Random off-simplex function behavior can introduce a random gradient component orthogonal to the simplex, which manifest as spurious noise in the input gradients

This proposed gradient correction—subtracting the original gradient components by the mean gradients across components for each position—is general for all data with categorical inputs, including DNA, RNA, and protein sequences.





□ GKLOMLI: a link prediction model for inferring miRNA–lncRNA interactions by using Gaussian kernel-based method on network profile and linear optimization algorithm

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05309-w

GKLOMLI, a novel link prediction model based on Gaussian kernel-based method and linear optimization algorithm for inferring miRNA–lncRNA interactions. The Gaussian kernel-based method was employed to output two similarity matrixes of miRNAs and lncRNAs.

Based on the integrated matrix combined with similarity matrixes and the observed interaction network, a linear optimization-based link prediction model was trained for inferring miRNA–lncRNA interactions.





□ Estimating the mean in the space of ranked phylogenetic trees

>> https://www.biorxiv.org/content/10.1101/2023.05.08.539790v1

A simulation study to validate our method and compare it to other tree summary approaches such as the Maximum Clade Credibility (MCC) method. They assess suitability of a treespace for statistical analyses, e.g. its "smoothness" w/ respect to probability distributions over trees.

The RNNI space is a treespace of ranked phylogenetic trees, which are rooted binary trees where internal nodes are ordered according to times of the corre-ponding evolutionary events, assuming no co-occurrence.

The RNNI space is then defined as a graph where vertices are ranked trees and edges are representing either a rank or an NNI move that transforms one tree into another.

The CENTROID algorithm minimizes the sum of squared (SoS) distances b/n a summary tree and a given tree sample and stops when it finds a locally optimal tree, approximating a centroid tree. The algorithm proceeds iteratively by computing the SoS values for all neighbors.





□ Model selection and robust inference of mutational signatures using Negative Binomial non-negative matrix factorization

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05304-1

A Negative Binomial NMF with a patient specific dispersion parameter to capture the variation across patients and derive the corresponding update rules for parameter estimation.

A novel model selection procedure inspired by cross-validation to determine the number of signatures. It uses the Kullback–Leibler divergence which would favor the Poisson model. This means that a direct comparison b/n the cost values for Po-NMF / NBN-NMF is not feasible.





□ STAGEs: A web-based tool that integrates data visualization and pathway enrichment analysis for gene expression studies

>> https://www.nature.com/articles/s41598-023-34163-2

STAGEs (Static and Temporal Analysis of Gene Expression studies) is a web-based and high-throughput analysis pipeline with an intuitive user interface that allows systematic characterisation of static and temporal transcriptomic data.

STAGEs converts the ratio values to log2-transformed fold change values at backend, and the correlation matrix is generated by performing pairwise correlations of the log2-transformed fold changes between the different experimental conditions.





□ Insights from a genome-wide truth set of tandem repeat variation

>> https://www.biorxiv.org/content/10.1101/2023.05.05.539588v1

By identifying the subset of insertions and deletions that represent TR expansions or contractions with motifs between 2 and 50 base pairs, we obtained accurate genotypes for 139,795 pure and 6,845 interrupted repeats in a single diploid sample.

This approach did not require running existing genotyping tools on short read or long read sequencing data and provided an alternative, more accurate view of tandem repeat variation.

The Synthetic Diploid (SynDip) Benchmark provides genotypes for 5, 182,765 SNV, insertion and deletion variants, as well as a set of high-confidence regions spanning 2.71 gigabases where genotypes are highly accurate.





□ Butt-seq: a new method for facile profiling of transcription

>> https://genesdev.cshlp.org/content/early/2023/05/10/gad.350434.123.abstract

Butt-seq (bulk analysis of nascent transcript termini sequencing), which can produce libraries from purified nascent RNA in 6 h and from as few as 10,000 cells—an improvement of at least 10-fold over existing techniques.

Butt-seq shows that inhibition of the superelongation complex (SEC) causes promoter-proximal pausing to move upstream in a fashion correlated with subnucleosomal fragments.





□ NGBO: Introducing -omics metadata to biobanking ontology

>> https://www.biorxiv.org/content/10.1101/2023.05.09.539725v1

NGBO is based on available genomics standards (e.g., Minimum information about a microarray experiment (MIAME)), the College of American Pathologists (CAP) laboratory accreditation requirements, and the Open Biological and Biomedical Ontologies Foundry principles.

NGBO fills the need for semantically enabling the discovery and integration of omics datasets and realization of FAIR data representation, which will impact the efficiency of finding, integrating, and re-using biobanking data of interest.





□ Robust discovery of causal gene networks via measurement error estimation and correction

>> https://www.biorxiv.org/content/10.1101/2023.05.09.540002v1

A new framework for causal discovery that is robust against measurement noise by extending an established statistical approach CIT (Causal Inference Test).

RCD (Robust Causal Discovery) estimates measurement error from gene expression data and then incorporate it to get consistent parameter estimates that could be used with appropriately extended statistical tests of correlation or mediation done in the original CIT.





□ Simple Tidy GeneCoEx: A gene co-expression analysis workflow powered by tidyverse and graph-based clustering in R

>> https://acsess.onlinelibrary.wiley.com/doi/10.1002/tpg2.20323

Simple Tidy GeneCoEx detects co-expression modules enriched in specific cell types, which were used to discover candidate genes in a biosynthetic pathway for complex plant natural products.

Simple Tidy GeneCoEx detects modules that are, on average, equivalently tight or tighter than those detected by WGCNA. A potential reason underlying the differences in module tightness might be due to the module detection methods.

By default, WGCNA uses hierarchical clustering followed by tree cutting to detect modules. Simple Tidy GeneCoEx uses the Leiden algorithm to detect modules, which returns modules that are highly interconnected.





□ Fulgor: A fast and compact k-mer index for large-scale matching and color queries

>> https://www.biorxiv.org/content/10.1101/2023.05.09.539895v1

Fulgor is a colored compacted de Bruijn graph index for large-scale matching and color queries, powered by SSHash. Fulgor has a generic intersection algorithm that can work over any compressed color sets, provided that an iterator over each color supports two primitives - Next and NextGEQ(x).

Themisto, an index for alignment-free matching that substantially outperforms these prior methods in the context of indexing and mapping against large collections of genomes. Compared to Bifrost, Themisto uses practically the same space, but is faster to build and query.

Compared to the fastest variant of Metagraph, Themisto offers similar query performance, but is much more space-efficient; on the other hand, Themisto is much faster to query than Metagraph-BRWT, the most-space efficient variant of Metagraph.





□ RaPID-Query for Fast Identity by Descent Search and Genealogical Analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad312/7160137

A new method, random projection-based identical-by-descent (IBD) detection (RaPID) query, is introduced to make fast genealogical search possible. RaPID-Query identifies IBD segments between a query haplotype and a panel of haplotypes.

By integrating matches over multiple PBWT indexes, RaPID- Query manages to locate IBD segments quickly with a given cutoff length while allowing mismatched sites.





□ CARMA is a new Bayesian model for fine-mapping in genome-wide association meta-analyses

>> https://www.nature.com/articles/s41588-023-01392-0

CARMA, a Bayesian model for fine-mapping that includes flexible specification of the prior distribution of effect sizes, joint modeling of summary statistics and functional annotations and accounting for discrepancies b/n summary statistics and external linkage disequilibrium in meta-analyses.

CARMA has higher power and lower false discovery rate (FDR) when including functional annotations, and higher power, lower FDR and higher coverage for credible sets in meta-analyses.





□ DeCOIL: Optimization of Degenerate Codon Libraries for Machine Learning-Assisted Protein Engineering

>> https://www.biorxiv.org/content/10.1101/2023.05.11.540424v1

DEgenerate Codon Optimization for Informed Libraries (DeCOIL), a generalized method which directly optimizes DC libraries to be useful for protein engineering: to sample protein variants that are likely to have both high fitness and high diversity in the sequence search space.

DeCOIL can be used to generate a designed library for screening based on computational predictors (ZS scores or ML models) at many possible points along the route to engineering a protein. DeCOIL enables protein engineering using ftMLDE with comparable outcomes.





□ moscot: Mapping cells through time and space

>> https://www.biorxiv.org/content/10.1101/2023.05.11.540374v1

moscot supports multimodal data throughout the framework by exploiting joint cellular representations. moscot improves scalability by adapting and demonstrating the applicability of recent methodological innovations to atlas-scale datasets.

moscot unifies previous single-cell applications of OT in the temporal and spatial domain and introduces a novel spatiotemporal application. All of this is achieved with a robust and intuitive API that interacts with the broader scverse ecosystem.







Equanimity.

2023-05-15 05:10:05 | Science News
(Art by ekaitza)






Mark

>> https://www.vastspace.com/roadmap

Very exciting timeline from Haven-1 in 2025 on F9 to 2030 Starship class space station/modules to 100m spinning station in the 2040’s.

Excellent plan and realistic timeline.




□ NextPolish2: a repeat-aware polishing tool for genomes assembled using HiFi long reads

>> https://www.biorxiv.org/content/10.1101/2023.04.26.538352v1

NextPolish2 can fix base errors in “highly accurate” draft assemblies without introducing overcorrections, even in regions with highly repetitive elements. Through the built-in phasing module, it can not only correct the error bases, but also maintain the original haplotype consistency.

NextPolish2 follows the Kmer Score Chain (KSC) algorithm of its previous version to perform an initial rough correction, and detect low-quality positions (LQPs) where the chosen alleles account for ≤ 0.95 of the total during a traceback procedure.

NextPolish2 repeats the above procedure until all conflict communities are resolved (the number of iterations can be adjusted according to user settings) and then use the KSC algorithm to generate a draft consensus sequence.





□ CODEC: Single duplex DNA sequencing with CODEC detects mutations with high sensitivity

>> https://www.nature.com/articles/s41588-023-01376-0

CODEC (Concatenating Original Duplex for Error Correction), a hybrid method that combines the massively parallel nature of NGS and the resolution of single-molecule sequencing by reading both strands of each DNA duplex with single NGS read pairs.

The CODEC structure can be built by replacing a typical adapter duplex with the CODEC adapter quadruplex, containing all elements required for NGS.

CODEC to physically concatenate the Watson strand with the reverse complement of the Crick strand into a single strand without forming a prohibitive hairpin or inverted repeat structure from two complementary sequences.





□ TRASH: Tandem Repeat Annotation and Structural Hierarchy

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad308/7159186

TRASH (Tandem Repeat Annotation and Structural Hierarchy) is a tool that identifies and maps tandem repeats in nucleotide sequence, without prior knowledge of repeat composition.

TRASH analyses a fasta assembly file, identifies regions occupied by repeats and then precisely maps them and their higher order structures.

TRASH searches for continuous, highly similar, tandemly arranged DNA repeats of a similar unit size. This excludes transposable elements and interspersed repeats from analysis and allows for precise definition of tandemly arranged repeats.





□ GraNA: Supervised biological network alignment with graph neural networks

>> https://www.biorxiv.org/content/10.1101/2023.04.24.538184v1

GraNA, a deep learning framework for the supervised NA paradigm for the pairwise network alignment problem. GraNA utilizes within-network interactions and across-network anchor links for learning protein representations and predicting functional correspondence.

GraNA integrates sequence similarity edges as additional anchor links to guide the alignment and pre-computed network embeddings as node features to better encode the topological roles of network nodes.





□ Riboformer: A Deep Learning Framework for Predicting Context-Dependent Translation Dynamics

>> https://www.biorxiv.org/content/10.1101/2023.04.24.538053v1

Riboformer uses a transformer architecture that detects long-range dependencies in the regulation of elongation. Riboformer models the context-dependent changes in ribosome dynamics at codon resolution.

The transformer block consists of self-attention layers that gather the impact of distant codons based on their sequence representations, in contrast to convolutional neural network that relies on convolution operators to detect local sequence motifs.

Riboformer can be combined with in silico mutagenesis analysis to identify sequence motifs that contribute to ribosome stalling. It also utilizes a reference input to prevent the learning of noninformative signals due to the experimental bias.





□ CellANOVA: Signal recovery in single cell batch integration

>> https://www.biorxiv.org/content/10.1101/2023.05.05.539614v1

CellANOVA utilizes a “pool-of-controls”, applicable across diverse settings, to separate unwanted variation from biological variation. CellANOVA allows the recovery of subtle biological signals and corrects, to a large extent, the data distortion introduced by integration.

A control-pool is a set of samples whereby variation beyond what is preserved by the existing integration are not of interest to the study. The control-pool samples are utilized to estimate a latent linear space that captures cell- and gene-specific unwanted batch variations.

CellANOVA produces a batch corrected GE matrix which can be used for gene-pathway level downstream analyses. By using the control pool in the estimation of the batch variation space, CellANOVA recovers any variation in the non-control samples that lie outside this space.





□ ProteiNN: a Transformer-based model for end-to-end single-sequence protein structure prediction

>> https://www.biorxiv.org/content/10.1101/2023.04.26.538026v1

ProteiNN predicts protein secondary and tertiary structures directly from integer-encoded amino acid sequences. The model was trained and evaluated using the SideChainNet dataset, which provides the basis for complete model training.

The input to the module is a sequence of feature vectors mapped to these component spaces via linear transformations. The multi-head mechanism enables the model to learn relationships between amino acids in parallel.

ProteiNN uses a gating mechanism that modulates the information flow between the input and output, allowing the model to emphasize specific relationships and discard irrelevant information selectively.





□ DeepUMQA3: a web server for model quality assessment of protein complexes

>> https://www.biorxiv.org/content/10.1101/2023.04.24.538194v1

DeepUMQA and DeepUMQA2, new features were designed for complex structures, and the lDDT of each residue and the accuracy of interface residues were predicted using an improved deep neural network.

At the level of overall complex, the overall complex is regarded as a large monomer structure. DeepUMQA3 provides fast and accurate interface residue accuracy prediction and per-residue lDDT prediction services for protein complexes.





□ ecpc: an R-package for generic co-data models for high-dimensional prediction

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05289-x

ecpc originally accommodated various and possibly multiple co-data sources, including categorical co-data, i.e. groups of variables, and continuous co-data. Continuous co-data were handled by adaptive discretisation, potentially inefficiently modelling and losing information.

An extension to the method for generic co-data models, particularly for continuous co-data. At the basis lies a classical linear regression model, regressing prior variance weights on the co-data. Co-data variables are then estimated with empirical Bayes moment estimation.




□ MaxKAT: A maximum kernel-based association test to detect the pleiotropic genetic effects on multiple phenotypes

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad291/7146028

MaxKAT reduces computational intensity greatly while maintaining high accuracy. Extensive simulations demonstrate that MaxKAT can properly control type I error rates and obtain remarkably higher power than KAT under most of the considered scenarios.

A generalized extreme value distribution is employed to calculate the statistical significance of MaxKAT under the null hypothesis. In addition, the proposed test can accommodate high-dimensional data and yield high power against various alternative hypotheses.





□ SeqImprove: Machine Learning Assisted Creation of Machine Readable Sequence Information

>> https://www.biorxiv.org/content/10.1101/2023.04.25.538300v1

SeqImprove is designed to aid authors in creating machine readable sequence data with complete metadata. It consists of a user-interface that was built using modular code. It can be reused by others to work as the front-end for their curation software.

As input, SeqImprove takes in a sequence file in the Synthetic Biology Open Language (SBOL) format or a link to a sequence stored in SynBioHub. It makes the information machine readable by using existing ontologies to structure the metadata.





□ CNV-ClinViewer: Enhancing the clinical interpretation of large copy-number variants online

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad290/7146044

CNV-ClinViewer enables real-time interactive exploration of large CNV datasets in a user-friendly designed interface and facilitates semi-automated clinical CNV interpretation following the ACMG guidelines by integrating the ClassifCNV tool.

The CNV-ClinViewer allows analysis of single or multiple CNVs, of the used to identify them. Minimal required information for each CNV, including whole chromosome trisomies and monosomies, is the chromosome, start, end and CNV type.





□ OrthoVenn3: an integrated platform for exploring and visualizing orthologous data across genomes

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkad313/7146343

OrthoVenn3 provides gene family contraction and expansion analysis to support researchers better understanding the evolutionary history of gene families, as well as collinearity analysis to detect conserved and variable genomic structures.

OrthoVenn3 offers multiple out-puts, including the UpSet table, occurrence table, phylogenetic tree, and collinearity graph, which provides users with various perspectives on their data.





□ ELVAR: Cell-attribute aware community detection improves differential abundance testing from single-cell RNA-Seq data

>> https://www.biorxiv.org/content/10.1101/2023.04.28.538653v1

ELVAR uses cell attribute aware clustering when inferring differentially enriched communities within the single-cell manifold. ELVAR can detect disease relevant DA-shifts in other cell-types and biological conditions.

The improved sensitivity to detect DA-shifts, as displayed by ELVAR, was also seen when benchmarked against an analogous clustering-based DA-method that uses Louvain in place of EVA.





□ xQTLbiolinks: a comprehensive and scalable tool for integrative analysis of molecular QTLs

>> https://www.biorxiv.org/content/10.1101/2023.04.28.538654v1

xQTLbiolinks is a end-to-end bioinformatic tool for efficient mining and analyzing public and user-customized xQTLs data for the discovery of disease susceptibility genes.

xQTLbiolinks allows users to conveniently retrieve ×QTLs data and metainformation for further analysis through gene names/IDs, tissue names, or genomic regions of interest.





□ Combining LIANA and Tensor-cell2cell to decipher cell-cell communication across multiple samples

>> https://www.biorxiv.org/content/10.1101/2023.04.28.538731v1

Integrating LIANA and Tensor-cell2cell, which combined can deploy multiple existing methods and resources, to enable the robust and flexible identification of cell-cell communication programs across multiple samples.

In this protocol, the integration of the tools facilitates the choice of method to infer cell-cell communication and subsequently perform an unsupervised deconvolution to obtain and summarize biological insights.





□ Signed distance correlation (SiDCo): an online implementation of distance correlation and partial distance correlation for data-driven network analysis

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad210/7151065

SiDCo is a GUI-platform for calculation of distance correlation in omics data, measuring linear and non-linear dependences between variables, as well as correlation between vectors of different lengths, e.g., different sample sizes.

Distance correlations can be selected as one-to-one / one-to-all correlations, showing relationships b/n each / all other features one at a time. SiDCo uses partial distance correlation, calculated using the Gaussian Graphical model approach adapted to distance covariance.





□ ERStruct: a fast Python package for inferring the number of top principal components from whole genome sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05305-0

ERStruct enables the inference of population structure using whole-genome sequencing data. By leveraging parallel computing and GPU acceleration, ERStruct achieves significant improvements in the speed of matrix operations for large-scale data.

In GOE.py, Monte Carlo method is used in the ERStruct algorithm to obtain the null distribution of our proposed ERStruct test statistic, which starts by generating multiple replications of high-dimensional Gaussian Orthogonal Ensemble matrices.





□ PascalX: a python library for GWAS gene and pathway enrichment tests

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad296/7151067

PascalX allows for scoring genes and annotated gene sets for enrichment signals based on data from, both, single GWAS and pairs of GWAS. The gene scores take into account the correlation pattern between SNPs.

They are based on the cumulative density function of a linear combination of χ2 distributed random variables, which can be calculated either approximately or exactly to high precision.





□ CZ CELLxGENE Discover Census

>> https://chanzuckerberg.github.io/cellxgene-census/

The Census provides efficient computational tooling to access, query, and analyze all single-cell RNA data from CZ CELLxGENE Discover.

Using a new access paradigm of cell-based slicing and querying, you can interact with the data through TileDB-SOMA, or get slices in AnnData or Seurat objects, thus accelerating your research by significantly minimizing data harmonization.





□ kimma: flexible linear mixed effects modeling with kinship covariance for RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad279/7152273

kimma supports DEG analyzes incl. covariance random effects. Kimma is an open-source R package that provides flexible linear mixed effects modeling for bulk RNA-seq data including univariate, multivariate, random, and covariance random effects as well as gene-level weights.

kimma utilizes a single function, kmFit, for modeling, ensuring consistent syntax, inputs, and outputs. Moreover, kimma provides post-hoc pairwise tests, model fit metrics like AIC, and fit warnings on a per gene basis.





□ CAGECAT: The CompArative GEne Cluster Analysis Toolbox for rapid search and visualisation of homologous gene clusters

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05311-2

CAGECAT has been designed to provide rapid interoperability between these functions, where homologous clusters of interest can be selected to be used in subsequent analysis.

CAGECAT can yield relevant matches that aid in the comparison, taxonomic distribution, or evolution of an unknown query. The search module leverages the cblaster pipeline, which utilises remote BLAST searches via NCBI’s servers as well as accelerated local Hidden Markov Model.





□ cellsnake: a user-friendly tool for single cell RNA sequencing analysis

>> https://www.biorxiv.org/content/10.1101/2023.05.03.539204v1

Cellsnake allows parallelization and readily utilizes high performance computing (HPC) platforms. cellsnake provides metagenome analysis capabilities if unmapped reads are available.

cellsnake can utilize different scRNA-seq algorithms to simplify tasks such as automatic mitochondrial gene trimming, selection of optimal clustering resolution, doublet filtering, visualization of marker genes, enrichment analysis and pathway analysis.





□ Whole-genome long-read sequencing downsampling and its effect on variant calling precision and recall

>> https://www.biorxiv.org/content/10.1101/2023.05.04.539448v1

Defining read-based methodologies as those requiring alignment of individual sequencing reads to a reference genome and applying specific read-based variant-calling algorithms to these alignments to identify variants.

Assembly-based methods first generate ab initio a whole-genome assembly from LRS reads without guidance from a particular reference genome, and then proceed analogously by aligning this assembly to a reference genome to call variants using assembly-based calling algorithms.





□ HiPhase: Jointly phasing small and structural variants from HiFi sequencing

>> https://www.biorxiv.org/content/10.1101/2023.05.03.539241v1

HiPhase jointly phases SNVs, indels, and structural variants called from PacBio HiFi sequencing on diploid organisms. HiPhase uses two novel approaches to solve the phasing problem: dual mode allele assignment and a phasing algorithm based on the A* search algorithm.

HiPhase offers additional benefits: no down-sampling, multi-allelic variation, logic to span coverage gaps with supplementary alignments, innate multi-threading, built-in statistics gathering, and assigning aligned reads to a haplotype (“haplotagging”) while phasing.





□ scMayoMap: an easy-to-use tool for cell type annotation in single-cell RNA-sequencing data analysis

>> https://www.biorxiv.org/content/10.1101/2023.05.03.538463v1

ScMayoMap takes the standard cluster marker gene list as input and returns the cell type prediction results in a plot and the mapped gene list. scMayoMap allows assignment of multiple cell types to the same cluster if their evidence is similar.

scMayoMap can predict PBMC cell types with small errors, suggesting that marker-based approach is still a promising approach if applied properly.





□ DeepGNN: Semi-supervised learning improves regulatory sequence prediction with unlabeled sequences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05303-2

DeepGNN, a paradigm shift with semi-supervised learning, which does not only exploit labeled sequences (e.g. human genome with ChIP-seq experiment), but also unlabeled sequences available in much larger amounts.

In parallel, the model takes as a secondary input the graph matrix connecting homologous sequences between species. An improvement would be to infer the homology matrix from the sequence embedding itself during training.





□ Challenges and considerations for reproducibility of STARR-seq assays

>> https://genome.cshlp.org/content/early/2023/05/02/gr.277204.122.long

A strong advantage of STARR-seg is its ability to screen random fragments of DNA from any source for enhancer activity. To this effect, DNA can be sourced from commercially available DNA repositories, from specific populations carrying non-coding mutations or SNPs to be assayed.

Cloning strategies such as In-fusion HD, Gibson assembly, and NEBuilder HiFi DNA Assembly allow for fast and one-step reactions that use complimentary overhang sequences on the inserts and the vector.

Highlighting the different challenges in performing STARR-seg, a particularly long and difficult assay with huge potential to identify detailed enhancer landscapes and validate enhancer function.





□ STEMSIM: a simulator of within-strain short-term evolutionary mutations for longitudinal metagenomic data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad302/7156836

STEMSIM (short-term evolutionary mutations simulator), which can generate mutations incl. SNV and InDel with various frequency distributions within strains in raw metagenomic sequencing data under a specified nucleotide substitution model.

STEMSIM directly takes the output of CAMISIM as input data. Next, the raw sequencing reads are mapped to the original reference genomes to obtain the alignment files (sam/bam) by Bowtie2.

Then, the details of mutations are gerated according to the specified parameters, such as the number of nucleotide substitutions, and the distribution and trajectory of allele frequency.





□ scDist: Robust identification of perturbed cell types in single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.05.06.539326v1

scDist estimates the distance between condition means in high-dimensional gene expression space for each cell type. scDist can recover biologically relevant between-group differences while also controlling for sample-level variability.

scDist is based on a linear mixed-effects model of single-cell GE counts. scDist uses an approximation for the between-group differences, based on a low-dimensional embedding, which results in a computationally convenient implementation that is substantially faster than Augur.





□ crosshap: Local haplotype visualization for trait association analysis

>> https://www.biorxiv.org/content/10.1101/2023.05.07.539781v1

crosshap performs density-based clustering of variants based on their linkage profiles to capture haplotype structures in local genomic regions. Tightly linked variants are clustered into MGs, and individuals are grouped into local haplotypes by shared allelic combinations.

Visualization tools are provided by crosshap for choosing optimal clustering parameters and producing intuitive crosshap figures that present information on the complex relationships between linked variants, haplotype combinations, and phenotypic/metadata traits of individuals.





□ SpatialData: an open and universal data framework for spatial omics

>> https://www.biorxiv.org/content/10.1101/2023.05.05.539647v1

SpatialData, a framework that establishes a unified and extensible multi-platform file-format, lazy representation of larger-than-memory data, transformations, and alignment to common coordinate systems.

SpatialData facilitates spatial annotations and cross-modal aggregation and analysis, the utility of which is illustrated via multiple vignettes including integrative analysis on a multi-modal Xenium and Visium breast cancer study.





Pitfalls.

2023-05-15 03:03:03 | Science News


The Pitfalls of generative AI can generally be replaced by the problem of determining where to intervene with evaluation procedures for noise and bias in procedures.

Regulatory measures for high-risk classification are operational evaluations and have not accurately estimated the technical hurdles at present. However, retroactively tracing from "facts" to "outcomes" becomes mechanically difficult.


生成系AIのPitfallは、概してプロシージャにおいて不可測なノイズやバイアスの評価手順をどこに介在させるかという問題に置き換えられる。ハイリスク分類の規制は運用上の評価であり、技術的なハードルを現状正確に見積もってはいない。但し『事実』から『結果』へ遡行するのは力学的に困難となる。





Chrysanthemum.

2023-04-24 04:44:44 | Science News

(Art by gen_ericai)




□ Genomic language model: Deep learning of genomic contexts predicts protein co-regulation and function

>> https://www.biorxiv.org/content/10.1101/2023.04.07.536042v1

A genomic language model (gLM) learns contextualized protein embeddings that capture the genomic context as well as the protein sequence itself, and appears to encode biologically meaningful and functionally relevant information. gLM is learning co-regulated functional modules.

gLM is based on the transformer architecture. gLM is trained with the masked language modeling objective, with the hypothesis that its ability to attend to different parts of a multi-gene sequence will result in the learning of gene functional semantics and regulatory syntax.





□ GPN: DNA language models are powerful zero-shot predictors of genome-wide variant effects

>> https://www.biorxiv.org/content/10.1101/2022.08.22.504706v2

GPN’s internal representation of DNA sequences can distinguish genomic regions like introns, untranslated regions, and coding sequences. The confidence of GPN’s predictions can help reveal regulatory grammar.

GPN can be employed to calculate a pathogenicity or functionality score for any SNP in the genome using the log-likelihood ratio between the alternate and reference allele. GPN can learn from joint nucleotide distributions across all similar contexts appearing in the genome.

GPN uses the Hugging Face library to one-hot encode the masked DNA sequence and process it thru 25 convolutional blocks. Each block contains a dilated layer, feed-forward layer, intermediate residual connections, and layer normalization. The embedding is fixed at 512 dimensions.





□ The architecture of information processing in biological systems

>> https://arxiv.org/abs/2301.12812

An archetypal model for sensing that starts from a thermodynamically consistent description. The combined effects of storage and negative feedback promote the emergence of a rich information dynamics shaped by adaptation and finite-time memory.

A chemical information reservoir for the system allows it to dynamically build up information on an external environment while reducing internal dissipation. Optimal sensing emerges from a dissipation-information trade-off, requires far-from-equilibrium in low-noise regimes.





□ DeepCORE: An interpretable multi-view deep neural network model to detect co-operative regulatory elements

>> https://www.biorxiv.org/content/10.1101/2023.04.19.536807v1

DeepCORE uses a multi-view architecture to integrate genetic and epigenetic profiles in a DNN. It captures short-range and long-range interactions between REs through BiLSTM.

The learnt attention is a vector of length equal to the number of output nodes from the CNN layer containing importance score of each genomic region.

DeepCORE then joins the two views by concatenating the decoder outputs from each view and giving it to a fully connected feedforward neural network (FNN) to predict continuous gene transcription levels.





□ An in-depth comparison of linear and non-linear joint embedding methods for bulk and single-cell multi-omics

>> https://www.biorxiv.org/content/10.1101/2023.04.10.535672v1

Non-linear methods developed in other fields generally outperformed the linear and simple non-linear ones at imputing missing modalities. CGVAE and ccVAE did better than PoE and MoE on both bulk and single-cell data, while they typically underperformed in the other tasks.

ccVAE uses a single encoder for the concatenation of both modalities, which might be beneficial for generation coherence, as the latent space is directly and concurrently influenced by matched samples from all modalities.

The architecture of CGVAE is identical to that of MoE and PoE with separate encoders per modality. MOFA+ has the advantage that it provides useful diagnostic messages about the input data as well as the learnt space.





□ Genotyping variants at population scale using DRAGEN gVCF Genotyper

>> https://www.illumina.com/science/genomics-research/articles/gVCF-Genotyper.html

DRAGEN gVCF Genotyper implements an iterative workflow to add new samples to an existing cohort. This workflow allows users to efficiently combine new batches of samples with existing batches without repeated processing.

DRAGEN gVCF Genotyper computes many variant metrics on the fly, among them allele counts. DRAGEN gVCF Genotyper relies on the gVCF input format, which contains both variant information, like a VCF, and a measure of confidence of a variant not existing at a given position.





□ SAVEMONEY: Barcode-free multiplex plasmid sequencing using Bayesian analysis and nanopore sequencing

>> https://www.biorxiv.org/content/10.1101/2023.04.12.536413v1

SAVEMONEY (Simple Algorithm for Very Efficient Multiplexing of Oxford Nanopore Experiments for You) guides researchers to mix multiple plasmids and subsequently computationally de-mixes the resultant sequences.

SAVEMONEY involves submitting samples with multiple different plasmids mix and deconvolving the obtained sequencing results while maintaining the quality of the analysis. SAVEMONEY leverages plasmid maps, which are in most cases already made prior to plasmid construction.





□ GraphPart: Homology partitioning for biological sequence analysis

>> https://www.biorxiv.org/content/10.1101/2023.04.14.536886v1

GraphPart, an algorithm for homology partitioning, where as many sequences as possible are kept in the dataset, but partitions are defined such that closely related sequences always end up in the same partition.

GraphPart operates on real-valued distance metrics. Sequence identities ranging from 0 to 1 are converted to distances as d(a,b) = 1-identity(a,b). The partitioning threshold undergoes the same conversion. GraphPart can accept any similarity metric and skip the alignment step.





□ RASP / FAAST: Assisting and Accelerating NMR Assignment with Restrainted Structure Prediction

>> https://www.biorxiv.org/content/10.1101/2023.04.14.536890v1

RASP (Restraints Assisted Structure Predictor) is an architecture derived from AlphaFold evoformer and structure module, and it accepts abstract or experimental restraints, sparse or dense, to generate structures.

FAAST(iterative Folding Assisted peak ASsignmenT) is an iterative NMR NOESY peak assignment pipeline. Using chemical shift and NOE peak lists as input, FAAST assigns NOE peaks iteratively and generates a structure ensemble.





□ Emergent autonomous scientific research capabilities of large language models

>> https://arxiv.org/abs/2304.05332

An Intelligent Agent system that combines multiple large language models for autonomous design/planning/execution. The Agent's scientific research capabilities with 3 distinct examples, with the most complex being the successful performance of catalyzed cross-coupling reactions.

The Agent calculates the required volumes of all reactants and writes the protocol. Subsequent GC-MS analysis of the reaction mixtures revealed the formation of the target products for both reactions. Agent corrects its own code based on the automatically generated outputs.






□ Generative Agents: Interactive Simulacra of Human Behavior

>> https://arxiv.org/abs/2304.03442

Generative agents wake up, cook breakfast, and head to work; artists paint, while authors write; they form opinions, notice each other, and initiate conversations; they remember and reflect on days past as they plan the next day.

An architecture that extends a large language model to store a complete record of the agent's experiences using natural language, synthesize those memories over time into higher-level reflections, and retrieve them dynamically to plan behavior.





□ Many bioinformatics programming tasks can be automated with ChatGPT

>> https://arxiv.org/abs/2303.13528


ChatGPT failed to solve 5 of the exercises within 10 attempts. This table summarizes characteristics of these exercises and provides a brief summary of complications that ChatGPT faced when attempting to solve them.





□ LOCC: a novel visualization and scoring of cutoffs for continuous variables

>> https://www.biorxiv.org/content/10.1101/2023.04.11.536461v1

Luo’s Optimization Categorization Curves (LOCC) helps visualize more information for better cutoff selection and understanding of the importance of the continuous variable against the measured outcome.

The LOCC score is made of three numeric components: a significance aspect, a range aspect, and an impact aspect. The higher the LOCC score, the more critical and predictive the expression is for prognosis.





□ Demultiplex2: robust sample demultiplexing for scRNA-seq

>> https://www.biorxiv.org/content/10.1101/2023.04.11.536275v1

deMULTIplex2, a mechanism-guided classification algorithm for multiplexed scRNA-seq data that successfully recovers many more cells across a spectrum of challenging datasets compared to existing methods.

deMULTIplex2 is built on a statistical model of tag read counts derived from the physical mechanism of tag cross-contamination. Using GLM and expectation-maximization, deMULTIplex2 probabilistically infers the sample identity of each cell and classifies singlets w/ high accuracy.





□ acorn: an R package for de novo variant analysis

>> https://www.biorxiv.org/content/10.1101/2023.04.11.536422v1

Acorn is an R package that works with de novo variants (DNVs) already called using a DNV caller. The toolkit is useful for extracting different types of DNVs and summarizing characteristics of the DNVs.

Acorn consists of several functions to analyze DNVs. readDNV reads in DNV data and turns it into an R object for use with other functions within acorn. Acorn fills a gap in genomic DNV analyses between the calling of DNVs and ultimate downstream statistical assessment.





□ VIPRS: Fast and accurate Bayesian polygenic risk modeling with variational inference

>> https://www.cell.com/ajhg/fulltext/S0002-9297(23)00093-9

VIPRS, a Bayesian summary statistics-based PRS method that utilizes variational inference techniques to approximate the posterior distribution for the effect sizes.

VIPRS is consistently competitive w/ the state-of-the-art in prediction accuracy while being more than twice as fast as popular MCMC-based approaches. This performance advantage is robust across a variety of genetic architectures, SNP heritabilities, and independent GWAS cohorts.





□ A gene-level test for directional selection on gene expression

>> https://academic.oup.com/genetics/advance-article/doi/10.1093/genetics/iyad060/7111744

Applying The QX test for polygenic selection to regulatory variants identified using Joint-tissue Imputation (JTI) models to test for population-specific selection on gene regulation in 26 human populations.

The gamma-corrected approach was uniformly more powerful than the permutation approach. Indeed, while the gamma-corrected test approaches a power of 1.0 under regimes with stronger selection, the effect-permuted version never reached that.





□ bootRanges: Flexible generation of null sets of genomic ranges for hypothesis testing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad190/7115835

bootRanges provides fast functions for generation of block bootstrapped genomic ranges representing the null hypothesis in enrichment analysis. bootRanges offers greater flexibility for computing various test statistics leveraging other Bioconductor packages.

Shuffling/permutation schemes may result in overly narrow test statistic null distributions and over-estimation of statistical significance, while creating new range sets w/ a block bootstrap preserves local genomic correlation structure and generates reliable null distributions.










□ catELMo: Context-Aware Amino Acid Embedding Advances Analysis of TCR-Epitope Interactions

>> https://www.biorxiv.org/content/10.1101/2023.04.12.536635v1

catELMo, whose architecture is adapted from ELMo (Embeddings from Language Models), a bi-directional context-aware language model. catELMo consists of a charCNN layer and four bidirectional LSTM layers followed by a softmax activation.

catELMo is trained on more than four million TCR sequences collected from ImmunoSEQ in an unsupervised manner, by contextualizing amino acid inputs and predicting the next amino acid token.





□ Streamlining PacBio HiFi assembly and QC with the hifi2genome workflow

>> https://research.arcadiascience.com/pub/resource-hifi2genome

hifi2genome assembles PacBio HiFi reads from a single organism and produce quality control statistics for the resulting assembly. The product of this pipeline is an assembly, mapped reads, and interactive visualizations reported with MultiQC.

hifi2genome uses Flye to assemble PacBio HiFi reads into contigs, followed by parallel processing steps for generating QC statistics. These steps include assembly QC stats with QUAST, lineage-specific QC stats with BUSCO, and mapping stats using SAMtools and minimap2.





□ disperseNN: Dispersal inference from population genetic variation using a convolutional neural network

>> https://academic.oup.com/genetics/advance-article/doi/10.1093/genetics/iyad068/7117621

disperseNN uses forward in time spatial genetic simulationsto train a deep neural network to infer the mean, per-generation dispersal distance from a single population sample of single nucleotide polymorphism (SNP) genotypes, e.g., whole genome data or RADseq data.

disperseNN predicts σ from full-spatial test data after simulations w/ 100 generations. Using successive layers of data compression, through convolution / pooling, to coerce disperseNN to look at the genotypes at different scales and learn the extent of linkage disequilibrium.





□ LinRace: single cell lineage reconstruction using paired lineage barcode and gene expression data

>> https://www.biorxiv.org/content/10.1101/2023.04.12.536601v1

LinRace (Lineage Reconstruction w/ asymmetric cell division model), that integrates the lineage barcode and gene expression data using the asymmetric cell division model and infers cell lineage under a framework combining Neighbor Joining and maximum-likelihood heuristics.

LinRace outputs more accurate cell division trees than existing methods for lineage reconstruction. Moreover, LinRace can output the cell states (cell types) of ancestral cells, which is rarely performed with existing lineage reconstruction methods.





□ Automatic extraction of ranked SNP-phenotype associations from text using a BERT-LSTM-based method

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05236-w

Transforms do not embed positional information as they do in recurrent models; however, they still embody positional information in modeling sentence order. Early stopping is a regularization technique to prevent over fitting when learning something iteratively.

Although the used linguist features could be employed to implement a superior association extraction method outperforming the kernel-based counterparts, the used BERT-CNN-LSTM-based methods exhibited the best performance.





□ nRCFV: a new, dataset-size-independent metric to quantify compositional heterogeneity in nucleotide and amino acid datasets

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05270-8

nRCFV, a truly normalised Relative Compositional Frequency Variation value. This new metrics add a normalisation constant to each of the different RCFV values (total, character-specific, taxon-specific) to mitigate the effect of increasing taxa number and sequence length.





□ Wearable-ome meets epigenome: A novel approach to measuring biological age with wearable devices.

>> https://www.biorxiv.org/content/10.1101/2023.04.11.536462v1

Aging is a dynamic process and simply utilizing chronological age as a predictor of All-Cause Mortality and disease onset is insufficient. Instead, measuring the organismal state of function, biological age, may provide greater insight.





□ PhenoCellPy: A Python package for biological cell behavior modeling

>> https://www.biorxiv.org/content/10.1101/2023.04.12.535625v1

PhenoCellPy defines Python classes for the Cell Volume (which it subdivides between the cytoplasm and nucleus) and its evolution, the state of the cell and the behaviors the cell displays in each state (called the Phase), and the sequence of behaviors (called the Phenotype).

PhenoCellPy’s can extend existing modeling frameworks as an embedded model. It integrates with the frameworks by defining the cell states (phases), signaling when a state change occurs, if division occurs, and by receiving information from the framework.





□ scVAEDer: The Power of Two: integrating deep diffusion models and variational autoencoders for single-cell transcriptomics analysis

>> https://www.biorxiv.org/content/10.1101/2023.04.13.536789v1

scVAEDer, a scalable deep-learning model that combines the power of variational autoencoders and deep diffusion models to learn a meaningful representation which can capture both global semantics and local variations in the data.

scVAEDer combes the strengths of VAEs and Denoising Diffusion Models (DDMs). It incorporates both VAE and DDM priors to more precisely capture the distribution of latent encodings in the data. By using vector arithmetic in the DDM space scVAEDer outperforms SOTA methods.





□ DeepEdit: single-molecule detection and phasing of A-to-I RNA editing events using nanopore direct RNA sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02921-0

DeepEdit can identify A-to-I editing events on single nanopore reads and determine the phasing information on transcripts through nanopore direct RNA sequencing.

DeepEdit is a fully connected neural network model which takes advantage of the raw electrical signal features flanking the editing sites. A total of 40,823 I-type reads from FY-ADAR2 and randomly chosen 47,757 HFF1 reads were used as the positive and negative controls.





□ GBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of species

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02906-z

Genotype Block Compressor (GBC) manages genotypes in Genotype Block (GTB). GTB is a unified data structure to store large-scale genotypes into many highly addressable byte-encoding compression blocks. Then, multiple advanced algorithms were developed for efficient compression.

The AMDO (approximate minimum discrepancy ordering) algorithm is applied on the variant level to sort the variants with similar genotype distributions for improving the compression ratio. The ZSTD algorithm is then adopted to compress the sorted data in each block.





□ Multivariate Genome-wide Association Analysis by Iterative Hard Thresholding

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad193/7126408

Multivariate IHT for analyzing multiple correlated traits. In simulation studies, multivariate IHT exhibits similar true positive rates, significantly lower false positive rates, and better overall speed than linear mixed models and canonical correlation analysis.

In IHT the most computationally intensive operations are the matrix-vector and matrix- matrix multiplications required in computing gradients. To accelerate these operations, SIMD (single instruction, multiple data) is employed for vectorization and tiling.





□ moslin: Mapping lineage-traced cells across time points

>> https://www.biorxiv.org/content/10.1101/2023.04.14.536867v1

moslin, a Fused Gromov-Wasserstein-based model to couple matching cellular profiles across time points. moslin leverages both intra-individual lineage relations and inter-individual gene expression similarity.

moslin uses lineage information at two or more time points and to include the effects of cellular growth and stochastic cell sampling. The algorithm combines gene expression with lineage information at all time points to reconstruct precise differentiation trajectories.





□ MolCode: An Equivariant Generative Framework for Molecular Graph-Structure Co-Design

>> https://www.biorxiv.org/content/10.1101/2023.04.13.536803v1

MolCode, a roto-translation equivariant generative framework for Molecular graph-structure Co-design. In MolCode, 3D geometric information empowers the molecular 2D graph generation, which in turn helps guide the prediction of molecular 3D structure.

MolCode not only consistently generates valid and diverse molecular graphs/structures with desirable properties, but also generate drug-like molecules with high affinity to target proteins, which demonstrates MolCode’s potential applications in material design and drug discovery.





□ AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data

>> https://www.biorxiv.org/content/10.1101/2023.04.17.537157v1

AlcoR addresses the challenge of automatically modeling and distinguishing LCRs. AlcoR enables the use of models with different memories, providing the ability to distinguish local from distant low-complexity patterns.

AlcoR is reference- and alignment-free, providing additional methodologies for testing, incl. a highly-flexible simulation method for generating biological sequences with different complexity levels, sequence masking, and a automatic computation of the LCR maps into ideogram.





□ De novo reconstruction of satellite repeat units from sequence data

>> https://arxiv.org/abs/2304.09729

Satellite Repeat Finder (SRF), a de novo assembler for reconstructing SatDNA repeat units and can identify most known HORs and SatDNA in well-studied species without prior knowledge on monomer sequences or repeat structures.

SRF uses a greedy algorithm to assemble SatDNA repeat units, but may miss the lower abundance and higher diversity unit when sharing long similar sequences. SRF may reconstruct repeat units similar in sequence. The similar repeat units may be mapped the same genomic locus.





Cloud nine.

2023-04-24 04:43:44 | Science News

(Art by gen_ericai)




□ Cellcano: supervised cell type identification for single cell ATAC-seq data

>> https://www.nature.com/articles/s41467-023-37439-3

Cellcano adopts a two-round prediction strategy. In the first round, Cellcano trains a Multi-layer Preceptron (MLP) model on reference gene scores with known cell labels. Then, Cellcano uses the trained MLP to predict cell types on target gene scores.

With the predicted probability matrix, entropies are calculated for each cell. Cells with relatively low entropies are selected as anchors to train a Knowledge Distillation (KD) model. The trained KD model is used to predict cell types in remaining non-anchors.





□ Building pangenome graphs

>> https://www.biorxiv.org/content/10.1101/2023.04.05.535718v1

PanGenome Graph Builder (PGGB), a reference-free pipeline to construct unbiased pangenome graphs. Its output presents a base-level representation of the pangenome, including variants of all scales from SNPs to SVs. The graph is unbiased - all genomes are treated equivalently.

PGGB uses an all-to-all alignment of the input sequences. PGGB makes no assumptions about phylogenetic relationships, orthology groups, or evolution- ary histories, allowing data to speak for itself without risk of implicit bias that may affect inference made on the graph.





□ scMSGL: Kernelized multiview signed graph learning for single-cell RNA sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05250-y

scMSGL is based on recently developed graph signal processing (GSP) based graph learning, where GRNs and gene expressions are modeled as signed graphs and graph signals.

scMSGL learns functional relationships between genes across multiple related classes of single cell gene expression datasets under the assumption that there exists a shared structure across classes.

scMSGL formulates a highly efficient optimization framework that extends the signed graph learning approach to high dimensional datasets with multiple classes. The kernelization trick embedded within the algorithm renders it capable of handling sparse and noisy features.





□ SLAT: Spatial-linked alignment tool for aligning heterogenous slices properly

>> https://www.biorxiv.org/content/10.1101/2023.04.07.535976v1

SLAT (Spatially-Linked Alignment Tool), a graph-based algorithm for efficient and effective alignment of spatial omics data. SLAT is the first algorithm capable of aligning heterogenous spatial data across distinct technologies and modalities.

By modeling the intercellular relationship as a spatial graph, SLAT adopts graph neural networks and adversarial matching for aligning spatial slices. SLAT calculates a similarity score for each aligned cell pair, making it possible to pinpoint spatially discrepant regions.





□ XClone: detection of allele-specific subclonal copy number variations from single-cell transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2023.04.03.535352v1

XClone accounts for two modules: the B-allele frequency (BAF) of heterozygous variants and the sequencing read depth ratio (RDR) of individual genes, respectively detecting the variation states on allelic balance and absolute copy number, which are further combined to generate the final CNV states.

XClone is implemented a three-step of haplotyping, from individual SNPs to a gene by population-based phasing, from consecutive genes to a gene bin by an Expectation-Maximization algorithm, and from gene bins to a chromosome arm by a dynamic programming method.

XClone employs two orthogonal strategies to smooth the CNV-state assignments on BAF/RDR: horizontally w/ hidden Markov models along the genome and vertically w/ KNN cell-cell connectivity graph, which not only denoises the data but also preserves the single-cell resolution.





□ TESA: A Weighted Two-stage Sequence Alignment Framework to Identify DNA Motifs from ChIP-exo Data

>> https://www.biorxiv.org/content/10.1101/2023.04.06.535915v1

TESA constructs a graph, in which vertices represent sequence segments and an unweighted edge connecting two vertices indicates a highly ranked similarity between them among all pairs of sequence segments between two sequences.

TESA identifies dense subgraphs as the seed for graph clustering. Then, TESA performs graph clustering based on seeds, leading to vertex clusters, each of which corresponds to a preliminary motif.

TESA optimizes the lengths of preliminary motifs using a bookend model. We call the sequence segments corresponding to the assembled clusters as motif seeds. TESA refines the sequence segments for each motif, by scoring them using the motif profile built from the motif seeds.





□ scART: recognizing cell clusters and constructing trajectory from single-cell epigenomic data

>> https://www.biorxiv.org/content/10.1101/2023.04.08.536108v1

scART integrates the MST and DDRTree algorithms used in reserved graph embedding (RGE), a population graph-based pseudotime analysis algorithm in scRNA-seq analysis.

scART predicts the developmental trajectory based on the lower dimensional space that the cells lie upon and use a cell-cell graph to describe the structure among cells. scART identifies branch points that describe significant divergences in cellular states automatically.





□ CLAMP: Enhancing Activity Prediction Models in Drug Discovery with the Ability to Understand Human Language

>> https://arxiv.org/pdf/2303.03363.pdf

Scientific language models (SLMs) can utilize both natural language and chemical structure but are suboptimal activity predictors. Large language models have demonstrated great zero- and few-shot capabilities.

The SLMs Galactica and KV-PLM tokenize the SMILES representations of chemical structures and embed those chemical tokens in the same embedding space as language tokens.

CLAMP improves predictive performance on few-shot learning benchmarks and zero-shot problems in drug discovery. CLAMP uses separate encoders for chemical and natural language data and embeds them into a joint embedding space.





□ LRU: Resurrecting Recurrent Neural Networks for Long Sequences

>> https://arxiv.org/abs/2303.06349

Deep Linear Recurrent Unit (LRU) architecture is inspired by S4. The model is a stack of LRU blocks, with nonlinear projections in between, and also uses skip connections 90% and normalization methods like batch/layer normalization.

Normalizing the hidden activations on the forward pass is important when learning tasks w/ long-sequences. LRU shares similarities w/ modern deep state-space models, its design does not rely on discretization of a latent continous-time system or on structured transition matrices.





□ K-RET: Knowledgeable Biomedical Relation Extraction System

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad174/7108769

RET is a flexible biomedical Relation Extraction (RE) system, allowing for the use of any pre-trained BERT-based system (e.g., SciBERT and BioBERT) to inject knowledge in the form of knowledge bases from a single source, multiple sources, and multi-token entities.

Adding a Knowledge layer to these entities from associations made w/ their domain ontologies. The tokens are flattened into a sequence for token embedding. Embedding / Seeing layers are fed to the Mask-transformer, corresponding to a stack of multiple mask-self-attention blocks.





□ mixMVPLN: Finite Mixtures of Matrix Variate Poisson-Log Normal Distributions for Three-Way Count Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad167/7108770

mixMVPLN is an R package for performing model-based clustering of three-way count data using mixtures of matrix variate Poisson-log normal (mixMVPLN) distributions.

mixMVPLN consists of three different frameworks: A Markov chain Monte Carlo expectation-maximization algorithm (MCMC-EM), Variational Gaussian approximations (VGAs), and a hybrid approach that combines both the variational approximation-based approach and MCMC-EM-based approach.





□ ICOR: improving codon optimization with recurrent neural networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05246-8

ICOR adopts the Bidirectional Long-Short-Term Memory (BiLSTM) architecture because of its ability to preserve temporal information from both the past and future. In a gene, the BiLSTM would theoretically use surrounding synonymous codons to make a prediction.

The ICOR architecture consists of a 12-layer RNN. It serves as the “brain” for the codon optimization tool. By providing the amino acid sequence as an input, ICORnet can output a nucleotide codon sequence that would ideally match the codon biases of the host genome.





□ Reconstruction Set Test (RESET): a computationally efficient method for single sample gene set testing based on randomized reduced rank reconstruction error

>> https://www.biorxiv.org/content/10.1101/2023.04.03.535366v1

RESET quantifies gene set importance at both the sample-level and for the entire data based on the ability of genes in each set to reconstruct values for all measured genes.

RESET is realized using a computationally efficient randomized reduced rank reconstruction algorithm and can effectively detect patterns of differential abundance and differential correlation for both self-contained and competitive scenarios.





□ transXpress: a Snakemake pipeline for streamlined de novo transcriptome assembly and annotation

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05254-8

transXpress supports two popular assembly programs, Trinity and rnaSPAdes, and allows parallel execution on heterogeneous cluster computing hardware. The transXpress pipeline performs parallel execution of the underlying tools whenever possible.

transXpress splits the input datafiles (Trimmomatic / FASTA steps) into multiple partitions (batches) to speed up even single-threaded tasks by parallelization. The partial results files from such split tasks are then merged automatically back into a single output file.





□ bioseq2seq / LFNet: Improving deep models of protein-coding potential with a Fourier-transform architecture and machine translation task

>> https://www.biorxiv.org/content/10.1101/2023.04.03.535488v1

bioseq2seq can recover potentially translated micropeptides is a proof-of-concept for using machine predictions to explore the cryptic space of proteome. Local Filter Network (LFNet) is a computationally efficient network layer based on the short-time Fourier transform.

The LFNet architecture will be of broad utility in biological sequence modeling tasks, w/ frequency-domain multiplication enabling larger context convolutions than in common convolutional architectures and lower computational complexity of O(Nlog N) in comparison to transformers.





□ SpliceAI-10k calculator for the prediction of pseudoexonization, intron retention, and exon deletion

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad179/7109800

SAI-10k-calc was designed to predict specific types of splicing aberrations, namely: pseudoexonization, partial intron retention, partial exon deletion, (multi)exon skipping, and whole intron retention.

SAI-10k-calc can process SpliceAI scores resulting from SNVs at any exonic or intronic position, but not scores resulting from indels due to the complexity of distance interpretations for such variants.





□ Modular response analysis reformulated as a multilinear regression problem

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad166/7109803

This formulation brings a number of advantages over the classical approach by providing a natural way to model data variability across experimental replicates, or even multiple perturbations at certain or all the modules.

This work dramatically extended the domain of application of MRA to much larger networks of sizes up to 1,000. This is a 100-fold increase compared to MRA with standard linear algebra, which had difficulties going beyond 10-node networks in the experiments.





□ BLAZE: Identification of cell barcodes from long-read single-cell RNA-seq

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02907-y

BLAZE eliminates the requirement for matched short-read scRNA-seq, simplifying long-read scRNA-seq workflows. BLAZE performs well across different sample types, sequencing depths, and sequencing accuracies and outperforms other barcode identification tools such as Sockeye.

BLAZE seamlessly integrates with the existing FLT-seq—FLAMES pipeline which performs UMI calling, read assignment, and mapping to enable the identification and quantification of RNA isoforms and their expression profiles across individual cells and cell types.





□ Genomics to Notebook (g2nb): extending the electronic notebook to address the challenges of bioinformatics analysis

>> https://www.biorxiv.org/content/10.1101/2023.04.04.535621v1

The g2nb environment incorporates multiple bioinformatics software platforms within the notebook interface. A standard Jupyter notebook consists of a sequence of cells, each of which can contain text or executable code.

g2nb provides an interface within the notebook to tools that are hosted on a remote Galaxy or GenePattern server. g2nb presents a form-like interface similar to the web interface of the original platforms, requiring that an investigator provide only the input parameters and data.





□ THAPBI PICT - a fast, cautious, and accurate metabarcoding analysis pipeline

>> https://www.biorxiv.org/content/10.1101/2023.03.24.534090v1

The THAPBI PICT core workflow comprises data reduction to unique marker sequences, often called amplicon sequence variants (ASVs), discard of low abundance sequences to remove noise and artifacts, and classification using a curated reference database.





□ Smmit: Integrating multiple single-cell multi-omics samples

>> https://www.biorxiv.org/content/10.1101/2023.04.06.535857v1

Smmit, a computational pipeline that leverages existing integration methods to simultaneously integrate both samples and modalities and produces a unified representation of reduced dimensions.

Smmit builds upon existing integration methods of Harmony and Seurat. Smmit employs Harmony to integrate multiple samples within each data modality. Smmit applies Seurat’s Weighted Nearest Neighbor function to integrate multiple data modalities and produces a single UMAP space.





□ clusterMaker2: a major update to clusterMaker, a multi-algorithm clustering app for Cytoscape

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05225-z

clusterMaker2 provides new capabilities to use remote servers to execute algorithms asynchronously. clusterMaker2 performs a variety of analyses, incl. Leiden clustering to break the entire network into smaller clusters, hierarchical clustering and dimensionality reduction.





□ GenoVi: an open-source automated circular genome visualizer for bacteria and archaea

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010998

GenoVi automatically calculates the GC content and GC skew from a genome, and unless specified, assigns CDS to COG categories. GenoVi produces histograms, heatmaps and tables of COG categories and frequency, and a table with general information about each contig/replicon.

GenoVi, a Python command-line tool able to create custom circular genome representations for the analysis and visualization of microbial genomes and sequence elements.





□ streammd: fast low-memory duplicate marking using a Bloom filter

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad181/7110893

streammd is implemented as a C++ program running in a single process. A Bloom filter is initialized with k = 10 hash functions and a bit array sized to meet user-specified memory and false-positive requirements. Input is QNAME-grouped SAM records.





□ Automatic block-wise genotype-phenotype association detection based on hidden Markov model

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05265-5

A Hidden Markov Model for the classification of influential sites. The states themselves are governed by a Markov process, with a starting state probability vector for the first site and a transition probability matrix when passing from one site to the next.

The algorithm accepts as input a matrix of genotypes and a vector of phenotypes, and alternates between updating the most probable state sequence and updating the model parameters, until finally halting and outputting its best estimate of the most probable state sequence.





□ scSPARKL: Apache Spark based parallel analytical framework for the downstream analysis of scRNA-seq data.

>> https://www.biorxiv.org/content/10.1101/2023.04.07.536003v1

scSPARKL leverages the power of Apache Spark to enable the efficient analysis of single-cell transcriptomic data. It incorporates six key operations: data reshaping, data preprocessing, cell/gene filtering, data normalization, dimensionality reduction, and clustering.

The dataframe is arranged according to the ranks of the rows. The retained rank matrix is used to reposition the obtained averages at their respective place. This makes it easier to compare the values of different distribution, while preserving the actual coherence of the matrix.





□ A modular metagenomics analysis system for integrated multi-step data exploration

>> https://www.biorxiv.org/content/10.1101/2023.04.09.536171v1

Each module is designed to complete a single analytic task (ex. de novo assembly), accepting a standardized input format (ex. CSV of paths to FastQ files) generated by antecedent modules, and generating a standardized output format(s) (ex. CSV of paths to assembled contigs).





□ Regression Transformer enables concurrent sequence regression and generation for molecular language modelling

>> https://www.nature.com/articles/s42256-023-00639-z

The Regression Transformer (RT), a method that abstracts regression as a conditional sequence modelling problem. This introduces a new direction for multitask language models, seamlessly bridging sequence regression and conditional sequence generation.

Despite solely relying on tokenization of numbers and cross-entropy loss, RT can successfully solve regression tasks. The same model can generate text sequences given continuous properties. They devises numerical encodings (NEs) to inform the model about the semantic proximity.





□ GALBA: Genome Annotation with Miniprot and AUGUSTUS

>> https://www.biorxiv.org/content/10.1101/2023.04.10.536199v1

GALBA is a fully automated pipeline that takes protein sequences of one or many species and a genome sequence as input, aligns the proteins to the genome with miniprot, trains AUGUSTUS, and then predicts genes with AUGUSTUS using the protein evidence.

GALBA uses miniprothint - an alignment scorer. miniprothint discards the least reliable evidence and separates the remaining evidence into high/low confidence. High-confidence evidence is used to select training gene candidates and is enforced during gene prediction w/ AUGUSTUS.





□ Comparison of transformations for single-cell RNA-seq data

>> https://www.nature.com/articles/s41592-023-01814-1

Variance-stabilizing transformations based on the delta method promise an easy fix for heteroskedasticity if the variance predominantly depends on the mean.

Considering the acosh transformation equation, the shifted logarithm equation with pseudo-count y0 = 1 or y0 = 1 / (4α) and the shifted logarithm with CPM.

The Pearson residuals-based transformation has attractive theoretical properties and, in this benchmarks, performed similarly well as the shifted logarithm transformation. It stabilizes the variance across all genes and is less sensitive to variations of the size factor.





□ Comprehensive benchmark and architectural analysis of deep learning models for nanopore sequencing basecalling

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02903-2

This toolbox can be used as benchmark for cross-comparison of existing and future basecallers.

Transformer layers have gained popularity in other fields due to increased performance and speed. However, the top ten models all use RNN (LSTM) layers in their encoders. A direct comparison shows that RNNs outperform Transformer layers in all the metrics.





□ uTR: Decomposing mosaic tandem repeats accurately from long reads

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad185/7114028

uTR estimates a mosaic TR pattern for an input DNA string, but the pattern and the string may have a number of mismatches because of variants in units.





□ FISHFactor: A Probabilistic Factor Model for Spatial Transcriptomics Data with Subcellular Resolution

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad183/7114027

FISHFactor is a non-negative, spatially informed factor analysis model with a Poisson point process likelihood to model single-molecule resolved data, as for example obtained from multiplexed fluorescence in-situ hybridization methods.

FISHFactor allows to integrate multiple cells by jointly inferring cell-specific factors and a weight matrix that is shared between cells. The model is implemented using the deep probabilistic programming language Pyro and the Gaussian process package GPyTorch.





□ methylR: a graphical interface for comprehensive DNA methylation array data analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad184/7114023

methylR, a complete pipeline for the analysis of both 450K and EPIC Illumina arrays which not only offers data visualization and normalization but also provide additional features such as the annotation of the genomic features resulting from the analysis.





□ The scverse project provides a computational ecosystem for single-cell omics data analysis The scverse project provides a computational ecosystem for single-cell omics data analysis

>> https://www.nature.com/articles/s41587-023-01733-8




FierceBiotech

Nothing good lasts forever. That sentiment held true for private biotech financing in 2022. After two record-setting years, fundraising finally fell, dipping 24% from the highs of 2021. Let’s take a deeper dive into 2022 VC trends.



Raft.

2023-04-24 04:42:44 | Science News
(Art by Beau Wright)



□ PAUSE: principled feature attribution for unsupervised gene expression analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02901-4

PAUSE, a novel, fully-unsupervised attribution method and demonstrate how it can be used to identify important pathways in transcriptomic datasets when combined with biologically-constrained autoencoders.

PAUSE uses a pathway module VAE, which is a sparse variational autoencoder model with deep, non-linear encoders / decoders. pmVAE uses sparse masked weight matrices to separate the weights of the encoder and decoder neural networks into non-interacting modules for each pathway.





□ GRACES: Graph Convolutional Network-based Feature Selection for High-dimensional and Low-sample Size Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad135/7135826

GRACES can select important features for HDLSS data. GRACES exploits latent relations between samples using various overfitting-reducing techniques to iteratively find a set of optimal features that give rise to the greatest decrease in the optimization loss.

GRACES outperforms HSIC Lasso and DNP (and other baseline methods) on both synthetic and real-world datasets. GRACES constructs a dynamic similarity graph based on the selected feature at each iteration; GRACES exploits advanced GCN (i.e., GraphSAGE) to refine sample embeddings.

GRACES is an iterative algorithm w/ 5 components: feature initialization / graph construction/NN/multiple dropouts/gradient computation. It involves considering weights along the dimensions corresponding to the selected features in the input weight matrix, w/o a bias vector.





□ Entropy predicts sensitivity of pseudo-random seeds

>> https://www.biorxiv.org/content/10.1101/2022.10.13.512198v2

Although the entropy curves are in general more spread out, the relative distances are relatively well preserved. The relative increase in entropy correlates well with the relative increase in sensitivity. And providing 3 new seed constructs, mixedstrobes/altstrobes/multistrobes.

Pseudo-random seed constructs have over k-mers reduces. This is because the high overlap of k-mers is removed with subsampling. Since the minimap2 implementation is centered around minimizers, it is possible that aligners customized for strobemers or other pseudo-random seeds.






□ Barren plateaus in quantum tensor network optimization

>> https://quantum-journal.org/papers/q-2023-04-13-974/

Analyzing the barren plateau phenomenon in the variational optimization of quantum circuits inspired by matrix product states (qMPS), tree tensor networks (qTTN), and the multiscale entanglement renormalization ansatz (qMERA).

The variance of the cost function gradient decreases exponentially with the distance of a Hamiltonian term from the canonical centre in the quantum tensor network. qMPS most gradient variances decrease exponentially and for qTTN as well as qMERA they decrease polynomially.

Focusing on k-local Hamiltonians, i.e. sums of observables which act on at most k qubits. One example of a 2-local Hamiltonian is the transverse-field quantum Ising chain. qMPS avoids the barren plateau problem for a Hamiltonian that is a sum of local terms acting on all qubits.





□ Bayes Hilbert Spaces for Posterior Approximation

>> https://arxiv.org/abs/2304.09053

Bayes Hilbert spaces are studied in functional data analysis in the context where observed functions are probability density functions and their application to computational Bayesian problems is in its infancy.

Exploring Bayes Hilbert spaces and their connection to Bayesian computation, in particular novel connections between Bayes Hilbert spaces, Bayesian coreset algorithms, and Bayesian coresets constructed using the Kullback-Leibler divergence and Bayes Hilbert spaces.





□ Categorical Structure in Theory of Arithmetic

>> https://arxiv.org/abs/2304.05477

A categorical analysis of the arithmetic theory 𝐼Σ1. It provides a categorical proof of the classical result that the provably total recursive functions in 𝐼Σ1 are exactly the primitive recursive functions. They construct the category PriM and show it is a pr-coherent category.

This strategy is to construct a coherent theory of arithmetic T, and prove that T presents the initial coherent category equipped with a parametrised natural number object. T is the Π2-fragment of 𝐼Σ1, and conclude they have the same class of provably total recursive functions.





□ Categories of hypermagmas, hypergroups, and related hyperstructures

>> https://arxiv.org/abs/2304.09273

Investigating the categories of hyperstructures that generalize hypergroups. By allowing hyperoperations w/ possibly empty products, one obtains categories with desirable features such as completeness and cocompleteness, free functors, regularity, and closed monoidal structures.

Unital, reversible hypermagmas -- mosaics -- form a worthwhile generalization of (canonical) hypergroups from the categorical perspective. Notably, mosaics contain pointed simple matroids as a subcategory, and projective geometries as a full subcategory.










□ COVET / ENVI: The covariance environment defines cellular niches for spatial inference

>> https://www.biorxiv.org/content/10.1101/2023.04.18.537375v1

COVET (the covariance environment), a representation that can capture the rich, continuous multivariate nature of cellular niches by capturing the gene-gene covariate structure across cells in the niche, which can reflect the cell-cell communication between them.

ENVI (Environmental variational inference), a conditional variational autoencoder that jointly embeds spatial and single-cell RNA-seq data into a latent space.

ENVI architecture includes a single encoder for both spatial and single-cell genomics data, and two decoder networks—one for the full transcriptome, and the second for the COVET matrix, providing spatial context.





□ LogBTF: Gene regulatory network inference using Boolean threshold network model from single-cell gene expression data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad256/7133738

LogBTF, a novel embedded Boolean threshold network method which effectively infers GRN by integrating regularized logistic regression and Boolean threshold function.

First, the continuous gene expression values are converted into Boolean values and the elastic net regression model is adopted to fit the binarized time series data.

Then, the estimated regression coefficients are applied to represent the unknown Boolean threshold function of the candidate Boolean threshold network as the dynamical equations.

To overcome the multi-collinearity and over-fitting problems, an effective approach is designed to optimize the network topology by adding a perturbation design matrix to the input data and thereafter setting sufficiently small elements of the output coefficient vector to zeros.





□ TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis

>> https://arxiv.org/abs/2210.02186

TimesNet extends the analysis of temporal variations into the 2D space by transforming the 1D time series into a set of 2D tensors. This transformation can embed the intra/inter period-variations into the 2D tensors, making the 2D-variations to be modeled by 2D kernels.

TimesNet uses TimesBlock as a task-general backbone for time series analysis. TimesBlock can discover the multi-periodicity adaptively and extract the complex temporal variations from transformed 2D tensors by a parameter-efficient inception block.





□ GearNet: Protein Representation Learning by Geometric Structure Pretraining

>> https://arxiv.org/abs/2203.06125

GearNet (GeomEtry-Aware Relational Graph Neural Network) a simple yet effective structure-based encoder, which encodes spatial information by adding different types of sequential or structural edges and then performs relational message passing on protein residue graphs.

GearNet uses a sparse edge message passing mechanism to enhance the protein structure encoder, which is the first attempt to incorporate edge-level message passing on GNNs for protein structure encoding.





□ Automatic Gradient Descent: Deep Learning without Hyperparameters

>> https://arxiv.org/abs/2304.05187

The theory extends mirror descent to non-convex composite objective functions: the idea is to transform a Bregman divergence to account for the non-linear structure of neural architecture.

Automatic gradient descent trains both fully-connected and convolutional networks. This framework is properly placed in the context of existing frameworks such as the majorise-minimise meta-algorithm, mirror descent and natural gradient descent.





□ HyperDB: A hyper-fast local vector database for use with LLM Agents. HyperDB separates relevant from irrelevant documents by the support of HW accelerated vector operations

>> https://github.com/jdagdelen/hyperDB




□ RedPajama

>> https://www.together.xyz/blog/redpajama

The RedPajama base dataset is a 1.2 trillion token fully-open dataset created by following the recipe described in the LLaMA paper.





□ RNA covariation at helix-level resolution for the identification of evolutionarily conserved RNA structure

>> https://www.biorxiv.org/content/10.1101/2023.04.14.536965v1

R-scape calculates covariation between all pairs of position in an alignment. However, RNA base pairs do not occur in isolation. The Watson-Crick base pairs stack together forming helices that constitute the scaffold that facilitates the formation of the non-WC base pairs

Helix-level aggregated covariation increases sensitivity in the detection of evolutionarily conserved RNA structure. To achieve this, a new measure has been introduced that aggregates the covariation significance and power calculated at the base-pair level resolution.





□ SPADE: Spatial Deconvolution for Domain Specific Cell-type Estimation

>> https://www.biorxiv.org/content/10.1101/2023.04.14.536924v1

SPADE (SPAtial DEconvolution) incorporates spatial patterns during cell type decomposition. SPADE utilizes a combination of scRNA-seq data, spatial location information, and histological information to estimate the proportion of cell types present at each spatial location.

The SPADE algorithm formulates the cell type deconvolution task as a constrained nonlinear optimization problem. It aims to minimize the relative error between true and estimated gene expression while adhering to non-negativity and sum-to-one constraints.





□ cogeqc: Assessing the quality of comparative genomics data and results

>> https://www.biorxiv.org/content/10.1101/2023.04.14.536860v1

cogeqc calculates a protein domain-aware orthogroup score that aims at maximizing the number of shared protein domains within the same orthogroup.

The assessment of synteny detection consists in representing anchor gene pairs as a synteny network and analyzing its graph properties, such as clustering coefficient, node count, and scale-free topology fit.





□ Building Block-Based Binding Predictions for DNA-Encoded Libraries

>> https://chemrxiv.org/engage/chemrxiv/article-details/6438943f08c86922ffeffe57

A method for analyzing DNA-encoded library (DEL) selection data at the building block-level, with the goal of gaining insights we can use to design better DELs for subsequent screening rounds.

A simple and interpretable method is developed to predict the behavior of new building blocks, their interactions with known building blocks, and the activity of full compounds.

They calculates all-by-all similarity matrices for building blocks at each position individually and then evaluated combinatorial effects at a later step in the analysis. Additionally, it mimics considerations involved in library design.





□ DESpace: spatially variable gene detection via differential expression testing of spatial clusters

>> https://www.biorxiv.org/content/10.1101/2023.04.17.537189v1

DESpace, a novel approach to discover spatially variable genes (SVGs). The framework inputs all types of SRT data, summarizes spatial information via spatial clusters, and identifies spatially variable genes by performing differential gene expression testing between clusters.

DESpace displays a higher true positive rate than competitors, controls for FP and FDR. DESpace leads to analogous results when inputting spatial clusters estimated from StLearn or BayesSpace, which indicates that DESpace is robust with respect to the spatial clusters provided.





□ Identification of genetic variants that impact gene co-expression relationships using large-scale single-cell data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02897-x

Conducting a co-eQTL meta-analysis across four scRNA-seq peripheral blood mononuclear cell datasets using a novel filtering strategy followed by a permutation-based multiple testing approach.

Part of the variable correlation could be explained by the sparsity of the single-cell data, as higher expressed gene pairs correlated better, but at least a few example cases showed the potential occurrence of Simpson’s paradox.





□ BREADR: An R Package for the Bayesian Estimation of Genetic Relatedness from Low-coverage Genotype Data

>> https://www.biorxiv.org/content/10.1101/2023.04.17.537144v1

BREADR (Biological RElatedness from Ancient DNA in R) leverages the so-called pairwise mismatch rate, calculated on optimally-thinned genome-wide pseudo-haploid sequence data, to estimate genetic relatedness up to the second degree, assuming an underlying binomial distribution.

BREADR also returns a posterior probability for each degree of relatedness, from identical twins/same individual, first-degree, second-degree or "unrelated" pairs, allowing researchers to quantify and report the uncertainty, even for particularly low-coverage data.





□ NucleosomeDB - a database of 3D nucleosome structures and their complexes with comparative analysis toolkit

>> https://www.biorxiv.org/content/10.1101/2023.04.17.537230v1

NucleosomeDB allows researchers to search, explore, and compare nucleosomes with each other, despite differences in composition and peculiarities of their representation.

By utilizing the information contained within the NucleosomeDB, researchers can gain valuable insights into how nucleosomes interact with DNA and other proteins, assess the implications of mutations and protein binding on nucleosome structure.





□ MARVEL: an integrated alternative splicing analysis platform for single-cell RNA sequencing data

>> https://academic.oup.com/nar/article/51/5/e29/6985826

MARVEL, a comprehensive R package for single-cell splicing analysis applicable to RNA-seq generated from the plate- and droplet-based methods. MARVEL enables systematic and integrated splicing and gene expression analysis of single cells to characterize the splicing landscape.

MARVEL uses a splice junction- based approach to compute PSI values. For MAST, MARVEL computes the number of genes detected per cell (gene detection rate) and includes this variable as a covariate in the zero- inflated regression model.





□ Transcriptome Complexity Disentangled: A Regulatory Elements Approach

>> https://www.biorxiv.org/content/10.1101/2023.04.17.537241v1

By using the prior knowledge of the critical roles of transcription factors and microRNAs in gene regulation, it can establish a low-dimensional representation of cell states and infer the entire transcriptome from a limited number of regulatory elements.

The value of a reduced cell state representation lies in its ability to capture the gene expression distribution not only under normal conditions but also under various perturbations, such as drugs, mutations, or gene knockouts.





□ Benchmarking causal reasoning algorithms for gene expression-based compound mechanism of action analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05277-1

According to statistical analysis (negative binomial model), the combination of algorithm and network most significantly dictated the performance of causal reasoning algorithms, with the SigNet recovering the greatest number of direct targets.

CARNIVAL with the Omnipath network was able to recover the most informative pathways containing compound targets, based on the Reactome pathway hierarchy. CARNIVAL, SigNet and CausalR ScanR all outperformed baseline gene expression pathway enrichment results.





□ PyAGH: a python package to fast construct kinship matrices based on different levels of omic dataPyAGH: a python package to fast construct kinship matrices based on different levels of omic data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05280-6

PyAGH can calculate additive, dominant and epistatic kinship matrices based on genomic data within one population and different additive kinship matrices across multiple populations efficiently.

PyAGH supports construction of kinship matrices using pedigree, microbiome and transcriptome data. In addition, the output of PyAGH can be easily provided to downstream mainstream software, such as DMU, GCTA, GEMMA and BOLT-LMM.





□ StableLM: Stability AI Language Models

>> https://github.com/Stability-AI/StableLM

StableLM-Tuned-Alpha is the fine-tuned model with Stanford Alpaca's procedure, using a combination of five recent datasets for conversational agents: Stanford's Alpaca, Nomic-AI's gpt4all, RyokoAI's ShareGPT52K datasets, Databricks labs' Dolly, and Anthropic's HH.

StableLM-Alpha models are trained on the new dataset that build on The Pile, which contains 1.5 trillion tokens, roughly 3x the size of The Pile. These models will be trained on up to 1.5 trillion tokens. The context length for these models is 4096 tokens.





□ PyHMMER: A Python library binding to HMMER for efficient sequence analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad214/7131068

PyHMMER provides Python integration of the popular profile Hidden Markov Model software HMMER via Cython bindings. A new parallelization model greatly improves performance when running multithreaded searches, while producing the exact same results as HMMER.

PyHMMER increases flexibility of use, allowing creating queries directly from Python code, launching searches and obtaining results without I/O, or accessing previously unavailable statistics like uncorrected p-values.





□ Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss

>> https://www.science.org/doi/10.1126/sciadv.adg6175

Grambank - a systematic sample of the structural diversity of the world’s languages. With over 400,000 data points, Grambank covers 2467 languages, grammatical phenomena in 195 features, from word order to verbal tense, nominal plurals, and many other linguistic variables.

Grambank deploys a Bayesian regression model of unusualness. The spatial and phylogenetic effects are both variance covariance (VCV) matrices based on a Brownian motion approach. The spatial data are taken from Glottolog, and the phylogeny is the global language tree.

Grambank uses the Agglomerated Endangerment Scale (AES) and categorize languages as either nonthreatened or threatened. The rest of the analysis is different in that it uses BRMS rather than Bayesian inference for latent Gaussian models (INLA).





□ UniMax: Fairer and more Effective Language Sampling for Large-Scale Multilingual Pretraining

>> https://arxiv.org/abs/2304.09151

UNIMAX, a new sampling method that delivers more uniform coverage of head lan- guages while mitigating overfitting on tail languages by explicitly capping the number of repeats over each language’s corpus.

UNIMAX controls the extent of data repeats of any language, providing a direct solution to overfitting on low-resource languages, w/o imposing any reprioritization on higher-resource. UNIMAX performs well across several benchmarks and model scales, up to 13 billion parameters.





□ satmut_utils: a simulation and variant calling package for multiplexed assays of variant effect

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02922-z

The satmut_utils “call” workflow is an end-to-end variant caller for MAVEs that supports direct analysis of targeted sequencing data from both (a) amplicon and (b) rapid amplification of cDNA ends (RACE)-like library preparation methods.

The satmut_utils “sim” workflow takes a Variant Call Format (VCF) and alignment (BAM) file with paired reads as input and generates variants in the reads at specified frequencies. Outputs are a VCF of true positive (truth) variants and counts, along with edited reads (FASTQ).

The number of fragments to edit and the read positions to edit are determined for each variant based on specified frequencies in the input VCF. “sim” employs a heuristic to sample reads for editing at each target position while prohibiting variant conversion.





□ pycoMeth: a toolbox for differential methylation testing from Nanopore methylation calls

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02917-w

pycoMeth Meth_Seg, a Bayesian changepoint detection algorithm for multi-read-group segmentation of methylation profiles, designed for the de novo discovery of methylation patterns from multiple (haplotyped) ONT sequenced samples.

pycoMeth Meth_Seg takes into account an arbitrary number of read groups (e.g., biological samples, haplotypes, or individual molecules/reads) to detect a dynamic set of methylation patterns from which it then derives a single consensus segmentation.





□ DFHiC: A dilated full convolution model to enhance the resolution of Hi-C data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad211/7135829

DFHiC has no restrictions on the input size of Hi-C data. The limitation effect caused by cutting matrix is eliminated and the Hi-C matrix no longer needs to be divided into several parts to enhance the entire chromosome.

The dilated convolution is able to effectively explore the global patterns in the overall Hi-C matrix by taking advantage of the information of the Hi-C matrix in a way of the longer genomic distance.





□ PSGRN: A gene regulatory network inference model based on pseudo-siamese network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05253-9

PSGRN (pseudo-Siamese GRN), a multilevel, multi-structure framework) for inferring large-scale GRNs from time-series expression datasets.

Based on the pseudo-Siamese network, Gated recurrent unit captures the time features of each TF and target matrix and learn the spatial features of the matrices after merging by applying the DenseNet framework. Finally, they applied a sigmoid function to evaluate interactions.







Morph.

2023-04-24 04:40:44 | Science News

(Artwork by ekaitza)




□ D-SPIN constructs gene regulatory network models from multiplexed scRNA-seq data revealing organizing principles of cellular perturbation response

>> https://www.biorxiv.org/content/10.1101/2023.04.19.537364v1

D-SPIN ((Dimension-reduced Single-cell Perturbation Integration Network), a mathematical modeling and network inference framework that constructs gene regulatory network models directly from single-cell perturbation-response data.

D-SPIN exploits a natural factoring within the mathematical structure of Markov random fields inference to separate the learning problem into two steps, construction of a unified GRN and inference of how each perturbation interacts w/ the gene programs within the unified network.






□ Dynamic Jacobian Ensemble: Emergent stability in complex network dynamics

>> https://www.nature.com/articles/s41567-023-02020-8

The dynamic Jacobian ensemble, which allows us to systematically investigate the fixed-point dynamics of a range of relevant network-based models. Within this ensemble, complex systems exhibit discrete stability classes. These range from asymptotically unstable to sensitive.

The asymptotic predictions capture the system’s global stability, but have no bearing on the dynamic stability of small motifs or sub-networks, which may be locally unstable.

Still in an asymptotically stable system, the global impact of such unstable motifs, vanishes in the limit of large N, and hence the system as a whole remains insensitive to these local discrepancies.





□ Reference-free and cost-effective automated cell type annotation with GPT-4 in single-cell RNA-seq analysis

>> https://www.biorxiv.org/content/10.1101/2023.04.16.537094v1

GPT-4 can automatically annotate cell types by utilizing marker gene information. GPT-4 annotations fully or partially match manual annotations for at least 75% of cell types, demonstrating GPT-4’s ability to generate cell type annotations comparable to those of human experts.

GPT-4 offers cost-efficiency and seamless integration into existing single-cell analysis pipelines, such as Seurat and Scanpy. For each cell type, reproducibility is defined as the proportion of instances in which GPT-4 generates the most prevalent cell type annotation.





□ Applications of transformer-based language models in bioinformatics: a survey

>> https://academic.oup.com/bioinformaticsadvances/article/3/1/vbad001/6984737

Transformer-based models have brushed up on SOTA performance with a large margin in most bioinformatics tasks. GeneBERT was pre-trained using large-scale genomic data in a multi-modal and self-supervised manner.

scBERT reused large-scale unlabeled scRNA-seq data to accurately capture the expression information of a single gene and the gene–gene interactions. The accuracy of scBERT in the prediction of novel and known cell types increased by 0.155 and 0.158, respectively.





□ Categories enriched over symmetric closed multicategories

>> https://arxiv.org/abs/2304.11227

Constructing a machine which takes as input a locally small symmetric closed complete multicategory V. And its output is again a locally small symmetric closed complete multicategory V-Cat, the multicategory of small V-categories and multi-entry V-functors.

A complete multicategory V is a multicategory which has all small products and all equalizers. Morphisms are short multilinear maps. The internal hom object is a vector space of multilinear maps. The symmetric multicategory has products and kernels / equalizers.





□ RMV-VAE: Representation Learning to Effectively Integrate and Interpret Omics Data

>> https://www.biorxiv.org/content/10.1101/2023.04.23.537975v1

RMV-VAE (Regularised Multi-View Variational Autoencoder) is composed of two Variational Autoencoders that take datasets as input and generate a regularised low dimensional representation of the data.

RMV-VAE uses a reconstruction loss between the model’s input X and the output Xˆ and a KL divergence between the encoded data and a Normal distribution, the model is forced to learn the "real" signal present in the data, thus prioritising signal to noise.

RMV-VAE formulates an ah-hoc regularisation of the latent space to obtain embeddings where patients with similar expression of fundamental genes are found close together.





□ Aligning distant sequences to graphs using long seed sketches

>> https://genome.cshlp.org/content/early/2023/04/18/gr.277659.123

Using long inexact seeds based on Tensor Sketching, to be able to efficiently retrieve similar sketch vectors, the sketches of nodes are stored in a Hierarchical Navigable Small Worlds.

The method scales to graphs with 1 billion nodes, with time and memory requirements for preprocessing growing linearly with graph size and query time growing quasi-logarithmically with query length.





□ CCNNs: Topological Deep Learning: Going Beyond Graph Data:

>> https://www.researchgate.net/publication/370134352_Topological_Deep_Learning_Going_Beyond_Graph_Data

Combinatorial complexes, a novel type of topological domain. Combinatorial complexes can be seen as generalizations of graphs that maintain certain desirable properties. Similar to hypergraphs, combinatorial complexes impose no constraints on the set of relations.

Combinatorial complexes permit the construction of hierarchical higher-order relations, analogous to those found in simplicial / cell complexes. A general class of message-passing Combinatorial Complex Neural Networks is developed for focusing primarily on attention-based CCNNs.

Combinatorial complexes generalize and combine useful traits of both hypergraphs and cell complexes, which have emerged as two promising abstractions that facilitate the generalization of graph neural networks to topological spaces.





□ Cameron R. Wolfe RT

>> https://twitter.com/cwolferesearch/status/1649476511248818182

Nearly all recently-proposed large language models (LLMs) are based upon the decoder-only transformer architecture. But, is this always the best architecture to use? It depends… 🧵 [1/8]





□ HOTSPOT: Hierarchical hOst predicTion for aSsembled Plasmid cOntigs with Transformer

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad283/7136643

HOTSPOT is based on a phylogenetic tree of plasmids' hosts from phylum to species. By incorporating the Transformer model, in each node’s taxon classifier, the top-down tree search achieves an accurate host taxonomy prediction for the input plasmid contigs.

HOTSPOT conducts hierarchical search from the root node to lower ranks to predict taxon. The tree search allows early stop when the prediction uncertainty based on Monte Carlo Dropout is above a given cutoff, we can improve prediction accuracy with minimal loss in resolution.

The Transformer block applied in HOTSPOT is the Transformer encoder, which can convert the input sentence into a latent vector w/ a fixed length. The feature vectors output by the 2 Transformer blocks and the Inc one-hot vector will be concatenated for the taxon classification.





□ Scaling Transformer to 1M tokens and beyond with Recurrent Memory Transformer

>> https://arxiv.org/abs/2304.11062

By employing a recurrent approach and memory, the quadratic complexity can be reduced to linear. Furthermore, models trained on sufficiently large inputs can extrapolate their abilities to texts orders of magnitude longer.

While larger models (OPT-30B, OPT-175B) tend to exhibit near-linear scaling on relatively short sequences up to 32,000, they reach quadratic scaling on longer sequences. Smaller models (OPT-125M, OPT-1.3B) demonstrate quadratic scaling even on shorter sequences.

The RMT’s capacity to successfully extrapolate to tasks of varying lengths, including those exceeding 1 million tokens with linear scaling of computations required. On sequences w/ 2,048,000 tokens, RMT can run OPT-175B w/ ×29 fewer FLOPs and w/ ×295 fewer FLOPs than OPT-135M.





□ Consequences and opportunities arising due to sparser single-cell RNA-seq datasets

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02933-w

As zeros become more abundant, a binarized expression might be as informative as counts. Using ~ 1.5 million cells, a strong point-biserial correlation b/n the normalized expression counts is observed, and its respective binarized variant, although differences b/n datasets exist.

This strong correlation implies that the binarized signal already captures most of the signal present in the normalized count data. This strong correlation is primarily explained by the detection rate and the variance of the non-zero counts of a cell.





□ CellHeap: A scRNA-seq workflow for large-scale bioinformatics data analysis

>> https://www.biorxiv.org/content/10.1101/2023.04.19.537508v1

CellHeap, a flexible, portable, and robust platform for analyzing large scRNA-seq datasets, with quality control throughout the execution steps, and deployable on platforms that support large-scale data, such as supercomputers or clouds.

One CellHeap’s phase can include many computational tools and couple them such that inputs and outputs consume/generate data in a flow that meets requirements for subsequent phases. It employs quality control to ensure correct results and relies on high-performance parallelizing.





□ TopoDoE: A Design of Experiment strategy for selection and refinement in ensembles of executable Gene Regulatory Networks

>> https://www.biorxiv.org/content/10.1101/2023.04.21.537619v1

TopoDoE, an iterative method for the in silico identification of the most informative perturbation – that is eliminating as many incorrect candidate GRNs as possible from the data gathered in one experiment.

GRNs generated by WASABI were defined by a mechanistic model of gene expression based on coupled Piecewise-Deterministic Markov Processes (PDMPs) governing how the mRNA and Protein quantities change over time.

When applied as a follow-up step to WASABI’s GRN inference algorithm, the presented strategy of network selection allowed to first identify and remove incorrect GRN topologies and then to recover a new GRN better fitting experimental data than any other candidate.





□ matchRanges: Generating null hypothesis genomic ranges via covariate-matched sampling

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad197/7135828

matchRanges computes for each range a propensity score, the probability of assigning a range to focal or background groups, given a chosen set of covariates. It provides 3 methods incl. nearest-neighbor matching, rejection sampling, and stratified sampling for null set selection.

matchRanges provides utilities for accessing matched data, assessing matching quality, and visualizing covariate distributions. The code has been optimized to accommodate genome scale data, matchRanges can efficiently process sets of millions of loci in seconds on a single core.





□ Fibertools: fast and accurate DNA-m6A calling using single-molecule long-read sequencing

>> https://www.biorxiv.org/content/10.1101/2023.04.20.537673v1

Fibertools enables highly accurate (over 90% precision and recall) m6A identification along multi-kilobase DNA molecules with a ~1,000-fold improvement in speed and the capacity to generalize to new sequencing chemistries.

fibertools also substantially reduces the amount of false-negative methylation calls, an improvement primarily driven by enabling m6A calling along multi-kilobase reads with fewer subread passes - a limitation of prior m6A calling tools.





□ scTenifoldXct: A semi-supervised method for predicting cell-cell interactions and mapping cellular communication graphs

>> https://www.cell.com/cell-systems/fulltext/S2405-4712(23)00030-3

scTenifoldXct detects ligand-receptor (LR)-mediated cell-cell interactions and mapping cellular communication graphs. Neural networks are employed to minimize the distance between corresponding genes while preserving the structure of gene regression networks.

scTenifoldXct is based on manifold alignment, using LR pairs as inter-data correspondences to embed ligand and receptor genes expressed in interacting cells into a unified latent space.





□ AsymmeTrix: Asymmetric Vector Embeddings for Directional Similarity Search

>> https://yoheinakajima.com/asymmetrix-asymmetric-vector-embeddings-for-directional-similarity-search/

By introducing a weighting factor based on a domain-specific asymmetric weighting function, AsymmeTrix is able to capture the inherent directionality of relationships between objects in various application domains.

Asymmetric kernel functions modify standard kernel functions or design custom functions to model asymmetric relationships. AsymmeTrix leverages graph-based structures to capture complex relationships and continuous vector spaces to represent objects in a continuous space.





□ A graph neural network-based interpretable framework reveals a novel DNA fragility–associated chromatin structural unit

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02916-x

A framework that integrates graph neural network (GNN) to unravel the relationship between 3D chromatin structure and DSBs using an advanced interpretable technique GNNExplainer.

FaCIN (DNA fragility–associated chromatin interaction network) is a bottleneck-like structure, and it helps to reveal a universal form of how the fragility of a piece of DNA might be affected by the whole genome through chromatin interactions.





□ Read2Tree: Inference of phylogenetic trees directly from raw sequencing reads

>> https://www.nature.com/articles/s41587-023-01753-4

Read2Tree directly processes raw sequencing reads into groups of corresponding genes and bypasses traditional steps in phylogeny inference, such as genome assembly, annotation and all-versus-all sequence comparisons, while retaining accuracy.

Read2Tree is 10–100 times faster than assembly-based approaches and in most cases more accurate—the exception being when sequencing coverage is high and reference species very distant.

Read2Tree is able to also provide accurate trees and species comparisons using only low-coverage (0.1×) datasets as well as RNA versus genomic sequencing and operates on long or short reads.





□ Transformation of alignment files improves performance of variant callers for long-read RNA sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02923-y

Transformations of the BAM alignment encodings are critical. This is because while variant calling from aligned DNA sequences data involves analysis of contiguously aligned reads, variant calling from lrRNA-seq alignments must handle reads with gaps representing intronic regions.

flagCorrection ensures all fragments retain the original flag. It enables an increase in recall of DeepVariant and the precision of Clair3’s pileup model (indel calling); Clair3-mix and SNCR + flagCorrection + DeepVariant are among the best-performing pipelines to call indels.





□ TimeAttackGenComp: Fast all versus all genotype comparison using DNA/RNA sequencing data: method and workflow

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05288-y

A Perl tool to rapidly compare genotypes from thousands of samples in an all vs. all manner. All vs. all is an O(n2) problem, and scalability is an issue for larger projects.

An end-to-end Workflow Descriptor Language (WDL)/Cromwell workflow taking FASTQ, BAM, or VCF files as input was developed for reproducibility and ease of use. Memory usage could be further improved with bit-packing, bit-vectors, and the use of lower-level languages.





□ squigualiser: A simple tool to Visualise nanopore raw signal-base alignment

>> https://github.com/hiruna72/squigualiser





□ BEERS2: RNA-Seq simulation through high fidelity in silico modeling

>> https://www.biorxiv.org/content/10.1101/2023.04.21.537847v1

BEERS2 takes input transcripts from either customizable input or from CAMPAREE simulated RNA samples. It produces realistic reads of these transcripts as FASTQ, SAM, or BAM formats with the SAM or BAM formats containing the true alignment to the reference genome.

BEERS2 combines a flexible and highly configurable design with detailed simulation of the entire library preparation and sequencing pipeline and is designed to incl. the effects of polyA selection and RiboZero for ribosomal depletion and hexamer priming sequence biases.





□ pyInfinityFlow: Optimized imputation and analysis of high-dimensional Flow Cytometry data for millions of cells

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad287/7142555

pyInfinityFlow is a Python package that enables imputation of hundreds of features from Flow Cytometry using XGBoost regression. It is an adaptation of the original implementation in R2 with the goal of optimizing the workflow for large datasets.

The final Infinity Flow object can be stored as sparse data objects (h5ad) or as a data frame stored in a binary feather file format, enabling direct manipulation with Scanpy, or other tools, to identify broad and rare cell populations with Leiden clustering.





□ hipFG: High-throughput harmonization and integration pipeline for functional genomics data

>> https://www.biorxiv.org/content/10.1101/2023.04.21.537695v1

hipFG, an automatically customized pipeline for efficient and scalable normalization of heterogenous FG data collections into standardized, indexed, rapidly searchable analysis-ready datasets while accounting for FG datatypes.

hipFG includes datatype-specific pipelines to process diverse types of FG data. These FG datatypes are categorized into three groups: annotated genomic intervals, quantitative trait loci (QTLs), and chromatin interactions.





□ Capture-recapture for -omics data meta-analysis

>> https://www.biorxiv.org/content/10.1101/2023.04.24.537481v1

The capture-recapture framework (C-R) statistically formalises the idea of inspecting list overlaps. The C-R model is a consistent estimator for the causal gene number in simple situations.

The estimate from C-R can be biased upwards, if the LD structure is ignored, because the causal signal spreads between linked SNPs which can then tag several different genes.





□ scDEED: a statistical method for detecting dubious 2D single-cell embeddings

>> https://www.biorxiv.org/content/10.1101/2023.04.21.537839v1

scDEED (single-cell dubious embedding detector) assigns every cell a “reliability score,” whose large value indicates that the cell’s immediate to mid-range neighbors are well preserved after the embedding.

scDEED offers users the flexibility to optimize hyperparameters in an intuitive and graphical way (users can see which cell embeddings are dubious under each hyperparameter setting), without modifying the embedding method’s algorithm.

scDEED’s definition of dubious cell embeddings distinguishes scDEED from DynamicViz, a method that optimizes hyperparameters by minimizing the variance of cell embeddings’ Euclidean distances across multiple bootstraps.





□ scGBM: Model-based dimensionality reduction for single-cell RNA-seq using generalized bilinear models

>> https://www.biorxiv.org/content/10.1101/2023.04.21.537881v1

scGBM, a novel method for model-based dimensionality reduction of single-cell RNA-seq data. scGBM employs a scalable algorithm to fit a Poisson bilinear model to datasets with millions of cells and quantifies the uncertainty in each cell’s latent position.

scGBM uses iteratively reweighted singular value decomposition (IRSVD) algorithm. IRSVD is asymptotically faster than Fisher scoring, and leverages special properties of Poisson GLMs to obtain vectorized updates for the intercepts.





□ SurVIndel2: improving local CNVs calling from next-generation sequencing using novel hidden information

>> https://www.biorxiv.org/content/10.1101/2023.04.23.538018v1

SurVIndel2 significantly reduces the number of called false positives, while retaining or even improving the sensitivity of the original SurVIndel, and generates precise breakpoints for most of the called CNVs.

SurVIndel2 detects candidate CNVs using split reads, discordant pairs, and a new type of evidence called hidden split reads. Hidden split reads can determine the existence and precise breakpoints of CNVs in repetitive regions.





□ AtlasXplore: a web platform for visualizing and sharing spatial epigenome data

>> https://www.biorxiv.org/content/10.1101/2023.04.23.537969v1.full.pdf

AtlasXplore integrates multiple layers of spatial epigenome. With the integration with Celery workers, there is unlimited potential for AtlasXplore to incorporate other software and functions for interactive exploration of high-dimensional data sets.

AtlasXplore protects private data with Amazon Cognito authentication, and makes published and exemplar data accessible to both guests and registered users. Users can search via PMID/author or filter by research group, type, species, and tissue.





□ cellDancer: A relay velocity model infers cell-dependent RNA velocity

>> https://www.nature.com/articles/s41587-023-01728-5

cellDancer, a scalable deep neural network that locally infers velocity for each cell from its neighbors and then relays a series of local velocities to provide single-cell resolution inference of velocity kinetics. The cellDancer algorithm separately trains a DNN for each gene.

cellDancer assesses the spliced and unspliced mRNA velocities of each cell in a DNN to calculate the cell-specific transcription, splicing and degradation rates (α, β, γ) and to predict the future spliced and unspliced mRNA by the outputted α, β and γ using an RNA velocity model.





□ TiDE: a time-series dense encoder for long-term time-series forecasting that enjoys the simplicity and speed of linear models while also being able to handle covariates and non-linear dependencies.

>> https://ai.googleblog.com/2023/04/recent-advances-in-deep-long-horizon.html

TiDE (Time-series Dense Encoder), a Multi-layer Perceptron (MLP) based encoder-decoder model for long-term time-series forecasting that enjoys the simplicity and speed of linear models while also being able to handle covariates and non-linear dependencies.

TiDE is more than 10x faster in training compared to transformer-based baselines while being more accurate on benchmarks. Similar gains can be observed in inference as it only scales linearly with the length of the context and the prediction horizon.





The eyes of truth.

2023-04-23 07:13:37 | Science News

(Artwork by ekaitza)


“An AI that cares about understanding the universe is unlikely to annihilate humans because we are an interesting part of the universe.”

TruthGPT、宇宙の真実の探究を行うAIが『人類の救済に繋がるかもしれない』と考えるのは、「自分は予め宇宙の真実に関わる性質を知っている」と言う前提がなければ成り立たないし、そんな「小さい島」でしかない人間の思考を、別の島へと漕ぎ着けるための方舟こそ、AIに期待される役割ではないか

「AI利用を合理的思考や手段のために民主化する」と言う意図であれば、『人類の救済』と言う意義には少し近くなるかもしれない。事実、Elon Maskの発言からは、大規模言語モデルのAI利用が一部の権利者や勢力に寡占されることへの懸念も窺える。





We were Once Kings.

2023-03-31 03:33:33 | Science News

(Photo by Joanne Hollings)




□ TXGNN: Zero-shot prediction of therapeutic use with geometric deep learning and clinician centered design

>> https://www.medrxiv.org/content/10.1101/2023.03.19.23287458v1

TXGNN is a graph neural network pre-trained on a comprehensive knowledge graph of 17,080 clinically-recognized diseases and 7,957 therapeutic candidates. The model can process various therapeutic tasks, such as indication and contraindication prediction, in a unified formulation.

TXGNN can perform zero-shot inference on new diseases without additional parameters or fine-tuning on ground truth labels. TXGNN uses a metric learning module that operates on the latent representation space.

TXGNN transforms points in the latent space representing the candidate and disease into predictions about their relationship. In TXGNN, we obtain a disease signature vector for each disease based on the set of neighboring proteins, exposures, and other biomedical entities.





□ NeuLay: Accelerating network layouts using graph neural networks

>> https://www.nature.com/articles/s41467-023-37189-2

The NeuLay algorithm, a Graph Neural Network (GNN) developed to parameterize node features, significantly improves both the speed and the quality of graph layouts, opening up the possibility to quickly and reliably visualize large networks.

NeuLay allows for the use of different GNN architecture other than GCN, such as Graph Attention. NeuLay encodes the graph structure by graph neural networks that maps the adjacency matrix to the node positions. NeuLay-2 w/ two GCN layers has the fastest convergence of the energy.





□ Con-AAE: Contrastive Cycle Adversarial Autoencoders for Single-cell Multi-omics Alignment and Integration

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad162/7091469

Con-AAE (Contrastive cycle adversarial Autoencoders), aiming at integrating and aligning the multi-omics data at the single-cell level. The contrastive loss minimizes the distance between positive pairs and maximizes the distance between negative pairs.

Con-AAE uses two autoencoders to map two modality data into two low-dimensional manifolds under the constrain of adversarial loss, trying to develop representations for each modality that are separated but cannot be identified by an adversarial network in a coordinated subspace.





□ Phenonaut; multiomics data integration for phenotypic space exploration

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad143/7082955

Phenonaut is a framework for applying workflows to multi-omics data. Originally targeting high-content imaging and the exploration of phenotypic space, with different visualisations and metrics.

Phenonaut runs are accompanied by cryptographic hashes proving reported inputs. Phenonaut allows now operates in a data agnostic manner, allowing users to describe their data (multi-view/multi-omics) and apply a series of generic or specialised data-centric transforms.





□ Accurate Flow Decomposition via Robust Integer Linear Programming

>> https://www.biorxiv.org/content/10.1101/2023.03.20.533019v1

A new ILP formulation for the flow decomposition problem for dealing with edge weights not forming a flow. It enables a macroscopic management of errors by attaching an error to each solution path instead of each edge.

This formulation defines the minimum path-error flow decomposition problem as the problem of finding a set of weighted paths with associated error variables, such that the superposition difference of each edge is within the sum of the error variables of the paths using the edge.





□ multiWGCNA: an R package for deep mining gene co-expression networks in multi-trait expression data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05233-z

multiWGCNA, a WGCNA-based procedure that can leverage the multidimensionality of experimental designs to study co-expression networks across variable conditions, such as space or time.

multiWGCNA generates a network for each condition separately, and subsequently maps these modules across designs, and performs relevant downstream analyses, incl. module-trait correlation and module preservation.





□ GVC: efficient random access compression for gene sequence variations

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05240-0

The Genomic Variant Codec(GVC), a novel approach for compressing gene sequence variations with random access capability. The genotypes are extracted from a VCF file and divided into blocks. Each block represents genotypes of all samples in a certain range of loci in a chromosome.

GVC uses two alternative binarization approaches to decompose the allele matrix into a binary representation: bit plane binarization and row binarization. GVC uses the Hamming distance to measure the similarity b/n adjacent rows/columns. Each binary matrix is entropy-encoded.





□ SoCube: an innovative end-to-end doublet detection algorithm for analyzing scRNA-seq data

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbad104/7081128

Several doublet detection algorithms are currently available, but their generalization performance could be further improved due to the lack of effective feature-embedding strategies with suitable model architectures.

SoCube proposed a novel 3D composite feature-embedding strategy that embedded latent gene information and constructed a multikernel, multichannel CNN-ensembled architecture in conjunction with the feature-embedding strategy.





□ OASIS: An interpretable, finite sample valid alternative to Pearson's X2 for scientific discovery

>> https://www.biorxiv.org/content/10.1101/2023.03.16.533008v1

OASIS (Optimized Adaptive Statistic for Inferring Structure) constructs a test-statistic which is linear in the normalized data matrix, providing closed form p-value bounds through classical concentration inequalities.

OASIS computes a bilinear form of residuals. OASIS provides a decomposition of the table, lending interpretability to its rejection of the null. The finite-sample bounds correctly characterize the p-value bound derived up to a variance term.





□ AIM: A Framework for High-throughput Sequence Alignment using Real Processing-in-Memory Systems

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad155/7087101

Alignment-in-Memory (AIM), a framework for PIM-based sequence alignment that targets the UPMEM system. AIM dispatches a large number of sequence pairs across different memory modules and aligns each pair using compute cores within the memory module where the pair resides.

AIM supports multiple alignment algorithms including NW, SWG, GenASM, WFA, and WFA-adaptive. Each algorithm has alternate implementations that manage the UPMEM memory hierarchy differently and are suitable for different read lengths.





□ scQA: Clustering scRNA-seq data via qualitative and quantitative analysis

>> https://www.biorxiv.org/content/10.1101/2023.03.25.534232v1

scQA (an architecture for clustering Single-Cell RNA-seq data based on Qualitative and Quantitative Analysis), which can efficiently cluster cells at various scale based on so called landmarks and each indicates the consensus of genes with similar expression patterns.

scQA constructs the consensus vector of genes whose qualitative expressions under certain cells are of similar trend: quasi-trend-preserved genes. scQA identifies distinct cell types, it proceeds to analyze the characteristics of the ID landmarks both internally / externally.





□ SpaceWalker: Interactive Gradient Exploration for Spatial Transcriptomics Data

>> https://www.biorxiv.org/content/10.1101/2023.03.20.532934v1

The intrinsic dimensionality can serve to guide the user to anatomically distinct regions, that changes in local intrinsic dimensionality in many cases mirror transitions between cell subclasses.

SpaceWalker consists of two key innovations: an interactive, real-time flood-fill and spatial projection of the local topology of the High-Dimensional space, and a gradient gene detector.





□ exFINDER: identify external communication signals using single-cell transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2023.03.24.533888v1

exFINDER analyzes the exSigNet by predicting signaling strength, calculating the maximal signal flow, clustering different ligand-target signaling paths, quantifying the signaling activities using the activation index, and evaluating the GO analysis outputs of exSigNet.





□ NOMAD2 provides ultra-efficient, scalable, and unsupervised discovery on raw sequencing reads

>> https://www.biorxiv.org/content/10.1101/2023.03.17.533189v1

NOMAD2 rapidly identifies candidate RNA editing de novo, including detecting potentially hyperedited events, filling a gap in existing bioinformatic tools. classified anchors as “mismatch” defined as cases where the two most abundant targets differ by single-base mismatches.

NOMAD2 enumerates all (a+g+t)-mers, these sequences are sorted lexicographically with KMC-tools. All occurrences of unique anchors are adjacent, which enables efficient gap removal and unique targets collapsing in the third step via a linear traversal over the (a+g+t)-mers.





□ PWN: enhanced random walk on a warped network for disease target prioritization

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05227-x

PWN (Prioritization with a Warped Network) uses the Forman–Ricci curvature instead of the Ollivier–Ricci curvature. PWN can be used for identifying the targets with properly given prior knowledge and gene scores.

PWN is designed to be an efficient variant of random walk with restart (RWR). PWN uses a weighted asymmetric network that is generated from an unweighted and undirected network. The weights come from two distinct features.

PWN is designed to manage the proportion of information circulating in and flowing out of certain regions by controlling the internal feature. PWN warps the network by assigning higher weights to prior knowledge-related edges.





□ Multi-Omics Integration For Disease Prediction Via Multi-Level Graph Attention Network And Adaptive Fusion

>> https://www.biorxiv.org/content/10.1101/2023.03.19.533326v1

This framework involves constructing co-expression and co-methylation networks for each subject, followed by applying multi-level graph attention to incorporate biomolecule interaction information.

The true-class-probability strategy is employed to evaluate omics-level confidence for classification, and the loss is designed using an adaptive mechanism to leverage both within- and across-omics information.

The initial feature is generated by the multi-level Graph Attention Network for each type of omics data respectively. The dicision feature of each type of omics data is generated by the TCP module. The decision features of each omics are concatenated into one fusion feature.





□ QADD: De Novo Drug Design by Iterative Multi-Objective Deep Reinforcement Learning with Graph-based Molecular Quality Assessment

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad157/7085596

QADD designs a multi-objective deep reinforcement learning pipeline to generate molecules w/ multiple desired properties iteratively, where a graph neural network-based model for accurate molecular quality assessment on drug potentials is introduced to guide molecule generation.

QADD uses the Deep Q-Network, a value-based reinforcement learning method, to estimate the action-value function under different action selection strategies. Since it does not require a fixed-dimensional action space, it is particularly suitable for discontinuous space search.





□ Distances and their visualization in studies of spatial-temporal genetic variation using single nucleotide polymorphisms (SNPs)

>> https://www.biorxiv.org/content/10.1101/2023.03.22.533737v1

They recommend selection of a distance measure for SNP genotype data that does not give differing outcomes depending on the arbitrary choice, and consideration of which state should be considered as zero when applying binary distance measures to fragment presence-absence data.





□ BSP: Dimension-agnostic and granularity-based spatially variable gene identification

>> https://www.biorxiv.org/content/10.1101/2023.03.21.533713v1

BSP (big-small patch), a spatial granularity-guided and non-parametric model to identify spatially variable genes SVGs from two or three- dimensional spatial transcriptomics data in a fast and robust manner.

BSP selects a set of neighboring spots within a certain distance to capture the regional means with different granularities. The variances of the expression mean across all spots are then calculated under different scales, and genes with high ratios are identified as the SVGs.





□ Capturing Spatiotemporal Signaling Patterns in Cellular Data with Geometric Scattering Trajectory Homology

>> https://www.biorxiv.org/content/10.1101/2023.03.22.533807v1

GSTH, a general framework that encapsulates time-lapse signals on a cell adjacency graph in a low-dimensional trajectory. GSTH integrates geometric scattering and topological data analysis (TDA) to provide a comprehensive understanding of complex cellular interactions.

Geometric scattering employs wavelet-based transformations to extract multiscale representations of the signaling data, capturing the intricate hierarchical structures present in the spatial organization of cells and the temporal evolution of signaling events.





□ Ensemble-GNN: federated ensemble learning with graph neural networks for disease module discovery and classification

>> https://www.biorxiv.org/content/10.1101/2023.03.22.533772v1

Ensemble-GNN allows to quickly build predictive models utilizing PPI networks consisting of various node features such as gene expression and/or DNA methylation.

Ensemble-GNNs were combined into a global federated model. In the federated case, each client has its dedicated data based on which a GNN classifier is trained. The trained models of the ensembles are shared among all clients, and predictions are again made via Majority Vote.





□ Scrooge: A Fast and Memory-Frugal Genomic Sequence Aligner for CPUs, GPUs, and ASICs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad151/7085594

Scrooge, a fast and memory-frugal genomic sequence aligner. Scrooge includes three novel algorithmic improvements which reduce the data movement, memory footprint, and the number of operations in the GenASM algorithm.

GenASM-DC uses only cheap bitwise operations to calculate the edit distance between two strings text and pattern. It builds an (n+1)×(k+1) dynamic programming (DP) table R, where n=length(text) and k is the maximum number of edits considered.





□ Estimation of a treatment effect based on a modified covariates method with L0 norm

>> https://www.biorxiv.org/content/10.1101/2023.03.22.533735v1

A new treatment effect estimation approaches based on the modified covariate method, one using lasso regression and the other ridge regression, using the L0 norm.

A modified covariate method based on the L0 norm and Lq norm (q = 1, 2). The first method estimates treatment effects using lasso regression with the L0 norm. The second method uses ridge regression with the L0 norm.





□ PENCIL: Supervised learning of high-confidence phenotypic subpopulations from single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.03.23.533712v1

PENCIL can perform gene selection during the training process, which allows learning proper gene spaces that facilitate accurate subpopulation identifications from single-cell data.

PENCIL has the flexibility to address various phenotypes such as binary, multi-category and continuous phenotypes. PENCIL can order cells to reveal the subpopulations undergoing continuous transitions between conditions.





□ xTrimoGene: An Efficient and Scalable Representation Learner for Single-Cell RNA-Seq Data

>> https://www.biorxiv.org/content/10.1101/2023.03.24.534055v1

xTrimoGene reduces FLOPs by one to two orders of magnitude compared to classical transformers while maintaining high accuracy, enabling us to train the largest transformer models over the largest scRNA-seq dataset today.

xTrimoGene proposes an asymmetric encoder-decoder framework that takes advantage of the sparse gene expression matrix, and establishes the projection strategy of continuous values with a higher resolution.





□ EnsInfer: a simple ensemble approach to network inference outperforms any single method

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05231-1

EnsInfer, an ensemble approach to the network inference problem: each individual network inference method will work as a first level learning algorithm that gives a set of predictions from the gene expression input.

EnsInfer uses a combination of state-of-the-art inference approaches and combines them using a simple Naive Bayes ensemble model. EnsInfer essentially turns all the predictions from different inference algorithms into priors about each edge in the network.





□ Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02899-9

Enformer were not trained on GTEx / Cardoso-Moreira et al. data. specifically and do not directly give predictions for many human tissues. To match CAGE tracks to tissues and stages of development in a simple, yet data-driven, way, they fitted a ridge regression.

Enformer can predict endogenous RNA abundance very well and consistently outperforms previous models. Enformer substantially outperformed Basenji2 even when it is restricted to the latter model‘s input window and even on tasks where the receptive field size is irrelevant.





□ ElasticBLAST: accelerating sequence search via cloud computing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05245-9

ElasticBLAST can handle anywhere from a few to many thousands of queries and run the searches on thousands of virtual CPUs.

ElasticBLAST leverages the cloud to provide multiple worker nodes to parallelize the computation by breaking the queries into query batches. ElasticBLAST relies on BLAST DB metadata that is automatically generated to determine the amount of main memory needed for that database.





□ SiPSiC: A novel method to accurately estimate pathway activity in single cells for clustering and differential analysis

>> https://www.biorxiv.org/content/10.1101/2023.03.27.534310v1

SiPSiC, a novel method for inferring pathway scores from scRNA-seq data. It has a high sensitivity, accuracy, and consistency with existing knowledge across different data types, including findings often missed by the original conventional analyses.

SiPSiC scores can be used to cluster the cells and compute their UMAP projections in a manner that better captures the biological underpinnings of tissue heterogeneity.





□ cnnLSV: detecting structural variants by encoding long-read alignment information and convolutional neural network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05243-x

cnnLSV can automatically adjust the images from different variants to a uniform size according to the length of each variant and the coverage of the dataset for training the filtering model.

cnnLSV converts the images in training set into one-dimensional arrays, and executes the principal component analysis and k-means clustering to eliminate the incorrectly labeled images to improve the filtering performance of the model.





□ KGETCDA: an efficient representation learning framework based on knowledge graph encoder from transformer for predicting circRNA-disease associations

>> https://www.biorxiv.org/content/10.1101/2023.03.28.534642v1

Knowledge Graph Encoder from Transformer for predicting CDA (KGETCDA) integrates more than 10 databases to construct a large heterogeneous non-coding RNA dataset, which contains multiple relationships between circRNA, miRNA, lncRNA and disease.

A biological knowledge graph is created based on this dataset and Transformer-based knowledge representation learning and attentive propagation layers are applied to obtain high-quality embeddings with accurately captured high-order interaction information.





□ C-DEPP: Scaling deep phylogenetic embedding to ultra-large reference trees: a tree-aware ensemble approach

>> https://www.biorxiv.org/content/10.1101/2023.03.27.534201v1

Clustered-DEPP (C-DEPP) uses carefully crafted techniques to enable quasi-linear scaling while maintaining accuracy. C-DEPP enables placing twenty million 16S fragments on the GG2 reference tree in 41 hours of computation.

C-DEPP trains a separate model for each of several overlapping subtrees; for each query, C-DEPP uses a 2-level classifier to select one or more subtrees, computes distances using those subtrees, and uses these distances as input to APPLES-II, leaving the other distances blank.





□ simpleaf: A simple, flexible, and scalable framework for single-cell transcriptomics data processing using alevin-fry

>> https://www.biorxiv.org/content/10.1101/2023.03.28.534653v1

simpleaf, a program that simplifies the processing of single-cell data using tools from the alevin-fry ecosystem, and adds new functionality and capabilities, while retaining the flexibility and performance of the underlying tools.

simpleaf quant, simpleaf quant will automatically recruit and parameterize the correct mapper, and will automatically locate and provide the file containing the transcript-to-gene mapping information to later quantification stages where appropriate.





□ Sequencing accuracy and systematic errors in nanopore direct RNA sequencing

>> https://www.biorxiv.org/content/10.1101/2023.03.29.534691v1

The presence of the same systematic error patterns in RODAN points to more fundamental causes of errors in the raw signal data, necessitating further development of better pore chemistry to produce higher quality dRNA-seq data.

Clearly, further development of dRNA-seq protocols, pore chemistry and basecalling algorithms are desirable. Appropriate quality control and error correction methods are needed to mitigate the effects of high error rates and systematic biases in downstream analyses.



Dihedral.

2023-03-31 02:22:22 | Science News

(Art by ekaitza)

GPT models are affected by various factors such as the size of the training dataset and architecture, which may influence the Kolmogorov complexity. Simpler algorithms can compress complex data. The performance of the GPT model is expected to improve as its complexity increases.






□ Split-Transformer Impute (STI): Genotype Imputation Using a Transformer-Based Model

>> https://www.biorxiv.org/content/10.1101/2023.03.05.531190v3

The model utilizes attention to capture correlations among the SNPs/SNVs in the data. It achieves high imputation accuracy at a modest memory consumption cost by dividing the data into chunks, enabling efficient application to long sequences.

STI uses Cat-Embedding layer in order to capture allele information per SNV. In conjunction with multi-headed attention layers, enables STI to model correlations among SNVs to impute missing values based on known and missing values per position.






□ Dagger Linear Logic and Categorical Quantum Mechanics

>> https://arxiv.org/abs/2303.14231

The existing frameworks of Categorical Quantum Mechanics (CQM) are categorical proof theories of compact dagger linear logic, and are motivated by the interpretation of quantum systems in the category of finite dimensional Hilbert spaces.

Mixed Unitary Categories is a novel non-compact framework. MUC is built upon linearly distributive categories and ∗-autonomous categories, which serve as categorical proof theories of non-compact multiplicative linear logic and can be applied to infinite dimensional systems.





□ AIBMD: Artificial Intelligence Boosted Molecular Dynamics

>> https://www.biorxiv.org/content/10.1101/2023.03.25.534210v1

In AIBMD, probabilistic Bayesian neural network models were used to construct boost potentials that exhibit Gaussian distribution with minimized anharmonicity for accurate energetic reweighting and enhanced sampling.

AIBMD has been demonstrated on model systems of the alanine dipeptide in explicit and implicit solvent, the chignolin fast-folding protein, and three hairpin RNAs with the GCAA, GAAA, and UUCG tetraloops.





□ Boolean Network Sketches: A Unifying Framework for Logical Model Inference

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad158/7099622

Boolean network sketch starts with an initial sketch that corresponds to the prior literature-based knowledge only. Subsequently, it is extended by adding restrictions representing experimental data resulting in the data-informed sketch.

BNs integrates partial knowledge about the network’s topology and the update logic, as well as dynamical restrictions representing knowledge or assumptions about the properties of the network’s transitions (e.g., attractor landscape), and restrictions on the model dynamics.





(Art by jaanus03)

人は人から人に似て産まれ、偶然乗り合わせた船の上で犇めき合っている。星を読むように類似した記号に意味を与え、風の追いやる方だけが確からしいと覚える。己が何を見つけて、何を想い、何を遂げようとしても、押し流していく風からは留めて置けないことを知る。誰もが、誰の名も忘れて解けていく。



□ StackOverflowのトップエンジニアからの提言。GPT-4に依存し続けると「枯れた川床から水を飲む危険がある。」という指摘。知識の再生産フェーズが可能になるかどうか。事実、Googleのトラフィックが下がっているという指摘も。因みに引用されている画像は、AIをテーマにした映画『Ex Machina』の撮影に使われたノルウェーのHotel Juvetですね。

Peter Nixey

I'm in the top 2% of users on StackOverflow. My content there has been viewed by over 1.7M people. And it's unlikely I'll ever write anything there again.

Which may be a much bigger problem than it seems. Because it may be the canary in the mine of our collective knowledge.

A canary that signals a change in the airflow of knowledge: from human-human via machine, to human-machine only. Don’t pass human, don’t collect 200 virtual internet points along the way.

StackOverflow is *the* repository for programming Q&A. It has 100M users & saves man-years of time & wig-factories-worth of grey hair every single day.

It is driven by people like me who ask questions that other developers answer. Or vice-versa. Over 10 years I've asked 217 questions & answered 77. Those questions have been read by millions of developers & had tens of millions of views.

But since GPT4 it looks less & less likely any of that will happen; at least for me. Which will be bad for StackOverflow. But if I'm representative of other knowledge-workers then it presents a larger & more alarming problem for us as humans.

What happens when we stop pooling our knowledge with each other & instead pour it straight into The Machine? Where will our libraries be? How can we avoid total dependency on The Machine? What content do we even feed the next version of The Machine to train on?

When it comes time to train GPTx it risks drinking from a dry riverbed. Because programmers won't be asking many questions on StackOverflow. GPT4 will have answered them in private. So while GPT4 was trained on all of the questions asked before 2021 what will GPT6 train on?

This raises a more profound question. If this pattern replicates elsewhere & the direction of our collective knowledge alters from outward to humanity to inward into the machine then we are dependent on it in a way that supercedes all of our prior machine-dependencies.

Whether or not it "wants" to take over, the change in the nature of where information goes will mean that it takes over by default.

Like a fast-growing Covid variant, AI will become the dominant source of knowledge simply by virtue of growth. If we take the example of StackOverflow, that pool of human knowledge that used to belong to us - may be reduced down to a mere weighting inside the transformer.

Or, perhaps even more alarmingly, if we trust that the current GPT doesn't learn from its inputs, it may be lost altogether. Because if it doesn't remember what we talk about & we don't share it then where does the knowledge even go?

We already have an irreversible dependency on machines to store our knowledge. But at least we control it. We can extract it, duplicate it, go & store it in a vault in the Arctic (as Github has done).

So what happens next? I don't know, I only have questions.

None of which you'll find on StackOverflow.





□ CONGAS+: A Bayesian method to infer copy number clones from single-cell RNA and ATAC sequencing

>> https://www.biorxiv.org/content/10.1101/2023.04.01.535197v1

CONGAS+, a Bayesian model to map single-cell RNA and ATAC profiles generated from independent or multimodal assays on the latent space of copy numbers clones. CONGAS+ is equipped with a shrinkage hyperparameter that can be used to weigh the evidence differently across RNA/ATAC.

CONGAS+ did retrieve complex subclonal architectures while providing a coherent mapping among ATAC and RNA, facilitating the study of genotype-phenotype mapping.






□ Reconstruction of Gene Regulatory Networks using sparse graph recovery models

>> https://www.biorxiv.org/content/10.1101/2023.04.02.535294v1

Categorizing graph recovery methods into four main types based on the underlying formulations: Regression-based, Graphical Lasso, Markov Networks and Directed Acyclic Graphs. And incorporate transcription factor information as a prior to ensure successful reconstruction of GRNs.

They modified the uGLAD algorithm to take into account TF information, called uGLAD-GRN, by using a post-hoc masking operation that only retains the edges having at least one node as a TF. It can be applied to most of the algorithms that recover Conditional Independence graphs.





□ STGRNS: An interpretable Transformer-based method for inferring gene regulatory networks from single-cell transcriptomic data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad165/7099621

STGRNS, a Transformer-based model, provides a fast and accurate tool to infer gene regulatory networks from a single-cell RNA-seq profile. By leveraging the newly designed neural network structure, STGRNS especially obtains an outperformance on GRN inference.

STGRNS has certain transferability on the TF-gene prediction task. STGRNS can accurately infer GRNs based on known relationships between genes, irrespective of whether the data is static, pseudo-time, or time-series.





□ SEQUENCE VS. STRUCTURE: DELVING DEEP INTO DATA DRIVEN PROTEIN FUNCTION PREDICTION

>> https://www.biorxiv.org/content/10.1101/2023.04.02.534383v1

The difference between the RGC TN and RG AT methods is that the former employs a transformer network and incorporates direction, orientation, and distance distribution information in the edge features, while the latter only includes distance and dihedral angle information.

The first fusion method directly splices the output of the ESM-1b model and the GAT model and feeds it to the classifier for the final prediction. The second fusion method involves taking the output of the ESM-1b model as the initialization characteristics of nodes in the graph.





□ Single-cell RNA-seq differential expression tests within a sample should use pseudo-bulk data of pseudo-replicates

>> https://www.biorxiv.org/content/10.1101/2023.03.28.534443v1

The results of the simulation experiments showed that bulk methods that use pseudo-bulk raw count data from pseudo-replicates ranked highest and were most effective in controlling the false discovery rate (FDR) for highly expressed genes.

For real scRNA-seq data, the top- performing pipelines were also dominated by the same kind of pipelines, but the differences between single-cell and pseudo-replicate methods were less clear.





□ sciPENN: A multi-use deep learning method for CITE-seq and single-cell RNA-seq data integration with cell surface protein prediction and imputation

>> https://www.nature.com/articles/s42256-022-00545-w

sciPENN is a flexible method that supports completion of multiple CITE-seq references (by imputing missing proteins for each reference) as well as protein expression prediction in an scRNA-seq test set, all in one framework.

sciPENN can transfer cell type labels from a training set to a test set, and can also integrate cells from the multiple datasets into a common latent space.

sciPENN’s model architecture comprises an input block, followed by a sequence of feed-forward (FF) blocks interleaved with updates to an internally maintained hidden state updated via an RNN cell.

The final hidden state is passed through three dense layers to compute protein predictions, protein prediction bounds and cell type class probability vectors.





□ Bayesian Multi-Study Non-Negative Matrix Factorization for Mutational Signatures

>> https://www.biorxiv.org/content/10.1101/2023.03.28.534619v1

A Bayesian multi-study NMF method that jointly decomposes multiple studies or conditions to identify signatures that are common, specific, or partially shared by any subset.

A “discovery-only" model that estimates de novo signatures in a completely unsupervised manner, and a “recovery-discovery" model that builds informative priors from previously known signatures to both update the estimates of these signatures and identify any novel signatures.





□ The impact of FASTQ and alignment read order on structural variation calling from long-read sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.03.27.534439v1

Comparisons of variant call format (VCF) files generated from the original and permutated FASTQ files demonstrated that the order of input data had a large impact on SV prediction, particularly for pbsv. The type of variant most affected by read order varied by caller.

For pbsv, most differences occurred for deletions and duplications, while for Sniffles, permutating the read order had a stronger impact on insertions. For SVIM, inversions and deletions accounted for most differences.





□ Spatial Transcriptomics Analysis of Gene Expression Prediction using Exemplar Guided Graph Neural Network

>> https://www.biorxiv.org/content/10.1101/2023.03.30.534914v1

Proposing a graph exemplar bridging (GEB) block to update window features by the exemplars and the gene expression of exemplars. Allowing dynamic information propagation, the exemplar feature also receives and is updated with the status of the window features.

Semantically, the former update corresponds w/ ‘the known gene expression’, and the latter corresponds w/ ‘the GE the model wants to be known’. Finally, It has an attention-based prediction block to aggregate exemplars of each window and the exemplar-revised window features.





□ CellTrackVis: interactive browser-based visualization for analyzing cell trajectories and lineages

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05218-y

CellTrackVis visualizes tracking results, e.g., cell trajectories, segmentation, raw or processed image sequence, cell lineages, or quantified information, on interconnected views. Those generally include the number of cell division or appearance/disappearance at each time step.

Distinct time-series data are plotted using line graphs and exact values appear with a vertical bar, moved by a mouse pointer. The statistic data set is not the mandatory input, and thus our tool supports its visual analysis while retaining the flexibility of input data.





□ A self-propagating, barcoded transposon system for the dynamic rewiring of genomic networks

>> https://www.embopress.org/doi/full/10.15252/msb.202211398

A modular, combinatorial assembly pipeline for the functionalization of transposons with synthetic or endogenous gene regulatory elements as well as DNA barcodes.

The continuous mobilization of transposons throughout the host genome yields multi-site adaptive mutations and growth phenotypes in both static and dynamic selective environments.

It first mimics a natural transposon, with the transposase acting in cis from within the region flanked by the inverted repeat sequences, while the second uses a medium copy helper plasmid (pHelper) to provide transposase acting in trans.





□ Sparse clusterability: testing for cluster structure in high dimensions

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05210-6

Clusterlab generates clusters of a user-provided dimension by a linear projection of two-dimensional Gaussian principal components into the desired higher-dimensional space. The clusterlab manual highlights 12 example two-dimensional structures to project into higher dimension.

Methods with the dip test and either sparse PCA or traditional PCA detected known cluster structure in high dimensional-omics based data and had high power in simulations. Type I error was controlled at or below the nominal level across all dimensions.





□ MBE: Model-based differential sequencing analysis

>> https://www.biorxiv.org/content/10.1101/2023.03.29.534803v1

Model-based enrichment (MBE) is based on sound theoretical principles, is easy to implement, and can trivially make use of advances in modern-day machine learning classification architectures or related innovations.

Increasingly, log-enrichment estimates are also being used as supervised labels for training machine learning models so that one may predict enrichment for unobserved sequences, or probe the model to gain further insights.





□ PanKmer: k-mer based and reference-free pangenome analysis

>> https://www.biorxiv.org/content/10.1101/2023.03.31.535143v1

PanKmer decomposes a set of input genomes into a table of observed k-mers and their presence-absence values in each genome. These are stored in an efficient k-mer index data format that encodes all forms of variation within the pangenome, including SNPs, INDELs, and SVs.

PanKmer includes functions for downstream analysis, such as calculating sequence similarity statistics b/n individuals at whole-genome or local scales. k-mers can be “anchored” in any individual genome to quantify sequence variability or conservation at a specific locus.





□ MOGAT: An Improved Multi-Omics Integration Framework Using Graph Attention Networks

>> https://www.biorxiv.org/content/10.1101/2023.04.01.535195v1

MOGAT, a novel multi-omics integration-based cancer subtype prediction leveraging a graph attention network (GAT) model that incorporates graph-based learning with an attention mechanism for analyzing multi-omics data.

MOGAT utilizes a multi-head attention mechanism that can efficiently extract information for a specific patient by assigning unique attention coefficients to its neighboring patients, i.e., getting the relative influence of neighboring patients in the patient similarity graph.





□ mlf-core: a framework for deterministic machine learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad164/7099608

mlf-core, a machine learning framework that enables building fully deterministic and therefore also reproducible machine learning projects. mlf-core is based on MLflow for machine learning experiment tracking, visualization and model deployment.

mlf-core provides project templates and static code analysis (linting) functionality that ensures the sole usage of deterministic algorithms for GPU computing as well as setting all necessary random seeds for deterministic results.





□ Discovering motifs and genomic patterns with SMT: a high-performance data structure for counting kmers

>> https://www.biorxiv.org/content/10.1101/2023.04.01.535163v1

The Sparse Motif Tree (SMT), an innovative tool specifically designed to store and count kmers efficiently. The SMT optimizes memory usage and computation.

The SMT provides advanced features, such as exact search in constant time, retrieval of the most abundant kmers, and approximate search in linear time to find fragments with up to d mutations uniformly distributed across their bases.





□ PanGraphViewer: A Versatile Tool to Visualize Pangenome Graphs

>> https://www.biorxiv.org/content/10.1101/2023.03.30.534931v1

PanGraphViewer targets pangenome graphs and allows the viewing of pangenome graphs built from multiple genomes in either the graphical fragment assembly format or the VCF. PanGraphViewer also integrates genome annotations with graph nodes to analyze insertions / deletions.

The graph node shapes in PanGraphViewer can represent different types of genomic variations when a VCF file is used. Notably, PanGraphViewer displays subgraphs from a chromosome or sequence segment based on any given coordinates.





□ ScRAT: Clinical Phenotype Prediction From Single-cell RNA-seq Data using Attention-Based Neural Networks

>> https://www.biorxiv.org/content/10.1101/2023.03.31.532253v1

ScRAT, a clinical phenotype prediction framework that can learn from limited numbers of scRNA-seq samples with minimal dependence on cell- type annotations.

ScRAT establishes the connection between the input (cells) and the output (phenotypes) of the Transformer model simply using the attention weights.





□ NEREL-BIO: A Dataset of Biomedical Abstracts Annotated with Nested Named Entities

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad161/7099619

NEREL-BIO contains annotations for 700+ Russian and 100+ English abstracts. NEREL-BIO comprises the following specific features: annotation of nested named entities, it can be used as a benchmark for cross-domain and cross-language transfer.

Transferability of trained models across two datasets with completely different contexts can be limited due to domain shift, while sequential training can cause complete retraining of model weights.





□ Dipwmsearch: a python package for searching di-PWM motifs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad141/7100340

dipwmsearch provides an easy and efficient procedure to find occurrences of di-PWMs in nucleotidic sequences, and well documented snippets. It offers practical advantages compared to an existing solution (like processing IUPAC codes, or an adaptable output).

dipwmsearch uses an original enumeration based search algorithm that handles di-PWMs. Coping with non selective positions was necessary to make search effective for some di-PWMs, which questions their information content, and in turn their construction process.





□ FRASER 2.0: Improved detection of aberrant splicing using the Intron Jaccard Index

>> https://www.medrxiv.org/content/10.1101/2023.03.31.23287997v1

As FRASER’s autoencoder works with values in the logit space, which is defined for values greater than 0 and less than 1, a pseudocount needs to be added to both the numerator and denominator when calculating each metric on raw read counts.

FRASER 2.0, a method to detect aberrant splicing using a novel intron-centric metric, the Intron Jaccard Index. In a single metric, the Intron Jaccard Index captures former metrics of splicing efficiency as well as alternative donor and acceptor site choice.

FRASER 2.0 decreases the number of reported splicing outliers by one order of magnitude, recovers splicing outliers associated with candidate splice-disrupting rare variants more accurately than competitor methods, and is more robust to variations in sequencing depth.





□ catchSalmon / catchKallisto: Dividing out quantification uncertainty allows efficient assessment of differential transcript expression

>> https://www.biorxiv.org/content/10.1101/2023.04.02.535231v1

Bootstrap samples generated by lightweight aligners can be used to accurately estimate the mapping ambiguity overdispersion which, in turn, can be used to scale down estimated transcript counts so that the resulting effective library sizes reflect their true precision.

As a result, standard methods designed for the differential expression analyses at the gene-level can be applied to transformed transcript counts for DTE analyses.

Functions catchSalmon and catchKallisto from edgeR import transcript-specific estimated counts (including bootstrap resamples) from Salmon and kallisto, respectively, and estimate the associated mapping ambiguity overdispersion.





□ HTOreader: A hybrid single-cell demultiplexing strategy that increases both cell recovery rate and calling accuracy

>> https://www.biorxiv.org/content/10.1101/2023.04.02.535299v1

HTOreader, an improved algorithm for cell hashing that distinguishes true positive from background for each individual hashtag at higher accuracy. This hybrid strategy increases cell recovery and calling accuracy while lowering experimental cost.

HTOreader uses a hybrid demultiplexing strategy for single-cell sample pooling and super loading. By integrating results of both cell hashing and SNP profiling, they successfully complement the two approaches with each other and hugely improve their weaknesses.




εν αρχη ην ο λογος.

2023-03-13 03:13:13 | Science News

(Art by joeryba.eth)

私たちが直面する問題は2種類に分けられる。それは「己の限界」と「他者の檻」である。全ての主観者が『反復』するプロセスを織り込んで、2つの問題は常に背中合わせとなる。自らが解決した問題は常に他者を囚え続け、鏡のようにその逆が成り立つ。檻から出た先は檻であり、入れ子のように循環する。




□ Φ-SO: Deep symbolic regression for physics guided by units constraints: toward the automated discovery of physical laws

>> https://arxiv.org/abs/2303.03192

Φ-SO, a Physical Symbolic Optimization framework for recovering analytical symbolic expressions from physics data using deep reinforcement learning techniques by learning units constraints.

Φ-SO restricts the freedom of the equation generator, and balanced units are proposed by construction, thus greatly reducing the search space. It enables the algorithm to zero-out the probability of forbidden symbols that would result in expressions that violate units rules.





□ scPheno: Extraction of biological signals by factorization enables the reliable analysis of single-cell transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.03.04.531126v1

scPheno, a deep auto-regressive factor model that is used to extract the biological signals imbedded in transcriptome, identify gene expression variations associated with each of the phenotypes, and re-build the accumulative effect of multiple phenotypes on cell states.

scPheno will factorize gene expression pertaining to a phenotypic factor and project cells onto a latent variable space, where the latent variable specifies a hidden cell state and cells of the same hidden states will cluster together.

The deep factor model will infer the factorized latent variable spaces. The factorization neural networks and the reconstruction neural network can be coupled to predict gene expression in relation to any factor combination.





□ INSnet: a method for detecting insertions based on deep learning network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05216-0

INSnet divides the reference genome into continuous sub-regions and takes five features for each locus through alignments between long reads and the reference genome. Next, INSnet uses a depthwise separable convolutional network.

INSnet uses two attention mechanisms, the convolutional block attention module (CBAM) and efficient channel attention (ECA) to extract key alignment features in each sub-region. INSnet uses a gated recurrent unit (GRU) network to further extract more important SV signatures.





□ LEMUR: Analysis of multi-condition single-cell data with latent embedding multivariate regression

>> https://www.biorxiv.org/content/10.1101/2023.03.06.531268v1

A new statistical model for differential expression analysis (or ANOVA) of multi-condition single-cell data that combines the ideas of linear models and principal compo- nent analysis (PCA).

Latent embedding multivariate regression (LEMUR) is based on a parametric mapping of latent space representations into each other and uses a design matrix to encode categorical and continuous covariates.





□ The Network Zoo: a multilingual package for the inference and analysis of gene regulatory networks

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02877-1

The Network Zoo, a platform that harmonizes the codebase for these methods, in line with recent similar efforts, and provides implementations in R, Python, MATLAB, and C. The netZoo codebase has helped develop an ecosystem of online resources for GRN inference and analysis.

netZoo integrates PANDA, LIONESS, and MONSTER to infer TF-gene targeting to explore how regulatory changes affect disease phenotype, and used DRAGON to integrate nine types of genomic information and find multi-omic markers that are associated with drug sensitivity.





□ RGT: a toolbox for the integrative analysis of high throughput regulatory genomics data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05184-5

Regulatory Genomics Toolbox (RGT) was programmed in an oriented-object fashion and its core classes provided functionalities to handle typical regulatory genomics data: regions and signals.

RGT built distinct regulatory genomics tools, i.e., HINT for footprinting analysis, TDF for finding DNA–RNA triplex, THOR for ChIP-seq differential peak calling, motif analysis for TFBS matching and enrichment, and RGT-viz for regions association tests and data visualization.

THOR is a Hidden Markov Model-based approach to detect and analyze differential peaks in two sets of ChIP-seq data from distinct biological conditions with replicates. Triplex Domain Finder (TDF) characterizes the triplex-forming potential between RNA and DNA regions.





□ phytools 2.0: An updated R ecosystem for phylogenetic comparative methods (and other things)

>> https://www.biorxiv.org/content/10.1101/2023.03.08.531791v1

The phytools library has now grown to be very large – consisting of hundreds of functions, a documentation manual that’s over 200 pages in length, and tens of thousands of lines of computer code.

For Mk model-fitter (which here will be the phytools function fitMk), and for the other discrete character methods of the phytools R package, the input phenotypic trait data will typically takes the form of a character or factor vector.





□ NextDenovo: An efficient error correction and accurate assembly tool for noisy long reads

>> https://www.biorxiv.org/content/10.1101/2023.03.09.531669v1

NextDenovo, a highly efficient error correction and CTA-based assembly tool for noisy long reads. NextDenovo can rapidly correct reads; these corrected reads contain fewer errors than other comparable tools and are characterized by fewer chimeric alignments.

NextDenovo uses the BOG algorithm to remove edges for non-repeat nodes. The graph usually contained some linear paths connecting some complex subgraphs. All paths were broken at the node connecting with multi-paths, and contigs were outputted from these broken linear paths.





□ vcfdist: Accurately benchmarking phased small variant calls in human genomes

>> https://www.biorxiv.org/content/10.1101/2023.03.10.532078v1

vcfdist, an alignment-based small variant calling evaluator that standardizes query and truth VCF variants to a consistent representation, requires local phasing of both input VCFs, and gives partial credit to variant calls which are mostly (but not exactly) correct.

A novel variant clustering algorithm reduces downstream computation while discovering long range variant dependencies. A novel alignment distance based metrics which are independent of variant representation, and measure the distance b/n the final diploid truth / query sequences.





□ scEvoNet: a gradient boosting-based method for prediction of cell state evolution

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05213-3

scEvoNet, a method that builds a cell type-to-gene network using the Light Gradient Boosting Machine (LGBM) algorithm overcoming different domain effects (different species/different datasets) and dropouts that are inherent for the scRNA-seq data.

ScEvoNet builds the confusion matrix of cell states and a bipartite network connecting genes and cell states. It allows a user to obtain a set of genes shared by the characteristic signature of two cell states even between distantly-related datasets.





□ NGenomeSyn: an easy-to-use and flexible tool for publication-ready visualization of syntenic relationships across multiple genomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad121/7072460

NGenomeSyn, an easy-to-use and flexible tool, for publication-quality visualization of syntenic relationships (user-defined or generated by our custom script) and genomic features (e.g. repeats, structural variations, genes) on tens of genomes with high customization.

NGenomeSyn allows its user to adjust default options for genome and link styles defined in the configuration file and simply adjusts options of moving, scaling, and rotation of target genomes, yielding a rich layout and publication-ready figure.





□ containX: Coverage-preserving sparsification of overlap graphs for long-read assembly

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad124/7074174

ContainX heuristics are promising in terms of improving assembly quality by avoiding coverage gaps. The string graph model filters out contained reads during graph construction.

containX is a prototype implementation of an algorithm that decides which contained reads can be dropped during overlap graph sparsfication. Reads which are substrings of longer reads are typically referred to as contained reads.

Hifiasm retained fewer contained reads than ContainX but it failed to resolve a majority of coverage gaps. The unitig graph of Hifiasm has the least number of junction reads because it does additional graph pruning which is necessary for computing longer unitigs.





□ LoMA: Localized assembly for long reads enables genome-wide analysis of repetitive regions at single-base resolution in human genomes

>> https://pubmed.ncbi.nlm.nih.gov/36895025/

LoMA constructs a CS spanning a target region. This process is initiated by finding overlaps of raw reads using pairwise all-to-all alignment of minimap2, followed by a layout of overlapped reads. It divides the layout into multiple blocks to make partial consensus sequences.

LoMA captures haplotype structures based on SVs and produces haplotype-resolved CSs. LoMA predicts heterozygous loci in the region based on the extent of deviation from the binomial distribution, and the reads derived from each estimated haplotype are gathered.





□ HiFiCNV : Copy number variant caller and depth visualization utility for PacBio HiFi reads

>> https://www.pacb.com/blog/hificnv/

HiFiCNV can generate several CNV related track files which can be loaded into IGV for visualization and assessment of its variant calls. HiFiCNV detected all large CNVs from this dataset, and 90% of those calls had high overlap accuracy when compared to the reported CNV.

Segmentation is performed by a Viterbi parse of the depth bins assuming the bin depth represents a Poisson sampling from a mean depth based on haploid depth. The haploid depth is computed from the zero-excluded mean depth of this chromosome set.





□ ReCo: automated NGS read-counting of single and combinatorial CRISPR gRNAs.

>> https://www.biorxiv.org/content/10.1101/2023.03.09.530923v1

ReCo! finds gRNA read counts (ReCo) in fastq files and runs as a standalone script or a python package. It can be used for single and combinatorial CRISPR-Cas libraries that have been sequenced with single-end or paired-end sequencing strategies.

ReCo works with conventionally cloned CRISPR-Cas libraries and 3Cs/3Cs-MPX libraries. ReCo can process multiple samples in a single run. It automatically determines the constant regions flanking the gRNAs, and utilizes Cutadapt to trim the fastq files.





□ StonPy: a tool to parse and query collections of SBGN maps in a graph database

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad100/7075543

The StonPy library allows users to store SBGN-ML maps into a running Neo4j database, and conversely retrieve them into SBGN-ML. StonPy includes a completion module that allows users to build valid SBGN maps from query results representing parts of maps automatically.

SBGN arcs are optionally modelled using additional Neo4j relationships that mimic the structure of the SBGN map. StonPy brings new capabilities for storing and analyzing large collections of CellDesigner and SBGN maps using Neo4j and Cypher.





□ SLEMM: million-scale genomic predictions with window-based SNP weighting

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad127/7075542

SLEMM (Stochastic-Lanczos-Expedited Mixed Models) uses the Stochastic Lanczos REML and SNP effects for large datasets. SLEMM is fast enough for million-scale genomic predictions.

SLEMM with SNP weighting had overall the best predictive ability among a variety of genomic prediction methods including GCTA’s empirical BLUP, BayesR, KAML, and LDAK’s BOLT and BayesR models.





□ scDeepInsight: a supervised cell-type identification method for scRNA-seq data with deep learning

>> https://www.biorxiv.org/content/10.1101/2023.03.09.531861v1

scDeepInsight can directly annotate the query dataset based on the model trained on the reference dataset. scDeepInsight does preprocessing of scRNA-seq data, including quality control and integration through batch normalization.

scDeepInsight is a single-cell labeling model based on supervised learning, so a reference dataset is also required. DeepInsight is utilized to convert the processed non-image data into images.





□ A general minimal perfect hash function for canonical k-mers on arbitrary alphabets with an application to DNA sequences

>> https://www.biorxiv.org/content/10.1101/2023.03.09.531845v1

A minimal perfect hash function of canonical k-mers on alphabets of arbitrary size, i.e., a mapping to the interval [0, σk /2−1]. The approach is introduced for canonicalization under reversal and extended to canonicalization under reverse complementation.

The encoding is based on the observation that there are fewer canonical k-mers than there are k-mers in general. A mapping is only required if k-mer x is canonical, i.e., x is lexicographically smaller than or equal to x^−1.





□ scBubbletree: quantitative visualization of single cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.03.09.531263v1

scBubbletree, a new scalable method for visualization of scRNA-seq data. The method identifies clusters of cells of similar transcriptomes and visualizes such clusters as “bubbles” at the tips of dendrograms, corresponding to quantitative summaries of cluster properties.

scBubbletree stacks bubble trees w/ further cluster-associated information. scBubbletree relies on the gap statistic method. scBubbletree can cluster scRNA-seq data in two ways, namely by graph-based community detection (GCD) algorithms such as Louvain or Leiden, and by k-means.





□ Panpipes: a pipeline for multiomic single-cell data analysis.

>> https://www.biorxiv.org/content/10.1101/2023.03.11.532085v1

Panpipes, a set of workflows designed to automate the analysis of multimodal single-cell datasets by incorporating widely used Python-based tools to efficiently perform QC, preprocessing, integration, clustering, and reference mapping at scale in the multiomic setting.

Panpipes generates a cluster matching metric, the Adjusted Rand Index, for global concordance evaluation. Panpipes can aid building unimodal or multimodal references and enables the user to query multiple references simultaneously using scArches.





□ plasma: Partial LeAst Squares for Multiomics Analysis

>> https://www.biorxiv.org/content/10.1101/2023.03.10.532096v1

plasma, a novel two-step algorithm to find models that can predict time-to-event outcomes on samples from multiomics data sets even in the presence of incomplete data. These components will be automatically associated with the outcome.

plasma uses partial least squares (PLS) for both steps, using Cox regression to learn the single omics models and linear regression. The plasma components are learned in a way that maximizes the covariance in the predictors and the response.





□ eOmics: an R package for improved omics data analysis

>> https://www.biorxiv.org/content/10.1101/2023.03.11.532240v1

eOmics combines an ensemble framework with limma, improving its performance on imbalanced data. It couples a mediation model with WGCNA, so the causal relationship among WGCNA modules, module features, and phenotypes can be found.

eOmics has some novel functional enrichment methods, capturing the influence of topological structure on gene set functions. It contains multi-omics clustering and classification functions to facilitate ML tasks. Some basic functions, such as ANOVA analysis, are also available.





□ Biomappings: Prediction and Curation of Missing Biomedical Identifier Mappings

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad130/7077133

Biomappings, a framework for semi-automatically creating and maintaining mappings in a public, version-controlled repository.

Biomappings combines multiple contributions: (i) a "curation cycle" workflow for creating mappings, (ii) an extensible pipeline for automatically predicting missing mappings between resources, and automatically detecting inconsistencies.

Biomappings currently makes available 9,274 curated mappings and 40,691 predicted ones, providing previously missing mappings between widely used identifier resources covering small molecules, cell lines, diseases, and other concepts.





□ fraguracy: overlapping bases in read-pairs from a fragment indicate accuracy.

>> https://github.com/brentp/fraguracy

Many factors can be predictive of the likelihood of an error. The dimensionality is a consideration because if the data is too sparse, prediction is less reliable. For each combination, while iterating over the bam, it stores the number of errors and the number of total bases in each bin.

fraguracy calculates real error rates using overlapping paired-end reads in a fragment. This avoids some bias. It does limit to the (potentially) small percentage of bases that overlap and it will sample less at the beginning of read 1 and the end of read2.





□ Genes2Genes: Gene-level alignment of single cell trajectories informs the progression of in vitro T cell differentiation

>> https://www.biorxiv.org/content/10.1101/2023.03.08.531713v1

Genes2Genes overcomes current limitations and is able to capture sequential matches and mismatches between a reference and a query at single gene resolution, highlighting distinct clusters of genes with varying patterns of gene expression dynamics.

Genes2Genes utilizes a Bayesian information-theoretic Dynamic Programming alignment algorithm that accounts for matches, warps and indels by combining the classical Gotoh’s biological sequence alignment algorithm and Dynamic Time Warping.





□ GenoPipe: identifying the genotype of origin within (epi)genomic datasets

>> https://www.biorxiv.org/content/10.1101/2023.03.14.532660v1

The three core modules of GenoPipe: EpitopeID, DeletionID, and StrainID were developed to identify major genotypical determinants of cellular identity. GenoPipe can detect genotype perturbations at realistic and practical sequencing depths as defined by ENCODE.

The DeletionID module models the background of a genomic experiment to identify depleted regions of the genome to predict genomic deletions. The StrainID uses existing SNP or variant calls databases of common cell lines to match a cell’s genetic identity inherent to each dataset.

The EpitopeID module identifies the presence and approximate location of specific DNA sequences within the genome. The algorithm functions by first aligning the raw sequencing data (i.e., FASTQ) against a curated DNA sequence database (tagDB) of common protein epitopes.





□ BioConvert: a comprehensive format converter for life sciences

>> https://www.biorxiv.org/content/10.1101/2023.03.13.532455v1

BioConvert aggregates existing software within a single framework and complemented them with original code when needed. It provides a common interface to make the user experience more streamlined instead of having to learn tens of them.

BioConvert supports about 50 formats and 100 direct conversions in areas such as alignment, sequencing, phylogeny, and variant calling. BioConvert can also be utilized by developers as a universal benchmarking framework for evaluating and comparing numerous conversion.





□ Fast Approximate IsoRank for Scalable Global Alignment of Biological Networks

>> https://www.biorxiv.org/content/10.1101/2023.03.13.532445v1

A new IsoRank approximation, which exploits the mathematical properties of IsoRank's linear system to solve the problem in quadratic time with respect to the maximum size of the two PPI networks.

A computationally cheaper refinement is proposed to this initial approximation so that the updated result is even closer to the original IsoRank formulation.

In synthetic experiments, they create random graphs using the Erd ̋os R ́enyi and Barab ́asi-Albert models, and ask IsoRank to recover the graph isomorphism between the graphs and a random node permutation.





□ IntLIM 2.0: identifying multi-omic relationships dependent on discrete or continuous phenotypic measurements

>> https://academic.oup.com/bioinformaticsadvances/article-abstract/3/1/vbad009/7022005

IntLIM 2.0 uncovers phenotype-dependent linear associations between two types of analytes. IntLIM 2.0 extends IntLIM 1.0 to support generalized analyte measurement data types, continuous phenotypic measurement, covariate correction, model validation and unit testing.

IntLIM 2.0 supports model validation using cross-validation and random permutation models.





□ NanoSquiggleVar: A method for direct analysis of targeted variants based on nanopore sequencing signals

>> https://www.biorxiv.org/content/10.1101/2023.03.15.532860v1

NanoSquiggleVar can directly identify targeted variants from the nanopore sequencing electrical signal without the requirement of base calling, sequence alignment, or variant detection with downstream analysis.

In each sequencing iteration, the signal is sliced into fragments by a moving window of 1-unit step size. Dynamic time warping is used to compare the signal squiggles to the detected variants. NanoSquiggleVar can only determine the existence of a mutation and not its frequency.





□ HiDecon: Accurate estimation of rare cell type fractions from tissue omics data via hierarchical deconvolution

>> https://www.biorxiv.org/content/10.1101/2023.03.15.532820v1

HiDecon, a penalized approach with constraints from both “parent” and “children” cell types to make full use of a hierarchical tree structure. The hierarchical tree is readily available from well-studied cell lineages or can be learned from hierarchical clustering of scRNA-seq.

The basic intuition of HiDecon is that there exists a summation relationship b/n the estimation results of adjacent layers. HiDecon implements the sum constraint penalties from the upper and lower layers to aggregate estimates across layers for more accurate cellular fraction.






□ Implementing Dynamic Time Warping (DTW) with Neural Networks and analyzing single-cell RNA data involves creating a custom model architecture with GPT-4.




Yubais RT

昔のAI観ではまず「知性そのもの」みたいなのをコンピュータ内に作って、それと人間が会話するためのインターフェースを別途作るようなイメージだったんだが、インターフェースであるはずの言語に知性っぽいものが内包されていたんじゃないか、と現状を見ていて思う