lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Dominator.

2023-10-17 22:17:37 | Science News

(Created with Midjourney v5.2)





□ Design Patterns of Biological Cells

>> https://arxiv.org/abs/2310.07880

Because design patterns exist at all levels of detail within biology, from the designs of specific molecules to the designs of multi-cellular organisms, they restrict this work to the chemical reaction networks that animate individual cells.

There are three dominant versions of this pattern, which are DNA replication, DNA transcription to RNA, and RNA translation to proteins.

Each is performed by complex biochemical machinery that moves along the template and catalyzes the production of the newly synthesized molecule, and each includes its own version of kinetic proofreading.





□ Deepurify: a multi-modal deep language model to remove contamination from metagenome-assembled genomes

>> https://www.biorxiv.org/content/10.1101/2023.09.27.559668v1

Deepurify developed two distinct encoders, a genomic sequence encoder (GseqFormer) and a taxonomic encoder (LSTM) to encode genomic sequences and their source genomes' taxonomic lineages.

Deepurify initially quantified the taxonomic similarities of contigs by assigning taxonomic lineages to them. It then used these lineages to construct a MAG-separated tree, partitioning the MAG into distinct sections, each containing contigs with the same lineage.

Deepurify optimized contig utilization within the MAG, avoiding immediate removal of contaminated contigs. A tree traversal algorithm was devised to maximize the count of medium- and high-quality MAGs within the MAG-separated tree.





□ scDILT: a model-based and constrained deep learning framework for single-cell Data Integration, Label Transferring, and clustering

>> https://www.biorxiv.org/content/10.1101/2023.10.09.561605v1

scDILT (Single-Cell Deep Data Integration and Label Tranferring) leverages a conditional autoencoder (CAE). The CAE receives the concatenated count matrix of multiple datasets, along with a vector indicating the batch IDs.

scDILT generates an integrated latent space representing the input datasets along with predicted labels for all cells. The cell-to-cell constraints will be built based on the labels of these data and implemented on the bottle-neck layer Z of the autoencoder.





□ ProxyTyper: Generation of Proxy Panels for Privacy-aware Outsourcing of Genotype Imputation

>> https://www.biorxiv.org/content/10.1101/2023.10.01.560384v1

ProxyTyper, a framework for building proxy panels, i.e. panels that are similar in statistical properties to the original panel but are anonymized. ProxyTyper utilizes 3 mechanisms to protect haplotype datasets in terms of variant positions, genetic maps, and variant genotypes.

First mechanism protects the variant positions and genetic maps that can leak side-channel information. Second is resampling of original haplotype panels using a Li-Stephens Markov model with privacy parameters for tuning privacy level and utility.

ProxyTyper generates a mosaic of the original haplotypes so that each chromosome-wide haplotype is a mosaic of the haplotypes in the original panel. The third mechanism consists of encoding the alleles in resampled panels using locality-based hashing and permutation.





□ DiffDec: Structure-Aware Scaffold Decoration with an End-to-End Diffusion Model

>> https://www.biorxiv.org/content/10.1101/2023.10.08.561377v1

DiffDec optimizes molecules through molecular scaffold decoration conditioned on the 3D protein pocket by an E(3)-equivariant graph neural network and diffusion model. DiffDec could identify the growth anchors and generate R-groups well for the scaffolds without provided anchors.

The diffusion process iteratively adds Gaussian noise to the data, while the generative process gradually denoises the noise distribution under the condition of scaffold and protein pocket to recover the ground truth R-groups.






□ ILIAD: A suite of automated Snakemake workflows for processing genomic data for downstream applications

>> https://www.biorxiv.org/content/10.1101/2023.10.11.561910v1

ILIAD, a suite of Snakemake workflows developed with several modules for automatic and reliable processing of raw or stored genomic data that lead to the output of ready-to-use genotypic information necessary to drive downstream applications.

ILIAD offers a containerized workflow with optional automatic downloads of desired files from file transfer protocol (FTP) sites coupled with the use of any genome reference assembly for variant calling using BCFtools.

Iliad features independent submodules for lifting over reference assembly genomic positions (GRCh37 to GRCh38 and vice versa) and merging multiple VCF files at once.





□ MSXFGP: combining improved sparrow search algorithm with XGBoost for enhanced genomic prediction

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05514-7

Chaos theory is a nonlinear theory and has good applications in random number generation. Many swarm intelligence optimization methods use chaos mapping as random number generators to initialize populations.

MSXFGP is based on a multi-strategy improved sparrow search algorithm (SSA) to optimize XGBoost parameters and feature selection. Firstly, logistic chaos mapping, elite learning, adaptive parameter adjustment, Levy flight, and an early stop strategy are incorporated into the SSA.





□ PhyGCN: Pre-trained Hypergraph Convolutional Neural Networks with Self-supervised Learning

>> https://www.biorxiv.org/content/10.1101/2023.10.01.560404v1

PhyGCN aims to enhance node representation learning in hypergraphs by effectively leveraging abundant unlabeled data. Hyperedge prediction is employed as a self-supervised task for model pre-training. The pre-trained embedding model is then used for downstream tasks.

To calculate the embedding for a target node, the hypergraph convolutional network aggregates information from neighboring nodes connected to it via hyperedges, and combines it with the target node embedding to output a final embedding.

PhyGCN employs two adapted strategies: DropHyperedge and Skip/Dense Connection. These strategies randomly mask the values of the adjacency matrix for the base hypergraph convolutional network during each iteration, which helps prevent overfitting and improves generalization.





□ Monopogen: Single-nucleotide variant calling in single-cell sequencing data

>> https://www.nature.com/articles/s41587-023-01873-x

Monopogen, a computational framework that enables researchers to detect single-nucleotide variants (SNVs) from a variety of single-cell transcriptomic and epigenomic sequencing data.

Monopogen uses high-quality haplotype and linkage disequilibrium (LD) data from an external reference panel to overcome uneven sequencing coverage, allelic dropout and sequencing errors in single-cell sequencing data.

Monopogen further conducts LD scoring at the cell population level within each sample, leveraging the expectation that most alleles are identical and in perfect LD with neighboring alleles across the genome, except for those that are somatically altered in a subpopulation of cells.





□ Ribotin: Automated assembly and phasing of rDNA morphs

>> https://www.biorxiv.org/content/10.1101/2023.09.29.560103v1

Ribotin uses the highly accurate long reads to build a graph which represents all variation within the rDNA. Then ultralong ONT reads are aligned to the graph and are used to detect rDNA repeat units. The ONT read paths are clustered to rDNA morphs..

Ribotin has integration with the assembly tool verkko to assemble rDNA morphs per chromosome. Ribotin also has a mode to run without a verkko assembly using only a related reference rDNA sequence. Ribotin detects the rDNA tangles using the reference k-mers and graph topology.





□ LMSRGC: Reference-based genome compression using the longest matched substrings with parallelization consideration

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05500-z

LMSRGC, an algorithm based on reference genome sequences, which uses the suffix array (SA) and the longest common prefix (LCP) array to find the longest matched substrings (LMS) for the compression of genome data in FASTA format.

The proposed algorithm utilizes the characteristics of SA and the LCP array to select all appropriate LMSs between the genome sequence to be compressed and the reference genome sequence and then utilizes LMSs to compress the target genome sequence.






□ CEN-DGCNN: Co-embedding of edges and nodes with deep graph convolutional neural networks

>> https://www.nature.com/articles/s41598-023-44224-1

CEN-DGCNN (Co-embedding of Edges and Nodes with Deep Graph Convolutional Neural Networks) introduces multi-dimensional edge embedding representation. It constructs a message passing framework which introduces the idea of residual connection and dense connection.

Based on CEN-DGCN, a deep graph convolution neural network can be designed to mine remote dependency relationships between nodes. Each layer can learn node features and edge features simultaneously, and can be updated iteratively across layers.





□ StrastiveVI: Isolating structured salient variations in single-cell transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2023.10.06.561320v1

StrastiveVI (Structured Contrastive Variational Inference) leverages previous advances in conditionally invariant representation learning to model the variations underlying scRNA-seq data using two sets of latent variables.

Strastive VI separates the target variations and the dominant background variations. The background variables, are invariant to the given covariate of interest. The target variables, capture variations related to the covariate of interest.





□ HycDemux: a hybrid unsupervised approach for accurate barcoded sample demultiplexing in nanopore sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03053-1

HycDemux integrates an unsupervised hybrid approach to achieve accurate clustering, in which the nucleotides-based greedy algorithm is utilized to obtain initial clusters, and the raw signal information is measured to guide the continuously optimization of clustering.

HycDemux integrates a module that uses a voting mechanism to determine the final demultiplexing result. This module selects n representatives (5 by default) for each cluster and calculates the Dynamic Time Warping.





□ diVas: Digenic variant interpretation with hypothesis-driven explainable AI

>> https://www.biorxiv.org/content/10.1101/2023.10.02.560464v1

diVas, an ML-based approach for digenic variant interpretation aiming to overcome the limitations of the other tools described above. Unlike other tools, diVas leverages proband's phenotypic information to predict the probability of each pair to be causative.

diVas employs cutting-edge Explainable Artificial Intelligence (XAl) techniques for further subclassification into distinct digenic mechanisms: True Digenic /Composite and Dual Molecular Diagnosis.





□ Incorporating extrinsic noise into mechanistic modelling of single-cell transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.09.30.560282v1

A fully Bayesian framework for the mechanistic analysis of scRNAseq data based on the telegraph model of gene expression, building on single cell sequencing / Kinetics analysis and including cell size effects via a cell-specific scaling factor.

This framework is implemented in the probabilistic programming language Stan and relies on a state-of-the-art Hamiltonian Monte Carlo sampler. It uses Bayesian model selection to distinguish between modes of gene expression and evaluate the possible presence of zero-inflation.






□ MINI-AC: inference of plant gene regulatory networks using bulk or single-cell accessible chromatin profiles

>> https://onlinelibrary.wiley.com/doi/10.1111/tpj.16483

MINI-AC (Motif-Informed Network Inference based on Accessible Chromatin), a computational method that integrates TF motif information with bulk or single-cell derived chromatin accessibility data to perform motif enrichment analysis and GRN inference.

MINI-AC generates information about motifs showing enrichment on the ACRs, a network that is context-specific for a functional enrichment analysis. MINI-AC can be used in two alternative modes - genome-wide and locus-based - to select different non-coding genomic spaces.






□ MBE: model-based enrichment estimation and prediction for differential sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03058-w

MBE can readily make use of modern-day neural network models in a plug-and-play manner, which also enables us to easily handle (possibly overlapping) reads of different lengths.

For example, fully convolutional neural network classifiers naturally handle variable-length sequences because the convolutional kernels and pooling operations in each layer are applied in the same manner across the input sequence, regardless of its length.

MBE trivially generalizes to settings with more than two conditions of interest by replacing the binary classifier with a multi-class classifier.

The multi-class classification model is trained to predict the condition from which each read arose; then, the density ratio for any pair of conditions can be estimated using the ratio of its corresponding predicted class probabilities.





□ LIANA: Comparison of methods and resources for cell-cell communication inference from single-cell RNA-Seq data

>> https://www.nature.com/articles/s41467-022-30755-0

CCC events are typically represented as a one-to-one interaction between a transmitter and receiver protein, accordingly expressed by the source and target cell clusters. The information about which transmitter binds to which receiver is extracted from diverse sources.

LIANA (a LIgand-receptor ANalysis frAmework) takes any annotated single-cell RNA (scRNA) dataset as input and establishes a common interface to all the resources and methods in any combination. LIANA provides a consensus ranking for the method’s predictions.





□ Benchmarking long-read RNA-sequencing analysis tools using in silico mixtures

>> https://www.nature.com/articles/s41592-023-02026-3

For long-read RNA-seq, This study is the first to compare differential transcript expression (DTE) and differential transcript usage (DTU) methods on a controlled dataset with a tens of millions of reads per sample, as is typically available in short-read studies.

DTU analysis calculates the proportion of transcript expression relative to all transcripts, which can be impacted more readily by changes in quantification of any transcript from a gene. Therefore, the difference of quantification in ONT and Illumina data had a larger impact.





□ happi: a hierarchical approach to pangenomics inference

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03040-6

happi is a method for modeling gene presence in pangenomics that leverages information about genome quality. happi models the association between an experimental condition and gene presence where the experimental condition is the primary predictor of interest.

happi provides sensible results in an analysis of metagenome-assembled genome data, improves statistical inference under simulation. The latent variable structure of the model makes the expectation-maximization algorithm an appealing choice for estimating unknown parameters.





□ PaGeSearch: A Tool for Identifying Genes within Pathways in Unannotated Genomes

>> https://www.biorxiv.org/content/10.1101/2023.09.26.559665v1

PaGeSearch identifies a list of genes within a genome, with a focus on genes associated with specific pathways. By identifying candidate regions through a sequence similarity search and performing gene prediction within them, PaGeSearch significantly reduces the search space.

PaGeSearch uses a neural network model to provide candidates that are the most likely orthologs of the query genes.





□ GenArk: towards a million UCSC genome browsers

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03057-x

GenArk (Genome Archive), a collection of UCSC Genome Browsers from NCBI assemblies. Built on our established track hub system, this enables fast visualization of annotations. Assemblies come with gene models, repeat masks, BLAT, and in silico PCR.

The GenArk genome browsers cover multiple clades: 159 primates, 409 mammals, 270 birds, 271 fishes, 115 other vertebrates, 598 invertebrates, 554 fungi, and 230 plants. It also includes 446 assemblies from the Vertebrate Genome Project (VGP) and 336 legacy assemblies.





□ scRANK: Ranking of cell clusters in a single-cell RNA-sequencing analysis framework using prior knowledge

>> https://www.biorxiv.org/content/10.1101/2023.10.02.560416v1

A novel methodology that exploits prior knowledge for a disease in combination with expert-user information to accentuate cell types from a scRNA-seq analysis that are most closely related to the molecular mechanism of a disease of interest.

The methodology is fully automated and a ranking is generated for all cell types. This provides a ranking which is based on topology information obtained from the CellChat networks.





□ Accelerated identification of disease-causing variants with ultra-rapid nanopore genome sequencing

>> https://www.nature.com/articles/s41587-022-01221-5

An approach for ultra-rapid nanopore WGS that combines an optimized sample preparation protocol, distributing sequencing over 48 flow cells, near real-time base calling and alignment, accelerated variant calling and fast variant filtration for efficient manual review.

The cloud-based pipeline scales compute-intensive base calling and alignment across 16 instances with 4× Tesla V100 GPUs each and runs concurrently with sequencing.

The instances aim for maximum resource utilization, where base calling using Guppy runs on GPU and alignment using Minimap2 runs on 42 virtual CPUs in parallel. Small-variant calling performed using GPU-accelerated PEPPER–Margin–DeepVariant.





□ AutoClass: A universal deep neural network for in-depth cleaning of single-cell RNA-Seq data

>> https://www.nature.com/articles/s41467-022-29576-y

AutoClass integrates two DNN components, an autoencoder and a classifier, as to maximize both noise removal and signal retention. AutoClass is distribution agnostic as it makes no assumption on specific data distributions, hence can effectively clean a wide range of noise and artifacts.

AutoClass effectively models and cleans a wide range of noises and artifacts in scRNA-Seq data including dropouts, random uniform, Gaussian, Gamma, Poisson, and negative binomial noises, as well as batch effects.





□ Mabs: a suite of tools for gene-informed genome assembly

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05499-3

Mabs is a genome assembly tool which optimizes parameters of genome assemblers Hifiasm and Flye. Mabs optimizes parameters of a genome assembler to make an assembly where protein-coding genes are assembled more accurately.

Mabs is able to distinguish true multicopy orthogroups from false multicopy orthogroups, because genes originating from haplotypic duplications have two times lower coverage than correctly assembled genes.





□ The longest intron rule

>> https://www.biorxiv.org/content/10.1101/2023.10.02.560625v1

The presence of introns substantially increases the complexity of ribosomal protein gene expression as they variably slow the expression cycle, and in addition, many introns can contain non-coding RNA involved in other layers of regulation.

The localization of the longest intron in the second or third third is significantly more frequent for certain functionally related groups of genes, e.g. for DNA repair genes.





□ DAESC: Single-cell allele-specific expression analysis reveals dynamic and cell-type-specific reg- ulatory effects

>> https://www.nature.com/articles/s41467-023-42016-9

DAESC (Differential Allelic Expression using Single-Cell data) accounts for haplotype switching using latent variables and handles sample repeat structure of single-cell data using random effects.

DAESC is based on a beta-binomial regression model and can be used for differential ASE against any independent variable, such as cell type, continuous developmental trajectories, genotype (eQTLs), or disease status.

The baseline model DAESC-BB is a beta-binomial model with individual-specific random effects that account for the sample repeat structure arising from multiple cells measured per individual inherent to single-cell data.

DAESC-BB can be used generally for differential ASE regardless of sample size (number of individuals, N). When sample size is reasonably large (e.g., N ≥ 20), a full model DAESC-Mix that accounts for both sample repeat structure and implicit haplotype phasing.





□ KmerSV: a visualization and annotation tool for structural variants using Human Pangenome derived k-mers

>> https://www.biorxiv.org/content/10.1101/2023.10.11.561941v1

KmerSV, a new tool for SV visualization and annotation. To mediate these functions, KmerSV uses a reference sequence deconstructed into its component k-mers, each having a length of 31 bp. These reference-derived k-mers are compared to the sequence of interest.

The program maps the Pangenome or other reference 31-mers against one or multiple target sequences which can include either contigs or sequence reads.

Initially, they retrieve these k-mers via a sliding window across a segment of the reference with its coordinate information. Then, the retrieved k-mers are systematically mapped against the target.

Unique 31-mers (as defined by the reference) serve as "anchor" points in the target sequence to facilitate using k-mers with multiple coordinates. This anchoring process eliminates ambiguous k-mers and improves the visualization of complex SVs such as duplications.





□ PanKmer: k-mer based and reference-free pangenome analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad621/7319363

PanKmer, a non-graphical k-mer decomposition method designed to efficiently represent and analyze many forms of variation in large pangenomic datasets, with no reliance on a reference genome and no assumption of annotation.

PanKmer includes a function to calculate the number of shared k-mers between all pairs of input genomes and return them as an adjacency matrix. Subsequently, the adjacency values can be used to perform a hierarchical clustering of input genomes.



Oxford Nanopore

>> https://nanoporetech.com/about/events/community-meetings/ncm-2023-houston

This week is #WorldSpaceWeek! At #nanoporeconf, Sarah Castro-Wallace will share @NASA’s project to take the MinION device to Mars — which will prove invaluable if we are to discover life beyond Earth.






Focal Point.

2023-10-17 22:17:36 | Science News

(Artwork by Andrew Kramer)




□ CellPLM: Pre-training of Cell Language Model Beyond Single Cells

>> https://www.biorxiv.org/content/10.1101/2023.10.03.560734v1

CellPLM (a novel single-Cell Pre-trained Language Model) proposes a cell language model to account for cell-cell relations. The cell embeddings are initialized by aggregating gene embeddings since gene expressions are bag-of-word features.

CellPLM leverages a new type of data, spatially-resolved transcriptomic (SRT) data, to gain an additional reference for uncovering cell-cell interactions. SRT data provides positional information for cells. Both types of data are jointly modeled by transformers.

CellPLM consists of a gene expression embedder, a transformer encoder, a Gaussian mixture model, and a batch-aware decoder. CellPLM introduces an inductive bias to overcome data quantity limitations by utilizing a Gaussian mixture as the prior distribution in the latent space.





□ SONATA: Disambiguated manifold alignment of single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.10.05.561049v1

SONATA represents the low-dimensional manifold structure of each single-cell dataset using a geodesic distance matrix of the cells. To do this, SONATA first construct a weighted k-nearest neighbor (k-NN) graph of cells based on Euclidean distance.

SONATA then calculates the shortest distance between each node pair on the graph because the shortest distances approximate geodesic distances. SONATA measures the likelihood that one cell from the dataset can be substituted for another in a cross-modality alignment.





□ TreePPL: A Universal Probabilistic Programming Language for Phylogenetics

>> https://www.biorxiv.org/content/10.1101/2023.10.10.561673v1

TreePPL introduces universal probabilistic programming and extensible Monte Carlo inference to a wider audience in statistical phylogenetics. It allows practitioners to craft probabilistic programs that utilize the sophisticated Miking CorePPL inference on the back-end.

To describe the problem of tree inference in a PPL, they use stochastic recursion. The core idea is to control a recursive function using a random variable, such that successive iterations generate a valid draw from the prior probability distribution over tree space.





□ Graphite: painting genomes using a colored De Bruijn graph

>> https://www.biorxiv.org/content/10.1101/2023.10.08.561343v1

Graphite starts with two graph files and a set of query identifiers. It then builds a suffix array of the queries along with other datastructures to speed up matching. Then each sequence (i.e "reference") is read from the graph file and mapped onto the Suffix array.

Each mapping is an identical sequence between the queries and ref, also called Maximum Exact Matches (MEMs). Each time a MEM is found its length is compared to previously discovered MEMs to only retain the Longest MEM (LMEM).





□ PARSEC: Rationalised experiment design for parameter estimation with sensitivity clustering

>> https://www.biorxiv.org/content/10.1101/2023.10.11.561860v1

PARSEC (PARameter SEnsitivity Clustering) uses the model architecture of the system through parameter sensitivity analysis to direct the search for informative experiment designs. PARSEC generates an 'optimal' DoE effectively.

PARSEC computes the parameter sensitivity indices (PSI) vectors at various parameter values that sample the distribution linked to parameter uncertainty. Concatenating the PSI vectors for a measurement candidate yields the composite PARSEC-PSI vector.





□ SC-Track: a robust cell tracking algorithm for generating accurate single cell linages from diverse cell segmentations

>> https://www.biorxiv.org/content/10.1101/2023.10.03.560639v1

SC-Track employs a hierarchical probabilistic cache-cascade model to overcome the noisy output of deep learning models. SC-Track can generate robust single cell tracks from noisy segmented cell outputs ranging from missing segmentations and false detections.

SC-Track provides smoothed classification tracks to aid the accurate classification of cellular events. SC-Track has a built-in biologically inspired cell division algorithm that can robustly assign mother-daughter associations from segmented nuclear or cellular masks.

SC-Track employs a tracking-by-detection approach, whereby detected cells are associated between frames. A TrackTree data structure was used to store the tracking relationships between each segmented cell temporally and spatially.





□ optima: an Open-source R Package for the Tapestri platform for Integrative single cell Multi-omics data Analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad611/7291856

optima stores all data matrices for a single biological sample, incl. DNA (amplicon data for DNA variants), CNV, and protein. optima also stores all the metadata, incl. cell barcodes, panels of amplicon names, as well as metadata to keep track of normalization/filter status.

The first step is DNA variant data filtering with the filterVariant () function. Several factors, including sequencing depth, genotype quality, etc., are imported from the h5 file and used in this filtering step. A cell/variant will be removed if too many loci fail QC.

After filtering, the DNA data will be used for cell clone identification. To identify clones, a user may choose to use the non-supervised clustering method dbscan. The clustering result will be stored in the cell labels vector contained within the optima object.





□ FedGMMAT: Federated Generalized Linear Mixed Model Association Tests

>> https://www.biorxiv.org/content/10.1101/2023.10.03.560753v1

FedGMMAT, a federated genetic association testing tool that utilizes a federated statistical testing approach for efficient association tests that can correct for arbitrary fixed and random effects among different collaborating sites.

FedGMMAT executes the null model fitting using a round-robin schedule among the sites wherein each site locally updates the model parameters, encrypts the intermediate results and passes them to the next site to be securely aggregated.

After the model parameters have converged, FedGMMAT fits the mixed-effect model parameters using a similar round-robin algorithm. FedGMMAT assigns the score-test statistics to each variant. The central server computes an aggregated projection matrix from all sites.





□ DegCre: Probabilistic association of differential gene expression with regulatory regions

>> https://www.biorxiv.org/content/10.1101/2023.10.04.560923v1

DegCre, a method that probabilistically associates CREs to target gene TSSs over a wide range of genomic distances. The premise of DegCre is that true CRE to DEG pairs should change in concert with one another as a result of a perturbation, such as a differentiation protocol.

DegCre is a non-parametric method that estimates an association probability for each possible pair of CRE and DEG. It considers CRE-DEG distance but avoids arbitrary thresholds. Because DegCre uses rank-order statistics, it can use various types of CRE-associated data.





□ The Bias of Using Cross-Validation in Genomic Predictions and Its Correction

>> https://www.biorxiv.org/content/10.1101/2023.10.03.560782v1

A comprehensive examination of CV bias across various models, including the Ordinary Least Square (OLS), Generalized Least Squares (GLS), polygenic method, i.e. LMM with its predictor gBLUP, three regular-ization methods, i.e. Ridge, Lasso, and ENET.

CVc method calculates the correction by adding the difference of covariance of the predicted dependent variable and the dependent variable in the cross-validation process with the covariance in the testing process.

To calculate the covariance, one extracts the projection matrix from the covariance, which means only linear methods with closed-form solutions can be applied to rectify the CV bias.





□ SNAIL: Adjustment of spurious correlations in co-expression measurements from RNA-Sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad610/7295542

SNAIL (Smooth-quantile Normalization Adaptation for Inference of co-expression Links) is modified implementation of smooth quantile normalization which uses a trimmed mean to determine the quantile distribution and applies median aggregation for genes with shared read counts.

SNAIL effectively removes false-positive associations between genes, without the need to select an arbitrary threshold or to exclude genes from the analysis.





□ simpleaf : A simple, flexible, and scalable framework for single-cell data processing using alevin-fry

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad614/7295550

simpleaf encapsulates the process of creating an expanded reference for quantification into a single command (index) and the quantification of a sample into a single command (quant). It also exposes various other functionality, and is actively being developed and expanded.

Simpleaf provides a simple and flexible interface to access the state-of-the-art features provided by the alevin-fry ecosystem, tracks best practices using the underlying tools, enables users to transparently process data with complex fragment geometry.





□ Aliro: an Automated Machine Learning Tool Leveraging Large Language Models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad606/7291858

Aliro is an easy-to-use data science assistant. It allows researchers without machine learning or coding expertise to run supervised machine learning analysis through a clean web interface.

By infusing the power of large language models (LLM), the user can interact with their data by seamlessly retrieving and executing code pulled from the LLM, accelerating automated discovery of new insights from data.

Aliro includes a pre-trained machine learning recommendation system that can assist the user to automate the selection of machine learning algorithms and its hyperparameters and provides visualization of the evaluated model and data.





□ Segzoo: a turnkey system that summarizes genome annotations

>> https://www.biorxiv.org/content/10.1101/2023.10.03.559369v1

Segzoo is a tool designed to automate various genomic analyses on segmentations obtained using Segway. It provides detailed results for each analysis and a comprehensive visualization summarizing the outcomes.

Segzoo generates segmentation-centric summary statistics using Segtools and BEDTools. Segzoo uses Go Get Data (GGD) to automatically download all required data for these analyses and produces an easy to interpret figure which reveals patterns of segmented regions.





□ GradHC: Highly Reliable Gradual Hash-based Clustering for DNA Storage Systems

>> https://www.biorxiv.org/content/10.1101/2023.10.05.561008v1

An implementation of the Gradual Hash-based clustering algorithm for DNA storage systems. The primary strength of GradHC lies in its capability to cluster with excellent accuracy various types of designs, incl. varying strand lengths, cluster sizes, and different error ranges.

Given an input design (with potential similarity among different DNA strands), one can randomly choose a seed and use it to generate pseudo-random DNA strands matching the original design's length and input set size.

Each input strand is then XORed with its corresponding pseudo-random DNA strand, ensuring a high likelihood that the new strands are far from each other (in terms of edit distance) and do not contain repeated substrings across different input strands.





□ Multimodal joint deconvolution and integrative signature selection in proteomics

>> https://www.biorxiv.org/content/10.1101/2023.10.04.560979v1

A novel algorithm to estimate the proteomics cell fractions by integrating bulk transcriptome-proteome without reference proteome, implemented in R package MICSQTL.

The method enables the downstream cell-type-specific protein quantitative trait loci mapping (cspQTL) based on the mixed-cell proteomes and pre-estimated proteomics cellular composition, without the need for large-scale single cell sequencing [9] or cell sorting.





□ The DeMixSC deconvolution framework uses single-cell sequencing plus a small benchmark dataset for improved analysis of cell-type ratios in complex tissue samples

>> https://www.biorxiv.org/content/10.1101/2023.10.10.561733v1

DeMixSC, which employs a benchmark dataset and an improved weighted nonnegative least-squares (WNNLS) framework to identify and adjust for genes consistently affected by technological discrepancies.

DeMixSC starts with a benchmark dataset of matched bulk and sc/snRNA-seq data with the same cell-type proportions. Pseudo-bulk mixtures are generated from the sc/sn data. DeMixSC identifies DE genes and non-DE genes between the matched real-bulk and pseudo-bulk data.





□ Afanc: a Metagenomics Tool for Variant Level Disambiguation of NGS Datasets

>> https://www.biorxiv.org/content/10.1101/2023.10.05.560444v1

Afanc, a novel metagenomic profiler which is sensitive down to species and strain level taxa, and capable of elucidating the complex pathogen profile of compound datasets.

Afanc solves the issues by carrying out species and subspecies level profiling using a novel Kraken2 report disambiguation algorithm and lineage-level profiling using a variant profiling approach.





□ Ocelli: an open-source tool for the visualization of developmental multimodal single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.10.05.561074v1

Ocelli is an explainable multimodal framework to learn a low-dimensional representation of developmental trajectories. In the data preprocessing step, we find modality-specific programs with topic modeling using Latent Dirichlet Allocation.

Ocelli constructs the Multimodal Markov Chain as a weighted sum of the unimodal affinities between cells. Ocelli determines the latent space of multimodal diffusion maps (MDM) by factoring the MMC into eigenvectors and eigenvalues.





□ AleRax: A tool for species and gene tree co-estimation and reconciliation under a probabilistic model of duplication, transfer, and loss

>> https://www.biorxiv.org/content/10.1101/2023.10.06.561091v1

AleRax, a novel probabilistic method for phylogenetic tree inference that can perform both species tree inference and reconciled gene tree inference from a sample of gene trees.

AleRax is on par with ALE in terms of reconciled gene tree accuracy, while being one order of magnitude faster and more robust to numerical errors. AleRax infers more accurate species trees than SpeciesRax and ASTRAL-Pro 2, because it can accommodate gene tree uncertainty.





□ Pindel-TD: a tandem duplication detector based on a pattern growth approach

>> https://www.biorxiv.org/content/10.1101/2023.10.08.561441v1

Pindel-TD, a Tandem duplication detection model by specifically optimizing the pattern growth approach in Pindel. Redesigning the search strategies of the minimum and maximum unique substring for different sized TDs, resulting in the high and robust performance of TD detection.

Firstly, they selected the read-pairs with only one read mapped uniquely (mapped only with 'M' character in its CIGAR string) while its mate showing split-read.

For each selected read-pair, the mapped read with a high mapping quality was considered as a reliable anchor read, determining the searching direction of subsequent split read analysis of soft clipped read.

Applying a pattern growth approach to find minimum and maximum unique substring start from either the leftmost of the rightmost of the unmapped read.

Next, they carefully processesing the split-read information to identify the TDs with accurate breakpoints. Finally, Pindel-TD removed the redundant TDs according to their length and break points to get final TD set.





□ PopGenAdapt: Semi-Supervised Domain Adaptation for Genotype-to-Phenotype Prediction in Underrepresented Populations

>> https://www.biorxiv.org/content/10.1101/2023.10.10.561715v1

PopGenAdapt is a deep learning model that applies semi-supervised domain adaptation (SSDA) to improve genotype-to-phenotype prediction in underrepresented populations.

PopGenAdapt leverages the large amount of labeled data from well-represented populations, as well as the limited labeled and the larger amount of unlabeled data from underrepresented populations.

PopGenAdapt adaptS for genotype-to-phenotype prediction the state-of-the-art method of SSDA via Minimax Entropy (MME) with Source Label Adaptation (SLA). Specifically, PopGenAdapt uses a 4-layer MLP with GELU activations, layer normalization, and a residual connection.





□ CUDASW++4.0: Ultra-fast GPU-based Smith-Waterman Protein Sequence Database Search

>> https://www.biorxiv.org/content/10.1101/2023.10.09.561526v1

CUDASW++4.0 is a fast software tool for scanning protein sequence databases with the Smith-Waterman algorithm on CUDA-enabled GPUs. This approach achieves high efficiency for dynamic programming-based alignment computation by minimizing memory accesses and instructions.

Base the parallelization scheme on computing an independent alignment for each (sub)warp. A (sub)warp consists of synchronized threads executed in lockstep, and they can communicate using warp shuffles. Within a (sub)warp, threads cooperatively compute DP matrix cell values.





□ cgMSI: pathogen detection within species from nanopore metagenomic sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05512-9

cgMSI formulates strain identification as a maximum a posteriori (MAP) estimation problem to take both sequencing errors and genome similarity between different strains into consideration for accurate strain-typing at low abundance.

cgMSI uses the core genome, and selects candidate strains using MAP probability estimation. After that, cgMSI maps the aligned reads to the full reference genomes of the candidate strains and identifies the target strain using the second-stage MAP probability estimation.





□ Multioviz: an interactive platform for in silico perturbation and interrogation of gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2023.10.10.561790v1

While many GRN platforms have been developed, a majority do not allow for perturbation analyses where a user is able to impose modifications onto a network and invoke a statistical reanalysis to learn how a phenotype might change with new sets of molecular interactions.

Multioviz enables a perturbation analyses using Biologically Annotated Neural Networks (BANNs) which are a class of feedforward Bayesian ML models that integrate known biological relationships to perform association mapping on multiple molecular levels simultaneously.





□ SpeakEasy2: Champagne: Robust, scalable, and informative clustering for diverse biological networks

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03062-0

SpeakEasy 2: Champagne (SE2) retains the core approach of popularity-corrected label propagation, but aims to reach a more accurate end state. The changes increase accuracy by escaping from label configurations that become prematurely stuck in globally suboptimal states.

SE2 utilizes a common approach in dynamical systems: making larger updates to jump out of suboptimal states, specifically using clusters-of-clusters, which allow it to reach configurations that would not be attained by only updating individual nodes.

SE2 increases runtime efficiency by initializing networks with far fewer labels than nodes, updates nodes to reflect the labels most specific to their neighbors, then divides the labels when their fit to the network drops below a certain level.

This reduced number of labels actually increases the opportunity for the label assignment to become stuck in suboptimal solution-states, but the more effective meta-clustering.





□ GASTON: Mapping the topography of spatial gene expression with interpretable deep learning

>> https://www.biorxiv.org/content/10.1101/2023.10.10.561757v1

GASTON (Gradient Analysis of Spatial Transcriptomics Organization with Neural networks) learns the isodepth of a tissue slice, the vector field of spatial gradients of gene expression, and spatial expression functions for individual genes directly from SRT data.

GASTON models gene expression as a piecewise linear function of the isodepth, thus describing both continuous gradients and sharp discontinuities in gene expression. GASTON reveals the geometry and continuous gene expression gradients of multiple tissues.





□ sincFold: end-to-end learning of short- and long-range interactions for RNA folding

>> https://www.biorxiv.org/content/10.1101/2023.10.10.561771v1

sincFold, an end-to-end deep learning model for RNA secondary structure prediction. Local and distant relationships can be encoded effectively using a hierarchical 1D-2D ResNet architecture, improving the state-of-the-art in RNA secondary structure prediction.

The sincFold model is based on ResNet blocks, bottlenecks layers and a 1D-to-2D projection. It has proven to be better suited to identify structures that might defy traditional modeling.





□ MkcDBGAS: a reference-free approach to identify comprehensive alternative splicing events in a transcriptome

>> https://academic.oup.com/bib/article/24/6/bbad367/7313457

MkcDBGAS uses a colored de Bruijn graph with dynamic- and mixed - kmers to identify bubbles generated by AS with precision higher than 98.17% and detect AS types overlooked by other tools. MkcDBGAS uses XGBoost to increase the accuracy of classification.

By leveraging cDBG with mixed k-mers and XGBoost with added motif features, MkcDBGAS accurately predicts all seven types of AS on transcriptome-wide using only transcripts. In particular, MkcDBGAS can accurately detect AS in other species, meaning that it is scalable.





□ STew: Uncover spatially informed shared variations for single-cell spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.10.10.561789v1

STew, a Spatial Transcriptomic multi-viEW representation learning method, or STew, to jointly characterize the gene expression variation and spatial information in the shared low-dimenion space in a scalable manner.

STew will output distinct spatially informed cell gradients, robust clusters, and statistical goodness of model fit to reveal significant genes that reflect subtle spatial niches in complex tissues.





□ dnctree: Scalable distance-based phylogeny inference using divide-and-conquer

>> https://www.biorxiv.org/content/10.1101/2023.10.11.561902v1

dnctree, a randomized divide-and-conquer heuristic which selectively estimates pairwise sequence distances and infers a tree by connecting increasingly large subtrees. The time complexity is at worst quadratic, and seems to scale like O(n lgn) on average.





□ Designing efficient randstrobes for sequence similarity analyses

>> https://www.biorxiv.org/content/10.1101/2023.10.11.561924v1

Constructing randstrobes consists of converting strings to integers through a hash function and selecting candidate k-mers to link through a link function and a comparator operator.

Always use a hash function to hash the strobes before linking. It does not result in a large overhead in construction time while being beneficial for pseudo-randomness for most link functions.




Astrolabe.

2023-10-17 22:17:33 | Science News

(Artwork by Viktor Blinnikov)




□ GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction

>> https://www.biorxiv.org/content/10.1101/2023.10.10.561776v1

GPN-MSA, a novel DNA language model which is designed for genome wide variant effect prediction and is based on the biologically-motivated integration of a multiple-sequence alignment (MSA) across diverse species using the flexible Transformer architecture.

GPN-MSA is trained with a weighted cross-entropy loss, designed to downweight repetitive elements and up-weight conserved elements. As data augmentation in non-conserved regions, prior to computing the loss, the reference is sometimes replaced by a random nucleotide.





□ DEMINING: A deep learning model embedded framework to distinguish DNA and RNA mutations directly from RNA-seq

>> https://www.biorxiv.org/content/10.1101/2023.10.17.562625v1

DEMINING incorporated a deep learning model named DeepDDR, which achieved the differentiation of expressed DMs from RMs directly from aligned RNA-seq reads. DEMINING uncovered previously-underappreciated DMs and RMs in unannotated AML-associated gene loci.

DEMINING employs the Light Gradient Boosting Machine (LightGBM), Logistic Regression and Random Forest, RNN and a hybrid of CNN+RNN. DeepDDR with two layers of CNN and the CNN+RNN hybrid model demonstrated comparable performance.





□ scIBD: a self-supervised iterative-optimizing model for boosting the detection of heterotypic doublets in single-cell chromatin accessibility data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03072-y

scIBD, a scCAS-specific self-supervised iterative-optimizing method to boost the detection of heterotypic doublets. As a simulation-based method, scIBD discards the routine random selection strategy that may yield excessive homotypic doublets in the simulation process.

scIBD uses an adaptive strategy to simulate high-confident heterotypic doublets and self-supervise for doublet-detection. scIBD adopts an iterative-optimizing strategy to detect the heterotypic doublets iteratively and finally outputs doublet scores based on an ensemble strategy.





□ CellContrast: Reconstructing Spatial Relationships in Single-Cell RNA Sequencing Data via Deep Contrastive Learning

>> https://www.biorxiv.org/content/10.1101/2023.10.12.562026v1

cellContrast, a deep-learning method that employs a contrastive learning framework for spatial relationship reconstruction. The fundamental assumption is that GE profiles can be projected into a latent space, where physically proximate cells demonstrate higher similarities.

cellContrast employs a contrastive framework of an encoder-projector. During inference, cellContrast discards the projector and uses the output of the encoder for spatial reconstruction, based on the principle that higher cosine similarity indicates shorter spatial distance.





□ sharp: Automated calibration of consensus weighted distance-based clustering approaches

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad635/7320014

The proposed consensus weighted clustering is controlled by two hyper-parameters, including the regularisation parameter and the number of clusters.

Calibrate jointly these two hyper-parameters in a grid search maximising the sharp score, a novel score measuring clustering stability from (weighted) consensus clustering outputs.

The assumption that co-membership probabilities are the same for all pairs of items within a given consensus cluster or between a given pair of consensus clusters, respectively, constitutes a potential limitation of the sharp score.





□ Assessing the limits of zero-shot foundation models in single-cell biology

>> https://www.biorxiv.org/content/10.1101/2023.10.16.561085v1

Geneformer and scGPT exhibit limited reliability in zero-shot settings and often underperform compared to simpler methods. These findings serve as a cautionary note for the deployment of proposed single-cell foundation models.

scGPT defaults to predicting the median bin when only given access to gene embeddings. Masked language modeling (MLM) are not effective at learning gene embeddings, which would also impact Geneformer, given that it produces a cell embedding by averaging over gene embeddings.





□ Relational Composition of Physical Systems: A Categorical Approach

>> https://arxiv.org/abs/2310.06088

The fact that each quadratic form has a unique signature despite the diagonalizing basis non-unique is analogous to how each finite-dimensional vector space has a unique dimension, although the basis that proves that the vector space has a given dimension is non-unique.

Dirac diagrams, a novel notation inspired by both bond graphs and string diagrams. They describe the syntax and semantics of Dirac diagrams. We can construct a category of vector spaces with quadratic forms using the Grothendieck construction.






□ scTab: Scaling cross-tissue single-cell annotation models

>> https://www.biorxiv.org/content/10.1101/2023.10.07.561331v1

scTab, an automated, feature-attention-based cell type prediction model specific to tabular data, and train it using a novel data augmentation scheme across a large corpus of single-cell RNA-seq observations (22.2 million human cells in total).

scTab leverages deep ensembles for uncertainty quantification. Moreover, we account for ontological relationships between labels in the model evaluation to accommodate for differences in annotation granularity across datasets.

The adapted TabNet architecture for scTab consists of two key building blocks: The first building block is the feature transformer, which is a multi-layer perceptron with batch normalization (BN), skip connections, and a gated linear unit nonlinearity (GLU).





□ scPoli: Population-level integration of single-cell datasets enables multi-scale analysis across samples

>> https://www.nature.com/articles/s41592-023-02035-2

scPoli, an open-world learner that incorporates generative models to learn sample and cell representations for data integration, label transfer and reference mapping.

scPoli introduces two modifications to the CVAE architecture. These modifications are the replacement of OHE vectors with continuous vectors of fixed dimensionality to represent the conditional term, and the usage of cell type prototypes to enable label transfer.





□ Hifieval: Evaluation of haplotype-aware long-read error correction

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad631/7321114

Hifieval compares the alignment of the raw read and the alignment of the corrected read. Hifieval evaluates phased assemblies and can distinguish under-corrections and over-corrections.

Hifieval calculates three metrics: correct corrections (CC), errors that are in raw reads but not in corrected reads; under-corrections (UC), errors present in both raw and corrected reads; and over-corrections (OC), new errors found in corrected reads but not in raw reads.





□ AtaCNV: Detecting copy number variations from single-cell chromatin sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.10.15.562383v1

AtaCNV generates a single-cell read count matrix over genomic bins of 1 million base pairs. Cells and genomic bins are filtered according to bin mappability and number of zero entries. AtaCNV smooths the count matrix by fitting a one-order dynamic linear model for each cell.

AtaCNV normalizes the smoothed count data against those of normal cells to deconvolute copy number signals from other confounding factors. AtaCNV clusters the cells and identifies a group of high confidence normal cells and normalizes the data against their smoothed depth data.

AtaCNV applies the multi-sample BIC-seq algorithm to jointly segment all single cells and estimates the copy number ratios for each cell in each segment. CNV burden scores are also derived and cells with high CNV scores are regarded as malignant cells.





□ BatchEval Pipeline: Batch Effect Evaluation Workflow for Multiple Datasets Joint Analysis

>> https://www.biorxiv.org/content/10.1101/2023.10.08.561465v1

BatchEval Pipeline performs Min-Max normalization and logarithmic mapping preprocessing on each spot/cell gene expression levels and integrates multiple batches of gene expression data into low-dimensional representations.

BatchEval Pipeline employs the Kruskal-Wallis H test to evaluate the variation in the average level of gene expression across different tissue sections and performs variance analysis on gene expression total counts for each tissue section.





□ TEclass2: Classification of transposable elements using Transformers

>> https://www.biorxiv.org/content/10.1101/2023.10.13.562246v1

TEclass2, a new architecture based on the Longformer model for the classification of selected TEs sequences, including various sequence specific aug-mentations, a k-mer specialized tokenizer, and implementing sliding window dilation.

TEclass2 is an all-in-one classifier that can be used to rapidly predict TE orders and superfamilies using TE models built upon the Transformer architecture. For TE DNA sequences, TEclass2 uses only the encoder-block, followed by a classification head as in a linear layer.





□ SPACO: Dimension Reduction by Spatial Components Analysis Improves Pattern Detection in Multivariate Spatial Data

>> https://www.biorxiv.org/content/10.1101/2023.10.12.562016v1

SPACO (Spatial Component Analysis), a proximity-aware kernel method for spatial data. By replacing PCA's global variance target with Moran's I, a measure of local (co)variance, SPACO constructs an ordered sequence of basis vectors, the spatial components (SpaC).

Orthogonal data projection onto the first k SpaCs maximises Moran's I, thereby pooling evidence of spatial dependence across genes with similar patterns. This enhances the sensitivity and spatial precision of the signal.





□ CAAStools: a toolbox to identify and test Convergent Amino Acid Substitutions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad623/7319365

CAAStools, a toolbox to identify and validate CAAS in a phylogenetic context. CAAStools implements different testing strategies through bootstrap analysis. CAAStools is designed to be included in parallel workflows and is optimized to allow scalability at proteome level.





□ Semla: A versatile toolkit for spatially resolved transcriptomics analysis and visualization

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad626/7319366

semla, a toolbox for data processing, exploration, analysis, and visualization of spatial gene expression patterns in tissues. Semla takes advantage of the tidyverse framework for data handling and the patchwork framework for customizable visualization.

semla requires data generated with the Visium Gene Expression profiling platform, including expression matrices, histological images and spot coordinate files produced with the 10x Genomics Space Ranger pipeline.





□ Ggkegg: analysis and visualization of KEGG data utilizing grammar of graphics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad622/7319364

ggkegg to extend these packages. ggkegg retrieves information such as the KEGG PATHWAY and MODULE, formats them into a structure that is easy to analyze, and offers a series of functions for further analyses and visualization.

ggkegg can also be viewed as an extension of ggplot2, an R package that deconstructs graphical components and composes images as grammar of graphics and serves as the foundation for visualization in numerous publications on bioinformatics.





□ GeneSegNet: a deep learning framework for cell segmentation by integrating gene expression and imaging

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03054-0

GeneSegNet makes a joint use of gene spatial coordinates and imaging information for cell segmentation, and is recursively learned by alternating between the optimization of network parameters and estimation of training labels for noise-tolerant training.

GeneSegNet exploits both imaging information and spatial locations of RNA reads for cell segmentation, based on a general U-Net architecture. U-Net downsamples convolutional features several times and then reversely upsamples them in a mirror-symmetric manner.





□ scHiCDiff: Detecting Differential Chromatin Interactions in Single-cell Hi-C Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad625/7320006

scHiCDiff, a novel statistical software tool, which applied two non-parametric tests (KS and CVM) and two parametric models (NB and ZINB) to distinguish the bin pairs showing significant changes in contact frequencies between two groups of scHi-C data.

scHiCDiff detects DCIs. Each scHi-C data is imputed by a Gaussian convolution filter to tackle the sparsity issue, then processed by scHiNorm w/ the Negative Binomial Hurdle option to remove systematic biases, and finally normalized for the cell-specific genomic distance effect.





□ iLSGRN: Inference of large-Scale Gene Regulatory Networks based on multi-model fusion

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad619/7321113

iLSGRN reconstructs large-scale GRNs from steady-state and time-series GE data based on nonlinear ODEs. The regulatory gene recognition algorithm calculates the Maximal Information Coefficient and excludes redundant regulatory relationships to achieve dimensionality reduction.

The feature fusion algorithm constructs a model leveraging the feature importance derived from XGBoost and Random Forest models, which can effectively train the nonlinear ODEs model of GRNs and improve the accuracy and stability of the inference algorithm.





□ scLinaX: Quantification of the escape from X chromosome inactivation with the million cell-scale human single-cell omics datasets reveals heterogeneity of escape across cell types and tissues

>> https://www.biorxiv.org/content/10.1101/2023.10.14.561800v1

scLinaX directly quantifies relative gene expression from the inactivated X chromosome with droplet-based scRNA-seq data. scLinaX-multi, an extension for the multiome (RNA + ATAC) dataset to evaluate the escape at the chromatin accessibility level.

First, pseudobulk allele-specific expression profiles are generated for cells expressing each candidate reference SNP. Then, alleles of the reference SNPs on the same X chromosome are listed by correlation analysis of the pseudobulk ASE profiles.

scLinaX assigns which X chromosome is inactivated to each cell based on the allelic expression of the reference SNPs and generates a nearly complete XCI skewed condition in silico and the estimates for the ratio of the expression from Xi.





□ Asterics: a simple tool for the ExploRation and Integration of omiCS data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05504-9

ASTERICS is designed to make both standard and complex exploratory and integration analysis workflows easily available to biologists and to provide high quality interactive plots.

ASTERICS allows the integration of multiple omics, i.e., it includes exploratory analysis able to explain the typology of individuals described by omics and/or characters simultaneously obtained at different levels of the living organisms.





□ AIWrap: Artificial Intelligence based wrapper for high dimensional feature selection

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05502-x

AIWrap, a novel Artificial Intelligence based Wrapper algorithm. The algorithm predicts the performance of unknown feature subset using an AI model referred here as Performance Prediction Model (PPM).

The performance of AIWrap is evaluated and compared with standard algorithms like LASSO, Adaptive LASSO (ALASSO), Group LASSO (GLASSO), Elastic net (Enet), Adaptive Elastic net (AEnet) and Sparse Partial Least Squares (SPLS) for both the simulated datasets and real data studies.





□ GENEPT: A SIMPLE BUT HARD-TO-BEAT FOUNDATION MODEL FOR GENES AND CELLS BUILT FROM CHATGPT

>> https://www.biorxiv.org/content/10.1101/2023.10.16.562533v1

GenePT demonstrates that LLM embedding of literature is a simple and effective path for biological foundation models. GenePT achieves comparable, and often better, performance than Geneformer and other methods.

GenePT generates single-cell embeddings in two ways: (i) by averaging the gene embeddings, weighted by each gene’s expression level; or (ii) by creating a sentence embedding for each cell, using gene names ordered by the expression level.





□ TDS: Privacy-Preserving Federated Genome-wide Association Studies via Dynamic Sampling

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad639/7323577

TDS (Two-Step Dynamic Sampling), a new efficient, privacy-preserving federated GWAS framework. In the first phase, local parties collaboratively identify loci in their local data that are not significantly associated.

This phase substantially curbs computation and communication costs by removing a large number of non-significant loci from subsequent analysis.

In the second phase, all the local parties iteratively share portions of their private datasets with the server. The server performs GWAS on the pooled data and returns the results to the local parties.





□ GoM DE: interpreting structure in sequence count data with differential expression analysis allowing for grades of membership

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03067-9

The concept of “Grade of Membership Differential Expression” (GoM DE) builds upon existing methods to analyze differential expression. By extending these established techniques, we can explore a variety of cell features beyond just discrete cell populations.

Investigateing the question of how to interpret the individual dimensions of a parts-based representation learned by fitting a topic model (in the topic model, the dimensions are also called “topics”)

The GoM DE analysis yields much larger LFC estimates of the cell-type-specific genes. This is because the topic model isolates the biological processes related to cell type while removing background biological processes that do not relate to cell type.





□ SPIRAL: integrating and aligning spatially resolved transcriptomics data across different experiments, conditions, and technologies

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03078-6

SPIRAL effectively integrates data in both feature space, including low-dimensional embeddings, high-dimensional gene expressions, and physical space.

SPIRAL combines gene expressions and spatial relationships in the consecutive processes of batch effect removal and coordinate alignment by employing graph-based domain adaption and cluster-aware Gromov-Wasserstein optimal transport.





□ DIVE: a reference-free statistical approach to diversity-generating and mobile genetic element discovery

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03038-0

DIVE, a novel reference-free algorithm designed to identify sequences that cause genetic diversification such as transposable elements, within MGE variability hotspots, or CRISPR repeats. DIVE operates directly on sequencing reads and does not rely on a reference genome.

DIVE makes the preceding logic into a statistical algorithm. DIVE aims to find anchors with neighboring statistically highly diverse sequences. DIVE processes each read sequentially using a sliding window to construct target dictionaries for each anchor encountered in each read.





□ stVAE deconvolves cell-type composition in large-scale cellular resolution spatial transcriptomics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad642/7325351

stVAE employs a variational encoder-decoder framework to decompose cell-type mixtures for cellular resolution spatial transcriptomic data. stVAE is scalable to large-scale datasets and has less running time.

stVAE constructs a pseudo-spatial transcriptomic dataset to guide the training of stVAE on the small spatial transcriptomic dataset. stVAE could accurately capture the sparsity of cell-type composition in the spots of cellular resolution spatial transcriptomic data.





□ SEM: sized-based expectation maximization for characterizing nucleosome positions and subtypes

>> https://www.biorxiv.org/content/10.1101/2023.10.17.562727v1

SEM (the Size-based Expectation Maximization), a new nucleosome-calling package. SEM analyzes the overall fragment size distribution to determine which types of nucleosomes are detectable within a given MNase-seq dataset.

SEM employs a hierarchical Gaussian mixture model to accurately estimate the locations and occupancy properties of nucleosomes and to assign subtype identities to each detected nucleosome.





□ MOAL: Multi-Omic Analysis at Lab. A simplified methodology workflow to make reproducible omic bioanalysis.

>> https://www.biorxiv.org/content/10.1101/2023.10.17.562686v1

MOAL (Multi Omic Analysis at Lab), an R package including a omic() function that automates most classical tasks. MOAL automates the bioanalysis corresponding to biostatistics and functional integration procedures.

For annotation tasks, symbols are automatically re-annotated using synonym checking to avoid information loss. MOAL also integrates the NBCI orthologs gene database to open functional enrichment analysis for species that have identified ortholog genes in human.





□ OMICmAge: An integrative multi-omics approach to quantify biological age with electronic medical records

>> https://www.biorxiv.org/content/10.1101/2023.10.16.562114v1

A robust, predictive biological aging phenotype, EMRAge, that balances clinical biomarkers with overall mortality risk and can be broadly recapitulated across EMRs.

Subsequently, they applied elastic-net regression to model EMRAge with DNA-methylation (DNAm) and multiple omics, generating DNAmEMRAge and OMICmAge, respectively.





□ CRAQ: Identification of errors in draft genome assemblies at single-nucleotide resolution for quality assessment and improvement

>> https://www.nature.com/articles/s41467-023-42336-w

CRAQ (Clipping information for Revealing Assembly Quality), a reference-free tool which maps raw reads back to assembled sequences to identify regional and structural assembly errors based on effective clipped alignment information.

CRAQ can identify assembly errors at different scales and transform error counts into corresponding assembly quality indicators (AQIs) that reflect assembly quality at the regional and structural levels.