lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Atlas.

2023-10-31 22:33:37 | Science News

(Art by carlhauser)



□ scDiff: A General Single-Cell Analysis Framework via Conditional Diffusion Generative Models

>> https://www.biorxiv.org/content/10.1101/2023.10.13.562243v1

scDiff enables extensive conditioning strategies. Besides LLMs and GNNs, we can enhance scDiff with other guidance methods, like CLIP. scDiff can be promptly extended to multiomics or multi-modality tasks.

scDiff uses a conditional diffusion generative model to approximate the posterior by a Markov chain. scDiff shows outstanding few-shot and zero-shot results. scDiff outperforms GEARS among all the metrics and datasets except the MSE on Norman.






□ SecDATA: Secure Data Access and de novo Transcript Assembly protocol - To meet the challenge of reliable NGS data analysis

>> https://www.biorxiv.org/content/10.1101/2023.10.26.564229v1

SecDATA, an optimized pipeline for de novo transcript assembly that adopts a Blockchain-based strategy. The major focus here lies towards implementing (a) a pipeline that accesses secured data with the help of DLT and (b) performs de novo transcript sequence reconstruction.

The "Optimized length" represents the minimum number of nodes traversed for building all transcripts i.e. minimum path length for transcript construction. SecDATA uses overlaps in k-mers to determine which k-mer pairs are adjacent in the read sequences.

SecDATA uses Ethereum techniques. SecDATA encompasses blocks or nodes, which are connected through a network. The nodes communicate through a secure channel and use the hash value as a key.





□ DeepGenomeVector: Towards AI-designed genomes using a variational autoencoder

>> https://www.biorxiv.org/content/10.1101/2023.10.22.563484v1

DeepGenomeVector can learn the basic genetic principles underlying genome composition. In-depth functional analysis of a generated genome vector suggests that it is near-complete and encodes largely intact pathways that are interconnected.

DeepGenomeVector involves training a generative variational autoencoder, consisting of three layers, with a latent representation size of 100 neurons. The model was trained to optimize the sum of binary cross-entropy loss and Kullback-Leibler divergence.





□ Deep DNAshape: Predicting DNA shape considering extended flanking regions using a deep learning method

>> https://www.biorxiv.org/content/10.1101/2023.10.22.563383v1

Deep DNAshape overcomes the limitation of DNAshape, particularly its reliance on the query table search key. This advancement is pivotal, given that the limitation was only caused by the available amount of data.

Deep DNAshape enhances the capability to discern how the shape at the center of a pentamer region is influenced by its extended flanking regions, providing a model that offers a more accurate representation of DNA.

Deep DNAshape can process a given DNA sequence as a string of characters (A, C, G and T) and predict any specific DNA shape for each nucleotide position of a sequence. Deep DNAshape predicts DNA shape and shape fluctuations considering extended flanking influences without biases.





□ NetREm: Network Regression Embeddings reveal cell-type transcription factor coordination for gene regulation

>> https://www.biorxiv.org/content/10.1101/2023.10.25.563769v1

NetREm incorporates information from prior biological networks to improve predictions and identify complex relationships among predictors (e.g. TF-TF coordination: direct/indirect interactions among TFs).

NetREm can highlight important nodes and edges in the network, reveal novel regularized embeddings for genes. NetREm employs Singular Value Decomposition (SVD) to create new latent space gene embeddings, which are then used in a Lasso regression model to predict TG expression.





□ eaDCA: Towards Parsimonious Generative Modeling of RNA Families

>> https://www.biorxiv.org/content/10.1101/2023.10.19.562525v1

eaDCA (Edge Activation Direct Coupling Analysis) is based on an empty coupling network. It then systematically constructs a non-trivial network from scratch, rather than starting with a fully connected network and subsequently simplifying it.

eaDCA operates more swiftly than starting with a fully connected model, leading to generative Potts models. By employing analytical likelihood maximization, it allows to easily track normalized sequence probabilities and estimate entropies throughout the network-building process.





□ BTR: A Bioinformatics Tool Recommendation System

>> https://www.biorxiv.org/content/10.1101/2023.10.13.562252v1

Bioinformatics Tool Recommendation system (BTR) models workflow construction as a session-based recommendation problem and leverage emergent graph neural network technologies to enable a workflow graph representation that captures extensive structural context.

BTR represents the workflow as a directed graph. A variant of the system is constrained to employ linear sequence representations for the purpose of comparison with other methods.

BTR takes the input of Input Query in the format of a sequence: Each tool instance is encoded by an initial embedding layer; The initial embeddings continue to a Gated Graph Neural Network to learn contextural features from neighboring nodes using full workflow graph.

An attention mechanism aggregates the latent graph node embeddings into a full workflow representation, which is concatenated with the representation of the last tool and transformed to yield the final workflow representation vector.





□ DPAMSA: Multiple sequence alignment based on deep reinforcement learning with self-attention and positional encoding

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad636/7323576

DPAMSA (Deep reinforcement learning with Positional encoding and self-Attention for MSA) is based on deep reinforcement learning (DRL). DPAMSA combines natural language processing technology and deep reinforcement learning in MSA.

DPAMSA is mainly based on progressive column alignment, and the sub-alignment of each column is calculated step by step Then all sub-alignments are spliced into a complete alignment.

DPAMSA particularly inserts a gap according to the current sequence state Deep Q Network (DQN) is the deep reinforcement learning model. The model's Q network is divided into positional encoding, self-attention, and multi-layer perceptron.





□ Derived ∞-categories as exact completions

>> https://arxiv.org/abs/2310.12925

A finitely complete ∞-category is exact and additive if and only if it is prestable, extending a classical characterization of abelian categories.

In the ∞-categorical setting, the connection between ∞-topoi, finitary Grothendieck topologies, and coherent ∞-topoi was studied, where it is proven that small hypercomplete ∞-topoi are in correspondence with hypercomplete coherent and locally coherent ∞-topoi.





□ Cyclone: Open-source package for simulation and analysis of finite dynamical systems

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad634/7323572

While there are software packages that analyze Boolean, ternary, or other multi-state models, none compute the complete state space of function-based models over any finite set.

Cyclone simulates the complete state space for an input finite dynamical system and finds all attractors (steady states and limit cycles). Cyclone takes as input functions over any finite set and outputs the entire state space or single trajectories.





□ MultiXrank: Random Walk with Restart on multilayer networks: from node prioritisation to supervised link prediction and beyond

>> https://www.biorxiv.org/content/10.1101/2023.10.18.562848v1

MultiXrank, a Random Walk with Restart algorithm able to explore generic multilayer networks. They define a generic multilayer network as a multilayer network composed of any number and combination of multiplex and monoplex networks connected by bipartite interaction networks.

In this multilayer framework, all the networks can also be weighted and/or directed. MultiXrank outputs scores representing a measure of proximity between the seed(s) and all the nodes of the multilayer network. MultiXrank scores can be used to compute diffusion profiles.





□ LexicHash: Sequence Similarity Estimation via Lexicographic Comparison of Hashes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad652/7329717

LexicHash, a new approach to pairwise sequence similarity estimation that combines the sketching strategy of MinHash with a lexicographic-based hashing scheme.

LexicHash is similar to MinHash in that distinct hash functions are used to create sketches of a sequence by storing the vector of minimum hash values over all k-mers in the sequence.

However, the k-value used in LexicHash actually corresponds to a maximum match length Kmax, and the hashing scheme maintains the ability to capture any match-length below the chosen Kmax.

LexicHash can identify variable-length substring matches between reads from their sketches. The sketches are also constructed in such a way that, to compare sketches, we can traverse the sketches position-by-position, as with the MinHash sketch.





□ gtfsort: a tool to efficiently sort GTF files

>> https://www.biorxiv.org/content/10.1101/2023.10.21.563454v1

gtfsort, a sorting tool that utilizes a lexicographically-based index ordering algorithm. gtfsort not only outperforms similar tools such as GFF3sort or AGAT but also provides a more natural, ordered, and user-friendly perspective on GTF structure.

gtfsort utilizes multiple layers to efficiently write transcript blocks. an outer layer for the highest-level hierarchy, an inner layer for lower-level hierarchies, and a transcript-mapper layer responsible for managing isoforms and their associated features for a given gene.

Each line in the GTF file is parsed and grouped according to its feature, aligning with the specific layer-dependent data flow, including genes, transcripts, and lower-level hierarchies.





□ Back to sequences: find the origin of kmers

>> https://www.biorxiv.org/content/10.1101/2023.10.26.564040v1

back to sequences is dedicated to extracting from a set of sequences, those that contain some of the k-mers given as input and counting the number of occurrences of each of such k-mers. The k-mers can be considered in their original, reverse-complemented, or canonical form.

back to sequences uses the native rust data structures (HashMap) to index and query k-mers. Sequence filtration is based on the minimal and maximal percent of k-mers shared with the indexed set.

On the GenOuest node, back_to_sequences enabled to retrieve all reads that contain at least one of the indexed k-mers in 5m17 with negligible RAM usage of 45MB.

They ran back to_sequences on the full read set, composed of ~ 26.3 billion k-mers, and 381 million reads, again for searching the 69 k-mers contained in its first read. This operation took 20m11.





□ Bioinfo-Bench: A Simple Benchmark Framework for LLM Bioinformatics Skills Evaluation

>> https://www.biorxiv.org/content/10.1101/2023.10.18.563023v1

BIOINFO-BENCH, the bioinformatics evaluation suite to thoroughly assess LLMs' advanced knowledge and problem solving abilities in a bioinformatics scenario. Conducting experiments to evaluate the state-of-the-art LLMs including ChatGPT, Llama, and Galactica on BIOINFO-BENCH.

These LLMs excel in knowledge acquisition, drawing heavily upon their training data for retention. However, their proficiency in addressing practical professional queries and conducting nuanced knowledge inference remains constrained.






□ Scalable genetic screening for regulatory circuits using compressed Perturb-seq

>> https://www.nature.com/articles/s41587-023-01964-9

An alternative approach to greatly increase the efficiency and power of Perturb-seq for both single and combinatorial perturbation screens, inspired by theoretical results from compressed sensing that apply to the sparse and modular nature of regulatory circuits in cells.

To elaborate, perturbation effects tend to be ‘sparse’, in that most perturbations affect only a small number of genes or co-regulated gene programs.

In this scenario, we can measure a much smaller number of random combinations of perturbations and accurately learn the effects of individual perturbations from the composite samples using sparsity-promoting algorithms.





□ PhyloES: An evolution strategy approach for the Balanced Minimum Evolution Problem

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad660/7331089

PhyloES, a novel heuristic that defines the new reference in approximating the optimal solutions to the Balanced Minimum Evolution Problem (BMEP). PhyloES proposes a possible way around this problem that consists in making nondeterministic the search in the solution space.

PhyloES first generates a new set of solutions to the problem by using local search strategies similar to those implemented in FastME. Subsequently, PhyloES stochastically recombines the new phylogenies so obtained by means of the so-called ES operator.

The two phases, the iteration of the local search and the recombination, allow spanning the whole solution space to the BMEP by enabling the potential convergence to the optimum on a sufficiently long period.





□ SeQual-Stream: approaching stream processing to quality control of NGS datasets

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05530-7

SeQual-Stream relies on the Apache Spark framework and the Hadoop Distributed File System (HDFS) to fully exploit the stream paradigm and accelerate the preprocessing of large datasets as they are being downloaded and/or copied to HDFS.

These operations are grouped into three different categories depending on the functionality they provide: (1) single filters, responsible for discarding input sequences that do not meet a certain criteria (sequence length), evaluating each sequence independently of the others;

(2) trimmers, operations that trim certain sequence bases at the beginning or end; and (3) formatters, operations to change the format of the input dataset (DNA to RNA). SeQual-Stream can receive as input single- or paired-end datasets, supporting FASTQ and FASTA formats.





□ cubeVB: Variational Bayesian Phylogenies through Matrix Representation of Tree Space

>> https://www.biorxiv.org/content/10.1101/2023.10.19.563180v1

Using a symmetric matrix with dimension equal to the number of taxa and apply a hierarchical clustering algorithm like the single link clustering algorithm to obtain a tree with internal node heights specified by the values of the matrix.

The entries in the matrix form a Euclidian space, however there are many ways to represent the same tree, so the transformation is not bijective.

By restricting ourselves to the 1-off-diagonal entries of the matrix (and leave the rest at infinity) the transformation becomes a bijection, but cannot represent all possible trees any more.

cubeVB captures the most interesting part of posterior tree space using this approach. cubeVB, a variational Bayesian algorithm based on MCMC and show through well calibrated simulation study that it is possible to recover parameters of interest like tree height and length.





□ LAVASET: Latent Variable Stochastic Ensemble of Trees. A novel ensemble method for correlated datasets

>> https://www.biorxiv.org/content/10.1101/2023.10.20.563223v1

LAVASET derives latent variables based on the distance characteristics of each feature and thereby incorporates the correlation factor in the splitting step. Hence, it inherently groups correlated features and ensures similar importance assignment for these.

LAVASET addresses a major limitation in the interpretation of feature importance of Random Forests when the data are collinear, such as is the case for spectroscopic and imaging data. LAVASET can perform on different types of omics data, from 1D to 3D.





□ ULTRA: Towards Foundation Models for Knowledge Graph Reasoning

>> https://arxiv.org/abs/2310.04562

ULTRA, a method for unified, learnable, and transferable Knowledge Graph (KG) representations that leverages the invariance of the relational structure and employs relative relation representations on top of this structure for parameterizing any unseen relation.

ULTRA constructs a graph of relations (where each node is a relation from the original graph) capturing their interactions. ULTRA obtains a unique relative representation of each relation. It enables zero-shot generalization to any other KG of any size and any relation.





□ PerFSeeB: designing long high-weight single spaced seeds for full sensitivity alignment with a given number of mismatches

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05517-4

The PerFSeeB is based on designing periodic blocks. When several mismatches are set, resulting spaced seeds are guaranteed to find all positions within a reference sequence. Each periodic seed consists of an integer number of periodic blocks and a “remainder”.

Those blocks can be used to generate spaced seeds required for any given length of reads. The best periodic seeds are seeds of maximum possible weight since this helps us to reduce the number of candidate positions when we try to align reads to the reference sequence.





□ Benchmarking algorithms for joint integration of unpaired and paired single-cell RNA-seq and ATAC-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03073-x

The incorporation of multiome data improves the cell type annotation accuracy of scRNA-seq and snATAC-seq data when there are a sufficient number of cells in the multiome data to reveal cell type identities.

When generating a multiome dataset, the number of cells is more important than sequencing depth for cell type annotation. Seurat v4 is the best at integrating scRNA-seq, snATAC-seq, and multiome data even in the presence of complex batch effects.





□ OMEinfo: Global Geographic Metadata for -omics Experiments

>> https://www.biorxiv.org/content/10.1101/2023.10.23.563576v1

OMEinfo leverages open data sources such as the Global Human Settlement Layer, Köppen-Geiger climate classification models, and Open-Data Inventory for Anthropogenic Carbon dioxide, to ensure metadata accuracy and provenance.

OMEinfo's Dash application enables users to visualise their sample metadata on an interactive map and to investigate the spatial distribution of metadata features, which is complemented by data visualisation to analyse patterns and trends in the geographical data.





□ MGA-seq: robust identification of extrachromosomal DNA and genetic variants using multiple genetic abnormality sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03081-x

MGA-Seq (multiple genetic abnormality sequencing) simultaneously detect structural variation, copy number variation, single-nucleotide polymorphism, homogeneously staining regions, and extrachromosomal DNA (ecDNA) from a single tube.

MGA-Seq directly sequences proximity-ligated genomic fragments, yielding a dataset with concurrent genome three-dimensional and whole-genome sequencing information, enabling approximate localization of genomic structural variations and facilitating breakpoint identification.





□ Open MoA: Revealing the Mechanism of Action (MoA) based on Network Topology and Hierarchy

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad666/7334463

Open MoA computes confidence scores to edges that represent connections between genes/proteins in the integrated network. The interactions showing the highest confidence score could indicate potential drug targets and infer the underlying molecular MoAs.

Open MoA reveasl the MoA of a repositioned drug (JNK-IN-5A) that modulates the PKLR expression in HepG2 cells and found STAT1 is the key transcription factor.

With the transcriptomic data, by inputting the known starting point and endpoints, Open MoA is able to give out the significant confidence score for each interaction in the context-specific subnetworks, thus leading to the identification of the most possible pathway.





□ ORI-Explorer: A unified cell-specific tool for origin of replication sites prediction by feature fusion

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad664/7334464

ORI-Explorer, a unique AI-based technique that combines multiple feature engi- neering techniques to train CatBoost Classifier for recognizing ORIs from four distinct eukaryotic species.

ORI-Explorer was created by utilizing a unique combination of three traditional feature-encoding techniques and a feature set obtained from a deep-learning neural network model.

ORI-Explorer uses 4 different feature descriptors, where one is extracted using the distinctive neural network architecture while the other 3 are composition k-spaced nucleic acid pairs, Parallel Correlation Pseudo Dinucleotide Composition and Dinucleotide-based Cross Covariance.

While these features are concatenated and given to SHapley Additive exPlanation (SHAP) to select the most important features that are further used by CatBoost to predict the ORI regions.





□ High-fidelity (repeat) consensus sequences from short reads using combined read clustering and assembly

>> https://www.biorxiv.org/content/10.1101/2023.10.26.564123v1

The presented repeat assembly workflow uses clustering and assembly tools to create informed consensus sequences from repeats to answer a wide variety of questions.

They use the term "informed consensus" to suggest that the derived sequences are not mere averages or sequence profiles, but that they have been carefully constructed using relevant data and analysis.





□ DENIS: Uncovering uncharacterized binding of transcription factors from ATAC-seq footprinting data

>> https://www.biorxiv.org/content/10.1101/2023.10.26.563982v1

DENIS (DE Novo motlf diScovery) that i) isolates UBM events from ATAC-seq data, ii) performs de novo motif generation, iii) calculates information content, motif novelty and quality parameters, and iv) characterizes de novo motifs through open chromatin enrichment analysis.

DENIS is designed to robustly explore DNA binding events on a global scale, to compare ATAC-seq datasets from one or multiple conditions, and is suitable to be applied to any organism.

DENIS merges very similar motifs found in multiple iterations and continues with the consensus motif. DENIS generated a total of 141 motifs over 26 iterations, which were finally merged to 30 unique motifs.





□ CHESS 3: an improved, comprehensive catalog of human genes and transcripts based on large-scale expression data, phylogenetic analysis, and protein structure

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03088-4

CHESS 3 takes a stricter approach to including genes and transcripts than other human gene catalogs. CHESS 3 represents an improved human gene catalog based on nearly 10,000 RNA-seq experiments across 54 body sites.

CHESS 3 contains 41,356 genes, including 19,839 protein-coding genes and 158,377 transcripts, with 14,863 protein-coding transcripts not in other catalogs. It includes all MANE transcripts and at least one transcript for most RefSeq and GENCODE genes.




Andromeda.

2023-10-31 22:31:13 | Science News

(Art by carlhauser)




□ Nexus: Pan-genome de Bruijn graph using the bidirectional FM-index

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05531-6

Nexus, a memory-efficient representation of the colored compacted de Bruijn graph enabling subgraph visualization and lossless approximate pattern matching of reads to the graph, developed to store pan-genomes.

Nexus provides other functionalities (such as visualization) next to read alignment. In contrast to a k-mer hash table, both the A4 algorithm by Beller and Ohlebusch and Nexus are based on a full-text index of the concatenation of all input genomes.






□ VRP Assembler: haplotype-resolved de novo assembly of diploid and polyploid genomes using quantum computing

>> https://www.biorxiv.org/content/10.1101/2023.10.19.563028v1

VRP assembler, a haplotype assembly method to combines both phasing and assembly process into a single optimization model. It enables the optimization procedure to be solved on quantum annealers as well as gate-based quantum computers to harness potential quantum acceleration.

The core system in quantum annealing for VRP is a time-dependent Hamiltonian of transverse-field Ising model. The reconstructed sequences exactly match the original sequences with zero hamming distance in all runs.

The VRP assembler has demonstrated its potential and feasibility through a proof of concept on short synthetic diploid and triploid genomes using a D-Wave quantum annealer.





□ Rosace: a robust deep mutational scanning analysis framework employing position and mean-variance shrinkage

>> https://www.biorxiv.org/content/10.1101/2023.10.24.562292v1

Rosace, the first growth-based Deep Mutational Scanning method that incorporates local positional information. Rosace attempts to simulate several properties of DMS such as bimodality, similarities in behavior across similar substitutions, and the overdispersion of counts.

Rosace uses Rosette to simulate several screening modalities. Rosace implements a hierarchical model that parameterizes each variant's effect as a function of the positional effect, providing a way to incorporate both position-specific information and shrinkage into the model.





□ AAMB: Adversarial and variational autoencoders improve metagenomic binning

>> https://www.nature.com/articles/s42003-023-05452-3

AAMB (Adversarial Autoencoders for Metagenomic Binning), an extension of the VAMB program. AAMB leverages AAEs to yield more accurate bins than VAMB’s VAE-based approach.

AAMB consists of: Tetra Nucleotide Frequencies (TNF) and per sample co-abundances are extracted from the contigs and BAM files of reads mapped to contigs, and input to the AAMB model as a concatenated vector. AAMB uses both a continuous and a categorical latent space.





□ A Safety Framework for Flow Decomposition Problems via Integer Linear Programming

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad640/7325350

mfd-safety is a tool reporting maximal safe paths for minimum flow decompositions (mfd) using Integer Linear Programming (ILP) calls, and implementing several optimization to reduce the number of ILP calls or their size (number of variables/constrains).

Computing the weighted precision of a graph as the average weighted precision over all reported paths in the graph, and the maximum coverage of a graph as the average maximum coverage over all ground truth paths in the graph.

The two algorithms for finding all maximal safe paths. Both algorithms use a similar approach, however the first uses a top-down approach starting from the original full solution paths and reports all safe paths, and then trims all the unsafe paths to find new maximal safe paths.





□ aMeta: an accurate and memory-efficient ancient metagenomic profiling workflow

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03083-9

aMeta, an accurate metagenomic profiling workflow for ancient DNA designed to minimize the amount of false discoveries and computer memory requirements. aMeta consumed nearly half as much computer memory as Heuristic Operations for Pathogen Screening.

Meta represents a combination of taxonomic classification  steps with KrakenUniq. aMeta performs alignments with the MALT aligner. The main advantage of MALT and motivation for us to use it in aMeta was that MALT is a metagenomic-specific aligner which applies the LCA algorithm.





□ Capricorn: Enhancing Hi-C contact matrices for loop detection with Capricorn, a multi-view diffusion model

>> https://www.biorxiv.org/content/10.1101/2023.10.25.564065v1

They hypothesize that resolution enhancement can produce contact matrices that can better capture these higher-order chromatin structures if we design a loss function that explicitly models structures like loops and TADs during resolution enhancement.

Capricorn incorporates additional biological views of the contact matrix to emphasize important chromatin interactions and leverages powerful computer vision diffusion models for the model backbone.

Capricorn learns a diffusion model that enhances a five-channel image, containing both the primary Hi-C matrix as well as representations of TADs, loops, and distance-normalized counts computed from the original low-resolution matrix.





□ dsRID: in silico identification of dsRNA regions using long-read RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad649/7328386

dsRID detects dRNA regions in an editing-agnostic manner. dsRID is built upon a previous observation and others that dRNA structures may induce region-skipping in RNA-seq reads, an artifact likely reflecting intra-molecular template switching in reverse transcription.

dsRNAs are potent triggers of innate immune responses upon recognition by cytosolic dsRNA sensor proteins. Identification of endogenous dsRNAs is critical to better understand the dsRNAome and its relevance to innate immunity related to human diseases.





□ TBLMM: Bayesian linear mixed model with multiple random effects for prediction analysis on high-dimensional multi-omics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad647/7330404

TBLMM (a two-step Bayesian Linear mixed model for predictive modeling of multi-omics data) uses BLMM-based integrative framework to fuse multiple designated kernel functions, which can account for heterogeneous effects and interactions, into one kernel for each genomic region.

TBLMM uses random effect terms to capture both within omics interactions, where the variance-covariance is modeled using three non-linear kernels, including polynomial kernel with 2 degrees of freedom, the neural network kernel, and the Hadamard product between linear kernels.





□ Equivariant flow matching

>> https://arxiv.org/abs/2306.15030

A novel flow matching objective designed for invariant densities, yielding optimal integration paths. Additionally, they introduce A new invariant dataset of alanine dipeptide and a large Lennard-Jones cluster.

The Boltzmann Generator capable of producing samples from the equilibrium Boltzmann distribution of a molecule in Cartesian coordinates. This method exploits the physical symmetries of the target energy simulation-free training of equivariant continuous normalizing flows.





□ Flow-Lenia: Towards open-ended evolution in cellular automata through mass conservation and parameter localization

>> https://arxiv.org/abs/2212.07906

Some spatially localized patterns (SLPs) resemble life-like artificial creatures and display complex behaviors. However, those creatures are found in only a small subspace of the Lenia parameter space and are not trivial to discover, necessitating advanced search algorithms.

Flow Lenia can integrate the parameters of the Cellular Automata update rules within the CA dynamics, allowing for multi-species simulations, w/ locally coherent update rules that define properties of the emerging creatures, and that can be mixed with neighbouring rules.





□ CONE: COntext-specific Network Embedding via Contextualized Graph Attention

>> https://www.biorxiv.org/content/10.1101/2023.10.21.563390v1

The core component of CONE consists of a graph attention network with contextual conditioning, and it is trained in a noise contrastive fashion using contextualized interactome random walks localized around contextual genes.

CONE contains two main components, including a GN decoder and an MLP context encoder. The GNN decoder converts the raw, learnable, node embeddings into the final embeddings.

On the other hand, the MLP context encoder projects the context-specific similarity profile that describes the relationships among different contexts into a condition embedding.

When added with the raw embeddings, the condition embedding serves as a high-level contextual semantics, similar to the widely-used positional encodings in Transformer models.





□ GNorm2: an improved gene name recognition and normalization system

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad599/7329714

GNorm2 integrates a range of advanced deep learning-based methods, resulting in the highest levels of accuracy and efficiency for gene recognition and normalization to date.

GNorm2 utilizes the Transformer-based infrastructure to recognize gene names mentioned in free text instead of Conditional Random Fields.

Bioformer is a language model based on the BERT architecture that is tailored for biomedical text mining. It employs a specialized vocabulary and reduces the model size by 60% compared to the original BERT, making it much more computationally efficient.





□ GRAIGH: Gene Regulation accessibility integrating GeneHancer database

>> https://www.biorxiv.org/content/10.1101/2023.10.24.563720v1

GRAIGH, a novel computational approach to interpret scATAC-seq features and understand the information they provide. GRAIGH aims to integrate scATAC-seq datasets with the GenHancer database, which describes genome-wide enhancer-to-gene and promoter-to-gene associations.

These associations have unique identifiers which have the potential to overcome one of the limitations of the scATAC-seq data, thus enabling interoperability of datasets obtained from different experiments.

GRAIGH is validated by comparing the results obtained from the GH matrix data with the original scATAC-seq data, showing the integration does not introduce any significant biases.





□ CoreDetector: A flexible and efficient program for core-genome alignment of evolutionary diverse genomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad628/7329718

CoreDetector generates a multiple core-genome alignment for closely and more distantly related genomes. A single longest genome with the least number of ambiguous bases (non ATGC) is initially selected as the query from the pool of genomes for pairwise alignment using Minimap2.

CoreDetector computationally scaled from the diploid smaller fungal pathogen to larger rodent and hexaploid plant genomes without the need for high-performance computing (HPC) resources, and in the case of the larger and more diverse rodent dataset.





□ Pumping the brakes on RNA velocity by understanding and interpreting RNA velocity estimates

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03065-x

Deconstructing the underlying workflow by separating the (gene-level) velocity estimation from the vector field visualization. Their findings reveal a significant dependence of the RNA velocity workflow on smoothing via the k-nearest-neighbors (k-NN) graph of the observed data.

They analyzed how the methods for mapping and visualizing the vector field impact the interpretation of RNA velocity and discover the central role played by the k-NN graph in both velocity estimation and vector field visualization.





□ The Quartet Data Portal: integration of community-wide resources for multiomics quality control

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03091-9

The Quartet Data Portal facilitates community access to well-characterized reference materials / datasets, and related resources. Users can request DNA, RNA, protein, and reference materials, as well as datasets generated across omics, platforms, labs, protocols, and batches.

The Quartet Data Portal uses a “distribution-collection-evaluation-integration” closed-loop workflow. Continuous requests for reference materials by the community will generate large amounts of data from the Quartet reference samples under different platforms and labs.





□ AMAS: An Automated Model Annotation System for SBML Models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad658/7330406

AMAS may produce an empty prediction set. This occurs if the query element is rejected by the Element Filter. It also occurs if the largest match score for the query element is smaller than the match score cutoff.

AMAS calculates the similarity between two species based on the similarity of strings associated with the two species. For the query species, the preferred strings is the SBML display name if it exists.





□ deltaXpress (ΔXpress): a tool for mapping differentially correlated genes using single-cell qPCR data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05541-4

ΔXpress uses cycle threshold (Ct) values and categorical information for each sample. ΔXpress emulates a bulk analysis by observing differentially expressed genes. It allows the discovery of pairwise genes differentially correlated when comparing two experimental conditions.

ΔXpress uses the NormFinder algorithm. The NormFinder algorithm will show two gene lists (single and paired) with their respective stability values. ΔXpress use the best pair of genes to calculate the mean value per sample and normalize all genes using the Livak method.





□ BIDARA: Bio-Inspired Design and Research Assistant (NASA)

>> https://www1.grc.nasa.gov/research-and-engineering/vine/petal/

BIDARA can guide users through the Biomimicry Institute’s Design Process, a step-by-step method to propose biomimetic solutions using Generative AI. This process includes defining the problem, biologizing the challenge, discovering natural models, and emulating the strategies.





□ Foundation Models Meet Imbalanced Single-Cell Data When Learning Cell Type Annotations

>> https://www.biorxiv.org/content/10.1101/2023.10.24.563625v1

Benchmarking foundation models, scGPT, scBERT, and Geneformer, for cell-type annotation. scGPT, using FlashAttention, has the fastest computational speed, whereas scBERT is much more memory-efficient.

Notably, in contrast to scGPT and scBERT, Geneformer uses ordinal positions of the tokenized genes rather than actual raw gene expression values. Random oversampling, but not random undersampling, improved the performance for all three foundation models.





□ JIVE: Joint and Individual Variation Explained: Batch-effect correction in single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.10.25.563973v1

JIVE, a multi-source dimension reduction method that decomposes two or more biological datasets into three low-rank approximation components: a joint structure among the datasets, individual structures unique to each distinct dataset, and residual noise.

The JIVE decomposition estimates the joint and individual structures by minimizing the sum of squared error of the residual matrix.

Given an initial estimate for the joint structure, it finds the individual structures to minimize the sum of squared error. Then, given the new individual structures, it finds a new estimate for the joint structure which minimizes the sum of squared error.

The original R. JIVE code utilizes singular value decompositions (SVD) in many different areas, however JIVE uses a partial SVD function which returns the largest singular values/vectors of a given matrix.





□ Flash entropy search to query all mass spectral libraries in real time

>> https://www.nature.com/articles/s41592-023-02012-9

Public repositories of metabolomics mass spectra encompass more than 1 billion entries. With open search, dot product or entropy similarity, comparisons of a single tandem mass spectrometry spectrum take more than 8 h.

Flash entropy search speeds up calculations more than 10,000 times to query 1 billion spectra in less than 2 s, without loss in accuracy. It benefits from using multiple threads and GPU calculations.






□ Linked-Pair Long-Read Sequencing Strategy for Targeted Resequencing and Enrichment

>> https://www.biorxiv.org/content/10.1101/2023.10.26.564243v1

A linked-pair sequencing strategy. This approach relies on generating library-sized DNA fragments from long DNA molecules such that the 300-1000 bp at the ends of the adjacent DNA fragments are duplicated.

A long contiguous DNA molecule was non-randomly fragmented into many smaller fragments in such a way that the ends of the fragments shared the specific identical sequences up to 1000 bp, called linkers or linker sequences.

The sequencing library constructed using these fragments maintains the contiguity of reads through the tandem duplicated sequences at fragment ends and improves the sequencing efficiency of targeted regions.





□ Rank and Select on Degenerate Strings

>> https://arxiv.org/abs/2310.19702

Recently, Alanko et al. generalized the rank-select problem to degenerate strings, where given a character c and position i the goal is to find either the ith set containing c or the number of occurrences of c in the first i sets.

The problem has applications to pangenomics; in another work by Alanko et al. they use it as the basis for a compact representation of de Bruijn Graphs that supports fast membership queries.

They revisit the rank-select problem on degenerate strings, providing reductions to rank-select on regular strings. Plugging in standard data structures, they improve the time bounds for queries exponentially while essentially matching, or improving, the space bounds.





□ CluStrat: Structure-informed clustering for population stratification in association studies

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05511-w

CluStrat, which corrects for complex arbitrarily structured populations while leveraging the linkage disequilibrium induced distances between genetic markers. It performs an agglomerative hierarchical clustering using the Mahalanobis distance covariance matrix of the markers.

The regularized Mahalanobis distance-based GRM used in CluStrat has a straightforward yet possibly not widely recognized connection with the leverage and cross-leverage scores, which becomes particularly interesting when applied to the genotype matrix.





□ Serial KinderMiner (SKiM) discovers and annotates biomedical knowledge using co-occurrence and transformer models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05539-y

SKiM performs LBD searches to discover relationships between arbitrary user-defined concepts. SKiM is generalized for any domain, can perform searches with many thousands of C term concepts, and moves beyond the simple identification of an existence of a relationship.

The knowledge graph, built by extracting biomedical entities and relationships from PubMed abstracts with ML, is queried for the A–B and B–C relationships. If these are found in the database, the relationships that SKiM found are annotated.





□ Matrix and analysis metadata standards (MAMS) to facilitate harmonization and reproducibility of single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.03.06.531314v1

MAMS captures the relevant information about the data matrices. MAMS defines fields that describe what type of data is contained within a matrix, relationships between matrices, and provenance related to the algorithm that created the matrix.

Feature and observation matrices (FOMs) contain biological data at different stages of processing incl. reduced dimensional representations. Metadata fields for the other classes were defined in MAMS. Fields are incl. to denote if an ID is a compound ID separated by a delimiter.





□ MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad651/7335842

MedCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot semantic IR in biomedicine. MedCPT re-ranker is trained with the negative distribution sampled from the pre-trained MedCPT retriever.

MedCPT contains a query encoder (QEnc), a document encoder (DEnc), and a cross-encoder (CrossEnc). The query encoder and document encoder compose of the MedCPT retriever, which is contrastively trained by 255M query-article pairs and in-batch negatives from PubMed logs.





□ Next-generation phenotyping: Introducing phecodeX for enhanced discovery research in medical phenomics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad655/7335839

phecodeX, an expanded version of phecodes with a revised structure and 1,761 new codes. PhecodeX adds granularity to phenotypes in key disease domains that are underrepresented in the current phecode structure.

PhecodeX 1) aligns its structure with the ICD-10 coding system, 2) revises the phecode labeling system, 3) leverages multi-mapping of both ICD-9 and -10 codes, 4) removes exclude ranges used to define controls, and 5) reorganizes phecode categories.





Oxford Nanopore

>> http://nanoporetech.com/about-us/news/blog-oxford-nanopore-meets-apples-m3-silicon-chip-hailing-new-era-distributed-genome

Today @Apple highlighted how their M3 silicon chip provides powerful, accessible compute — citing the ability to run the complex analysis required for DNA/RNA #nanopore sequencing, by anyone, anywhere in the world.




□ Coste energético de la bioinformática

>> https://bioinfoperl.blogspot.com/2023/10/coste-energetico-de-la-bioinformatica.html




Veera Rejagopal

>> https://www.businesswire.com/news/home/2023

Something big happened a few days ago. Industry leaders in the genomics fields (Regeneron, AstraZeneca, Novo Nordisk, Roche) announced their collaboration with the US's largest black medical school, Meharry Medical College, Nashville, to establish what might become the UK Biobank of Africa--the largest genomics database of 500,000 volunteers from African ancestries.




I’m caught in a storm!

2023-10-31 21:36:37 | 写真


<svg width="50px" height="50px" viewBox="0 0 60 60" version="1.1" xmlns="https://www.w3.org/2000/svg" xmlns:xlink="https://www.w3.org/1999/xlink"><g stroke="none" stroke-width="1" fill="none" fill-rule="evenodd"><g transform="translate(-511.000000, -20.000000)" fill="#000000"><g><path d="M556.869,30.41 C554.814,30.41 553.148,32.076 553.148,34.131 C553.148,36.186 554.814,37.852 556.869,37.852 C558.924,37.852 560.59,36.186 560.59,34.131 C560.59,32.076 558.924,30.41 556.869,30.41 M541,60.657 C535.114,60.657 530.342,55.887 530.342,50 C530.342,44.114 535.114,39.342 541,39.342 C546.887,39.342 551.658,44.114 551.658,50 C551.658,55.887 546.887,60.657 541,60.657 M541,33.886 C532.1,33.886 524.886,41.1 524.886,50 C524.886,58.899 532.1,66.113 541,66.113 C549.9,66.113 557.115,58.899 557.115,50 C557.115,41.1 549.9,33.886 541,33.886 M565.378,62.101 C565.244,65.022 564.756,66.606 564.346,67.663 C563.803,69.06 563.154,70.057 562.106,71.106 C561.058,72.155 560.06,72.803 558.662,73.347 C557.607,73.757 556.021,74.244 553.102,74.378 C549.944,74.521 548.997,74.552 541,74.552 C533.003,74.552 532.056,74.521 528.898,74.378 C525.979,74.244 524.393,73.757 523.338,73.347 C521.94,72.803 520.942,72.155 519.894,71.106 C518.846,70.057 518.197,69.06 517.654,67.663 C517.244,66.606 516.755,65.022 516.623,62.101 C516.479,58.943 516.448,57.996 516.448,50 C516.448,42.003 516.479,41.056 516.623,37.899 C516.755,34.978 517.244,33.391 517.654,32.338 C518.197,30.938 518.846,29.942 519.894,28.894 C520.942,27.846 521.94,27.196 523.338,26.654 C524.393,26.244 525.979,25.756 528.898,25.623 C532.057,25.479 533.004,25.448 541,25.448 C548.997,25.448 549.943,25.479 553.102,25.623 C556.021,25.756 557.607,26.244 558.662,26.654 C560.06,27.196 561.058,27.846 562.106,28.894 C563.154,29.942 563.803,30.938 564.346,32.338 C564.756,33.391 565.244,34.978 565.378,37.899 C565.522,41.056 565.552,42.003 565.552,50 C565.552,57.996 565.522,58.943 565.378,62.101 M570.82,37.631 C570.674,34.438 570.167,32.258 569.425,30.349 C568.659,28.377 567.633,26.702 565.965,25.035 C564.297,23.368 562.623,22.342 560.652,21.575 C558.743,20.834 556.562,20.326 553.369,20.18 C550.169,20.033 549.148,20 541,20 C532.853,20 531.831,20.033 528.631,20.18 C525.438,20.326 523.257,20.834 521.349,21.575 C519.376,22.342 517.703,23.368 516.035,25.035 C514.368,26.702 513.342,28.377 512.574,30.349 C511.834,32.258 511.326,34.438 511.181,37.631 C511.035,40.831 511,41.851 511,50 C511,58.147 511.035,59.17 511.181,62.369 C511.326,65.562 511.834,67.743 512.574,69.651 C513.342,71.625 514.368,73.296 516.035,74.965 C517.703,76.634 519.376,77.658 521.349,78.425 C523.257,79.167 525.438,79.673 528.631,79.82 C531.831,79.965 532.853,80.001 541,80.001 C549.148,80.001 550.169,79.965 553.369,79.82 C556.562,79.673 558.743,79.167 560.652,78.425 C562.623,77.658 564.297,76.634 565.965,74.965 C567.633,73.296 568.659,71.625 569.425,69.651 C570.167,67.743 570.674,65.562 570.82,62.369 C570.966,59.17 571,58.147 571,50 C571,41.851 570.966,40.831 570.82,37.631"></path></g></g></g></svg>
この投稿をInstagramで見る

@razoralignがシェアした投稿


Killers of the Flower Moon.

2023-10-30 00:08:53 | 映画


『Killers of the Flower Moon』

>> https://www.apple.com/tv-pr/originals/killers-of-the-flower-moon/

Apple TV+ / Paramount Pictures (2023)

Directed by Martin Scorsese
Screenplay by Martin Scorsese / Eric Roth
Based on the book by David Grann

Music by Robbie Robertson
Cinematography by Rodrigo Prieto
Set Decoration by Adam Willis
Production design by Jack Fisk



Apple Original films。先住民族のオイルマネーに群がる白人入植者たち。均等受益権を巡る組織犯罪、そして近代法執行機関の成り立ち。経済原理は世界のあらゆる民族を巻込み、あらゆる悪意の介在を見え難くし、その惨劇は決して過去のものではない。スコセッシの冷徹なリアリズムが鈍い光を放つ







□ The Halluci Nation Ft. Black Bear / Stadium Pow Wow

『Killers of the Flower Moon』 公式予告編の音楽は、ネイティブアメリカンの伝統的コーラスとダブステップを融合したエスノトロニカ。映画本編のサウンドトラックにも、オセージ族の実際の民族コーラスが収録されている


The Creator.

2023-10-29 00:10:10 | 映画

『The Creator』

>> https://www.20thcenturystudios.com/movies/the-creator

20th Century Studios (2023)

Directed by Gareth Edwards
Written by Gareth Edwards
Screenplay by Chris Weitz

Music by Has Zimmer
Cinematography by Greig Fraser / Oren Soffer

Production Design by James Clyne
Art Direction by Lek Chaiyan Chunsuttiwat / Chris DiPAola
Costume Design by Jeremy Hanna / Preeyanan ‘Lin’ Suwannathda



人を依代に人に寄り添うAIと、神の真似事をする人間との飽くなき闘争。古典的なサイバーパンクでありながら先進的な視覚効果と演出によって、カタルシス溢れる最も美しい情景をSF映画史に刻む。MVのようなキメキメのビジュアルと編集、Hans Zimmerの神々しい楽曲が、まるで神話のような一枚画を描く

An AI created to resemble and serve as a companion to humans, is engaged in a struggle with humans who play god. This cyberpunk sci-fi film, employing cutting-edge visual effects and direction, combined with the divine music of Hans Zimmer, creates scenes of breathtaking beauty.





□ Hans Zimmer / “A Place in the Sky”



□ Hans Zimmer / “Prayer”



□ Hans Zimmer / “True love”



Dominator.

2023-10-17 22:17:37 | Science News

(Created with Midjourney v5.2)





□ Design Patterns of Biological Cells

>> https://arxiv.org/abs/2310.07880

Because design patterns exist at all levels of detail within biology, from the designs of specific molecules to the designs of multi-cellular organisms, they restrict this work to the chemical reaction networks that animate individual cells.

There are three dominant versions of this pattern, which are DNA replication, DNA transcription to RNA, and RNA translation to proteins.

Each is performed by complex biochemical machinery that moves along the template and catalyzes the production of the newly synthesized molecule, and each includes its own version of kinetic proofreading.





□ Deepurify: a multi-modal deep language model to remove contamination from metagenome-assembled genomes

>> https://www.biorxiv.org/content/10.1101/2023.09.27.559668v1

Deepurify developed two distinct encoders, a genomic sequence encoder (GseqFormer) and a taxonomic encoder (LSTM) to encode genomic sequences and their source genomes' taxonomic lineages.

Deepurify initially quantified the taxonomic similarities of contigs by assigning taxonomic lineages to them. It then used these lineages to construct a MAG-separated tree, partitioning the MAG into distinct sections, each containing contigs with the same lineage.

Deepurify optimized contig utilization within the MAG, avoiding immediate removal of contaminated contigs. A tree traversal algorithm was devised to maximize the count of medium- and high-quality MAGs within the MAG-separated tree.





□ scDILT: a model-based and constrained deep learning framework for single-cell Data Integration, Label Transferring, and clustering

>> https://www.biorxiv.org/content/10.1101/2023.10.09.561605v1

scDILT (Single-Cell Deep Data Integration and Label Tranferring) leverages a conditional autoencoder (CAE). The CAE receives the concatenated count matrix of multiple datasets, along with a vector indicating the batch IDs.

scDILT generates an integrated latent space representing the input datasets along with predicted labels for all cells. The cell-to-cell constraints will be built based on the labels of these data and implemented on the bottle-neck layer Z of the autoencoder.





□ ProxyTyper: Generation of Proxy Panels for Privacy-aware Outsourcing of Genotype Imputation

>> https://www.biorxiv.org/content/10.1101/2023.10.01.560384v1

ProxyTyper, a framework for building proxy panels, i.e. panels that are similar in statistical properties to the original panel but are anonymized. ProxyTyper utilizes 3 mechanisms to protect haplotype datasets in terms of variant positions, genetic maps, and variant genotypes.

First mechanism protects the variant positions and genetic maps that can leak side-channel information. Second is resampling of original haplotype panels using a Li-Stephens Markov model with privacy parameters for tuning privacy level and utility.

ProxyTyper generates a mosaic of the original haplotypes so that each chromosome-wide haplotype is a mosaic of the haplotypes in the original panel. The third mechanism consists of encoding the alleles in resampled panels using locality-based hashing and permutation.





□ DiffDec: Structure-Aware Scaffold Decoration with an End-to-End Diffusion Model

>> https://www.biorxiv.org/content/10.1101/2023.10.08.561377v1

DiffDec optimizes molecules through molecular scaffold decoration conditioned on the 3D protein pocket by an E(3)-equivariant graph neural network and diffusion model. DiffDec could identify the growth anchors and generate R-groups well for the scaffolds without provided anchors.

The diffusion process iteratively adds Gaussian noise to the data, while the generative process gradually denoises the noise distribution under the condition of scaffold and protein pocket to recover the ground truth R-groups.






□ ILIAD: A suite of automated Snakemake workflows for processing genomic data for downstream applications

>> https://www.biorxiv.org/content/10.1101/2023.10.11.561910v1

ILIAD, a suite of Snakemake workflows developed with several modules for automatic and reliable processing of raw or stored genomic data that lead to the output of ready-to-use genotypic information necessary to drive downstream applications.

ILIAD offers a containerized workflow with optional automatic downloads of desired files from file transfer protocol (FTP) sites coupled with the use of any genome reference assembly for variant calling using BCFtools.

Iliad features independent submodules for lifting over reference assembly genomic positions (GRCh37 to GRCh38 and vice versa) and merging multiple VCF files at once.





□ MSXFGP: combining improved sparrow search algorithm with XGBoost for enhanced genomic prediction

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05514-7

Chaos theory is a nonlinear theory and has good applications in random number generation. Many swarm intelligence optimization methods use chaos mapping as random number generators to initialize populations.

MSXFGP is based on a multi-strategy improved sparrow search algorithm (SSA) to optimize XGBoost parameters and feature selection. Firstly, logistic chaos mapping, elite learning, adaptive parameter adjustment, Levy flight, and an early stop strategy are incorporated into the SSA.





□ PhyGCN: Pre-trained Hypergraph Convolutional Neural Networks with Self-supervised Learning

>> https://www.biorxiv.org/content/10.1101/2023.10.01.560404v1

PhyGCN aims to enhance node representation learning in hypergraphs by effectively leveraging abundant unlabeled data. Hyperedge prediction is employed as a self-supervised task for model pre-training. The pre-trained embedding model is then used for downstream tasks.

To calculate the embedding for a target node, the hypergraph convolutional network aggregates information from neighboring nodes connected to it via hyperedges, and combines it with the target node embedding to output a final embedding.

PhyGCN employs two adapted strategies: DropHyperedge and Skip/Dense Connection. These strategies randomly mask the values of the adjacency matrix for the base hypergraph convolutional network during each iteration, which helps prevent overfitting and improves generalization.





□ Monopogen: Single-nucleotide variant calling in single-cell sequencing data

>> https://www.nature.com/articles/s41587-023-01873-x

Monopogen, a computational framework that enables researchers to detect single-nucleotide variants (SNVs) from a variety of single-cell transcriptomic and epigenomic sequencing data.

Monopogen uses high-quality haplotype and linkage disequilibrium (LD) data from an external reference panel to overcome uneven sequencing coverage, allelic dropout and sequencing errors in single-cell sequencing data.

Monopogen further conducts LD scoring at the cell population level within each sample, leveraging the expectation that most alleles are identical and in perfect LD with neighboring alleles across the genome, except for those that are somatically altered in a subpopulation of cells.





□ Ribotin: Automated assembly and phasing of rDNA morphs

>> https://www.biorxiv.org/content/10.1101/2023.09.29.560103v1

Ribotin uses the highly accurate long reads to build a graph which represents all variation within the rDNA. Then ultralong ONT reads are aligned to the graph and are used to detect rDNA repeat units. The ONT read paths are clustered to rDNA morphs..

Ribotin has integration with the assembly tool verkko to assemble rDNA morphs per chromosome. Ribotin also has a mode to run without a verkko assembly using only a related reference rDNA sequence. Ribotin detects the rDNA tangles using the reference k-mers and graph topology.





□ LMSRGC: Reference-based genome compression using the longest matched substrings with parallelization consideration

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05500-z

LMSRGC, an algorithm based on reference genome sequences, which uses the suffix array (SA) and the longest common prefix (LCP) array to find the longest matched substrings (LMS) for the compression of genome data in FASTA format.

The proposed algorithm utilizes the characteristics of SA and the LCP array to select all appropriate LMSs between the genome sequence to be compressed and the reference genome sequence and then utilizes LMSs to compress the target genome sequence.






□ CEN-DGCNN: Co-embedding of edges and nodes with deep graph convolutional neural networks

>> https://www.nature.com/articles/s41598-023-44224-1

CEN-DGCNN (Co-embedding of Edges and Nodes with Deep Graph Convolutional Neural Networks) introduces multi-dimensional edge embedding representation. It constructs a message passing framework which introduces the idea of residual connection and dense connection.

Based on CEN-DGCN, a deep graph convolution neural network can be designed to mine remote dependency relationships between nodes. Each layer can learn node features and edge features simultaneously, and can be updated iteratively across layers.





□ StrastiveVI: Isolating structured salient variations in single-cell transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2023.10.06.561320v1

StrastiveVI (Structured Contrastive Variational Inference) leverages previous advances in conditionally invariant representation learning to model the variations underlying scRNA-seq data using two sets of latent variables.

Strastive VI separates the target variations and the dominant background variations. The background variables, are invariant to the given covariate of interest. The target variables, capture variations related to the covariate of interest.





□ HycDemux: a hybrid unsupervised approach for accurate barcoded sample demultiplexing in nanopore sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03053-1

HycDemux integrates an unsupervised hybrid approach to achieve accurate clustering, in which the nucleotides-based greedy algorithm is utilized to obtain initial clusters, and the raw signal information is measured to guide the continuously optimization of clustering.

HycDemux integrates a module that uses a voting mechanism to determine the final demultiplexing result. This module selects n representatives (5 by default) for each cluster and calculates the Dynamic Time Warping.





□ diVas: Digenic variant interpretation with hypothesis-driven explainable AI

>> https://www.biorxiv.org/content/10.1101/2023.10.02.560464v1

diVas, an ML-based approach for digenic variant interpretation aiming to overcome the limitations of the other tools described above. Unlike other tools, diVas leverages proband's phenotypic information to predict the probability of each pair to be causative.

diVas employs cutting-edge Explainable Artificial Intelligence (XAl) techniques for further subclassification into distinct digenic mechanisms: True Digenic /Composite and Dual Molecular Diagnosis.





□ Incorporating extrinsic noise into mechanistic modelling of single-cell transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.09.30.560282v1

A fully Bayesian framework for the mechanistic analysis of scRNAseq data based on the telegraph model of gene expression, building on single cell sequencing / Kinetics analysis and including cell size effects via a cell-specific scaling factor.

This framework is implemented in the probabilistic programming language Stan and relies on a state-of-the-art Hamiltonian Monte Carlo sampler. It uses Bayesian model selection to distinguish between modes of gene expression and evaluate the possible presence of zero-inflation.






□ MINI-AC: inference of plant gene regulatory networks using bulk or single-cell accessible chromatin profiles

>> https://onlinelibrary.wiley.com/doi/10.1111/tpj.16483

MINI-AC (Motif-Informed Network Inference based on Accessible Chromatin), a computational method that integrates TF motif information with bulk or single-cell derived chromatin accessibility data to perform motif enrichment analysis and GRN inference.

MINI-AC generates information about motifs showing enrichment on the ACRs, a network that is context-specific for a functional enrichment analysis. MINI-AC can be used in two alternative modes - genome-wide and locus-based - to select different non-coding genomic spaces.






□ MBE: model-based enrichment estimation and prediction for differential sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03058-w

MBE can readily make use of modern-day neural network models in a plug-and-play manner, which also enables us to easily handle (possibly overlapping) reads of different lengths.

For example, fully convolutional neural network classifiers naturally handle variable-length sequences because the convolutional kernels and pooling operations in each layer are applied in the same manner across the input sequence, regardless of its length.

MBE trivially generalizes to settings with more than two conditions of interest by replacing the binary classifier with a multi-class classifier.

The multi-class classification model is trained to predict the condition from which each read arose; then, the density ratio for any pair of conditions can be estimated using the ratio of its corresponding predicted class probabilities.





□ LIANA: Comparison of methods and resources for cell-cell communication inference from single-cell RNA-Seq data

>> https://www.nature.com/articles/s41467-022-30755-0

CCC events are typically represented as a one-to-one interaction between a transmitter and receiver protein, accordingly expressed by the source and target cell clusters. The information about which transmitter binds to which receiver is extracted from diverse sources.

LIANA (a LIgand-receptor ANalysis frAmework) takes any annotated single-cell RNA (scRNA) dataset as input and establishes a common interface to all the resources and methods in any combination. LIANA provides a consensus ranking for the method’s predictions.





□ Benchmarking long-read RNA-sequencing analysis tools using in silico mixtures

>> https://www.nature.com/articles/s41592-023-02026-3

For long-read RNA-seq, This study is the first to compare differential transcript expression (DTE) and differential transcript usage (DTU) methods on a controlled dataset with a tens of millions of reads per sample, as is typically available in short-read studies.

DTU analysis calculates the proportion of transcript expression relative to all transcripts, which can be impacted more readily by changes in quantification of any transcript from a gene. Therefore, the difference of quantification in ONT and Illumina data had a larger impact.





□ happi: a hierarchical approach to pangenomics inference

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03040-6

happi is a method for modeling gene presence in pangenomics that leverages information about genome quality. happi models the association between an experimental condition and gene presence where the experimental condition is the primary predictor of interest.

happi provides sensible results in an analysis of metagenome-assembled genome data, improves statistical inference under simulation. The latent variable structure of the model makes the expectation-maximization algorithm an appealing choice for estimating unknown parameters.





□ PaGeSearch: A Tool for Identifying Genes within Pathways in Unannotated Genomes

>> https://www.biorxiv.org/content/10.1101/2023.09.26.559665v1

PaGeSearch identifies a list of genes within a genome, with a focus on genes associated with specific pathways. By identifying candidate regions through a sequence similarity search and performing gene prediction within them, PaGeSearch significantly reduces the search space.

PaGeSearch uses a neural network model to provide candidates that are the most likely orthologs of the query genes.





□ GenArk: towards a million UCSC genome browsers

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03057-x

GenArk (Genome Archive), a collection of UCSC Genome Browsers from NCBI assemblies. Built on our established track hub system, this enables fast visualization of annotations. Assemblies come with gene models, repeat masks, BLAT, and in silico PCR.

The GenArk genome browsers cover multiple clades: 159 primates, 409 mammals, 270 birds, 271 fishes, 115 other vertebrates, 598 invertebrates, 554 fungi, and 230 plants. It also includes 446 assemblies from the Vertebrate Genome Project (VGP) and 336 legacy assemblies.





□ scRANK: Ranking of cell clusters in a single-cell RNA-sequencing analysis framework using prior knowledge

>> https://www.biorxiv.org/content/10.1101/2023.10.02.560416v1

A novel methodology that exploits prior knowledge for a disease in combination with expert-user information to accentuate cell types from a scRNA-seq analysis that are most closely related to the molecular mechanism of a disease of interest.

The methodology is fully automated and a ranking is generated for all cell types. This provides a ranking which is based on topology information obtained from the CellChat networks.





□ Accelerated identification of disease-causing variants with ultra-rapid nanopore genome sequencing

>> https://www.nature.com/articles/s41587-022-01221-5

An approach for ultra-rapid nanopore WGS that combines an optimized sample preparation protocol, distributing sequencing over 48 flow cells, near real-time base calling and alignment, accelerated variant calling and fast variant filtration for efficient manual review.

The cloud-based pipeline scales compute-intensive base calling and alignment across 16 instances with 4× Tesla V100 GPUs each and runs concurrently with sequencing.

The instances aim for maximum resource utilization, where base calling using Guppy runs on GPU and alignment using Minimap2 runs on 42 virtual CPUs in parallel. Small-variant calling performed using GPU-accelerated PEPPER–Margin–DeepVariant.





□ AutoClass: A universal deep neural network for in-depth cleaning of single-cell RNA-Seq data

>> https://www.nature.com/articles/s41467-022-29576-y

AutoClass integrates two DNN components, an autoencoder and a classifier, as to maximize both noise removal and signal retention. AutoClass is distribution agnostic as it makes no assumption on specific data distributions, hence can effectively clean a wide range of noise and artifacts.

AutoClass effectively models and cleans a wide range of noises and artifacts in scRNA-Seq data including dropouts, random uniform, Gaussian, Gamma, Poisson, and negative binomial noises, as well as batch effects.





□ Mabs: a suite of tools for gene-informed genome assembly

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05499-3

Mabs is a genome assembly tool which optimizes parameters of genome assemblers Hifiasm and Flye. Mabs optimizes parameters of a genome assembler to make an assembly where protein-coding genes are assembled more accurately.

Mabs is able to distinguish true multicopy orthogroups from false multicopy orthogroups, because genes originating from haplotypic duplications have two times lower coverage than correctly assembled genes.





□ The longest intron rule

>> https://www.biorxiv.org/content/10.1101/2023.10.02.560625v1

The presence of introns substantially increases the complexity of ribosomal protein gene expression as they variably slow the expression cycle, and in addition, many introns can contain non-coding RNA involved in other layers of regulation.

The localization of the longest intron in the second or third third is significantly more frequent for certain functionally related groups of genes, e.g. for DNA repair genes.





□ DAESC: Single-cell allele-specific expression analysis reveals dynamic and cell-type-specific reg- ulatory effects

>> https://www.nature.com/articles/s41467-023-42016-9

DAESC (Differential Allelic Expression using Single-Cell data) accounts for haplotype switching using latent variables and handles sample repeat structure of single-cell data using random effects.

DAESC is based on a beta-binomial regression model and can be used for differential ASE against any independent variable, such as cell type, continuous developmental trajectories, genotype (eQTLs), or disease status.

The baseline model DAESC-BB is a beta-binomial model with individual-specific random effects that account for the sample repeat structure arising from multiple cells measured per individual inherent to single-cell data.

DAESC-BB can be used generally for differential ASE regardless of sample size (number of individuals, N). When sample size is reasonably large (e.g., N ≥ 20), a full model DAESC-Mix that accounts for both sample repeat structure and implicit haplotype phasing.





□ KmerSV: a visualization and annotation tool for structural variants using Human Pangenome derived k-mers

>> https://www.biorxiv.org/content/10.1101/2023.10.11.561941v1

KmerSV, a new tool for SV visualization and annotation. To mediate these functions, KmerSV uses a reference sequence deconstructed into its component k-mers, each having a length of 31 bp. These reference-derived k-mers are compared to the sequence of interest.

The program maps the Pangenome or other reference 31-mers against one or multiple target sequences which can include either contigs or sequence reads.

Initially, they retrieve these k-mers via a sliding window across a segment of the reference with its coordinate information. Then, the retrieved k-mers are systematically mapped against the target.

Unique 31-mers (as defined by the reference) serve as "anchor" points in the target sequence to facilitate using k-mers with multiple coordinates. This anchoring process eliminates ambiguous k-mers and improves the visualization of complex SVs such as duplications.





□ PanKmer: k-mer based and reference-free pangenome analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad621/7319363

PanKmer, a non-graphical k-mer decomposition method designed to efficiently represent and analyze many forms of variation in large pangenomic datasets, with no reliance on a reference genome and no assumption of annotation.

PanKmer includes a function to calculate the number of shared k-mers between all pairs of input genomes and return them as an adjacency matrix. Subsequently, the adjacency values can be used to perform a hierarchical clustering of input genomes.



Oxford Nanopore

>> https://nanoporetech.com/about/events/community-meetings/ncm-2023-houston

This week is #WorldSpaceWeek! At #nanoporeconf, Sarah Castro-Wallace will share @NASA’s project to take the MinION device to Mars — which will prove invaluable if we are to discover life beyond Earth.






Focal Point.

2023-10-17 22:17:36 | Science News

(Artwork by Andrew Kramer)




□ CellPLM: Pre-training of Cell Language Model Beyond Single Cells

>> https://www.biorxiv.org/content/10.1101/2023.10.03.560734v1

CellPLM (a novel single-Cell Pre-trained Language Model) proposes a cell language model to account for cell-cell relations. The cell embeddings are initialized by aggregating gene embeddings since gene expressions are bag-of-word features.

CellPLM leverages a new type of data, spatially-resolved transcriptomic (SRT) data, to gain an additional reference for uncovering cell-cell interactions. SRT data provides positional information for cells. Both types of data are jointly modeled by transformers.

CellPLM consists of a gene expression embedder, a transformer encoder, a Gaussian mixture model, and a batch-aware decoder. CellPLM introduces an inductive bias to overcome data quantity limitations by utilizing a Gaussian mixture as the prior distribution in the latent space.





□ SONATA: Disambiguated manifold alignment of single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.10.05.561049v1

SONATA represents the low-dimensional manifold structure of each single-cell dataset using a geodesic distance matrix of the cells. To do this, SONATA first construct a weighted k-nearest neighbor (k-NN) graph of cells based on Euclidean distance.

SONATA then calculates the shortest distance between each node pair on the graph because the shortest distances approximate geodesic distances. SONATA measures the likelihood that one cell from the dataset can be substituted for another in a cross-modality alignment.





□ TreePPL: A Universal Probabilistic Programming Language for Phylogenetics

>> https://www.biorxiv.org/content/10.1101/2023.10.10.561673v1

TreePPL introduces universal probabilistic programming and extensible Monte Carlo inference to a wider audience in statistical phylogenetics. It allows practitioners to craft probabilistic programs that utilize the sophisticated Miking CorePPL inference on the back-end.

To describe the problem of tree inference in a PPL, they use stochastic recursion. The core idea is to control a recursive function using a random variable, such that successive iterations generate a valid draw from the prior probability distribution over tree space.





□ Graphite: painting genomes using a colored De Bruijn graph

>> https://www.biorxiv.org/content/10.1101/2023.10.08.561343v1

Graphite starts with two graph files and a set of query identifiers. It then builds a suffix array of the queries along with other datastructures to speed up matching. Then each sequence (i.e "reference") is read from the graph file and mapped onto the Suffix array.

Each mapping is an identical sequence between the queries and ref, also called Maximum Exact Matches (MEMs). Each time a MEM is found its length is compared to previously discovered MEMs to only retain the Longest MEM (LMEM).





□ PARSEC: Rationalised experiment design for parameter estimation with sensitivity clustering

>> https://www.biorxiv.org/content/10.1101/2023.10.11.561860v1

PARSEC (PARameter SEnsitivity Clustering) uses the model architecture of the system through parameter sensitivity analysis to direct the search for informative experiment designs. PARSEC generates an 'optimal' DoE effectively.

PARSEC computes the parameter sensitivity indices (PSI) vectors at various parameter values that sample the distribution linked to parameter uncertainty. Concatenating the PSI vectors for a measurement candidate yields the composite PARSEC-PSI vector.





□ SC-Track: a robust cell tracking algorithm for generating accurate single cell linages from diverse cell segmentations

>> https://www.biorxiv.org/content/10.1101/2023.10.03.560639v1

SC-Track employs a hierarchical probabilistic cache-cascade model to overcome the noisy output of deep learning models. SC-Track can generate robust single cell tracks from noisy segmented cell outputs ranging from missing segmentations and false detections.

SC-Track provides smoothed classification tracks to aid the accurate classification of cellular events. SC-Track has a built-in biologically inspired cell division algorithm that can robustly assign mother-daughter associations from segmented nuclear or cellular masks.

SC-Track employs a tracking-by-detection approach, whereby detected cells are associated between frames. A TrackTree data structure was used to store the tracking relationships between each segmented cell temporally and spatially.





□ optima: an Open-source R Package for the Tapestri platform for Integrative single cell Multi-omics data Analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad611/7291856

optima stores all data matrices for a single biological sample, incl. DNA (amplicon data for DNA variants), CNV, and protein. optima also stores all the metadata, incl. cell barcodes, panels of amplicon names, as well as metadata to keep track of normalization/filter status.

The first step is DNA variant data filtering with the filterVariant () function. Several factors, including sequencing depth, genotype quality, etc., are imported from the h5 file and used in this filtering step. A cell/variant will be removed if too many loci fail QC.

After filtering, the DNA data will be used for cell clone identification. To identify clones, a user may choose to use the non-supervised clustering method dbscan. The clustering result will be stored in the cell labels vector contained within the optima object.





□ FedGMMAT: Federated Generalized Linear Mixed Model Association Tests

>> https://www.biorxiv.org/content/10.1101/2023.10.03.560753v1

FedGMMAT, a federated genetic association testing tool that utilizes a federated statistical testing approach for efficient association tests that can correct for arbitrary fixed and random effects among different collaborating sites.

FedGMMAT executes the null model fitting using a round-robin schedule among the sites wherein each site locally updates the model parameters, encrypts the intermediate results and passes them to the next site to be securely aggregated.

After the model parameters have converged, FedGMMAT fits the mixed-effect model parameters using a similar round-robin algorithm. FedGMMAT assigns the score-test statistics to each variant. The central server computes an aggregated projection matrix from all sites.





□ DegCre: Probabilistic association of differential gene expression with regulatory regions

>> https://www.biorxiv.org/content/10.1101/2023.10.04.560923v1

DegCre, a method that probabilistically associates CREs to target gene TSSs over a wide range of genomic distances. The premise of DegCre is that true CRE to DEG pairs should change in concert with one another as a result of a perturbation, such as a differentiation protocol.

DegCre is a non-parametric method that estimates an association probability for each possible pair of CRE and DEG. It considers CRE-DEG distance but avoids arbitrary thresholds. Because DegCre uses rank-order statistics, it can use various types of CRE-associated data.





□ The Bias of Using Cross-Validation in Genomic Predictions and Its Correction

>> https://www.biorxiv.org/content/10.1101/2023.10.03.560782v1

A comprehensive examination of CV bias across various models, including the Ordinary Least Square (OLS), Generalized Least Squares (GLS), polygenic method, i.e. LMM with its predictor gBLUP, three regular-ization methods, i.e. Ridge, Lasso, and ENET.

CVc method calculates the correction by adding the difference of covariance of the predicted dependent variable and the dependent variable in the cross-validation process with the covariance in the testing process.

To calculate the covariance, one extracts the projection matrix from the covariance, which means only linear methods with closed-form solutions can be applied to rectify the CV bias.





□ SNAIL: Adjustment of spurious correlations in co-expression measurements from RNA-Sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad610/7295542

SNAIL (Smooth-quantile Normalization Adaptation for Inference of co-expression Links) is modified implementation of smooth quantile normalization which uses a trimmed mean to determine the quantile distribution and applies median aggregation for genes with shared read counts.

SNAIL effectively removes false-positive associations between genes, without the need to select an arbitrary threshold or to exclude genes from the analysis.





□ simpleaf : A simple, flexible, and scalable framework for single-cell data processing using alevin-fry

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad614/7295550

simpleaf encapsulates the process of creating an expanded reference for quantification into a single command (index) and the quantification of a sample into a single command (quant). It also exposes various other functionality, and is actively being developed and expanded.

Simpleaf provides a simple and flexible interface to access the state-of-the-art features provided by the alevin-fry ecosystem, tracks best practices using the underlying tools, enables users to transparently process data with complex fragment geometry.





□ Aliro: an Automated Machine Learning Tool Leveraging Large Language Models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad606/7291858

Aliro is an easy-to-use data science assistant. It allows researchers without machine learning or coding expertise to run supervised machine learning analysis through a clean web interface.

By infusing the power of large language models (LLM), the user can interact with their data by seamlessly retrieving and executing code pulled from the LLM, accelerating automated discovery of new insights from data.

Aliro includes a pre-trained machine learning recommendation system that can assist the user to automate the selection of machine learning algorithms and its hyperparameters and provides visualization of the evaluated model and data.





□ Segzoo: a turnkey system that summarizes genome annotations

>> https://www.biorxiv.org/content/10.1101/2023.10.03.559369v1

Segzoo is a tool designed to automate various genomic analyses on segmentations obtained using Segway. It provides detailed results for each analysis and a comprehensive visualization summarizing the outcomes.

Segzoo generates segmentation-centric summary statistics using Segtools and BEDTools. Segzoo uses Go Get Data (GGD) to automatically download all required data for these analyses and produces an easy to interpret figure which reveals patterns of segmented regions.





□ GradHC: Highly Reliable Gradual Hash-based Clustering for DNA Storage Systems

>> https://www.biorxiv.org/content/10.1101/2023.10.05.561008v1

An implementation of the Gradual Hash-based clustering algorithm for DNA storage systems. The primary strength of GradHC lies in its capability to cluster with excellent accuracy various types of designs, incl. varying strand lengths, cluster sizes, and different error ranges.

Given an input design (with potential similarity among different DNA strands), one can randomly choose a seed and use it to generate pseudo-random DNA strands matching the original design's length and input set size.

Each input strand is then XORed with its corresponding pseudo-random DNA strand, ensuring a high likelihood that the new strands are far from each other (in terms of edit distance) and do not contain repeated substrings across different input strands.





□ Multimodal joint deconvolution and integrative signature selection in proteomics

>> https://www.biorxiv.org/content/10.1101/2023.10.04.560979v1

A novel algorithm to estimate the proteomics cell fractions by integrating bulk transcriptome-proteome without reference proteome, implemented in R package MICSQTL.

The method enables the downstream cell-type-specific protein quantitative trait loci mapping (cspQTL) based on the mixed-cell proteomes and pre-estimated proteomics cellular composition, without the need for large-scale single cell sequencing [9] or cell sorting.





□ The DeMixSC deconvolution framework uses single-cell sequencing plus a small benchmark dataset for improved analysis of cell-type ratios in complex tissue samples

>> https://www.biorxiv.org/content/10.1101/2023.10.10.561733v1

DeMixSC, which employs a benchmark dataset and an improved weighted nonnegative least-squares (WNNLS) framework to identify and adjust for genes consistently affected by technological discrepancies.

DeMixSC starts with a benchmark dataset of matched bulk and sc/snRNA-seq data with the same cell-type proportions. Pseudo-bulk mixtures are generated from the sc/sn data. DeMixSC identifies DE genes and non-DE genes between the matched real-bulk and pseudo-bulk data.





□ Afanc: a Metagenomics Tool for Variant Level Disambiguation of NGS Datasets

>> https://www.biorxiv.org/content/10.1101/2023.10.05.560444v1

Afanc, a novel metagenomic profiler which is sensitive down to species and strain level taxa, and capable of elucidating the complex pathogen profile of compound datasets.

Afanc solves the issues by carrying out species and subspecies level profiling using a novel Kraken2 report disambiguation algorithm and lineage-level profiling using a variant profiling approach.





□ Ocelli: an open-source tool for the visualization of developmental multimodal single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.10.05.561074v1

Ocelli is an explainable multimodal framework to learn a low-dimensional representation of developmental trajectories. In the data preprocessing step, we find modality-specific programs with topic modeling using Latent Dirichlet Allocation.

Ocelli constructs the Multimodal Markov Chain as a weighted sum of the unimodal affinities between cells. Ocelli determines the latent space of multimodal diffusion maps (MDM) by factoring the MMC into eigenvectors and eigenvalues.





□ AleRax: A tool for species and gene tree co-estimation and reconciliation under a probabilistic model of duplication, transfer, and loss

>> https://www.biorxiv.org/content/10.1101/2023.10.06.561091v1

AleRax, a novel probabilistic method for phylogenetic tree inference that can perform both species tree inference and reconciled gene tree inference from a sample of gene trees.

AleRax is on par with ALE in terms of reconciled gene tree accuracy, while being one order of magnitude faster and more robust to numerical errors. AleRax infers more accurate species trees than SpeciesRax and ASTRAL-Pro 2, because it can accommodate gene tree uncertainty.





□ Pindel-TD: a tandem duplication detector based on a pattern growth approach

>> https://www.biorxiv.org/content/10.1101/2023.10.08.561441v1

Pindel-TD, a Tandem duplication detection model by specifically optimizing the pattern growth approach in Pindel. Redesigning the search strategies of the minimum and maximum unique substring for different sized TDs, resulting in the high and robust performance of TD detection.

Firstly, they selected the read-pairs with only one read mapped uniquely (mapped only with 'M' character in its CIGAR string) while its mate showing split-read.

For each selected read-pair, the mapped read with a high mapping quality was considered as a reliable anchor read, determining the searching direction of subsequent split read analysis of soft clipped read.

Applying a pattern growth approach to find minimum and maximum unique substring start from either the leftmost of the rightmost of the unmapped read.

Next, they carefully processesing the split-read information to identify the TDs with accurate breakpoints. Finally, Pindel-TD removed the redundant TDs according to their length and break points to get final TD set.





□ PopGenAdapt: Semi-Supervised Domain Adaptation for Genotype-to-Phenotype Prediction in Underrepresented Populations

>> https://www.biorxiv.org/content/10.1101/2023.10.10.561715v1

PopGenAdapt is a deep learning model that applies semi-supervised domain adaptation (SSDA) to improve genotype-to-phenotype prediction in underrepresented populations.

PopGenAdapt leverages the large amount of labeled data from well-represented populations, as well as the limited labeled and the larger amount of unlabeled data from underrepresented populations.

PopGenAdapt adaptS for genotype-to-phenotype prediction the state-of-the-art method of SSDA via Minimax Entropy (MME) with Source Label Adaptation (SLA). Specifically, PopGenAdapt uses a 4-layer MLP with GELU activations, layer normalization, and a residual connection.





□ CUDASW++4.0: Ultra-fast GPU-based Smith-Waterman Protein Sequence Database Search

>> https://www.biorxiv.org/content/10.1101/2023.10.09.561526v1

CUDASW++4.0 is a fast software tool for scanning protein sequence databases with the Smith-Waterman algorithm on CUDA-enabled GPUs. This approach achieves high efficiency for dynamic programming-based alignment computation by minimizing memory accesses and instructions.

Base the parallelization scheme on computing an independent alignment for each (sub)warp. A (sub)warp consists of synchronized threads executed in lockstep, and they can communicate using warp shuffles. Within a (sub)warp, threads cooperatively compute DP matrix cell values.





□ cgMSI: pathogen detection within species from nanopore metagenomic sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05512-9

cgMSI formulates strain identification as a maximum a posteriori (MAP) estimation problem to take both sequencing errors and genome similarity between different strains into consideration for accurate strain-typing at low abundance.

cgMSI uses the core genome, and selects candidate strains using MAP probability estimation. After that, cgMSI maps the aligned reads to the full reference genomes of the candidate strains and identifies the target strain using the second-stage MAP probability estimation.





□ Multioviz: an interactive platform for in silico perturbation and interrogation of gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2023.10.10.561790v1

While many GRN platforms have been developed, a majority do not allow for perturbation analyses where a user is able to impose modifications onto a network and invoke a statistical reanalysis to learn how a phenotype might change with new sets of molecular interactions.

Multioviz enables a perturbation analyses using Biologically Annotated Neural Networks (BANNs) which are a class of feedforward Bayesian ML models that integrate known biological relationships to perform association mapping on multiple molecular levels simultaneously.





□ SpeakEasy2: Champagne: Robust, scalable, and informative clustering for diverse biological networks

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03062-0

SpeakEasy 2: Champagne (SE2) retains the core approach of popularity-corrected label propagation, but aims to reach a more accurate end state. The changes increase accuracy by escaping from label configurations that become prematurely stuck in globally suboptimal states.

SE2 utilizes a common approach in dynamical systems: making larger updates to jump out of suboptimal states, specifically using clusters-of-clusters, which allow it to reach configurations that would not be attained by only updating individual nodes.

SE2 increases runtime efficiency by initializing networks with far fewer labels than nodes, updates nodes to reflect the labels most specific to their neighbors, then divides the labels when their fit to the network drops below a certain level.

This reduced number of labels actually increases the opportunity for the label assignment to become stuck in suboptimal solution-states, but the more effective meta-clustering.





□ GASTON: Mapping the topography of spatial gene expression with interpretable deep learning

>> https://www.biorxiv.org/content/10.1101/2023.10.10.561757v1

GASTON (Gradient Analysis of Spatial Transcriptomics Organization with Neural networks) learns the isodepth of a tissue slice, the vector field of spatial gradients of gene expression, and spatial expression functions for individual genes directly from SRT data.

GASTON models gene expression as a piecewise linear function of the isodepth, thus describing both continuous gradients and sharp discontinuities in gene expression. GASTON reveals the geometry and continuous gene expression gradients of multiple tissues.





□ sincFold: end-to-end learning of short- and long-range interactions for RNA folding

>> https://www.biorxiv.org/content/10.1101/2023.10.10.561771v1

sincFold, an end-to-end deep learning model for RNA secondary structure prediction. Local and distant relationships can be encoded effectively using a hierarchical 1D-2D ResNet architecture, improving the state-of-the-art in RNA secondary structure prediction.

The sincFold model is based on ResNet blocks, bottlenecks layers and a 1D-to-2D projection. It has proven to be better suited to identify structures that might defy traditional modeling.





□ MkcDBGAS: a reference-free approach to identify comprehensive alternative splicing events in a transcriptome

>> https://academic.oup.com/bib/article/24/6/bbad367/7313457

MkcDBGAS uses a colored de Bruijn graph with dynamic- and mixed - kmers to identify bubbles generated by AS with precision higher than 98.17% and detect AS types overlooked by other tools. MkcDBGAS uses XGBoost to increase the accuracy of classification.

By leveraging cDBG with mixed k-mers and XGBoost with added motif features, MkcDBGAS accurately predicts all seven types of AS on transcriptome-wide using only transcripts. In particular, MkcDBGAS can accurately detect AS in other species, meaning that it is scalable.





□ STew: Uncover spatially informed shared variations for single-cell spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.10.10.561789v1

STew, a Spatial Transcriptomic multi-viEW representation learning method, or STew, to jointly characterize the gene expression variation and spatial information in the shared low-dimenion space in a scalable manner.

STew will output distinct spatially informed cell gradients, robust clusters, and statistical goodness of model fit to reveal significant genes that reflect subtle spatial niches in complex tissues.





□ dnctree: Scalable distance-based phylogeny inference using divide-and-conquer

>> https://www.biorxiv.org/content/10.1101/2023.10.11.561902v1

dnctree, a randomized divide-and-conquer heuristic which selectively estimates pairwise sequence distances and infers a tree by connecting increasingly large subtrees. The time complexity is at worst quadratic, and seems to scale like O(n lgn) on average.





□ Designing efficient randstrobes for sequence similarity analyses

>> https://www.biorxiv.org/content/10.1101/2023.10.11.561924v1

Constructing randstrobes consists of converting strings to integers through a hash function and selecting candidate k-mers to link through a link function and a comparator operator.

Always use a hash function to hash the strobes before linking. It does not result in a large overhead in construction time while being beneficial for pseudo-randomness for most link functions.




Astrolabe.

2023-10-17 22:17:33 | Science News

(Artwork by Viktor Blinnikov)




□ GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction

>> https://www.biorxiv.org/content/10.1101/2023.10.10.561776v1

GPN-MSA, a novel DNA language model which is designed for genome wide variant effect prediction and is based on the biologically-motivated integration of a multiple-sequence alignment (MSA) across diverse species using the flexible Transformer architecture.

GPN-MSA is trained with a weighted cross-entropy loss, designed to downweight repetitive elements and up-weight conserved elements. As data augmentation in non-conserved regions, prior to computing the loss, the reference is sometimes replaced by a random nucleotide.





□ DEMINING: A deep learning model embedded framework to distinguish DNA and RNA mutations directly from RNA-seq

>> https://www.biorxiv.org/content/10.1101/2023.10.17.562625v1

DEMINING incorporated a deep learning model named DeepDDR, which achieved the differentiation of expressed DMs from RMs directly from aligned RNA-seq reads. DEMINING uncovered previously-underappreciated DMs and RMs in unannotated AML-associated gene loci.

DEMINING employs the Light Gradient Boosting Machine (LightGBM), Logistic Regression and Random Forest, RNN and a hybrid of CNN+RNN. DeepDDR with two layers of CNN and the CNN+RNN hybrid model demonstrated comparable performance.





□ scIBD: a self-supervised iterative-optimizing model for boosting the detection of heterotypic doublets in single-cell chromatin accessibility data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03072-y

scIBD, a scCAS-specific self-supervised iterative-optimizing method to boost the detection of heterotypic doublets. As a simulation-based method, scIBD discards the routine random selection strategy that may yield excessive homotypic doublets in the simulation process.

scIBD uses an adaptive strategy to simulate high-confident heterotypic doublets and self-supervise for doublet-detection. scIBD adopts an iterative-optimizing strategy to detect the heterotypic doublets iteratively and finally outputs doublet scores based on an ensemble strategy.





□ CellContrast: Reconstructing Spatial Relationships in Single-Cell RNA Sequencing Data via Deep Contrastive Learning

>> https://www.biorxiv.org/content/10.1101/2023.10.12.562026v1

cellContrast, a deep-learning method that employs a contrastive learning framework for spatial relationship reconstruction. The fundamental assumption is that GE profiles can be projected into a latent space, where physically proximate cells demonstrate higher similarities.

cellContrast employs a contrastive framework of an encoder-projector. During inference, cellContrast discards the projector and uses the output of the encoder for spatial reconstruction, based on the principle that higher cosine similarity indicates shorter spatial distance.





□ sharp: Automated calibration of consensus weighted distance-based clustering approaches

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad635/7320014

The proposed consensus weighted clustering is controlled by two hyper-parameters, including the regularisation parameter and the number of clusters.

Calibrate jointly these two hyper-parameters in a grid search maximising the sharp score, a novel score measuring clustering stability from (weighted) consensus clustering outputs.

The assumption that co-membership probabilities are the same for all pairs of items within a given consensus cluster or between a given pair of consensus clusters, respectively, constitutes a potential limitation of the sharp score.





□ Assessing the limits of zero-shot foundation models in single-cell biology

>> https://www.biorxiv.org/content/10.1101/2023.10.16.561085v1

Geneformer and scGPT exhibit limited reliability in zero-shot settings and often underperform compared to simpler methods. These findings serve as a cautionary note for the deployment of proposed single-cell foundation models.

scGPT defaults to predicting the median bin when only given access to gene embeddings. Masked language modeling (MLM) are not effective at learning gene embeddings, which would also impact Geneformer, given that it produces a cell embedding by averaging over gene embeddings.





□ Relational Composition of Physical Systems: A Categorical Approach

>> https://arxiv.org/abs/2310.06088

The fact that each quadratic form has a unique signature despite the diagonalizing basis non-unique is analogous to how each finite-dimensional vector space has a unique dimension, although the basis that proves that the vector space has a given dimension is non-unique.

Dirac diagrams, a novel notation inspired by both bond graphs and string diagrams. They describe the syntax and semantics of Dirac diagrams. We can construct a category of vector spaces with quadratic forms using the Grothendieck construction.






□ scTab: Scaling cross-tissue single-cell annotation models

>> https://www.biorxiv.org/content/10.1101/2023.10.07.561331v1

scTab, an automated, feature-attention-based cell type prediction model specific to tabular data, and train it using a novel data augmentation scheme across a large corpus of single-cell RNA-seq observations (22.2 million human cells in total).

scTab leverages deep ensembles for uncertainty quantification. Moreover, we account for ontological relationships between labels in the model evaluation to accommodate for differences in annotation granularity across datasets.

The adapted TabNet architecture for scTab consists of two key building blocks: The first building block is the feature transformer, which is a multi-layer perceptron with batch normalization (BN), skip connections, and a gated linear unit nonlinearity (GLU).





□ scPoli: Population-level integration of single-cell datasets enables multi-scale analysis across samples

>> https://www.nature.com/articles/s41592-023-02035-2

scPoli, an open-world learner that incorporates generative models to learn sample and cell representations for data integration, label transfer and reference mapping.

scPoli introduces two modifications to the CVAE architecture. These modifications are the replacement of OHE vectors with continuous vectors of fixed dimensionality to represent the conditional term, and the usage of cell type prototypes to enable label transfer.





□ Hifieval: Evaluation of haplotype-aware long-read error correction

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad631/7321114

Hifieval compares the alignment of the raw read and the alignment of the corrected read. Hifieval evaluates phased assemblies and can distinguish under-corrections and over-corrections.

Hifieval calculates three metrics: correct corrections (CC), errors that are in raw reads but not in corrected reads; under-corrections (UC), errors present in both raw and corrected reads; and over-corrections (OC), new errors found in corrected reads but not in raw reads.





□ AtaCNV: Detecting copy number variations from single-cell chromatin sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.10.15.562383v1

AtaCNV generates a single-cell read count matrix over genomic bins of 1 million base pairs. Cells and genomic bins are filtered according to bin mappability and number of zero entries. AtaCNV smooths the count matrix by fitting a one-order dynamic linear model for each cell.

AtaCNV normalizes the smoothed count data against those of normal cells to deconvolute copy number signals from other confounding factors. AtaCNV clusters the cells and identifies a group of high confidence normal cells and normalizes the data against their smoothed depth data.

AtaCNV applies the multi-sample BIC-seq algorithm to jointly segment all single cells and estimates the copy number ratios for each cell in each segment. CNV burden scores are also derived and cells with high CNV scores are regarded as malignant cells.





□ BatchEval Pipeline: Batch Effect Evaluation Workflow for Multiple Datasets Joint Analysis

>> https://www.biorxiv.org/content/10.1101/2023.10.08.561465v1

BatchEval Pipeline performs Min-Max normalization and logarithmic mapping preprocessing on each spot/cell gene expression levels and integrates multiple batches of gene expression data into low-dimensional representations.

BatchEval Pipeline employs the Kruskal-Wallis H test to evaluate the variation in the average level of gene expression across different tissue sections and performs variance analysis on gene expression total counts for each tissue section.





□ TEclass2: Classification of transposable elements using Transformers

>> https://www.biorxiv.org/content/10.1101/2023.10.13.562246v1

TEclass2, a new architecture based on the Longformer model for the classification of selected TEs sequences, including various sequence specific aug-mentations, a k-mer specialized tokenizer, and implementing sliding window dilation.

TEclass2 is an all-in-one classifier that can be used to rapidly predict TE orders and superfamilies using TE models built upon the Transformer architecture. For TE DNA sequences, TEclass2 uses only the encoder-block, followed by a classification head as in a linear layer.





□ SPACO: Dimension Reduction by Spatial Components Analysis Improves Pattern Detection in Multivariate Spatial Data

>> https://www.biorxiv.org/content/10.1101/2023.10.12.562016v1

SPACO (Spatial Component Analysis), a proximity-aware kernel method for spatial data. By replacing PCA's global variance target with Moran's I, a measure of local (co)variance, SPACO constructs an ordered sequence of basis vectors, the spatial components (SpaC).

Orthogonal data projection onto the first k SpaCs maximises Moran's I, thereby pooling evidence of spatial dependence across genes with similar patterns. This enhances the sensitivity and spatial precision of the signal.





□ CAAStools: a toolbox to identify and test Convergent Amino Acid Substitutions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad623/7319365

CAAStools, a toolbox to identify and validate CAAS in a phylogenetic context. CAAStools implements different testing strategies through bootstrap analysis. CAAStools is designed to be included in parallel workflows and is optimized to allow scalability at proteome level.





□ Semla: A versatile toolkit for spatially resolved transcriptomics analysis and visualization

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad626/7319366

semla, a toolbox for data processing, exploration, analysis, and visualization of spatial gene expression patterns in tissues. Semla takes advantage of the tidyverse framework for data handling and the patchwork framework for customizable visualization.

semla requires data generated with the Visium Gene Expression profiling platform, including expression matrices, histological images and spot coordinate files produced with the 10x Genomics Space Ranger pipeline.





□ Ggkegg: analysis and visualization of KEGG data utilizing grammar of graphics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad622/7319364

ggkegg to extend these packages. ggkegg retrieves information such as the KEGG PATHWAY and MODULE, formats them into a structure that is easy to analyze, and offers a series of functions for further analyses and visualization.

ggkegg can also be viewed as an extension of ggplot2, an R package that deconstructs graphical components and composes images as grammar of graphics and serves as the foundation for visualization in numerous publications on bioinformatics.





□ GeneSegNet: a deep learning framework for cell segmentation by integrating gene expression and imaging

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03054-0

GeneSegNet makes a joint use of gene spatial coordinates and imaging information for cell segmentation, and is recursively learned by alternating between the optimization of network parameters and estimation of training labels for noise-tolerant training.

GeneSegNet exploits both imaging information and spatial locations of RNA reads for cell segmentation, based on a general U-Net architecture. U-Net downsamples convolutional features several times and then reversely upsamples them in a mirror-symmetric manner.





□ scHiCDiff: Detecting Differential Chromatin Interactions in Single-cell Hi-C Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad625/7320006

scHiCDiff, a novel statistical software tool, which applied two non-parametric tests (KS and CVM) and two parametric models (NB and ZINB) to distinguish the bin pairs showing significant changes in contact frequencies between two groups of scHi-C data.

scHiCDiff detects DCIs. Each scHi-C data is imputed by a Gaussian convolution filter to tackle the sparsity issue, then processed by scHiNorm w/ the Negative Binomial Hurdle option to remove systematic biases, and finally normalized for the cell-specific genomic distance effect.





□ iLSGRN: Inference of large-Scale Gene Regulatory Networks based on multi-model fusion

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad619/7321113

iLSGRN reconstructs large-scale GRNs from steady-state and time-series GE data based on nonlinear ODEs. The regulatory gene recognition algorithm calculates the Maximal Information Coefficient and excludes redundant regulatory relationships to achieve dimensionality reduction.

The feature fusion algorithm constructs a model leveraging the feature importance derived from XGBoost and Random Forest models, which can effectively train the nonlinear ODEs model of GRNs and improve the accuracy and stability of the inference algorithm.





□ scLinaX: Quantification of the escape from X chromosome inactivation with the million cell-scale human single-cell omics datasets reveals heterogeneity of escape across cell types and tissues

>> https://www.biorxiv.org/content/10.1101/2023.10.14.561800v1

scLinaX directly quantifies relative gene expression from the inactivated X chromosome with droplet-based scRNA-seq data. scLinaX-multi, an extension for the multiome (RNA + ATAC) dataset to evaluate the escape at the chromatin accessibility level.

First, pseudobulk allele-specific expression profiles are generated for cells expressing each candidate reference SNP. Then, alleles of the reference SNPs on the same X chromosome are listed by correlation analysis of the pseudobulk ASE profiles.

scLinaX assigns which X chromosome is inactivated to each cell based on the allelic expression of the reference SNPs and generates a nearly complete XCI skewed condition in silico and the estimates for the ratio of the expression from Xi.





□ Asterics: a simple tool for the ExploRation and Integration of omiCS data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05504-9

ASTERICS is designed to make both standard and complex exploratory and integration analysis workflows easily available to biologists and to provide high quality interactive plots.

ASTERICS allows the integration of multiple omics, i.e., it includes exploratory analysis able to explain the typology of individuals described by omics and/or characters simultaneously obtained at different levels of the living organisms.





□ AIWrap: Artificial Intelligence based wrapper for high dimensional feature selection

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05502-x

AIWrap, a novel Artificial Intelligence based Wrapper algorithm. The algorithm predicts the performance of unknown feature subset using an AI model referred here as Performance Prediction Model (PPM).

The performance of AIWrap is evaluated and compared with standard algorithms like LASSO, Adaptive LASSO (ALASSO), Group LASSO (GLASSO), Elastic net (Enet), Adaptive Elastic net (AEnet) and Sparse Partial Least Squares (SPLS) for both the simulated datasets and real data studies.





□ GENEPT: A SIMPLE BUT HARD-TO-BEAT FOUNDATION MODEL FOR GENES AND CELLS BUILT FROM CHATGPT

>> https://www.biorxiv.org/content/10.1101/2023.10.16.562533v1

GenePT demonstrates that LLM embedding of literature is a simple and effective path for biological foundation models. GenePT achieves comparable, and often better, performance than Geneformer and other methods.

GenePT generates single-cell embeddings in two ways: (i) by averaging the gene embeddings, weighted by each gene’s expression level; or (ii) by creating a sentence embedding for each cell, using gene names ordered by the expression level.





□ TDS: Privacy-Preserving Federated Genome-wide Association Studies via Dynamic Sampling

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad639/7323577

TDS (Two-Step Dynamic Sampling), a new efficient, privacy-preserving federated GWAS framework. In the first phase, local parties collaboratively identify loci in their local data that are not significantly associated.

This phase substantially curbs computation and communication costs by removing a large number of non-significant loci from subsequent analysis.

In the second phase, all the local parties iteratively share portions of their private datasets with the server. The server performs GWAS on the pooled data and returns the results to the local parties.





□ GoM DE: interpreting structure in sequence count data with differential expression analysis allowing for grades of membership

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03067-9

The concept of “Grade of Membership Differential Expression” (GoM DE) builds upon existing methods to analyze differential expression. By extending these established techniques, we can explore a variety of cell features beyond just discrete cell populations.

Investigateing the question of how to interpret the individual dimensions of a parts-based representation learned by fitting a topic model (in the topic model, the dimensions are also called “topics”)

The GoM DE analysis yields much larger LFC estimates of the cell-type-specific genes. This is because the topic model isolates the biological processes related to cell type while removing background biological processes that do not relate to cell type.





□ SPIRAL: integrating and aligning spatially resolved transcriptomics data across different experiments, conditions, and technologies

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03078-6

SPIRAL effectively integrates data in both feature space, including low-dimensional embeddings, high-dimensional gene expressions, and physical space.

SPIRAL combines gene expressions and spatial relationships in the consecutive processes of batch effect removal and coordinate alignment by employing graph-based domain adaption and cluster-aware Gromov-Wasserstein optimal transport.





□ DIVE: a reference-free statistical approach to diversity-generating and mobile genetic element discovery

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03038-0

DIVE, a novel reference-free algorithm designed to identify sequences that cause genetic diversification such as transposable elements, within MGE variability hotspots, or CRISPR repeats. DIVE operates directly on sequencing reads and does not rely on a reference genome.

DIVE makes the preceding logic into a statistical algorithm. DIVE aims to find anchors with neighboring statistically highly diverse sequences. DIVE processes each read sequentially using a sliding window to construct target dictionaries for each anchor encountered in each read.





□ stVAE deconvolves cell-type composition in large-scale cellular resolution spatial transcriptomics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad642/7325351

stVAE employs a variational encoder-decoder framework to decompose cell-type mixtures for cellular resolution spatial transcriptomic data. stVAE is scalable to large-scale datasets and has less running time.

stVAE constructs a pseudo-spatial transcriptomic dataset to guide the training of stVAE on the small spatial transcriptomic dataset. stVAE could accurately capture the sparsity of cell-type composition in the spots of cellular resolution spatial transcriptomic data.





□ SEM: sized-based expectation maximization for characterizing nucleosome positions and subtypes

>> https://www.biorxiv.org/content/10.1101/2023.10.17.562727v1

SEM (the Size-based Expectation Maximization), a new nucleosome-calling package. SEM analyzes the overall fragment size distribution to determine which types of nucleosomes are detectable within a given MNase-seq dataset.

SEM employs a hierarchical Gaussian mixture model to accurately estimate the locations and occupancy properties of nucleosomes and to assign subtype identities to each detected nucleosome.





□ MOAL: Multi-Omic Analysis at Lab. A simplified methodology workflow to make reproducible omic bioanalysis.

>> https://www.biorxiv.org/content/10.1101/2023.10.17.562686v1

MOAL (Multi Omic Analysis at Lab), an R package including a omic() function that automates most classical tasks. MOAL automates the bioanalysis corresponding to biostatistics and functional integration procedures.

For annotation tasks, symbols are automatically re-annotated using synonym checking to avoid information loss. MOAL also integrates the NBCI orthologs gene database to open functional enrichment analysis for species that have identified ortholog genes in human.





□ OMICmAge: An integrative multi-omics approach to quantify biological age with electronic medical records

>> https://www.biorxiv.org/content/10.1101/2023.10.16.562114v1

A robust, predictive biological aging phenotype, EMRAge, that balances clinical biomarkers with overall mortality risk and can be broadly recapitulated across EMRs.

Subsequently, they applied elastic-net regression to model EMRAge with DNA-methylation (DNAm) and multiple omics, generating DNAmEMRAge and OMICmAge, respectively.





□ CRAQ: Identification of errors in draft genome assemblies at single-nucleotide resolution for quality assessment and improvement

>> https://www.nature.com/articles/s41467-023-42336-w

CRAQ (Clipping information for Revealing Assembly Quality), a reference-free tool which maps raw reads back to assembled sequences to identify regional and structural assembly errors based on effective clipped alignment information.

CRAQ can identify assembly errors at different scales and transform error counts into corresponding assembly quality indicators (AQIs) that reflect assembly quality at the regional and structural levels.