lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

ARBITER.

2023-07-31 19:17:37 | Science News
(Art by William Bao)




□ HYFA: Hypergraph factorization for multi-tissue gene expression imputation

>> https://www.nature.com/articles/s42256-023-00684-8

HYFA (hypergraph factorization) is genotype agnostic, supports a variable number of collected tissues per individual, and imposes strong inductive biases to leverage the shared regulatory architecture of tissues and genes.

HYFA employs a custom message-passing neural network that operates on a 3-uniform hypergraph. HYFA infers latent metagene values for the target tissue—a hyperedge-level prediction task—and maps these representations back to the original gene expression space.





□ Charting cellular differentiation trajectories with Ricci flow

>> https://www.biorxiv.org/content/10.1101/2023.07.20.549833v1

Modern interpretations of Waddington's Landscape have re-framed cell fate trajectories via the phase space of transcriptomic dynamics. A framework for employing a discrete Ricci curvature and normalized Ricci flow to predict dynamic trajectories b/n temporally linked GE samples.

Network entropy and the total Forman-Ricci curvature are related quantities but not interchangeable. A positive correlation between network entropy and total discrete curvature of a biological network, by appealing to results on metric-measure spaces.





□ GraphChainer: Chaining for Accurate Alignment of Erroneous Long Reads to Acyclic Variation Graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad460/7231478

A new algorithm to co-linearly chain a set of seeds in a string labeled acyclic graph, together with the first efficient implementation of such a co-linear chaining algorithm into a new aligner of erroneous long reads to acyclic variation graphs, GraphChainer.

GraphChainer connects the anchor paths to obtain a longer path, which is then reported as the answer. GraphChainer splits its solution whenever a path joining consecutive anchors is longer than some parameter g = colinear-gap, and reports the longest path after these splits.




□ GNNome Assembly: Untangling genome assembly graphs with graph neural networks

>> http://talks.cam.ac.uk/talk/index/202234

GNNome Assembly consists of simulating the synthetic reads, generating the assebmly graphs, and decoding edge probabilities with greedy search. The selected path is translated into a contig of reconstructed genome by concatenating the overlapping reads in the path.

GNNome Assembly constructs assembly graphs using Raven. GatedGCN is utilized to compute d-dimensional representations of nodes and edges. An Multi-Layer Perception classifier then outputs a probability indicating whether a given edge can lead to the optimal reconstruction.





□ cloudrnaSPAdes: Isoform assembly using bulk barcoded RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.07.25.550587v1

cloudrnaSPAdes, a novel tool for de novo assembly of full-length isoforms from barcoded RNA-seq data. It constructs a single assembly graph using the entire set of input reads and further derives paths for each read cloud, closing gaps and fixing sequencing errors in the process.

cloudrnaSPAdes is able to accurately reconstruct full-length transcript sequences from read clouds having coverage as low as 1x, including genes with dozens of different expressing isoforms.





□ GRouNdGAN: GRN-guided simulation of single-cell RNA-seq data using causal generative adversarial networks

>> https://www.biorxiv.org/content/10.1101/2023.07.25.550225v1

GRouNdGAN simulates steady-state and transient-state single-cell datasets where genes are causally expressed under the control of their regulating TFs. GRouNdGAN captures non-linear TF-gene dependences and preserves gene identities, cell trajectories and pseudo-time ordering.

The architecture of GRouNdGAN builds on the causal generative adversarial network (CausalGAN) and includes a causal controller, several target generators, a critic, a labeler and an anti-labeler all implemented as separately parameterized neural networks.





□ seq2cells: Single-cell gene expression prediction from DNA sequence at large contexts

>> https://www.biorxiv.org/content/10.1101/2023.07.26.550634v1

seq2cells uses a transfer learning framework that utilizes Enformer as a pre-trained epigenomic model, to create gene embeddings that capture the sequence logic of transcriptional regulation.

seq2cells can in principle use as a seq2emb module any model that embeds the DNA sequence of the TSS. The Enformer trunk takes as input a one-hot encoded 196,608 base pair DNA sequence and outputs 3,072 dimensional sequence embedding of the central 896 sequence windows.





□ TopGen: Unraveling cell differentiation mechanisms through topological exploration of single-cell developmental trajectories

>> https://www.biorxiv.org/content/10.1101/2023.07.28.551057v1

TopGen, a method that uses the representatives of homology groups to analyze gene expression patterns. In essence, the method involves establishing a common basis for the kernel and image of consecutive boundary maps via the Smith Normal Form.

By calculating the n-th Betti number, we can determine the homology group generator from this shared basis. By hypothesis, cyclic topologies would have oscillatory genes that are transiently active in different parts of the cycle.

The eigenfunctions of the Laplace-Beltrami operator encodes the geometry of a manifold in an orthogonal basis of harmonic functions. The discrete version of these harmonic eigenfunctions also turn out to have oscillatory behavior and are eigenvectors of the discrete Laplacian.





□ Theory and models of (∞,ω)-categories

>> https://arxiv.org/abs/2307.11931

The models of (∞,ω)-categories. The main result is to establish a Quillen equivalence between Rezk's complete Segal Θ-spaces and Verity's complicial sets.

The (∞,1)-category corresponding to these two model structures, denoted by (∞,ω)-cat. Its connection with Rezk's complete Segal Θ-spaces allows us to use the globular language, while its connection with complicial sets gives us access to a fundamental operation, the Gray tensor product.

The objective will be to implement standard categorical constructions in the context of (∞,ω)-categories. A special emphasis will be placed on the Grothendieck construction.





□ LegNet: a best-in-class deep learning model for short DNA regulatory regions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad457/7230784

LegNet, an EfficientNetV2-inspired convolutional network for modeling short gene regulatory regions. LegNet can be used in diffusion generative modeling as a step toward the rational design of gene regulatory sequences.

LegNet-Generator corrects the artificial noise by reverting back point mutations introduced in sequences with known expression levels.

Iterative generation by applying LegNet-Generator induces substitutions in a completely random sequence, i.e. by tricking the model to correct "errors" in the provided random sequence so that upon full correction the resulting promoter provides a desired expression level.





□ LPHash: Locality-preserving minimal perfect hashing of k-mers

>> https://academic.oup.com/bioinformatics/article/39/Supplement_1/i534/7210438

LPHash achieves very compact space by exploiting the fact that consecutive k-mers share overlaps of k - 1 symbols. This allows LPHash to actually break the theoretical log 2(e) bit/key barrier for MPHFs.

One used to build a BBHash function over the k-mers and spend 3 bits/k-mer and 100-200 ns per lookup. This work shows that it is possible to do significantly better than this when the k-mers come from a spectrum-preserving string set: less than 0.6-0.9 bits/k-mer and 30-60 ns.





□ DeepDynaForecast: Phylogenetic-informed graph deep learning for epidemic transmission dynamic prediction

>> https://www.biorxiv.org/content/10.1101/2023.07.17.549268v1

DeepDynaForecast, a cutting-edge deep learning algorithm designed for forecasting pathogen transmission dynamics. DeepDynaForecast was trained on in-depth data and used more information from the phylogenetic tree, allowing classification of samples according to their dynamics.

DeepDynaForecast incorporates the Primal-Dual Graph Long Short-Term Memory learning architecture. The Phylogenetic tree is modeled as a bi-directed graph. DeepDynaForecast can predict near-future transmission dynamics for the external nodes.





□ Unified fate mapping in multiview single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.07.19.549685v1

CellRank 2 models cell-state dynamics from multiview single-cell data. It automatically determines initial and terminal states, computes fate probabilities, charts trajectory-specific gene expression trends, and identifies putative driver genes.

CellRank2 employs a probabilistic system description wherein each cell constitutes one state in a Markov chain with edges representing cell-cell transition probabilities.

CellRank 2 provides a set of diverse kernels that derive transition probabilities. CellRank 2 generalizes earlier concepts to arbitrary pseudotimes and atlas-scale datasets with the PseudotimeKernel and CytoTRACEKernel. The RealTimeKernel combines across time point transitions.





□ PINNACLE: Contextualizing protein representations using deep learning on protein networks and single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.07.18.549602v1

PINNACLE (Protein Network-based Algorithm for Contextual Learning), a self-supervised geometric deep learning model adept at generating protein representations through the analysis of protein interactions within various cellular contexts.

In total, PINNACLE's unified multi-scale embedding space comprises 394,760 protein representations, 156 cell type representations, and tissue representations.

PINNACLE generates a distinct representation for each cell type in which a protein-coding gene is activated. PINNACLE learns the topology of proteins, cell types, and tissues by optimizing a unified latent representation space.






□ FraSICL: Molecular Property Prediction by Semantic-invariant Contrastive Learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad462/7233069

FraSICL (Fragment-based Semantic-Invariant Contrastive Learning), a semantic-invariant view generation method by properly breaking molecular graphs into fragment pairs.

FraSICL is an asymmetric model with two branches, the molecule view branch and the fragment view branch. FraSICL is trained by both NT-Xent contrastive loss and an auxiliary similarity loss. In the contrastive loss, two projections of a molecule are treated as a positive pair.





□ XA4C: eXplainable representation learning via Autoencoders revealing Critical genes

>> https://www.biorxiv.org/content/10.1101/2023.07.16.549209v1

XA4C offers optimized autoencoders to process gene expressions at two levels: whole transcriptome (global) autoencoder, and single pathway (local) autoencoders. The decoder is symmetrical to the encoder counterpart to recover the gene expressions.

XA4C disentangles the black box of the neural network of an autoencoder by providing each gene's contribution to the latent variables. XA4C quantifies the Critical index of a gene by averaging the absolute values of its SHapley Additive exPlanations value to all latent variable.





□ cycle_finder: de novo analysis of tandem and interspersed repeats based on cycle-finding

>> https://www.biorxiv.org/content/10.1101/2023.07.17.549334v1

cycle_finder constructs a graph structure from low-cost short-read data and constructs units of both types of repeats. The tool can detect cycles with branching and corresponding tandem repeats, and can construct interspersed repeats by exploring non-cycle subgraphs.

cycle_finder can estimate sequences w/ large copy-number differences. Tandem repeats detected from de Bruin graphs are output as different sequences if they contain even a single nucleotide difference, a large number of sequences are detected from sequences in the same cluster.





□ RECOMBINE: Recurrent composite markers of cell types and states

>> https://www.biorxiv.org/content/10.1101/2023.07.17.549344v1

RECOMBINE, a novel framework, recurrent composite markers for biological identities with neighborhood enrichment. RECOMBINE is a data-driven approach for unbiased selection of composite markers that characterize discrete cell types and continuous cell states in tissue ecosystems.

RECOMBINE selects an optimized set of markers that discriminate hierarchical cell subpopulations. RECOMBINE identifies recurrent composite markers (RCMs) for not only discrete cell types but also continuous cell states with high granularity.





□ AE-TWAS: Autoencoder-transformed transcriptome improves genotype-phenotype association studies

>> https://www.biorxiv.org/content/10.1101/2023.07.23.550223v1

AE-TWAS, which adds a transformation step before conducting standard TWAS. The transformation is composed of two steps by first splitting the whole transcriptome into co-expression networks and then using autoencoder to reconstruct the transcriptome data within each module.

This transformation removes noise (including nonlinear ones) from the transcriptome data, paving the path for downstream TWAS. After transformation, the transcriptome data enjoy higher expression heritability at the low-heritability spectrum and possess higher connectivity.





□ Petasearch: Efficient parallelized peta-scale protein database search

>> https://github.com/steineggerlab/petasearch

Petasearch depends on block-aligner for fast computation of Smith-Waterman alignments in the blockalign module. format. You can use convert2sradb to convert a FASTA/FASTQ file or a MMseqs2 database into a srasearch database.





□ netSGCCA: Integrating multi-omics and prior knowledge: a study of the Graphnet penalty impact

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad454/7230780

This work focuses on studying the effect of the injection of a prior graphical knowledge as a penalty into a parsimonious variant of the Regularised Generalised Canonical Correlation (RGCCA) model, namely the Sparse Generalised Canonical Correlation Analysis (SGCCA).

Contrary to Elastic-Net, GraphNet penalty can select a reasonable set of genes and yields informative interpretation from the pathway enrichment analysis. The co-selection of variables is not primarily influenced by the structure of the graph, but rather by its overall density.





□ L-GIREMI uncovers RNA editing sites in long-read RNA-seq

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03012-w

L-GIREMI (long-read GIREMI) effectively handles sequencing errors and biases in the reads and uses a model-based approach to score RNA editing sites. L-GIREMI allows investigation of RNA editing patterns of single RNA molecules, co-occurrence of multiple RNA editing events.

L-GIREMI examines the linkage patterns between sequence variants in the same reads, complemented by a model-driven approach, to predict RNA editing sites. L-GIREMI affords high accuracy as reflected by the high fraction of A-to-G sites or known REDIportal sites in its predictions.





□ Voyager: exploratory single-cell genomics data analysis with geospatial statistics

>> https://www.biorxiv.org/content/10.1101/2023.07.20.549945v1

Voyager implements plotting functions for gene expression, cell attributes, and spatial analysis results. The documentation website includes tutorials that demonstrate ESDA on data from multiple spatial -omics, incl. Visium, Slide-seq, Xenium, CosMX, MERFISH, seqFISH, and CODEX.

Voyager is built on the SFEdata structure, which bundles geometries such as cell segmentation polygons with gene expression data. While Vovager is focused on spatial data, neighborhood view ESDA methods can be applied to the k-nearest-neighbor graph in gene expression PCA space.





□ LOCLA: A Novel Genome Optimization Tool for Chromosome-Level Assembly across Diverse Sequencing Techniques

>> https://www.biorxiv.org/content/10.1101/2023.07.20.549842v1

LOCLA (Local Optimization for Chromosome-Level Assembly) identifies reads and contigs aligned locally with high quality on gap flanks or scaffold boundaries of draft assemblies for gap filling and scaffold connection. LOCLA applies to both de novo and reference-based assemblies.

LOCLA can utilize reads produced by diverse sequencing techniques, e.g., 10x Genomics Linked-Reads, and PacBio HiFi reads. LOCLA enhances the draft assemblies by recovering 27.9 million bases and 35.7 million bases of the sequences discarded by the reference-guided assembly tool.





□ MAGICAL: Mapping disease regulatory circuits at cell-type resolution from single-cell multiomics data

>> https://www.nature.com/articles/s43588-023-00476-5

MAGICAL (Multiome Accessibility Gene Integration Calling and Looping), a hierarchical Bayesian approach that leverages paired scRNA-seq and transposase-accessible chromatin sequencing from different conditions to map disease-associated TFs and genes as regulatory circuits.

Using Gibbs sampling, MAGICAL iteratively estimates variable values and optimizes the states of circuit TF–peak–gene linkages.

MAGICAL introduces hidden variables for explicitly modeling the transcriptomic and epigenetic signal variations between conditions and optimization against the noise in both scRNA-seq and scATAC-seq datasets. MAGICAL reconstructs regulatory circuits at cell-type resolution.





□ ISLET: individual-specific reference panel recovery improves cell-type-specific inference

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03014-8

ISLET (Individual Specific celL typE referencing Tool) estimates the cell-type-specific gene expression reference panel for each participant. The unobserved panel per subject are estimated by the expectation-maximization (EM) algorithm in a mixed-effect regression model.

ISLET leverages multiple or temporal observations of each subject, to construct a likelihood-based statistics for csDEG inference. This is the first statistical framework to recover the subject-level reference panel by employing multiple samples per subject.





□ vamos: variable-number tandem repeats annotation using efficient motif sets

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03010-y

VNTR Annotation Using Efficient Motifs Set (vamos) finds efficient motif sets using a reference panel of diversity genomes. StringDecomposer algorithm is integrated into vamos to annotate new genomes sequenced from aligned LRS reads or their assemblies using efficient motif sets.

vamos to create a combined VNTR callset across the HGSVC and HPRC assemblies to quantify diversity of VNTR sequences, and compared this to the diversity measured by a separate approach that combines calls based on merging similar variants.





□ Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models

>> https://arxiv.org/abs/2307.14587

Transformer-based protein language models provide new opportunities for building global evolutionary models. A variety of Transformer-based models have been developed such as evolutionary scale modeling (ESM), ProGen, ProteinBERT, Tranception and ESM-2.

Persistent hypergraph Laplacians enable the topological description of internal structures or organizations in data. Persistent hyperdigraph Laplacians further allow for the topological Laplacian modeling of directed hypergraphs

A similar algebraic topology structure is shared by persistent Hodge Laplacians and persistent Laplacians, but the former is a continuum theory for volumetric data and the latter is a discrete formulation for point cloud.





□ getphylo: rapid and automatic generation of multi-locus phylogentic trees from genbank files

>> https://www.biorxiv.org/content/10.1101/2023.07.26.550493v1

getphylo, a tool to automatically generate multi-locus phylogenetic trees from GenBank files. It has a low barrier to entry with minimal dependencies. getphylo uses a parallelised, heuristic workflow to keep runtime and system requirements as low as possible.

getphylo consistently produces trees with topologies comparable to other tools in less time. Furthermore, as getphylo does not rely on reference databases, it has a virtually unlimited scope in terms of taxonomy and genetic scale.





□ Gradient-based implementation of linear model outperforms deep learning models

>> https://www.biorxiv.org/content/10.1101/2023.07.29.551062v1

ZINB-Grad uses a scalable algorithm, reminiscent of alternating least squares, for fitting ZINB-WaVE models. In implementing this algorithm, it borrows the stochastic gradient descent-based model fitting machinery used in deep learning.

ZINB-Grad entropy of batch mixing is better than ZINB-WaVE and comparable to scVI performance. ZINB-Grad has biologically meaningful latent space performing as good as scVI4 and ZINB-WaVE regarding data imputation and accountability for technical variations.





□ COMPASS: joint copy number and mutation phylogeny reconstruction from amplicon single-cell sequencing data

>> https://www.nature.com/articles/s41467-023-40378-8

COMPASS (COpy number and Mutation Phylogeny from Amplicon Single-cell Sequencing), a probabilistic model and inference algorithm that can reconstruct the joint phylogeny of SNVs and CNAs from single-cell amplicon sequencing data.

COMPASS models amplicon-specific coverage fluctuations and that it can efficiently process high-throughput data of thousands of cells. COMPASS vastly outperforms BiTSC in settings where coverage variability resembles targeted scDNAseq.





□ GVP-MSA: Learning protein fitness landscapes with deep mutational scanning data from multiple sources

>> https://www.cell.com/cell-systems/fulltext/S2405-4712(23)00210-7

Geometric Vector Perceptron (GVP)-MSA, a deep learning network to learn the fitness landscapes, in which a 3D equivariant graph neural network was used to extract features from protein structure and a pre-trained model MSA Transformer was applied to embed MSA constraints.

Proof-of-concept trials are designed to validate this training scheme in three aspects: random and positional extrapolation for single-variant effects, zero-shot fitness predictions for new proteins, and extrapolation for higher-order variant effects from single-variant effects.





□ scASfind: Mining alternative splicing patterns in scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.08.19.553947v1

scASfind utilizes an efficient data structure to store the percent spliced-in value for each splicing event. This makes it possible to search for patterns among all differential splicing events, identify marker events, mutually exclusive events, and large blocks of exons.





CHARON.

2023-07-31 19:16:36 | Science News

(Art by William Bao)




□ EMERALD: Sensitive inference of alignment-safe intervals from biodiverse protein sequence clusters

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03008-6

EMERALD embraces the diversity of possible alignment solutions, by revealing alignment-safe intervals of the two sequences which appear as conserved (and not even necessarily identical) in the entire space of optimal and suboptimal alignments.

Once all alignment-safe intervals are computed, EMERALD projects these safety intervals back to the representative sequence, thereby annotating the sequence intervals that are robust across all possible alignment configurations within the suboptimal alignment space.





□ Identifying Clusters in Graph Representations of Genomes

>> https://www.biorxiv.org/content/10.1101/2023.07.20.549917v1

Finding a set of vertex-disjoint paths with a maximum score in a weighted directed graph. They defined the maximum-score disjoint paths problem and provided two algorithms for solving it.

The algorithm runs in linear time on n-layered bubble graphs, which can represent pangenomes expressed as elastic-degenerate strings. A fixed-parameter tractable algorithm runs on general DAGs in time O(2^w.w.|V|) where w is the width of a special directed path decomposition.





□ ChromatinHD connects single-cell DNA accessibility and conformation to gene expression through scale-adaptive machine learning

>> https://www.biorxiv.org/content/10.1101/2023.07.21.549899v1

ChromatinHD inputs raw fragments in a neural network architecture, transforms this positional encoding into a fragment embedding, pools the fragment information for each cell and gene, and predicts the gene expression using one or more non-linear layers.

ChromatinHD can capture co-predictivity between two fragments. ChromatinHD captures dependencies between fragment size and gene expression, for example to capture whether larger fragments are more predictive for gene expression than smaller fragments.





□ Splam: a deep-learning-based splice site predictor that improves spliced alignments

>> https://www.biorxiv.org/content/10.1101/2023.07.27.550754v1

The Splam algorithm focuses on training the model to recognize splice junction patterns at the "splice junction" level; i.e., it attempts to recognize donor and acceptor sites in pairs, just as the spliceosome operates in the cell when it splices out an intron.

The Splam model consists of 20 residual units, each containing two convolutional layers, and each convolutional layer follows a batch normalization and a Leaky rectified linear unit (LReLU).

Splam can run on alignment files of either single-end and paired-end RNA-Seq samples. Any alignment containing any spurious splice junctions is removed, and if it is paired, Splam updates the flags to unpair reads for both the aligned read and its mate.





□ Accurate sequencing of DNA motifs able to form alternative (non-B) structures

>> https://genome.cshlp.org/content/early/2023/07/10/gr.277490.122.abstract

A probabilistic approach to determine the number of false positives at non-B motifs depending on sample size and variant frequency, and applied it to publicly available data sets; 1000 Genomes, Simons Genome Diversity Project, and gnomAD.

Elevated sequencing errors at non-B DNA motifs should be considered in low- read-depth studies (single-cell, ancient DNA, pooled-sample population sequencing) and in scoring rare variants. Combining technologies should maximize sequencing accuracy in future studies of non-B DNA.





□ Taxor: Fast and space-efficient taxonomic classification of long reads with hierarchical interleaved XOR filters

>> https://www.biorxiv.org/content/10.1101/2023.07.20.549822v1

Taxor, a new tool for long-read metagenomic classification using a hi- erarchical interleaved XOR filter data structure. Taxor implements k-mer-based approaches such as syncmers for pseudo-alignment to classify reads and an Expectation-Maximization algorithm.

Taxor computes the k-mer content of the input reference genomes and creates an index for each set of reference genomes. The index is a hierarchical interleaved XOR filter (HIXF), a novel space-efficient data structure for approximate membership queries.





□ quickBAM: a parallelized BAM file access API for high throughput sequence analysis informatics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad463/7232227

quickBAM uses the bam file index (BAI) for parallel data reading, and takes the scatter / gather programming paradigm to parallelize computation tasks over many different genomic regions. quickBAM has the potential to significantly shorten end-to-end analysis turnaround.

When the bam BAI is available, it utilizes the fixed-bin indices which contain the starting file offset of each 16-kb genomic window. When the BAI is not available (unindexed), It uses a heuristic scanner to directly locate multiple starting locations for parallel parsing.





□ SPLASH: a statistical, reference-free genomic algorithm unifies biological discovery

>> https://www.biorxiv.org/content/10.1101/2023.07.17.549408v1

SPLASH (Statistically Primary aLignment Agnostic Sequence Homing), an approach that directly analyzes raw sequencing data to detect a signature of regulation: sample-specific sequence variation.

SPLASH relies on a simple formalization of sequence variation - short stretches of varying sequences, targets, adjacent to short stretches of a constant sequence, anchors. SPLASH steps through all positions in all reads of all samples, recording all anchor-target pairs.




□ DeepTraSynergy: Drug Combinations using Multi-modal Deep Learning with Transformers

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad438/7226508

DeepTraSynergy is based on an architecture based on transformers to extract features from drugs. One of the main advantages of a transformer that lead us to utilize it is that it provides context for any position in the drug molecule.

Transformer-based feature extractor simultaneously captures the local structure and encodes the long-range dependencies. Deep TraSynergy method outperforms GraphSynergy and NexGB for the prediction of the synergic drug pairs.





□ Mandalorion: Identifying and quantifying isoforms from accurate full-length transcriptome sequencing reads

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02999-6

Mandalorion v4.1 identifies isoforms with very high recall and precision when applied to either spike-in or simulated data with known ground-truth isoforms. Mandalorion had a distinct performance lead when tools were run entirely without annotation files.

Mandalorion shows the equivalent performance when run on ONT-based R2C2 data or a mix of the two data types. Mandalorion compares favorably to StringTie, Bambu, and IsoQuant—especially in the absence of genome annotation.





□ mEthAE: an Explainable AutoEncoder for methylation data

>> https://www.biorxiv.org/content/10.1101/2023.07.18.549496v1

CpGs are strongly encoded in a common latent feature due to spatial proximity on the chromosome, forming linkage disequilibrium (LD)-like clusters. CpGs highly perturbed for the same latent feature are spatially not clustered together on the chromosome.

mEthAE, a chromosome-wise autoencoder framework for interpretable dimensionality reduction of methylation data. mEthAE is based on latent feature perturbations, yields groups of related CoGs at the latent-feature specific (local), as well as embedding-wide (global) level.





□ PseudoCell: A collaborative network for in silico prediction of regulatory pathways

>> https://www.biorxiv.org/content/10.1101/2023.07.20.549793v1

Based on a systemic perspective, the PseudoCell tool implements a set of computational methods for asynchronous logical model simulation, including the definition of perturbations in constant or pulsatile frequency, as well as knockout emulation.

In PseudoCell the state of a given node n was given by a discrete or continuous number and assumed values in the interval [0, Max]n, where Max, is the maximum state value described for that component.

Whenever possible, boolean values were assumed to describe the activation state of the nodes to represent the threshold from which this element can elucidate a certain biological effect.





□ BuDDI: Bulk Deconvolution with Domain Invariance to predict cell-type-specific perturbations from bulk

>> https://www.biorxiv.org/content/10.1101/2023.07.20.549951v1

BuDDI (BUlk Deconvolution with Domain Invariance) utilizes domain adaptation techniques to effectively integrate available corpora of case-control bulk and reference scRNA-seq observations to infer cell-type-specific perturbation effects.

BuDDI achieves this by learning independent latent spaces within a single variational autoencoder (VAE) encompassing at least four sources of variability: 1) cell-type proportion, 2) perturbation effect, 3) structured experimental variability, and 4) remaining variability.





□ SCROAM: Highly accurate estimation of cell type abundance in bulk tissues based on single-cell reference and domain adaptive matching

>> https://www.biorxiv.org/content/10.1101/2023.07.22.550132v1

SCROAM transforms scRNA-seg and bulk RNA-seg into a shared feature space, effectively eliminating distributional differences in the latent space. And then generates cell-type-specific expression matrices.

When constructing a feature matrix from scRNA-seg, SCROAM is not based on the average expression, but instead, each gene is weighted according to its cell-specific score, allowing for larger gene sets to be used in deconvolution.





□ PAIA: Prior Information Assisted Integrative Analysis of Multiple Datasets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad452/7230782

For regularizing estimation and selecting relevant variables, penalization and other regularization techniques are routinely adopted. "Blindly" searching over a vast number of variables may not be efficient.

In the first step, a CNN model with active learning has been proposed to extract comprehensive and accurate prior information from published studies. In the second step, the prior information has been incorporated for integrative variable selection with group LASSO.





□ hadge: a comprehensive pipeline for donor deconvolution in single cell

>> https://www.biorxiv.org/content/10.1101/2023.07.23.550061v1

hadge (hashing deconvolution combined with genotype information) combines 12 methods to perform both hashing- and genotype-based deconvolution. hadge allows for the automatic determination of the best combination of hashing and SN-based donor deconvolution tools.

hadge then generates a new assignment of the cells based on this optimal match between hashing and genotype-based deconvolution to uncover the true donor identity of the cells effectively rescuing cells from failed hashing with a valid genotyped-based deconvolution assignment.





□ simCAS: an embedding-based method for simulating single-cell chromatin accessibility sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad453/7231479

simCAS provides three simulation modes, namely pseudo-cell-type mode, discrete mode and continuous mode, to generate synthetic data with pseudo-real manifold, discrete clusters and continuous differentiation trajectories.

For the pseudo-cell-type mode, the input of simCAS is the real scCAS data represented by a peak-by-cell matrix, and matched cell type information represented by a vector.

For the discrete or continuous mode, simCAS only requires the peak-by-cell matrix as the input data, followed by automatically obtaining the variation from multiple cell states. The output of simCAS is a synthetic peak-by-cell matrix with a vector of user-defined ground truths.





□ SCISSORS: Sub-Cluster Identification through Semi-Supervised Optimization of Rare-Cell Silhouettes in Single-Cell RNA-Sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad449/7232228

SCISSORS employs silhouette scoring for the estimation of heterogeneity of clusters and reveals rare cells in heterogenous clusters by a multistep semi-supervised reclustering. SCISSORS provides a method for the identification of marker genes of high specificity to the cell type.

With a pre-processed count matrix, SCISSORS performs an initial clustering step to define broad clusters using conservative parameters. SCISSORS calculates the silhouette score of each cell, which measures how well cells fit within their assigned clusters.





□ CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning

>> https://www.nature.com/articles/s41592-023-01940-w

CheckM2, a machine learning-based tool for predicting isolate, single-cell and MAG quality. CheckM2 builds models suitable for predicting bacterial and archaeal genome completeness and contamination without explicitly considering taxonomic information.

CheckM2 was trained on simulated genomes with known levels of completeness and contamination, benchmarked, and subsequently applied to MAGs from a range of different environments. CheckM2 performed better on MAGs from novel lineages with sparse or no genomic representation.





□ Systematic assessment of long-read RNA-seq methods for transcript identification and quantification

>> https://www.biorxiv.org/content/10.1101/2023.07.25.550582v1

The LRGASP Consortium Organizers produced long-read and short-read RNA-seq data from aliquots of the same RNA samples using a variety of library protocols and sequencing platforms.

The overall design of the LRGASP Challenge aimed for a fair and transparent process of evaluating long-read methods.

The LRGASP effort was announced to the broader research community via social media and the GENCODE main website to recruit tool developers to submit transcript detection and quantification predictions based on the LRGASP data.





□ mAFiA: Detecting m6A at single-molecular resolution via direct-RNA sequencing

>> https://www.biorxiv.org/content/10.1101/2023.07.28.550944v1

m6A Finding Algorithm (mAFiA) re-uses intern features generated by the backbone neural network during basecalling, and assigns an m6 probability, P(m6), to a specific A on the read.

mAFiA does not require additional post-processing such as nanopolish and can be integrated into an existing basecaller without altering the latter's accuracy.





□ Quantum machine learning for untangling the real-world problem of cancers classification based on gene expressions

>> https://www.biorxiv.org/content/10.1101/2023.08.09.552597v1

skqulacs-QSVM, a hybrid quantum support vector machine algorithm. A quantum kernel is a function determining the resemblance between two quantum states in the feature space.

Employing kernel, the QML algorithms classify quantum states according to their similarities. The infinite possibility of the dimension of the kernel Hilbert space makes the kernel approach powerful.





□ Modeling Single Cell Trajectory Using Forward-Backward Stochastic Differential Equations

>> https://www.biorxiv.org/content/10.1101/2023.08.10.552373v1

This FBSDE model integrates the forward and backward movements of two SDEs in time, aiming to capture the underlying dynamics of single-cell developmental trajectories.

The FBSDE model iterates between the Forward and Backward models; traversing through the Forward model generates new simulated data points which are subsequently used as training set by the backward model and vice versa.





□ RUN-DVC: Generalizing deep variant callers via domain adaptation and semi-supervised learning

>> https://www.biorxiv.org/content/10.1101/2023.08.12.549820v1

RUN-DVC optimizes the DVC model through a novel loss function that combines unsupervised and supervised losses from two training modules. First, the unsupervised loss is derived from the semi-supervised learning module that incorporates consistency training within unlabeled data.

The model propagates labels from labeled data to similar unlabeled data, allowing the model to generalize well from known data to unlabeled data.The supervised loss is derived from the random logit interpolation module that aligns embeddings of the source and target domains.





□ Automated convergence diagnostic for phylogenetic MCMC analyses

>> https://www.biorxiv.org/content/10.1101/2023.08.10.552869v1

In the context of MCMC, samples of trees should exhibit near-indistinguishability between independent chains if drawn from the same distribution over the treespace. The presented tree PSRF value quantifies this property of similarity.

Firstly, This approach is based on the properties of a metric treespace, with geometry based on local tree rearrangements, giving it a strong mathematical and statistical foundation.

Secondly, by utilising the first polynomial time computable tree rearrangement based distance. They overcome the previous limitations imposed by the computational complexity of such distances.





□ SATL: Species-Agnostic Transfer Learning for Cross-species Transcriptomics Data Integration without Gene Orthology

>> https://www.biorxiv.org/content/10.1101/2023.08.11.552752v1

SATL not only allows knowledge integration and translation across various species without relying on gene orthology but also identifies similar GO biological processes amongst the most influential genes composing the latent space for species integration.

SATL builds on the Cross-Domain Structural Preserving Projection (CDSPP) method where the model learns a projection matrix for a domain-invariant feature subspace to reduce the discrepancy between domains. It allows to incorporate the entire dataset in the cross-species analysis.





□ Accurate human genome analysis with Element Avidity sequencing

>> https://www.biorxiv.org/content/10.1101/2023.08.11.553043v1

Element whole genome sequencing achieves higher mapping and variant calling accuracy compared to Illumina sequencing at the same coverage, with larger differences at lower coverages (20x-30x).

Using Element’s ability to generate paired end sequencing with longer insert sizes than typical short–read sequencing. Longer insert sizes result in even higher accuracy, with long insert Element sequencing giving noticeably more accurate genome analyses at all coverages.





□ scover: Predicting the impact of sequence motifs on gene regulation using single-cell data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03021-9

scover, a convolutional neural network which performs de novo discovery of regulatory motifs and their cell lineage-specific impact on gene expression and chromatin accessibility. It finds weights for these motifs across pseudo-bulks and also reports the 'influence' of each motif.

Scover takes as input a set of one-hot encoded sequences, e.g., promoters or distal enhancers, along with measurements of their activity, e.g., expression levels of the associated genes or accessibility levels of the enhancers.





□ Pythia: Structure-based self-supervised learning enables ultrafast prediction of stability changes upon mutation at the protein universe scale

>> https://www.biorxiv.org/content/10.1101/2023.08.09.552725v1

Pythia, a self-supervised graph neural network tailored for zero-shot ∆∆G predictions. Pythia outshines its contenders with superior correlations while operating with the fewest parameters, and exhibits a remarkable acceleration in computational speed, up to 10^5fold.

Pythia paves the way for precise anticipation of mutational impacts. This model operates independently of both evolutionary information and manually derived features from energy functions. Instead, it learns the stability directly from the protein structures.





□ Scientific discovery in the age of artificial intelligence

>> https://www.nature.com/articles/s41586-023-06221-2

Examining breakthroughs over the past decade that include self-supervised learning, which allows models to be trained on vast amounts of unlabelled data, and geometric deep learning, which leverages knowledge about the structure of scientific data to enhance model accuracy and efficiency.

Generative AI methods can create designs, such as small-molecule drugs and proteins, by analysing diverse data modalities, including images and sequences. We discuss how these methods can help scientists throughout the scientific process and the central issues that remain despite such advances.





□ CAST: Search and Match across Spatial Omics Samples at Single-cell Resolution

>> https://www.biorxiv.org/content/10.1101/2023.08.13.552987v1

CAST (Cross-sample Alignment of SpaTial omics), a deep graph neural network based method enabling spatial-to-spatial searching. CAST aligns tissues based on intrinsic similarities of spatial molecular features and reconstructs spatially resolved single-cell multi-omic profiles.

CAST enables spatially resolved differential analysis to visualize disease-associated molecular pathways and cell-cell interactions, and single-cell relative translational efficiency (scRTE) profiling to reveal variations in translational control across cell types and regions.





□ Bayesian Flow Networks

>> https://arxiv.org/abs/2308.07037

Bayesian Flow Networks, a novel generative model in which the parameters of a set of independent distributions are modified w/ Bayesian inference in the light of noisy data samples, then passed as input to a neural network that outputs a second, interdependent distribution.

Starting from a simple prior and iteratively updating the two distributions yields a generative procedure similar to the reverse process of diffusion models; however it is conceptually simpler in that no forward process is required.

Discrete and continuous-time loss functions are derived for continuous, discretised and discrete data, along with sample generation procedures.

Notably, the network inputs for discrete data lie on the probability simplex, and are therefore natively differentiable, paving the way for gradient-based sample guidance and few-step generation in discrete domains such as language modelling.





□ Chrysalis: decoding tissue compartments in spatial transcriptomics with archetypal analysis

>> https://www.biorxiv.org/content/10.1101/2023.08.17.553606v1

Chrysalis, a novel computation method for the rapid detection of tissue compartments on grid-based ST datasets. Chrysalis identifies unique spatial compartments by archetypal decomposition of the low-dimensional representation derived from the SVG expression profiles.

Chrysalis features a distinctive approach based on maximum intensity projection to visualise various tissue compartments simultaneously, facilitating the rapid characterisation of spatial relationships across the inferred domains.





PHAETHON.

2023-07-31 19:13:37 | Science News

(Art by William Bao)




□ MSV: a modular structural variant caller that reveals nested and complex rearrangements by unifying breakends inferred directly from reads

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03009-5

A description of genomic rearrangements using skew-symmetric graphs. A highlight of the graph model is a folding scheme for adjacency matrices that unifies forward strand and reverse strand.

Maximal Exact Matches (MEMs) are a particular form of seeds, where seeds are equivalences between a reference genome and a read, typically used by an aligner as the basis for alignment computation.

A sequence that occurs once on the reference genome but many times on the sequenced genome equals one or several duplications. Such duplications create cycles in our graph model that can be resolved via a graph traversal.





□ ScHiCEDRN: Single-cell Hi-C data Enhancement with Deep Residual and Generative Adversarial Networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad458/7232230

ScHiCEDRN combines customized deep residual networks and convolutional neural networks (CNN) to create a generator to generate the enhanced data from raw low-coverage single-cell Hi-C data.

ScHiCEDRN can generalize well across individual cells of the same cell line or even between different cell types of two very different species. ScHiCEDRN can generate single-cell Hi-C data more suitable for identifying TAD boundaries and reconstructing 3D chromosome structures.





□ Greengenes2 unifies microbial data in a single reference tree

>> https://www.nature.com/articles/s41587-023-01845-1

By inserting sequences into a whole-genome phylogeny, 16S rRNA and shotgun metagenomic data generated from the same samples agree in principal coordinates space, taxonomy and phenotype effect size when analyzed with the same tree.

Greengenes2 is much larger than past resources in its coverage, as compared to SILVA and GTDB. Because their amplicon library is linked to environments labeled with EMPO categories, It can easily identifies the environments that contain samples that can fill out the tree.





□ cellCounts: an R function for quantifying 10x Chromium single-cell RNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad439/7225850

cellCounts adapted the seed-and-vote aligner Subread for mapping Chromium reads. cellCounts performs more sensitive read mapping than Subread, by using more seeds to discover candidate mapping locations and applying a more relaxed voting threshold for calling mapping locations.

Mapped reads will be assigned to genes in each cell using the featureCounts algorithm. Within each gene, assigned reads that share the same UMI tag (allowing one base mismatch) will be reduced to one UMI.





□ singleCellHaystack: A universal tool for predicting differentially active features in single-cell and spatial genomics data

>> https://www.nature.com/articles/s41598-023-38965-2

singleCellHaystack, a method that predicts DEGs based on the distribution of cells in which they are active within an input space. This method does not rely on comparisons between clusters of cells and is applicable to both scRNA-seq and spatial transcriptomics data.

A new method uses cross-validation for choosing a suitable flexibility of splines during its modeling steps. The computational time has been drastically reduced by incorporating several engineering improvements to the base code, including the use of sparse matrices.





□ Cellular proliferation biases clonal lineage tracing and trajectory inference

>> https://www.biorxiv.org/content/10.1101/2023.07.20.549801v1

A fundamental statistical bias that emerges from sampling cell lineage barcodes across a time course. Considering the setting in which cell state and lineage barcodes may be measured simultaneously, and copies of the lineage barcodes may be observed over multiple time points.

A mathematical analysis that proves that the relative abundance of subpopulations is changed, or biased, in multi-time clonal datasets. The source of the bias is heterogeneous growth rates; cells with more descendants are more likely to be represented in multi-time clones.





□ rvTWAS: identifying gene-trait association using sequences by utilizing transcriptome-directed feature selection

>> https://www.biorxiv.org/content/10.1101/2023.07.16.549227v1

rvTWAS uses Sum of Single Effects, or SuSiEto carry out variants selections to form a prioritized set of genetic variants weighted by their relevance to gene expressions. rTWAS uses a kernel method to aggregate weighted variants to form a score test for the association.

TWAS uses the Bayesian feature selection model implemented by SuSiE to select variants that are highly associated with gene expressions and aggregates them for association mapping to the phenotype using a weighted kernel. rTWAS works on one gene each time.





□ Reliable interpretability of biology-inspired deep neural networks

>> https://www.biorxiv.org/content/10.1101/2023.07.17.549297v1

P-NET is a biology-inspired model trained on patient mutation data. Despite its usefulness, it has notable issues such as variability in interpretation and susceptibility to knowledge biases.

P-NET uses DeepLIFT to obtain the importance scores for hidden nodes, which are ultimately used as interpretations. It uses two hard-coded random seeds to ensure reproducible network training. Controlling for network biases, thehy used deterministic inputs and shuffled labels.





□ otargen: GraphQL-based R Package for Tidy Data Accessing and Processing From Open Targets Genetics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad441/7226507

otargen is an open-source R package designed to make data retrieval and analysis from the Open Target Genetics portal as simple as possible for R users.

otargen offers a suite of functions covering all query types, allowing streamlined data access in a tidy table format. By executing only a single line of code, theotargen users avoid the repetitive scripting of complex GraphQL queries, including the post-processing steps.





□ Sarek: Scalable and efficient DNA sequencing analysis on different compute infrastructures aiding variant discovery

>> https://www.biorxiv.org/content/10.1101/2023.07.19.549462v1

A re-implementation of the nf-core/sarek pipeline using the Nextflow DSL2 framework. The input data is an nf-core community standardized samplesheet in CSV format, that provides all relevant metadata needed for the analysis as well as the paths to the FastQ files.

The pipeline has multiple entry points to facilitate (re-)computation of specific steps (e.g. recalibration, variant calling, annotation) by providing a samplesheet with paths to the intermediary (recalibrated) BAM/CRAM files.

The pipeline processes input sequencing data in FastQ file format based on GATK best-practice recommendations. It consists of four major processing units: pre-processing, variant 138 calling, variant annotation, and quality control (QC) reporting.





□ AtlasXplore: a web platform for visualizing and sharing spatial epigenome data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad447/7227715

AtlasXplore integrates multiple layers of spatial epigenome data for deep diving into the biological insights buried inside the data. AtlasXplore supports three modalities of interactive exploration: gene, motif, and eRegulon.

AtlasXplore uses Celery (with RabbitMQ and redis) for queuing asynchronous tasks, such as cell type identification with user-provided markers, identifying the top ten features in a lasso selection, injecting spatial data into the platform, and subsetting the regulation network.





□ ClusterDE: a post-clustering differential expression (DE) method robust to false-positive inflation caused by double dipping

>> https://www.biorxiv.org/content/10.1101/2023.07.21.550107v1

ClusterDE, a post-clustering DE test for identifying potential cell-type marker genes by avoiding the inflated FDR issue due to double dipping. In particular, ClusterDE controls the FDR for identifying cell-type marker genes even when the cell clusters are spurious.

ClusterDE adapts to the most widely used pipelines Seurat & Scanpy, which include a wide range of clustering algorithms / DE tests. They employed the default Seurat clustering algorithm (which involves data processing steps followed by the Louvain algorithm) for cell clustering.





□ ProstT5: Bilingual Language Model for Protein Sequence and Structure

>> https://www.biorxiv.org/content/10.1101/2023.07.23.550085v1

ProstT5 is a protein language model (pLM) which can translate between protein sequence and structure. It is based on ProtT5-XL-U50, a T5 model trained on encoding protein sequences using span corruption applied on billions of protein sequences.

ProstT5 finetunes ProtT5-XL-U50 on translating between protein sequence and structure using 17M proteins with high-quality 3D structure predictions from the AlphaFoldDB. Protein structure is converted from 3D to 1D using the 3Di-tokens introduced by Foldseek.





□ The weighted total cophenetic index: A novel balance index for phylogenetic networks

>> https://arxiv.org/abs/2307.08654

The weighted total cophenetic index is suitable for general networks. However, both the reconstruction of networks from data as well as their mathematical analyses are challenging and often more intricate than for trees.

This index can be behaves in a mathematical sound way, i.e., it satisfies so-called locality and recursiveness conditions. Investigating its maxima and minima as well as the structure of networks that achieve these values within the space of level-1 networks.





□ uDance: Generation of accurate, expandable phylogenomic trees

>> https://www.nature.com/articles/s41587-023-01868-8

uDance enables updatable genome-wide inference using a divide-and-conquer strategy that refines different parts of the tree independently and can build off of existing trees, with high accuracy and scalability.

The input to uDANCE is a backbone tree, a set of DNA xor amino-acid multiple sequence alignments (MSAs) of backbone sequences, and new (query) sequences. uDance infers a species tree of roughly 200,000 genomes using 387 marker genes, totaling 42.5 billion amino acid residues.




□ Explainale AI (XAI) for bioinformatics

>> https://github.com/rezacsedu/XAI-for-bioinformatics





□ Mcadet: a feature selection method for fine-resolution single-cell RNA-seq data based on multiple correspondence analysis and community detection

>> https://www.biorxiv.org/content/10.1101/2023.07.26.550732v1

Mcadet, a novel feature selection framework for unique molecular identifiers (UMIs) scRNA-seq data. Mcadet integrates Multiple Correspondence Analysis (MCA), graph-based community detection, and a novel statistical testing approach.

Mcadet utilizes Leiden community detection and MCA to select informative genes from scRNA-seg data and facilitate cell population recovery. The framework aims to accurately select informative genes, handle rare cell populations and fine-resolution datasets.





□ MethyLasso: a segmentation approach to analyze DNA methylation patterns and identify differentially methylation regions from whole-genome datasets

>> https://www.biorxiv.org/content/10.1101/2023.07.27.550791v1

MethyLasso models DNA methylation data using a nonparametric regression framework known as a Generalized Additive Model. It relies on the fused lasso method to segment the genome by estimating regions in which the methylation is constant.

MethyLasso identifies low-methylated regions (LMRs), unmethylated regions (UMRs), DNA methylation valleys (DMVs) and partially methylated domains (PMDs) in a single condition as well as differentially methylated regions (DMRs) between two conditions.





□ weIMPUTE: A User-Friendly Web-Based Genotype Imputation Platform

>> https://www.biorxiv.org/content/10.1101/2023.08.10.552759v1

weIMPUTE supports multiple imputation software, including SHAPEIT, Eagle, Minimac4, Beagle, and IMPUTE2, while encompassing the entire workflow, from quality control to data format conversion. weIMPUTE offers automated imputation without the need for additional data operations.

The platform offers multiple pipelines to attend to various imputing scenarios, such as data segmentation and parallelization, while still allowing users to perform customized tasks, including phasing and imputing large datasets.





□ ADMIRE: Anomaly detection in mixed high dimensional molecular data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad501/7243154

ADMIRE (Anomaly Detection using MIxed gRaphical modEls), a novel approach for the detection and correction of anomalies in mixed high dimensional data.

ADMIRE combines Mixed Graphical Models and cross validated re-estimation of data points to detect data anomalies. The MGM learns inherent data structure, the CV based re-estimation checks whether individual data points are consistent with this data structure.





□ AARDVARK: An Automated Reversion Detector for Variants Affecting Resistance Kinetics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad509/7243156

AARDVARK (An Automated Reversion Detector for Variants Affecting Resistance Kinetics), an R package that identifies reversion mutations in DNA sequence data.

AARDVARK produces a summary of all alleles where a candidate pathogenic mutation is identified and reports the reads supporting those alleles. AARDVARK improves alignments occurs when the leading or trailing edge of a DNA read overlaps a pathogenic deletion.





□ Effect of Tokenization on Transformers for Biological Sequences

>> https://www.biorxiv.org/content/10.1101/2023.08.15.553415v1

Fragmentation can be avoided by tokenizing the data, i.e., tokenization allows architectures to expend their capacity to substantially longer proteins and DNA sequences, as was recently shown in DNABERT-2.

One of the benefits of the proposed approach compared to motifs in the form of Profile Hidden Markov Models is that it does not rely on a multiple sequence alignment, which may be unreliable, especially when highly diverged sequences are analyzed.





□ DifferentialRegulation: a Bayesian hierarchical approach to identify differentially regulated genes

>> https://www.biorxiv.org/content/10.1101/2023.08.17.553679v1

DifferentialRegulation, a Bayesian hierarchical method to discover changes between experimental conditions with respect to the relative abundance of unspliced mRNA (over the total mRNA).

DifferentialRegulation accounts for the quantification uncertainty via a latent variable model, and allocates reads to their transcript or gene of origin, and corresponding splice version.

DifferentialRegulation takes as input the equivalence classes counts derived from RNA-seg reads, and recovers the overall abundance of each transcript.





□ Alignment of spatial genomics data using deep Gaussian processes

>> https://www.nature.com/articles/s41592-023-01972-2

GPSA (Gaussian Process Spatial Alignment), a Bayesian model for aligning spatial genomic and histology samples with spatial coordinates that are distorted or on different systems.

GPSA consists of a two-layer Gaussian process: the first layer maps observed samples’ spatial locations onto a common coordinate system (CCS), and the second layer maps from the CCS to the observed readouts.





□ GTM-decon: guided-topic modeling of single-cell transcriptomes enables sub-cell-type and disease-subtype deconvolution of bulk transcriptomes

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03034-4

GTM-decon can infer multiple cell-type-specific gene topic distributions per cell type, which captures sub-cell-type variations. GTM-decon can also use phenotype labels from single-cell or bulk data to infer phenotype-specific gene distributions.

GTM-decon automatically learns CTS gene signatures from scRNA-seq reference. GTM-decon captured distinct sets of CTS gene signatures, as shown by the gene-by-topic probability distributions (i.e., the matrix φ) for the top 20 genes in each topic.





□ TranSyT, an innovative framework for identifying transport systems

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad466/7243984

Transport Systems Tracker (TranSyT) does not rely on manual curation to expand its internal database, which is derived from highly curated records retrieved from the Transporters Classification Database and complemented with information from other data sources.

TranSyT compiles information regarding transporter families and proteins, and derives reactions into its internal database, making it available for rapid annotation of complete genomes.





□ Adjusting for gene-specific covariates to improve RNA-seq analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad498/7243988

A novel positive false discovery rate (pFDR) controlling method for testing gene-specific hypotheses using a gene-specific covariate variable, such as gene length. We suppose the null probability depends on the covariate variable.

Proposing a rejection rule that accounts for heterogeneity among tests by employing two distinct types of null probabilities - A pFDR estimator for a given rejection rule by following Storey’s q-value framework.





□ iDeLUCS: A deep learning interactive tool for alignment-free clustering of DNA sequences

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad508/7243983

iDeLUCS (interactive Deep Learning-based software tool for Unsupervised Clustering of DNA Sequences), that detects genomic signatures and uses them to cluster DNA sequences, without the need for sequence alignment or taxonomic identifiers.

iDeLUCS is scalable and user-friendly: Its graphical user interface, with support for hardware acceleration, allows the practitioner to fine-tune the different hyper-parameters involved in the training process without requiring extensive knowledge of deep learning.





□ popV: Consensus prediction of cell type labels

>> https://www.biorxiv.org/content/10.1101/2023.08.18.553912v1

popV, an automated cell type annotation framework that takes in an unannotated query data set from a scRNAseq experiment, transfers labels from an annotated reference data set, and generates predictions with a predictability score indicating the confidence of the prediction.

popV incorporates the predictions from automated annotation. PopV takes into account annotations at different levels of granularity by aggregating results over the Cell Ontology; an expert-curated formalization of cell types in a hierarchical structure.





□ mosaicMPI: a framework for modular data integration across cohorts and -omics modalities

>> https://www.biorxiv.org/content/10.1101/2023.08.18.553919v1

mosaicMPI, a framework for discovery of low to high-resolution molecular programs representing both cell types and states, and integration within and across datasets into a network representing biological themes.

mosaicMPI uses a consensus non-negative matrix factorization method (CNMF) to discover low to high resolution programs within individual datasets, and implement a novel statistical approach for selecting multi-rank anchors within and between datasets.





□ GATK-gCNV enables the discovery of rare copy number variants from exome sequencing data

>> https://www.nature.com/articles/s41588-023-01449-0

GATK-gCNV, a flexible algorithm to discover rare CNVs from sequencing read-depth information, complete with open-source distribution via GATK. GATK-gCNV is a tunable approach for sensitive and specific CNV discovery in WES data, with broad applications.

GATK-gCNV generates a reference catalog of rare coding CNVs in WES data from 197,306 individuals in the UK Biobank, and observed strong correlations between per-gene CNV rates and measures of mutational constraint, as well as rare CNV associations with multiple traits.





□ CelFiE-ISH: Multi-cell type deconvolution using a probabilistic model for single-molecule DNA methylation haplotypes

>> https://www.biorxiv.org/content/10.1101/2023.08.20.554012v1

CelFiE-ISH was able to detect a cell type present in just 0.03% of reads out of a total of 5x genomic sequencing coverage. While CelFiE-ISH performed best at statistically distinguishing rare from non-existent cell types, the in silico mixtures revealed an overestimation of both.

One possible strategy to mitigate this behavior would be to implement weighting of individual reads. Long reads would be assigned bigger weights and short, ambiguous reads would be down-weighted.





□ Flexiplex: A versatile demultiplexer and search tool for omics data

>> https://www.biorxiv.org/content/10.1101/2023.08.21.554084v1

Flexiplex, which given a set of reads as either FASTQ or FASTA, will demultiplex and/or identify a sequence of interest, reporting matching reads and read-barcode assignment. Flexiplex assumes a read structure where a barcode and UM are flanked by other known sequences.

A dynamic programming algorithm implemented in Flexiplex is used to align the extracted sequence against a user-provided list of known barcodes using the Levenshtein distance.





□ Accessibility of covariance information creates vulnerability in Federated Learning frameworks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad531/7255908

The Covariance-Based Attack Algorithm attack is robust to the addition of zero-mean noise. The noisy data estimate can be decomposed into the true data and a noise component, making it initially impossible for the malicious client to retrieve the original data.

The algorithm involves evaluating the sample covariance to reconstruct inner vector products between the attacked variable and the linearly independent vectors, yielding a linear system of equations that can be solved to obtain the variable's data.





□ BioConvert: a comprehensive format converter for life sciences

>> https://academic.oup.com/nargab/article/5/3/lqad074/7246552

BioConvert aggregates existing software within a single framework and complemented them with original code when needed. It provides a common interface to make the user experience more streamlined instead of having to learn tens of them.

BioConvert supports about 50 formats and 100 direct conversions in areas such as alignment, sequencing, phylogeny, and variant calling. BioConvert can also be utilized by developers as a universal benchmarking framework for evaluating and comparing numerous conversion tools.





Mission: Impossible - Dead Reckoning Part One

2023-07-24 00:58:22 | 映画


□ 『Mission: Impossible - Dead Reckoning Part One』

Directed by Christopher McQuarrie
Written by Erik Jendresen
Exective Producer: Susan E. Novick
Music by Lorne Balfe
Cinematography by Fraser Taggart

SFスリラー的な作劇はシリーズとしては新機軸。その設定を活かして荒唐無稽なラインギリギリのアクション文法とスタントの限界をアップデートしている。イーサンとマーヴェリックに通じるヒロイズムは、トム自身の人生哲学の発露なのかもしれない


表題の”Dead Reckoning”は、今作のヴィランである実体を持たない脅威『Entity』と、2部作であることのダブルミーニングとなっている。突如として自我に目覚め、機械仕掛けの檻から解き放たれたAGIであるエンティティ、そして着地点を決めぬまま2部作のpart 1として解き放たれた今作

『TOP GUN: Marverick』では、AIとドローンの台頭によって消えゆく英雄となったパイロットたちの最後の勇姿を敵味方ともに鮮やかに描き出し、『Dead Reckning』ではAIが世界の真実を操り、大義のために儚く散ってゆく最後のスパイ像を描き出している


□ Lorne Balfe / “I Was Hoping It'd Be You”

Arc.

2023-07-17 07:17:37 | Science News
(Art taken from the Terrence Malicks film “Voyage of Time”)



□ Retrotransposons hijack alt-EJ for DNA replication and eccDNA biogenesis

>> https://www.nature.com/articles/s41586-023-06327-7

Retrotransposons hijack the alternative end-joining (alt-EJ) DNA repair process of the host for a circularization step to synthesize their second-strand DNA. Using Nanopore sequencing to examine the fates of replicated retrotransposon DNA.

Using extrachromosomal circular DNA production as a readout, further genetic screens identified factors from alt-EJ as essential for retrotransposon replication. alt-EJ drives the second-strand synthesis of the long terminal repeat retrotransposon DNA through a circularization.





□ fortuna: Counting pseudoalignments to novel splicing events

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad419/7222626

Using pairing information during mapping could potentially further improve mapping accuracy, but in contrast to genomic mappings the unknown structure of the originating transcript would only impose weak constraints on mapping locations.

fortuna creates a set of sequence fragments of guessed novel transcripts that contain all possible combinations of unspliced exonic segments. fortuna pseudoaligns reads to fragments using kallisto and derives counts of the most elementary splicing units from equivalence classes.





□ Distinguishing word identity and sequence context in DNA language models

>> https://www.biorxiv.org/content/10.1101/2023.07.11.548593v1

To build a framework to extract information content from foundation DNA language models, they used DNABERT a transformer model1 with a Bidirectional Encoder Representations from Transformers (BERT) architecture.

DNABERT struggled to predict next-k-mers of the same size that it managed to predict when masked. Evaluation for contextualized learning w/ maximum explainable variance also showed that average embedding of the tokens explains more maximum variance than the static W2V embedding.





□ TFvelo: gene regulation inspired RNA velocity estimation

>> https://www.biorxiv.org/content/10.1101/2023.07.12.548785v1

The insight behind TFvelo that the clockwise curve on the joint plot between two variables indicates the potential causality with time-delay, can provide a new perspective to infer the regulation relationship from single cell data.

TFvelo can be used to infer the pseudo time, cell trajectory and detect key TF-target regulation. TFvelo relies on a generalized EM algorithm, which iteratively updates the weights of the TFs, the latent time of cells, and the parameters in the dynamic equation.





□ Cytocipher determines significantly different populations of cells in single cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad435/7224247

Cytocipher refers back to the original gene expression measurements, and performs per-cell enrichment scoring for cluster marker genes and a bi-directional statistical test to infer significantly different clusters.

Cytocipher would be sensitive to transcriptionally distinct intermediate states, potentially allowing for identification of fine-grained branch points that represent lineage decisions toward terminal cell fates.





□ Mellon: Quantifying Cell-State Densities in Single-Cell Phenotypic Landscapes

>> https://www.biorxiv.org/content/10.1101/2023.07.09.548272v1

Mellon is a non-parametric cell-state density estimator based on a nearest-neighbors-distance distribution. It uses a sparse gaussian process to produce a differntiable density function that can be evaluated out of sample.

Mellon connects densities between highly similar cell-states using Gaussian processes to accurately and robustly compute cell-state densities that characterize single-cell phenotypic landscapes.

Mellon infers a continuous density function across the high-dimensional cell-state space, capturing the essential characteristics of the cell population in its entirety. The density function can also be used to determine cell-state densities at single-cell resolution.





□ mapquik: Efficient mapping of accurate long reads in minimizer space

>> https://genome.cshlp.org/content/early/2023/06/29/gr.277679.123

mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches of k consecutively-sampled minimizers (k-min-mers) and only indexing k-min-mers that occur once in the reference genome, thereby unlocking ultra-fast mapping.

mapquik significantly accelerates the seeding and chaining steps. These accelerations are enabled not only from minimizer-space seeding but also a novel heuristic O(n) pseudo-chaining algorithm, which improves upon the long-standing O(n log n) bound.





□ MiGCN: Predicting Disease-gene Associations through Self-supervised Mutual Infomax Graph Convolution Network

>> https://www.biorxiv.org/content/10.1101/2023.07.13.548865v1

Self-Supervised Mutual Infomax GraphConvolution Network (MiGCN), a new method to predict disease-gene associations under the guidance of external disease-disease and gene-gene collaborative graphs.

MiGCN constructs two collaborative graphs from external gene-gene interactions and disease-disease associations information, which are individually input into a self-supervised mutual infomax module to learn the node embeddings by maximizing mutual information.





□ UNI-RNA: UNIVERSAL PRE-TRAINED MODELS REVOLUTIONIZE RNA RESEARCH

>> https://www.biorxiv.org/content/10.1101/2023.07.11.548588v1

Uni-RNA, a series of context-aware deep learning models. Based on the BERT architecture, advanced techniques such as rotary embedding, flash attention, and fused layernorm were integrated for optimal performance in terms of training efficiency and representational capabilities.

Uni-RNA models performed pre-training using 1 billion RNA sequences from different species and categories. To remove sequence redundancy, MMseqs2 clustering is employed. Uni-RNA enables direct prediction of modifications across full-length sequences.





□ CAJAL enables analysis and integration of single-cell morphological data using metric geometry

>> https://www.nature.com/articles/s41467-023-39424-2

CAJAL infers cell morphology latent spaces where distances between points indicate the amount of physical deformation required to change the morphology of one cell into that of another.

CAJAL enables the characterization of morphological cellular processes from a biophysical perspective and produces an actual mathematical distance upon which rigorous algebraic and statistical analytic approaches can be built.





□ scGPTHub: Single-Cell Foundation Models for Everyone

>> https://scgpthub.org/

scGPT Hub provides access to the scGPT model via a convenient user interface. The scGPT model is the first single-cell foundation model built through generative pre-training on over 33 million cells.

By adapting the transformer architecture, scGPT enables the simultaneous learning of cell and gene representations, facilitating a comprehensive understanding of cellular characteristics based on gene expression.





□ SPADE: Spatial pattern and differential expression analysis with spatial transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2023.07.06.547967v1

SPADE for spatial pattern and differential expression analysis to identify SV genes in complex tissues using spatial transcriptomic data. SPADE employes a Gaussian process regression (GPR) model with a gene-specific Gaussian kernel to enable accurate detection of SV genes.

SPADE provides a framework for detecting SV genes between groups using a crossed likelihood-ratio test. SPADE estimates the optimal hyperparameter for kernel matrix in each group. For each gene, the log likelihood in each group can be easily calculated with its optimal kernel.





□ Pebblescout: Indexing and searching petabyte-scale nucleotide resources

>> https://www.biorxiv.org/content/10.1101/2023.07.09.547343v1

Pebblescout can be used for (i) indexing sequence data in a resource once and (ii) searching the index to produce a ranked list for the subset of the resource with matches to any user query; the guarantee on the match length is determined by the parameters used for indexing.

Pebblescout requires a network attached random access storage array. Pebblescout score considers only unmasked kmers sampled from the query. The score for a subject normalizes the sum of kmer scores for all kmers considered from the query that match the subject.





□ GreenHill: a de novo chromosome-level scaffolding and phasing tool using Hi-C https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03006-8

GreenHill receives assembled contigs from other assembler as inputs. Any format of contigs is acceptable,such as paired-haplotype, pseudo-haplotype, and haplotype-ignorant styles.

GreenHill-based assemblies have greater phasing accuracy than FALCON-phase-based assemblies. Using a newly developed algorithm, long reads and Hi-C were synergistically used to improve the accuracy of the resulting haplotypes.





□ SCS: cell segmentation for high-resolution spatial transcriptomics

>> https://www.nature.com/articles/s41592-023-01939-3

Existing cell segmentation methods for this data only rely on the stained image, which do not fully utilize the information provided by the experiment leading to less accurate results.

SCS (subcellular spatial transcriptomics cell segmentation) combines imaging data with sequencing data to improve cell segmentation accuracy. SCS assigns spots to cells by adaptively learning the position of each spot relative to the center of its cell using a transformer.





□ kGWASflow: a modular, flexible, and reproducible Snakemake workflow for k-mers-based GWAS

>> https://www.biorxiv.org/content/10.1101/2023.07.10.548365v1

kGWASflow conducts k-mer-based GWAS while offering enhanced pre- and post-GWAS analysis capabilities. kGWASflow offers extensive customization, either via the command line or a configuration file, enabling users to modify the workflow to their specific requirements.

kGWASflow initially retrieves the source reads for each associated k-mer from the FASTQ files of samples containing those k-mers. kGWASflow also converts the alignment outputs into BAM and BED files for downstream analysis.

kGWASflow first performs a de-novo assembly of the source reads using SPADES. After the assembly step, kGWASflow runs minimap2 to map assembled contigs onto a reference genome FASTA file.





□ SEESAW: detecting isoform-level allelic imbalance accounting for inferential uncertainty

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03003-x

Statistical Estimation of Allelic Expression using Salmon and Swish (SEESAW), for allelic quantification and inference of AI patterns. Aggregating isoform-level expression estimates to the TSS level can have higher sensitivity than either gene- or isoform-level analysis.

SEESAW follows the general framework of mmseq and mmdiff for haplotype- and isoform-specific quantification and uncertainty-aware inference. SEESAW assumes that phased genotypes are available, and is designed for multiple replicates / conditions of organisms w/ the same genotype.





□ ENTRAIN: integrating trajectory inference and gene regulatory networks with spatial data to co-localize the receptor-ligand interactions that specify cell fate

>> https://www.biorxiv.org/content/10.1101/2023.07.09.548284v1

ENTRAIN (ENvironment-aware TRajectory INference), a computational method that integrates trajectory inference methods with ligand-receptor pair gene regulatory networks to identify extracellular signals and evaluate their relative contribution towards a differentiation trajectory.

ENTRAIN-Pseudotime, ENTRAIN-Velocity, and ENTRAIN-Spatial, which can be applied on the outputs of pseudotime-based methods, RNA velocity or paired single-cell and spatially resolved data. ENTRAIN determines driver ligands responsible for observed RNA velocity vectors.





□ RNAGEN: A generative adversarial network-based model to generate synthetic RNA sequences to target proteins

>> https://www.biorxiv.org/content/10.1101/2023.07.11.548246v1

The RNAGEN model is a deep generative adversarial network (GAN) that learns to generate piRNA sequences with similar characteristics to the natural ones. This model is a novel version of the WGAN-GP architecture for one-hot encoded RNA sequences.

RNAGEN provides improved training over the original Convolutional GAN models and is less prone to overfitting than the WGAN architecture. RNAGEN learns latent vectors that lead to the generation of optimized piRNA sequences with improved binding scores to the target protein.





□ Hyperparameter optimisation in differential evolution using Summed Local Difference Strings, a rugged but easily calculated landscape for combinatorial search problems

>> https://www.biorxiv.org/content/10.1101/2023.07.11.548503v1

A simple, related objective function in which the objective is not to maximise each element but to maximise the sum of the differences between adjacent elements. This is very easily calculated, allowing rapid assessment of different search algorithms.

The contribution to the overall fitness of any element of the string is absolutely context-sensitive. The objective function for the hyperparameter optimisation for summed local difference strings has been defined.





□ DNA Storage Designer: A practical and holistic design platform for storing digital information in DNA sequence

>> https://www.biorxiv.org/content/10.1101/2023.07.11.548641v1

DNA Storage Designer, the first online platform to simulate the whole process of DNA storage experiments. This platform offers classical and novel technologies and experimental settings that simulate encoding, error simulation, and decoding for DNA storage system.

DNA Storage Designer enables not only to encode their files and simulate the entire process but also to upload FASTA files and solely simulate the sustaining process of sequences while mimicking the mutation errors along with distribution changes of sequences.




□ Sebastian Raschka

Gzip + kNN beats transformers on text classification.

(Gzip as in good old zip file compression)

“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors

>> https://aclanthology.org/2023.findings-acl.426

>> https://twitter.com/rasbt/status/1679472364931670016


□ Rob Patro

>> https://twitter.com/nomad421/status/1679495774743216128

People seem really surprised by this result (it's cool!), but I think it's evidence of how wrapped up we are in the DL craze. There's a storied history of relative compression as a similarly measure. It's not surprising that it may capture something DL methods currently don't.


□ Halvar Flake RT

>> https://twitter.com/halvarflake/status/1679391941123792896

Understanding that every compressor is a machine learning predictor, and vice versa, was the single most important insight I learnt about between 2019 and now.





□ DeepRVAT: Integration of variant annotations using deep set networks boosts rare variant association genetics

>> https://www.biorxiv.org/content/10.1101/2023.07.12.548506v1

DeepRVAT is an end-to-end model that first accounts for nonlinear effects from rare variants on gene function (gene impairment module) to then model variation in one or multiple traits as linear functions of the estimated gene impairment scores.

DeepRVAT employs a deep set neural network architecture to aggregate the effects from multiple discrete and continuous annotations for an arbitrary number of rare variants. The gene impairment module can be used as input to train predictive models for phenotype from genotype.





□ SBOannotator: a Python Tool for the Automated Assignment of Systems Biology Ontology Terms

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad437/7224245

The SBOannotator is the first standalone tool that automatically assigns SBO terms to multiple entities of a given SBML model, The main focus lies on the reactions, as the correct assignment of precise SBO annotations requires their extensive classification.

The SBOannotator can interpret this information and add a precise SBO term for "enzymatic catalyst". Without specifying the exact mechanism of this catalysis, the role of the modifier is now defined through an "is a"-relationship: This modifier is an enzymatic catalyst.





□ Nadavca: Precise Nanopore Signal Modeling Improves Unsupervised Single-Molecule Methylation Detection

>> https://www.biorxiv.org/content/10.1101/2023.07.13.548926v1

Nadavca, a nanopore signal aligner that incorporates several enhancements to the Dynamic Time Warping algorithm. Nadavca's output exhibits improved accuracy by eliminating length distribution artifacts and eliminating the need for event segmentation as a preliminary step.

The core part of Nadavca aligns a portion of nanopore signal to the corresponding part of the reference genome. The objective is to improve the accuracy of an approximate alignment, resulting from aligning base-called reads to the reference.

Nadavca considers a contribution of sub-optimal alignments. Many of these alignments can have scores very close to the optimum, representing uncertainty in the true alignment. Posterior decoding algorithms consider this uncertainty at each position of the alignment.





□ SANDSTORM / GARDN: Generative and predictive neural networks for the design of functional RNA molecules

>> https://www.biorxiv.org/content/10.1101/2023.07.14.549043v1

SANDSTORM, a generalized neural network architecture that utilizes the sequence and structure of RNA molecules to inform functional predictions. SANDSTORM achieves SOTA performance across several distinct RNA prediction tasks, while learning interpretable abstractions.

GARDN, a generative adversarial RNA design networks that allows the generative modelling of novel mRNA 5-prime untranslated regions and toehold switch riboregulators. These paired inputs are passed through parallel convolutional stacks that form an ensemble prediction.





□ TriTan: An efficient triple non-negative matrix factorisation method for integrative analysis of single-cell multiomics data

>> https://www.biorxiv.org/content/10.1101/2023.07.14.549059v1

TriTan (Triple inTegrative fast nonnegative matrix factorisation) decomposes the input single-cell multi-modal matrices into following low-dimensional matrices: a shared cell cluster matrix across all modalities, distinct feature-cluster matrices, and association matrices.

TriTan enables the simultaneous detection of latent cell clusters and feature clusters, as well as the exploration of associations between features, such as the links between genes and potential regulatory peaks.





□ BERLIN: Basic Explorer for single-cell RNAseq analysis and cell Lineage Determination.

>> https://www.biorxiv.org/content/10.1101/2023.07.13.548919v1

BERLIN, a basic analytical pipeline protocol, that outlines a workflow for analyzing scRNAseq data. This protocol encompasses crucial steps, including quality control, normalization, data scaling, dimensionality reduction, clustering, and automated cell annotation.

The output files generated by this protocol, including metadata, H5 Seurat files, cell subpopulation metadata, and ISCVA-compliant files, facilitate downstream analyses and enable integration with other analysis and visualization tools.

BERLIN performs clustering of the cells by constructing a shared nearest neighbor (SNN) graph, which connects cells based on their similarities in gene expression pattern. The Louvain algorithm is applied to optimize the modularity of a network by iteratively assigning nodes.





□ ChomActivity: Integrative epigenomic and functional characterization assay based annotation of regulatory activity across diverse human cell types

>> https://www.biorxiv.org/content/10.1101/2023.07.14.549056v1

ChromActivity, a computational framework that predicts gene regulatory element activity across diverse cell types by integrating information from chromatin marks and multiple functional characterization datasets.

ChromActivity produces two complementary integrative outputs for each cell type. One of them is ChromScoreHMM, which annotates the genome into states representing combinatorial and spatial patterns in the expert's regulatory activity track predictions.

The other is ChromScore, which is a cell type-specific continuous numerical score of predicted regulatory activity potential across the genome based on combining the individual expert predictions.





□ GeCoNet-Tool: a software package for gene co-expression network construction and analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05382-1

In the network construction part, GeCoNet-Tool offers users various options for processing gene co-expression data derived from diverse technologies. The output of the tool is an edge list with the option of weights associated with each link.

In network analysis part, the user can produce a table that includes several network properties such as communities, cores, and centrality measures. With GeCoNet-Tool, users can explore and gain insights into the complex interactions between genes.





□ Huatuo: An analytical framework for decoding cell type-specific genetic variation of gene regulation

>> https://www.nature.com/articles/s41467-023-39538-7

Huatuo, a framework to decode genetic variation of gene regulation at cell type and single-nucleotide resolutions by integrating deep-learning-based variant predictions with population-based association analyses.

Huatuo sheds light on cell type-dependent cis-regulatory loci by investigating the interaction effects between genotypes and estimated cell type proportions with a linear regression model. Huatuo unravels the causal mechanisms underlying genetic variation of gene regulation.





□ RNA Strain-Match: A tool for matching single-nucleus, single-cell, or bulk RNA-sequencing alignment data to its corresponding genotype

>> https://www.biorxiv.org/content/10.1101/2023.07.14.548847v1

RNA Strain-Match, a quality control tool developed to match RNA data in the form of sequence alignment files (i.e. SAM or BAM files) to their corresponding genotype without the use of an RNA variant call format file.

RNA Strain-Match uses known genotyping information - specifically autosomal coding single nucleotide polymorphisms (SNPs) with a single alternative allele - to match RNA sequencing data to corresponding genotypic information.





□ MosaiCatcher v2: a single-cell structural variations detection and analysis reference framework based on Strand-seq

>> https://www.biorxiv.org/content/10.1101/2023.07.13.548805v1

MosaiCatcher v2, a standardised workflow and reference framework for single-cell SV detection using Strand-seq.

MosaiCatcher v2 incorporates a structural variation (S) functional analysis module, which uses nucleosome occupancy data measured directly from Strand-seq libraries (SNOVA) as well as a SV genotyper (ArbiGent).



Laplacian Code 3.

2023-07-16 19:13:37 | デジタル・インターネット

ラプラシアン行列を用いたグラフ構造の出力を、chatGPT PlusとGitHub Copilot+VSCode-Open AIとで試してみた。Wolfram Pluginが即座にランダムグラフのヒートマップを描画できたのに対し、Copilotはサブグラフのコード生成で躓いた。Laplacian Eigenmapの描画はリソース不足でタイムアウトした

現在までのところ、大規模言語モデルの算術領域における応用能力は非常に限定的に思える

I tested the Laplacian matrix to generate graph structures using AI. The Wolfram Plugin instantly rendered a heatmap of a random graph. Copilot struggled generating the subgraph code. The attempt to create a Laplacian Eigenmap hit a timeout due to insufficient resources.

So far, it seems that the ability of large-scale language models to apply in the field of arithmetic is quite limited.