lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

INFINITE.

2023-08-31 20:08:08 | Science News

(Made with Midjourney v5.2)




□ GEARS: Predicting transcriptional outcomes of novel multigene perturbations

>> https://www.nature.com/articles/s41587-023-01905-6

GEARS (graph-enhanced gene activation and repression simulator), a computational method that integrates deep learning with a knowledge graph of gene–gene relationships to simulate the effects of a genetic perturbation.

GEARS initializes a gene embedding vector and a gene perturbation embedding vector. GEARS optimizes model parameters to fit the predicted postperturbation gene expression to true postperturbation gene expression using stochastic gradient descent.





□ multiDGD: A versatile deep generative model for multi-omics data

>> https://www.biorxiv.org/content/10.1101/2023.08.23.554420v1

multiDGD is a generative model of transcriptomics and chromatin accessibility data. It consists of a decoder mapping shared representations of both modalities to data space, and learned distributions defining latent space.

multiDGD employs a Gaussian Mixture Model (GMM) as a distribution over latent space increases the ability of the latent distribution to capture clusters in comparison to the standard Gaussian used in applied VAEs.





□ Geniml: Genomic interval machine learning: Methods for evaluating unsupervised vector representations of genomic regions

>> https://www.biorxiv.org/content/10.1101/2023.08.28.555137v1

There exists no method for evaluating the quality of these embeddings in the absence of metadata, making it difficult to assess the reliability of analyses based on the embeddings, and to tune model training to yield optimal results.

To bridge this gap, they propose four evaluation metrics: the cluster tendency test (CTT), the reconstruction test (RCT), the genome distance scaling test (GDST), and the neighborhood preserving test (NPT).

The GDST and NPT exploit the biological tendency of regions close in genomic space to have similar biological functions; they measure how much such information is captured by individual region embeddings and a set of region embeddings.






□ Aligned Diffusion Schrödinger Bridges

>> https://arxiv.org/abs/2302.11419

Diffusion Schrödinger bridges (DSB) have recently emerged as a powerful framework for recovering stochastic dynamics via their marginal observations at different time points.

SBALIGN is a novel algorithmic framework derived from the Schrödinger bridge theory and Doob's h-transform. SBALIGN recovers a stochastic trajectory from the unbound to the bound structure.





□ Genetics of circulating inflammatory proteins identifies drivers of immune-mediated disease risk and therapeutic targets

>> https://www.nature.com/articles/s41590-023-01588-w

pQTLs provide valuable insights into the molecular basis of complex traits and diseases by identifying proteins that lie b/n genotype and phenotype. Integration of pQTL data with eQTL and GWAS provided insight into pathogenesis, implicating lymphotoxin-α in multiple sclerosis.

Using Mendelian randomization (MR) to assess causality in disease etiology, they identified both shared and distinct effects of specific proteins across immune-mediated diseases. Two-sided P values are from meta-analysis of linear regression estimates.





□ Σ-monoids: Categories of sets with infinite addition

>> https://arxiv.org/abs/2308.15183

Σ-monoids, a set with infinite addition. Their most general Σ-monoid structure admits additive inverses and generalises partially commutative monoids. Every Hausdorff commutative monoid is an instance of a Σ-monoid and that the corresponding forgetful functor has a left adioint.

Σ-monoids have well-defined tensor products, unlike topological abelian groups. Thus we may enrich categories over Σ-monoids, where composition respects addition of morphisms. This can be applied to categorical semantics of while loops for (quantum) computer programs.






□ Reverse Physics: Geometric and physical interpretation of the action principle

>> https://www.nature.com/articles/s41598-023-39145-y

Reverse Physics, an approach that examines current theories to find a set of starting physical assumptions that are sufficient to rederive them.

Hamiltonian system and Lagrangian mechanics is equivalent to three assumptions: determinism/reversibility, independence of degrees of freedom and kinematics/dynamics equivalence.





□ Totem: Cell-connectivity-guided trajectory inference from single-cell data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad515/7251030

Totem generates a large number of clustering results with a k-medoids algorithm (CLARA) and constructs an minimum spanning trees (MST) for each clustering. Totem estimates their topologies as minimum spanning trees, and uses them to measure the connectivity of the cells.

Totem smoothens the MSTs of the selected clustering results using the simultaneous principal curves algorithm of Slingshot to obtain directed trajectories that include pseudotime.





□ CASi: A multi-timepoint scRNAseq data analysis framework

>> https://www.biorxiv.org/content/10.1101/2023.08.16.553543v1

CASi providea a full analvsis pipeline for analyzing scRNA-seq data from multi-timepoint designs, Ultimately creating an informative profle of dynamic cellular changes.

CASi uses the neural network classifier to achieve cross-time points cell annotation. It avoids the overclustering issue. CASi uses the levels of similarity b/n the known cell types and new cells to identify potential novel cell types that may have appeared at later time points.





□ Sigmoni: classification of nanopore signal with a compressed pangenome index

>> https://www.biorxiv.org/content/10.1101/2023.08.15.553308v1

Sigmoni extends the r-index framework for read classification – first used in SPUMONI – to the problem of classifying raw nanopore electrical signal. Sigmoni uses an ultra-fast signal discretization method to project the current signal into a small alphabet for exact match querying with the r-index.

Sigmoni adapts the r-index classification framework to analysis of nanopore signal data using a combination of picoamp binning and a sampled document array structure for computing co-linearity statistics.

Sigmoni uses a novel classification method that accurately classifies reads using pseudo-matching lengths. By avoiding the complexities of the seed-chain-extend paradigm, Sigmoni's core algorithm consists only of a simple linear-time loop.





□ scNCL: transferring labels from scRNA-seq to scATAC-seq data with neighborhood contrastive regularization

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad505/7243158

scNCL transforms scATAC- seq features into gene activity matrix based on prior knowledge. Since feature transformation can cause information loss, scNCL introduces neighborhood contrastive learning to preserve the neighborhood structure of scATAC-seq cells in raw feature space.

scNCL uses a feature projection loss and a alignment loss to harmonize embeddings between scRNA-seq and scATAC- seq. scNCL not only realizes accurate and robust label transfer for common types, but also achieves reliable detection of novel types.





□ GEDI: A unified model for interpretable latent embedding of multi-sample, multi-condition single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.08.15.553327v1

GEDI (Gene Expression Decomposition and Integration), a generative model to identify latent space variations in multi-sample, multi-condition single cell datasets and attribute them to sample-level covariates.

GEDI can further project pathway and regulatory network activities onto the cellular state space, enabling the computation of the gradient fields of transcription factor activities and their association with the transcriptomic vector fields of sample covariates.





□ MarsGT: Multi-omics analysis for rare population inference using single-cell graph transformer

>> https://www.biorxiv.org/content/10.1101/2023.08.15.553454v1

MarsGT (Multi-omics analysis for rare population inference using single-cell Graph Transformer) employs a novel probability-based subgraph-sampling method, can highlight rare cell-related genes and peaks in a heterogeneous graph.

MarsGT calculates an entropy score to contrast the differences between the base and predicted cell clustering outcomes. The base cell clusters are ascertained by implementing the Louvain clustering method on the initial cell embeddings.





□ The unphysicality of Hilbert spaces

>> https://arxiv.org/abs/2308.06669

Hilbert spaces should not be considered the “correct” spaces to represent quantum states mathematically. Proving the requirements posited by complex inner product spaces are physically justified.

Completeness in the infinite-dimensional case requires the inclusion of states with infinite expectations, coordinate transformations that take finite expectations to infinite ones and vice-versa, and time evolutions that transform finite expectations to infinite ones in finite time.





□ Internal Grothendieck construction for enriched categories

>> https://arxiv.org/abs/2308.14455

Fundamental constructions in algebra, geometry, and topology can be understood as categorical concepts defined by certain universal properties.

The cartesian product of sets, the kernel of a linear map b/n vector spaces, and the fiber over a point in a topological space, are all instances of a universal construction called limit. The internal Grothendieck construction is closely related to internal discrete fibrations.





□ DeepTRs: Deep Learning Enhanced Tandem Repeat Variation Identification via Multi-Modal Conversion of Nanopore Reads Alignment

>> https://www.biorxiv.org/content/10.1101/2023.08.17.553659v1

DeepTRs, a novel method for identifying TR variations, which enables direct TR variation identification from raw Nanopore sequencing reads and achieves high sensitivity and completeness results through the multi-modal conversion of Nanopore reads alignment and deep learning.

DeepTR aligns the resulting nanopore reads and transformes into a positional weighted matrix (PWM). Subsequently, DeepTR converts the PWM into transformed similarity matrices (TSM) using modal conversion, which serve as inputs for the DeepTR Predictor.





□ CeLEry: Leveraging spatial transcriptomics data to recover cell locations in single-cell RNA-seq

>> https://www.nature.com/articles/s41467-023-39895-3

CeLEry (Cell Location recovEry) uses a deep neural network to learn the relationships between gene expression and spatial locations by minimizing a loss function that is specified according to the specific problem.

CeLEry generates replicates of the ST data via a variational autoencoder. The generated embedding and the gene cluster embedding are concatenated, which is used as input for a CNN to decode the concatenated embedding into a 2D matrix with the same dimension as the GE input.





□ SCA: recovering single-cell heterogeneity through information-based dimensionality reduction

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02998-7

Surprisal Component Analysis leverages the notion of surprisal, whereby less probable events are more informative when they occur, to assign a surprisal score to each transcript. SCA enables dimensionality reduction that better preserves information from rare defined cell types.

SCA projects the input data to a linear subspace spanned by a set of basis vectors. SCA is highly efficient, requires no information aside from transcript counts, and generalizes to data comprised of discrete cell types or continuous trajectories.





□ expiMap: Biologically informed deep learning to query gene programs in single-cell atlases

>> https://www.nature.com/articles/s41556-022-01072-x

ExpiMap learns to map cells into biologically understandable components representing known ‘gene programs’. The activity of each cell for a gene program (GP) is learned while simultaneously refining them and learning de novo programs.

The probabilistic representation learned by expiMap as a Bayesian model allows the performance of hypothesis testing on the integrated latent space of the query.





□ Data-driven discovery of oscillator models using SINDy: Towards the application on experimental data in biology

>> https://www.biorxiv.org/content/10.1101/2023.08.25.554817v1

Exploring the limitations of the SINDy approach in the specific context of oscillatory systems. By directly applying SINDy to experimental data, we define the main limiting aspects: data availability and quality, complexity of interactions, and dimensionality of systems.

SINDy struggles especially when the data resolution is low and the oscillatory behavior is characterized by strong time scale separation. When the variables forming the limit cycle are separated, SINDy identifies important dynamical features of the system from the phase space.





□ The phenotype-genotype reference map: Improving biobank data science through replication

>> https://www.cell.com/ajhg/fulltext/S0002-9297(23)00275-6

The GWAS catalog diseases and traits are annotated with the Experimental Factor Ontology (EFO). They attempted to annotate all EFO terms present in the filtered list of associations with a matching phecode.

The phenotype-genotype reference map (PGRM), a set of 5,879 genetic associations from 523 GWAS publications. The use of phecodes in the PGRM ensures interoperability with international ICD standards and a familiar context for researchers who work with EHR-linked biobanks.





□ dRFEtools: Dynamic recursive feature elimination for omics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad513/7252233

Recursive feature elimination (RFE) is an iterative process that optimally removes one feature at a time. We can eliminate a substantial number of features; however, it can be difficult to balance computational time and model performance degradation.

dRFEtools that implements dynamic RFE, reducing computational time with high accuracy compared to standard RFE, expanding dynamic RFE to regression algorithms, and outputting the subsets of features that hold predictive power with and without peripheral genes.

<bt />



□ sc-fGAIN: A novel f-divergence based generative adversarial imputation method for scRNA-seq data analysis

>> https://www.biorxiv.org/content/10.1101/2023.08.28.555223v1

sc-fGAIN, a novel f-divergence based generative adversarial imputation method for the scRNA-seq data imputation. The imputed values generated by sc-fGAIN have a smaller root-mean-square error, and it is robust to varying missing rates, moreover, it can reduce imputation bias.

Using sc-fGAIN algorithm, they identified four f-divergence functions: cross-entropy / Kullback-Leibler / reverse KL / Jensen-Shannon that can be integrated with GAIN to generate imputed values w/o any assumptions, and mathematically prove that the distribution of imputed data.





□ Hist2Vec: Kernel-Based Embeddings for Biological Sequence Classification

>> https://www.biorxiv.org/content/10.1101/2023.08.24.554699v1

Hist2Vec, a kernel-based embedding generation approach for capturing sequence similarities. Hist2Vec combines the concept of histogram-based kernel matrices and Gaussian kernel functions. It constructs histogram-based representations using the unique k-mers in the sequences.

Hist2Vec transforms the representations into high-dimensional feature spaces, preserving important sequence information. Hist2Vec employs kernel Principal Component Analysis (KPCA) to generate low-dimensional embeddings from the kernel matrix.





□ Automappa: An interactive interface for metagenome-derived genome bins

>> https://www.biorxiv.org/content/10.1101/2023.08.25.554826v1

Autometa is an automated workflow which aims to scale to the most complex communities that have been assembled. Therefore, Automappa was implemented to handle manual curation of MAGs at the scale of these complex datasets.

Automappa was designed to visualize, verify and refine genome binning results to aid curation of high-quality MAGs. It is composed of interactive and inter-connected tables and figures that support selection with real-time MAG quality updates.





□ SDePER: A hybrid machine learning and regression method for cell type deconvolution of spatial barcoding-based transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2023.08.24.554722v1

SDePER uses a machine learning approach to remove the systematic difference between ST and scRNA-seq data (platform effects) explicitly and efficiently to ensure the linear relationship between ST data and cell type-specific expression profile.

SDePER considers sparsity of cell types per capture spot and across-spots spatial correlation in cell type compositions. SDePER imputes cell type compositions and gene expression at unmeasured locations in a tissue map with enhanced resolution.





□ Using LLM Models and Explainable ML to Analyse Biomarkers at Single Cell Level for Improved Understanding of Diseases

>> https://www.biorxiv.org/content/10.1101/2023.08.24.554441v1

A novel approach that employs both an LLM-based framework and explainable machine learning to facilitate generalization across single-cell datasets and identify gene signatures to capture disease-driven transcriptional changes.

An approach that combines supervised learning and a large language model. This method, which involves fine tuning scBERT and utilizing the QLattice. enhances cell type annotation and improves interpretability, generalizability, and scalability for scRNA-seq analvsis.





□ VCFshiny: An R/Shiny application for interactively analyzing and visualizing genetic variants

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbad107/7252269

VCFshiny, an interactive R/Shiny application for analysing and visualizing VCF files. It allows non-bioinformatician researchers to upload VCF files to annotate and visualize detailed variant information without requiring any programming code.

VCFshiny accepts annotated VCF files for comparing and visualizing variants between different samples. VCFshiny offers two annotation methods, Annovar and VariantAnnotation, to add annotations such as genes or functional impact.





□ Examining dynamics of three-dimensional genome organization with multi-task matrix factorization

>> https://www.biorxiv.org/content/10.1101/2023.08.25.554883v1

Tree-Guided Integrated Factorization (TGIF), a multi-task learning framework using Non-negative Matrix Factorization (NMF) to enable joint identification of organizational units such as compartments and TADs across multiple conditions.

TGIF recovers ground-truth differential TAD boundaries with higher precision in simulated data and is more robust to calling false positive boundary changes arising due to differences in depth.





□ AutoHiC: a deep-learning method for automatic and accurate chromosome-level genome assembly

>> https://www.biorxiv.org/content/10.1101/2023.08.27.555031v1

AutoHiC harnesses the power of deep learning and Hi-C to automate
chromosome-level genome assembly and advance scaffold assembly. AutoHiC automates realize Hi-C assembly error correction, significantly improving genome assembly continuity and accuracy.

AutoHiC is based on the Swin Transformer architecture, which incorporates self-attention mechanisms. AutoHiC calculates the length of the inversion error based on the area of the peak on the interaction curve and then adjusts the sequence in that area in the opposite direction.





□ MAGinator enables strain-level quantification of de novo MAGs

>> https://www.biorxiv.org/content/10.1101/2023.08.28.555054v1

MAGinator provides de novo identification of subspecies-level microbes and accurate abundance estimates of metagenome-assembled genomes (MAGs).

MAGinator utilises the information from both gene- and contig-based methods yielding insight into both taxonomic profiles and the origin of genes as well as genetic content, used for inference of functional content of each sample by host organism.

MAGinator facilitates the reconstruction of phylogenetic relationships between the MAGs, providing a framework to identify clade-level differences within subspecies MAGs.





□ Joint-snhmC-seq: Joint single-cell profiling resolves 5mC and 5hmC and reveals their distinct gene regulatory effects

>> https://www.nature.com/articles/s41587-023-01909-2

Existing single-cell bisulfite sequencing methods cannot resolve 5mC and 5hmC, leaving the cell-type-specific regulatory mechanisms of TET and 5hmC largely unknown.

joint single-nucleus (hydroxy)methylcytosine sequencing (Joint-snhmC-seq), a scalable and quantitative approach that simultaneously profiles 5hmC and true 5mC in single cells by harnessing differential deaminase activity of APOBEC3A toward 5mC and chemically protected 5hmC.





□ A multimodal Transformer Network for protein-small molecule interactions enhances drug-target affinity and enzyme-substrate predictions

>> https://www.biorxiv.org/content/10.1101/2023.08.21.554147v1

PrpSmith facilitates the exchange of all relevant information between the two molecule types during the calculation of their numerical representations, allowing the model to account for their structural and functional interactions.

ProSmith combines gradient boosting predictions based on the resulting multimodal Transformer Network with independent predictions based on separate deep learning representations of the proteins and small molecules.





□ Sampling with flows, diffusion and autoregressive neural networks: A spin-glass perspective

>> https://arxiv.org/abs/2308.14085

A comparatively good grasp of parameter regions where traditional sampling methods like Monte Carlo sampling or Langevin dynamics are effective and where they are not.

Disordered models that exhibit a phase diagram of the random-first-order-theory type, called discontinuous one-step replica symmetry breaking, are typical in the mean-field theory of glass transition, but they also appear in a variety of random constraint satisfaction problems.

The tools available for outlining the phase diagrams of these problems turn out to be highly effective in analytically describing the performance of generative techniques such as flow-based, diffusion-based, or autoregressive networks for the respective probability measures.





□ SIMVI reveals intrinsic and spatial-induced states in spatial omics data

>> https://www.biorxiv.org/content/10.1101/2023.08.28.554970v1

SIMVI generates highly accurate SE inferences in synthetic datasets and unveils intrinsic variation in complex real datasets. SIMVI disentangles intrinsic and spatial variations in gene expression. It models the gene expression of each cell by two sets of low-dimensional latent variables. The spatial latent variables.are modeled by graph neural network variational posteriors.






Oblivionum.

2023-08-31 20:07:08 | Science News

(Created with Midjourney V5.2)




□ StarSpace: Joint representation learning for retrieval and annotation of genomic interval sets

>> https://www.biorxiv.org/content/10.1101/2023.08.21.554131v1

An application of the StarSpace method to convert annotated genomic interval data into low-dimensional distributed vector representations. A system that solves three related information retrieval tasks using embedding distance computations.

The StarSpace algorithm converts each region set and its corresponding label to a numerical vector / embedding / n-dimensional vector represented in embedding space, putting biologically related region set vectors and their labels close to one another in the shared latent space.





□ ClairS: a deep-learning method for long-read somatic small variant calling

>> https://www.biorxiv.org/content/10.1101/2023.08.17.553778v1

ClairS, a somatic variant caller designed for paired samples and primarily ONT long-read. ClairS uses Clair3 and LongPhase for germline variant calling, phasing and read haplotagging. The processed alignments are used for pileup- / full-alignment based somatic variant calling.

ClairS considers the power of the two neural networks equal. Full-alignment-based calling is performant at mid-range VAFs. However, pileup-based calling requires less evidence than full-alignment calling to draw the same conclusion.





□ ETNA: Joint embedding of biological networks for cross-species functional alignment

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad529/7252232

Existing transfer methods either formulate the alignment problem as a matching problem which pits network features against known orthology, or more recently, as a joint embedding problem.

ETNA (Embeddings to Network Alignment (ETNA) generates individual network embeddings based on network topological structures and then uses a Natural Language Processing-inspired cross-training approach to align the two embeddings using sequence-based orthologs.





□ DCAlign v1.0: Aligning biological sequences using co-evolution models and informed priors

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad537/7255914

DCAlign v1.0 is a new implementation of the DCA-based alignment technique, DCAlign, which conversely to the first implementation, allows for a fast parametrization of the seed alignment.

DCAlign v1.0 uses an approximate message-passing algorithm coupled with an annealing scheme over β (i.e. we iteratively increase β) to get the best alignment for the query sequence.





□ Ariadne: synthetic long read deconvolution using assembly graphs

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03033-5

Ariadne, a novel assembly graph-based algorithm, that can be used to deconvolve a large metagenomic linked-read dataset. Ariadne is intuitive, computationally efficient, and scalable to other large-scale linked-read problems, such as human genome phasing.

Ariadne relies on cloudSPAdes parameters to generate the assembly graph (iterative k-mer sizes), the program by itself only has two: search distance and size cutoff. The maximum search distance determines the maximum path length of the Dijkstra graphs surrounding the focal read.

Ariadne deconvolution generates read clouds that are enhanced up to 37.5-fold, containing only reads from a single fragment. Since each read is modeled as the center of a genomic fragment, the search distance can be thought of as the width of the fragment.





□ ReDis: efficient metagenomic profiling via assigning ambiguous reads

>> https://www.biorxiv.org/content/10.1101/2023.08.29.555244v1

ReDis combines Kraken2 with Minimap2 for aligning sequencing reads against a reference database with hundreds of gigabytes (GB) in size accurately within feasible time.

ReDis's novel assigning ambiguous reads step significantly raises the accuracy of abundance estimation of the organism with many multi-mapped reads by establishing the statistical model including the unique mapping rate.





□ IsoFrog: a Reversible Jump Monte Carlo Markov Chain feature selection-based method for predicting isoform functions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad530/7255910

IsoFrog adopts a Reversible Jump Monte Carlo Markov Chain (RJMCMC)-based feature selection framework to assess the feature importance to gene functions. A sequential feature selection (SFS) procedure is applied to select a subset of function-relevant features.

IsoFrog screens the relevant features for the specific function while eliminating irrelevant ones. The SFS are input into modified domain-invariant partial least squares, which prioritizes the most likely positive isoform and utilizes diPLS for isoform function prediction.





□ Minmers are a generalization of minimizers that enable unbiased local jaccard estimation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad512/7246743

The minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. By construction, miners, unlike minimizers, enable an unbiased estimation of the Jaccard.

This scheme does not yield an unbiased Jaccard estimator. The density of the [w/s]-minimizer scheme tracks closely with the density of (w, s)-miner intervals which, while not necessary for the use of minmers, serve as a helpful auxiliary index for improving query performance.





□ R2C2+UMI: Combining concatemeric consensus sequencing with unique molecular identifiers enables ultra-accurate sequencing of amplicons on Oxford Nanopore Technologies sequencers

>> https://www.biorxiv.org/content/10.1101/2023.08.19.553937v1

Processing the libraries into high molecular weight DNA using the R2C2. R2C2 circularizes library molecules using Gibson assembly. It then uses rolling circle amplification to generate long, linear concatemers containing multiple tandem repeats of the original library molecule.

After sequencing this concatemeric DNA on ONT sequencers, the computational C3POa and BC1 tools generate consensus sequences for each original library molecule. C3POa parses concatemeric raw reads into subreads and generates accurate R2C2 consensus reads from these subreads.

BC1 parses R2C2 consensus reads using a highly flexible syntax for the locating and parsing of UMI sequences, enabling the detection of fixed bases used as spacers or IUPAC wildcard base codes, which can be used to optimize UMIs for more indel-prone long-reads.





□ ggCaller: Accurate and fast graph-based pangenome annotation and clustering

>> https://genome.cshlp.org/content/early/2023/08/24/gr.277733.123

ggCaller (graph-gene-caller) uses population-frequency information to guide gene prediction, aiding the identification of homologous start codons across orthologues, and consistent scoring and functional annotation of orthologues.

ggCaller incorporates Balrog to filter open-reading frames (ORFs) to improve specificity of calls and Panaroo. ggCaller includes a query mode, enabling reference-agnostic functional inference for sequences of interest, applicable in pangenome-wide association studies (PGWAS).

ggCaller identifies all stop codons in the DBG and traverses the DBG to identify putative gene sequences. Each stop codon is paired with a downstream stop-codon in the same reading frame using a depth first search, thereby delineating the coordinates of all possible reading frames.





□ GraphCpG: Imputation of Single-cell Methylomes Based on Locus-aware Neighboring Subgraphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad533/7255916

GraphCpG, a graph-based deep learning method using locus-aware neighboring subgraphs to impute the missing methylation states. GraphCpG generates an optimized representation for the target methylation state, which consolidates follow-up neural networks in prediction.

Without CpG position information and DNA context, the completion of the methylation matrix is transformed into a graph-based link prediction problem in a non-Euclidean space and the computational complexity is also reduced.





□ Factorial state-space modelling for kinetic clustering and lineage inference

>> https://www.biorxiv.org/content/10.1101/2023.08.21.554135v1

The directed signal obtained from RNA velocity enables the estimation of transition probabilities between cell-states. This information can be represented as a directed and asymmetric graph.

A latent state-space Markov model that utilises cell-state transitions to model differentiation as a sequence of latent state transitions and to perform soft kinetic clustering of cell-states that accommodates the transitional nature of cells in a differentiation process.





□ scProjection: Projecting RNA measurements onto single cell atlases to extract cell type-specific expression profiles

>> https://www.nature.com/articles/s41467-023-40744-6

scProjection uses deeply sequenced single cell atlases to improve the precision of individual sc-resolution. It does so by jointly performing two tasks: deconvolution (estimating % RNA contributions of each of a set of cell types to a single RNA measurement) and projection.

scProjection can impute the expression levels of genes not directly measured. scProjection can separate RNA contributions of the target neuron from neighboring glial cells when analyzing Patch-seq data, leading to more accurate prediction of one data modality from another.





□ Scan: Scanning sample-specific miRNA regulation from bulk and single-cell RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.08.21.554111v1

Scan (Sample-specific miRNA regulation) framework to scan sample-specific miRNA regulation from bulk and single-cell RNA-sequencing data. Scan incorporates 27 network inference methods and two strategies to infer tissue-specific or cell-specific miRNA regulation.

Scan adapts two strategies: statistical perturbation and linear interpolation to infer sample-specific miRNA regulatory networks. Scan can help to cluster samples and construct sample correlation network.





□ pareg: Coherent pathway enrichment estimation by modeling inter-pathway dependencies using regularized regression

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad522/7248907

pareg follows the ideas of GSEA as it requires no stratification of the input gene list, of MGSA as it incorporates term-term relations in a database-agnostic way, and of LRPath as it makes use of the flexibility of the regression approach.

By re-gressing the differential expression p-values of genes on their membership to multiple gene sets while using LASSO and gene set similarity-based regularization terms, they require no prior thresholding and incorporate term-term relations into the enrichment computation.





□ CellAnn: A comprehensive, super-fast, and user-friendly single-cell annotation web server

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad521/7248909

CellAnn, a reference-based cell annotation web server. CellAnn uses a cluster-to-cluster alignment method to transfer cell labels from the reference to the query datasets, which is superior to the existing methods with higher accuracy and higher scalability.

CellAnn calculates the correlations and estimates correlation cutoffs b/n the query data and sub-clusters in reference datasets. CellAnn performs the Wilcoxon rank-sum test to determine cell types further if a query cluster is similar to multiple sub-clusters in the reference.





□ GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information

>> https://arxiv.org/abs/2304.09667

GeneGPT, a novel method that prompts Codex to use NCBI Web APIs. GeneGPT consists of a specifically designed prompt that consists of documentations and demonstrations of API usage, and an inference algorithm that integrates API calls in the Codex decoding process.

GeneGPT generalizes to longer chains of subquestion decomposition and API calls with simple demonstrations; GeneGPT makes specific errors that are enriched for each task. GeneGPT uses chain-of-thought API calls to answer a multi-hop question in GeneHop.

GeneHop contains three new multi-hop QA tasks based on the GeneTuring: SN gene function / Disease gene location, where the task is to list the chromosome locations / Sequence gene alias, which asks for the aliases of the gene that contains a specific DNA sequence.





□ CellAgentChat: Harnessing Agent-Based Modeling in CellAgentChat to Unravel Cell-Cell Interactions from Single-Cell Data

>> https://www.biorxiv.org/content/10.1101/2023.08.23.554489v1

CellAgentChat presents a unique agent-based perspective on cellular interactions, seamlessly integrating temporal, spatial, and biological data, offering a more precise and comprehensive understanding of cellular interaction dynamics.

CellAgentChat employs individual cell agents guided by simple behavior rules to investigate the arising complexity of cellular interactions.CellAgentChat enables in silico perturbations and in-depth analysis of the effects of cellular interactions on downstream gene expression.





□ SC2Spa: a deep learning based approach to map transcriptome to spatial origins at cellular resolution

>> https://www.biorxiv.org/content/10.1101/2023.08.22.554277v1

SC2Spa identified spatially variable genes and suggested negative regulatory relationships between genes. SC2Spa armored with deep learning provides a new way to map the transcriptome to its spatial location and perform subsequent analyses.

A key feature of SC2Spa is the ability to score the SVGs from their weight space. SC2Spa can choose either polar or Cartesian coordinates.As SC2Spa maps gene expression directly to coordinates the computational complexity of SC2Spa increases linearly.





□ eGADA: enhanced Genomic Alteration Detection Algorithm, a fast genomic segmentation algorithm

>> https://www.biorxiv.org/content/10.1101/2023.08.20.553622v1

eGADA is an enhanced version of GADA, which is a fast segmentation algorithm utilizing the Sparse Bayesian Learning (or Relevance Vector Machine) technique.

eGADA uses a Red-Black (RB) tree to store all segment breakpoints as nodes in the tree and then eliminate the least significant breakpoint based on the tree. Breakpoints are sorted by their corresponding t-statistic if either t-statistic is below a pre-set threshold.

The segment length of a breakpoint is defined as the length of the shorter flanking segment. Red-Black tree has a time complexity of O(log(n)) for both building and querving the tree. So the time complexity of the BE step is improved from O(n^2) to O(n*log(n)).





□ Gonomics: Uniting high performance and readability for genomics with Go

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad516/7251027

Gonomics, an open-source collection of command line programs and bioinformatic libraries implemented in Go that unites readability and performance for genomic analyses.

Gonomics contains packages to read, write, and manipulate a wide array of file formats (e.g. FASTA, FASTQ, BED, BEDPE, SAM, BAM, and VCF), and can convert and interface between these formats.

<bt />



□ CoFrEE: An Application to Estimate DNA Copy Number from Genome-wide RNA Expression Data

>> https://www.biorxiv.org/content/10.1101/2023.08.25.554898v1

Copy number from Expression Estimation (CoFrEE) is unique in providing an intuitively simple approach appropriate for both RNAseq and array-based expression cohorts. This is also the first such application to focus on facilitating copy number estimates.

The core methodology shares recursive median filtering with CaSpER [6] but employs dedicated by-gene pre-processing and by-sample post-processing to achieve final copy number estimates. The preprocessing step shares similarity to CNV-Kit.





□ scNanoHi-C: a single-cell long-read concatemer sequencing method to reveal high-order chromatin structures within individual cells

>> https://www.nature.com/articles/s41592-023-01978-w

scNanoHi-C applies Nanopore long-read sequencing to explore genome-wide proximal high-order chromatin contacts within individual cells. scNanoHi-C can reliably and effectively profile 3D chromatin structures and distinguish structure subtypes among individual cells.

scNanoHi-C could also be used to detect genomic variations, including copy-number variations and structural variations, as well as to scaffold the de novo assembly of single-cell genomes.

Extensive high-order chromatin structures exist in active chromatin regions across the genome, and multiway interactions between enhancers and their target promoters were systematically identified within individual cells.

scNanoHi-C sequencing data was first demultiplexed to single cells by Nanoplexer using known cell barcodes with default parameters. Adapter sequences were trimmed by Cutadapt and reads shorter than 500bp were also removed.





□ Sandy: A user-friendly and versatile NGS simulator to facilitate sequencing assay design and optimization

>> https://www.biorxiv.org/content/10.1101/2023.08.25.554791v1

Sandy, a user-friendly and computationally efficient tool with complete computational methods for simulating NGS data from three platforms: Illumina, Oxford Nanopore, and Pacific Bioscience. Sandy generates reads requiring only a fasta file as input.

Sandy simulates single-end and paired-end reads from both DNA and RNA sequencing. Sandy tracks a built-in database with predefined models extracted from real data for sequencer quality-profiles (i.e. Illumina hiseq, miseq, nextseq), expression-matrices generated from GTExV8 data.





□ Flow: a web platform and open database to analyse, store, curate and share bioinformatics data at scale

>> https://www.biorxiv.org/content/10.1101/2023.08.22.544179v1

Flow uses established nf-core pipelines, with some custom ones written to nf-core conventions including demultiplexing and CLIP-Seq pipelines. Once analysed, all stages of data processing can be seamlessly shared with the community via open database model.





□ Accurate human genome analysis with Element Avidity sequencing

>> https://www.biorxiv.org/content/10.1101/2023.08.11.553043v1

Element whole genome sequencing achieves higher mapping and variant calling accuracy compared to Illumina sequencing at the same coverage, with larger di�erences at lower coverages (20x-30x).

One new property of Element's AVITI platform is the ability to generate paired-end sequencing data with longer insert sizes (the distance between the paired reads) than is typical with Illumina preparations.





□ RichPathR: a gene set enrichment analysis and visualization tool

>> https://www.biorxiv.org/content/10.1101/2023.08.28.555198v1

RichPathR fills the gap of available tools for rapid mining of pre-annotated data of pathways/terms. A single transcriptomic or epigenetic high throughput sequencing experiment might generate several gene sets andmining these gene sets one at a time could be time consuming.





□ ASTA-P: a pipeline for the detection, quantification and statistical analysis of complex alternative splicing events

>> https://www.biorxiv.org/content/10.1101/2023.08.28.555224v1

ASTA-P, a pipeline for the analysis of arbitrarily complex splice patterns, using ASTALAVISTA to mine complete splicing events of different dimensions, followed by quantification with a custom script, and modelling the event counts using the Dirichlet-multinomial regression.

ASTA-P combines full-length transcript reconstruction for enriching the existing annotation model before assembling the splicing graph for each gene. This is followed by mining and quantification of local splice events incl. binary as well as high dimensional patterns.





□ HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad535/7255913

HAPNEST simulates pairs of synthetic haplotypes, where each haplotype is constructed as a mosaic of segments of various lengths imperfectly copied from a reference set of real haplotypes.

HAPNEST additionally models the coalescence age of segments using an approximate model inspired by the sequential Markovian coalescent model.





□ DosaCNV: Deep multiple-instance learning accurately predicts gene haploinsufficiency and deletion pathogenicity

>> https://www.biorxiv.org/content/10.1101/2023.08.29.555384v1

DosaCNV is a supervised deep MIL model designed to simultaneously infer the pathogenicity of coding deletions and the haploinsufficiency of genes, based on the assumption that the joint effect of gene haploinsufficiency determines deletion pathogenicity. DosaCNV, a deep multiple-instance learning framework that models deletion pathogenicity jointly with gene haploinsufficiency.





□ Galba: genome annotation with miniprot and AUGUSTUS

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05449-z

GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein-to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes.

GALBA provides substantially higher accuracy than BRAKER2 in the genomes of large vertebrates because GeneMark-ES within BRAKER2 performs poorly in such genomes when generating seed regions for spliced-alignment of proteins to the genome.





□ DecentTree: Scalable Neighbour-Joining for the Genomic Era

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad536/7257068

DecentTree is designed as a stand-alone application and a header-only library easily integrated with other phylogenetic software (e.g., it is integral in the IQ-TREE). DecentTree shows improved performance over existing software (BIONJ, Quicktree, FastME, and RapidNJ).

DecentTree uses the Vector Class Library and the multithreading OpenMP to parallelize the computations. DecentTree accepts either a distance matrix in Phylip format or a multiple sequence alignment in common formats such as Phylip or Fasta.





□ VData: Temporally annotated data manipulation and storage

>> https://www.biorxiv.org/content/10.1101/2023.08.29.555297v1

VData, a solution for storing and manipulating single cell datasets that extends the widely used AnnData format and is designed with synthetic data in mind.

VData adds a third 'time' dimension beyond the usual 'cell and 'gene' axes to support time stamped longitudinal data and heavily focuses on low memory footprint to allow fast and efficient handling of large datasets of tens of Gigabytes even on regular laptops.





CLANN / “Arise”

2023-08-31 01:22:57 | Music20

□ CLANN / “Arise”

神秘的な女声ヴォーカルに教会風合唱、そして壮大なオーケストラとドープなビートトラックが特徴の伝説的ユニット、CLANNの6年ぶりの新曲。悲痛に満ちながらも包み込むような慈愛と力強さに溢れた名曲



Stream/Buy: https://idol-io.ffm.to/Arise

Composer/Producer: Sebastian McKinnon
Lyrics: Sebastian McKinnon
Vocals: Charlotte Oleena
Violin: Chloé Picard
Arrangement: Francis Choinière
Conductor: Francis Choinière
Orchestra: Orchestre FILMharmonique
Mixing: Sylvain Lefebvre, Sebastian McKinnon
- - -
Sculpture: Riccardo Tenani
Lighting and Rendering Artist: Henri Hebeisen

Everlasting.

2023-08-31 00:32:16 | 写真


過去を否定しないのなら、肯定もしないということ。自分を今ここへと導いてきた人たちを思い浮かべても、大きな意味のある出会いほど大きな痛みがある。知りようもないことを知りたがるのは弱さで、手に持てる出来事だけで未来を決めつけるのは楽なのだ。わからないままを受け入れるのが強さなのだから

Morphing Thru Time

2023-08-27 08:08:08 | Enigma

□ ENIGMA / “Morphing Thru Time” (AI-Video Created with GEN-2 runwayml)

GEN-2 Image-To-Videoのムービー自動生成で、大好きなエニグマの曲を映像化してみました。※素材はMidjourneyで生成したものと、一部にekaitza氏などの著作物が含まれます




Sound source copyright ℗ – Virgin Schallplatten GmbH
Copyright © – Virgin Schallplatten GmbH
Publication – Enigma Songs
Publication – Mambo-Music
Recording – A.R.T. Studios

Instruments – Michael Cretu
Lyrics By – David Fairstein
Music By, Lyrics By – Michael Cretu
Producer, Engineer – Michael Cretu
Vocals [Female Voice] – Louisa Stanley, Sandra Cretu


BELLUSTAR.

2023-08-22 22:22:22 | ホテル

BELLUSTAR TOKYOに滞在しています。シックで黒基調の直角的なデザインと、Deluxe Suiteの眺望との黄金比が美しい























BELLUSTAR TOKYO ディナー。どれも日本の季節感を演出した独創的な料理たち。蝦夷鹿の燻製焼が本当に美味しかった。47階の高層から見下ろす花火大会も格別で、照明を落とした雰囲気がとても素敵。デザートでセレクトした生茶葉のレモングラスティーが、今まで飲んだどのハーブティーよりも感動的で、3杯もおかわりしちゃった





雨に咽ぶ高層階の景色も良き



Seven Sisters.

2023-08-22 20:20:20 | 天文


□ Bakry

>> https://x.com/bakrybaso/status/1691927701928526160


‎‏عنقود الثريا أو الأخوات السبع | The Pleiades ✨️

‎‏هذي أكثر صورة أفتخر فيها وتعنيلي الشيء الكثير وأعتبرها نقطة تحول في مسيرتي الفلكية 🌠

‎‏أحد أجرام السماء التي كانت العرب تتغنى بها، ولعل أبرز ماقيل فيها هو قول أبو الطيب المتنبي:
‎‏ما أَبعَدَ العَيبَ وَالنُقصانَ عَن شَرَفي
‎‏أَنا الثُرَيّا وَذانِ الشَيبُ وَالهَرَمُ

‎‏شبه نفسه بالثريا من باب العلو 🌌

‎‏وقول ابن المعتز:
‎‏كأنَّ الثريا هودجٌ فوقَ نـاقةٍ
‎‏يسيرُ بها حادٍ من الليل مزعج

‎‏وقد لمعتْ بينَ النجومِ كأنهـا
‎‏قواريرُ فيها زئبقٌ يترجـرج

‎‏أتحفونا

Missa Solemnis 2.0

2023-08-21 01:46:34 | art music


- [x] 『Beethoven: Missa Solemnis 2.0』ベートーヴェン、荘厳ミサ曲とAIのコラボレーション。オーケストラの演奏とシンクロして、巨大なディスプレイにAIデータ彫刻がリアルタイムレンダリングで表示される

Omega Point.

2023-08-16 00:00:00 | Science News

(made with DALL-E 2)




□ ENIGMA: Approximate estimation of cell-type resolution transcriptome in bulk tissue through matrix completion

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbad273/7234627

ENIGMA (Deconvolution based on Regularized Matrix Completion), a method that addresses this limitation through accurately deconvoluting bulk tissue RNA-seq data into a readout with cell-type resolution by leveraging information from scRNA-seq data.

ENIGMA employs a matrix completion strategy to minimizes the distance between the mixture transcriptome obtained with bulk sequencing and a weighted combination of cell-type-specific expression. ENIGMA reconstructs the latent continuous structure of CSE into a pseudo-trajectory.





□ GROVER: The human genome’s vocabulary as proposed by the DNA language model

>> https://www.biorxiv.org/content/10.1101/2023.07.19.549677v1

GROVER ("Genome Rules Obtained Via Extracted Representations") to select the optimal vocabulary with a custom fine-tuning task of next-k-mer prediction. GROVER has learned these structures purely from the contextual relationships of tokens.

GROVER extracts the information content of the genome, its language structures via token embeddings or through extracting attention from the foundation model. Self-similarity was assessed as the cosine similarity of different embeddings separately for all 12 transformer layers.





□ biomolecular neuron: Simple and rewireable biomolecular building blocks for DNA machine-learning algorithms

>> https://www.biorxiv.org/content/10.1101/2023.07.20.549967v1

biomolecular neuron, a polymerase-actuated DNA computing unit which serve as rewireable building blocks for neural network algorithms. biomolecular neuron generates DNA computing units of longer lengths than is feasible via chemical synthesis.

This scheme combines enzymatic synthesis to encode a greater number of i/o connections on a single DNA strand, solid-phase immobilization to spatially segregate DNA computing units into network layers, and universal addressing to enable the assembly of different circuits.

biomolecular neuron generates computing units from fewer DNA sequences, and built-in modularity through circuit rewiring. a surface-based DNA computing approach has a unique feature: computation at each layer is synchronized to the timing of fluid transfer.





□ XGDAG: eXplainable Gene–Disease Associations via Graph Neural Networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad482/7235567

XGDAG is the first method to use an XAI-based solution in the context of positive-unlabeled learning for disease gene prioritization with GNNs. A graph based on a PPI network and enriched with GDA information and node features is fed into a graph neural network.

XGDAG exploits XAI methods to draw the final ranking of candidate genes. This is a novelty that presents XAI not only as a tool that opens the black box of deep neural networks but also as an analysis component directly incorporated into the GDA discovery pipeline.





□ GENIUS: GEnome traNsformatIon and spatial representation of mUltiomicS data

>> https://www.biorxiv.org/content/10.1101/2023.02.09.525144v3

Integrated Gradients evaluates the trained model relative to I/O label, resulting in attribution scores for each input w/ respect to the output label. Integrated Gradients represent the integral of gradients with respect to inputs along the path from a given baseline.

GENIUS (GEnome traNsformatIon and spatial representation of mUltiomicS data) can transform multi-omics data into images with genes displayed as spatially connected pixels and successfully extract relevant information with respect to the desired output.





□ scyan: Biology-driven deep generative model for cell-type annotation in cytometry

>> https://mics-lab.github.io/scyan/

Scyan (Single-cell Cytometry Annotation Network) is a Bayesian probabilistic model composed of a deep invertible neural network called a normalizing flow (the function ). It maps a latent distribution of cell expressions into the empirical distribution of cell expressions.

This cell distribution is a mixture of gaussian-like distributions representing the sum of a cell-specific and a population-specific term. Also, interpretability and batch effect correction are based on the model latent space.





□ SCLSC: Predicting cell types with supervised contrastive learning on cells and their types

>> https://www.biorxiv.org/content/10.1101/2023.08.08.552379v1

SCLSC (Supervised Contrastive Learning for Single Cell) leverages supervised contrastive learning, which utilizes label information from the training data to provide explicit guidance on the similarity or dissimilarity between samples during the learning process.

SCLSC has two key parameters: the dimension of the input and the dimension of the output of the encoder. In case of input dimension, SCLSC has the capability to process input from all genes.





□ scDGD: The Deep Generative Decoder: MAP estimation of representations improves modeling of single-cell RNA data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad497/7241685

scDGD is an application of the encoder-less generative model, the Deep Generative Decoder (DGD), a simple generative model that computes model parameters and representations directly via maximum a posteriori (MAP) estimation.

The DGD handles complex parameterized latent distributions naturally unlike VAEs which typically use a fixed Gaussian distribution, because of the complexity of adding other types.





□ Genome-wide prediction of disease variant effects with a deep protein language model

>> https://www.nature.com/articles/s41588-023-01465-0

ESM1b, a 650-million-parameter protein language model trained on 250 million protein sequences. The model was trained via the masked language modeling task, where random residues are masked from input sequences and the model has to predict the correct amino acid at each position.

ESM1b computes the LLR scores for all possible missense mutations in a protein through a single pass.





□ BEENE: Deep Learning based Nonlinear Embedding Improves Batch Effect Estimation https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad479/7240486

BEENE uses an autoencoder model to learn the nonlinear embeddings of RNA-seq expression data. The nonlinear embedding learned by the autoencoder is used by both batch and biological variable learner modules.

The autoencoder and these two learning networks are trained in tandem to guide the embedding in such a way that biological heterogeneity in the data as well as variability across batches are preserved.





□ CellGO: A novel deep learning-based framework and webserver for cell type-specific gene function interpretation

>> https://www.biorxiv.org/content/10.1101/2023.08.02.551654v1

CellO, a VNN-based tool for cell type-specific pathway analysis. CellGO integrates the single-cell RNA expression data and the VNN model that emulates the hierarchy of GO terms to capture cell type-specific signatures, intra-pathway gene connections, and inter-pathway crosstalk.

CellO can construct the network of cell type-specific active pathways and report top communities enriched with active pathways, by incorporating the random walk with restart algorithm and the community partition algorithm.





□ SR2: Sparse Representation Learning for Scalable Single-cell RNA Sequencing Data Analysis

>> https://www.biorxiv.org/content/10.1101/2023.07.31.551228v1

SR2 is based on an ensemble of matrix factorization and sparse representation learning. It decomposes variation from multiple biological conditions and cellular variation across bio-samples into shared low-rank latent spaces.

SR2 employs sparse regularization on embedding of cells to facilitate cell population discovery and norm constraint on each component of gene representations to ensure equal scale.





□ BEDwARS: a robust Bayesian approach to bulk gene expression deconvolution with noisy reference signatures

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03007-7

BEDwARS (Bayesian Expression Deconvolution with Approximate Reference Signatures), which tackles the problem of signature mismatch from a complementary angle.

BEDwARS incorporates the possibility of reference signature mismatch directly into the statistical model used for deconvolution, using the reference to estimate the true cell type signatures underlying the given bulk profiles while simultaneously learning cell type proportions.

BEDwARS assumes that each bulk expression profile is a weighted mixture of cell type-specific profiles (“true signatures”) that are unknown but not very different from given reference signatures.





□ MLNGCF: circRNA-disease associations prediction with multi-layer attention neural graph based collaborative filtering

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad499/7240485

MLNGCF first enhances multiple biological information with autoencoder as the initial features of circRNAs and diseases. A multi-layer cooperative attention-based message propagation is performed on the central network to obtain the high-order features of circRNAs and diseases.

An interaction function of collaborative filtering is introduced to integrate both matrix factorization and multilayer perceptron and score circRNAs-disease associations.





□ contrastiveVI: Isolating salient variations of interest in single-cell data

>> https://www.nature.com/articles/s41592-023-01955-3

contrastiveVI (contrastive Variational Inference), a framework for deconvolving variations in treatment–control single-cell RNA sequencing (scRNA-seq) datasets into shared and treatment-specific latent variables.

contrastiveVI is a generative model designed to isolate factors of variation specific to a group of "target" cells (e.g. from specimens with a given disease) from those shared with a group of "background".





□ Chrombus-XMBD: A Graph Generative Model Predicting 3D-Genome, ab initio from Chromatin Features

>> https://www.biorxiv.org/content/10.1101/2023.08.02.551072v1

Chrombus-XMBD, a graph generative model capable of predicting chromatin interactions. Chrombus employes dynamic edge convolution with QKV attention setup, which maps the relevant chromatin features to a learnable embedding space thereby generate genome-wide 3D-contactmap.

Chrombus is adopted from Graph Auto-Encoder architecture. Each graph consists 128 vertices, and each vertex represents a chromatin segment derived from CTCF binding peaks. The node (vertex) attributes consist 14-dimensional chromatin features.





□ Block Aligner: an adaptive SIMD-accelerated aligner for sequences and position-specific scoring matrices

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad487/7236499

Block Aligner, a new SIMD-accelerated algorithm for aligning nucleotide and protein sequences against other sequences or position-specific scoring matrices. They introduce a new paradigm that uses blocks in the dynamic programming matrix that greedily shift, grow, and shrink.

Block Aligner relies on the COMPUTE_RECT function to efficiently compute the scores for certain rectangular regions of the DP matrix. Block Aligner generally tries to center the maximum scores (likely optimal alignment path) within the computed regions.





□ AAontology: An ontology of amino acid scales for interpretable machine learning

>> https://www.biorxiv.org/content/10.1101/2023.08.03.551768v1

AAontology-a two-level classification for 586 amino acid scales (mainly from AAindex) together with an in-depth analysis of their relations-using bag-of-word-based classification, clustering, and manual refinement over multiple iterations.

AAontology organizes amino acid property scales into 8 categories and 67 subcategories based on their numerical similarity and physicochemical meaning.

The Energy category comprises around 40 scales organized into 9 specific subcategories, each highlighting different energetic aspects of amino acids including free energy determining conformational stability.





□ PhyloVelo enhances transcriptomic velocity field mapping using monotonically expressed genes

>> https://www.nature.com/articles/s41587-023-01887-5

PhyloVelo, a computational framework that estimates the velocity of transcriptomic dynamics by using monotonically expressed genes (MEGs) or genes with expression patterns that either increase or decrease, but do not cycle, through phylogenetic time.

PhyloVelo identifies MEGs and reconstructs a transcriptomic velocity field. A diffusion process is used to model the dynamics of latent gene expression. This enables the estimation of phylogenetic velocity, which corresponds to the drift coefficients of MEGs in the diffusion process.





□ disperseNN2: a neural network for estimating dispersal distance from georeferenced polymorphism data

>> https://www.biorxiv.org/content/10.1101/2023.07.30.551115v1

The disperseNN2 program uses a deep neural network trained on simulated data to infer the mean, per-generation parent-offspring distance. It aims to infer σ, the root-mean-square displacement along a given axis between a randomly chosen child and one of their parents chosen at random.

disperseNN2 is designed for SNP data obtained from RADseq or whole genome sequencing, with either short-range or full linkage information. disperseNN2 uses a pairwise convolutional network that performs feature extraction on pairs of individuals at a time.

“The extractor" extracts pertinent information from pairs of genotypes, and merges the extracted features from all combinatorial pairs into a summary table for downstream processing.

This strategy allows us to convey spatial information to the network which is accomplished by attaching the geographic distance between each sample-pair directly to the genotype summaries from the corresponding pair.

The first input to disperseNN2 is a genotype matrix consisting of minor allele counts (Os, Is, and 2s) for m SNPs from n individuals. However, rather than show the full genotype matrix to the network, it loops through all pairs of individuals and sub-set the genotypes of each pair.





□ PACS: Model-based compound hypothesis testing for snATAC-seq data

>> https://www.biorxiv.org/content/10.1101/2023.07.30.551108v1

PACS (Probability model of Accessible Chromatin of Single cells), a zero-adjusted statistical model that can allow complex hypothesis testing of factors that affect accessibility while accounting for sparse and incomplete data.

PACS could detect both linear and quadratic signals, and its power is dependent on the "effect sizes" defined as the log fold change of accessibility between the highest and lowest accessibility.

PACS resolves the issue of sequencing coverage variability in scATAC-seq data by combining a probability model of the underlying group-level accessibility with an independent cell-level capturing probability.





□ ISMI-VAE: A Deep Learning Model for Classifying Disease Cells Using Gene Expression and SNV Data

>> https://www.biorxiv.org/content/10.1101/2023.07.28.550985v1

ISMI-VAE leverages latent variable models that utilize the characteristics of SNV and gene expression data to overcome high noise levels, and uses deep learning techniques to integrate multimodal information, map them to a low-dimensional space, and classify disease cells.

ISMI-VAE combines attention mechanism and variational autoencoder. It proposes an attention module that uses the weights of the attention vector to reflect the importance of gene features as a way to determine genes or SNVs that are highly associated with disease.





□ SCENIC+: single-cell multiomic inference of enhancers and gene regulatory networks

>> https://www.nature.com/articles/s41592-023-01938-4

SCENIC+, a computational framework that combines single-cell chromatin accessibility and gene expression data with motif discovery to infer enhancer-driven GRNs.

SCENIC+ integrates region accessibility, TF and target gene expression and cistromes to infer eGRNs, in which TFs are linked to their target regions and these to their target genes.

SCENIC+ next uses GRNBoost2 to quantify the importance of both TFs and enhancer candidates for target genes and it infers the direction of regulation (activating/repressing) using linear correlation.





□ PCGAN: A Generative Approach for Protein Complex Identification from Protein Interaction Networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad473/7235566

PCGAN (Protein Complexes by Generative Adversarial Networks) learns the characteristics of complexes, and generates new complexes. PCGAN trains a generator for generating protein complexes, and a discriminator for distinguishing the generated protein complexes from real ones.

The input data of PCGAN includes a PIN and a gold standard dataset. The competition learning between the generator and the discriminator promotes the two models to improve their capabilities until the generated complexes are indistinguishable from the real ones.





□ NetActivity enhances transcriptional signals by combining gene expression into robust gene set activity scores through interpretable autoencoders

>> https://www.biorxiv.org/content/10.1101/2023.07.31.551238v1

NetActivity, a computational framework to define highly representative and interpretable gene set activity scores (GSAS) based on shallow sparsely-connected autoencoders. NetActivity model was trained w/ 1,518 GO biological processes terms and KEGG pathways and all GTEx samples.

NetActivity generates GSAS robust to the initialization parameters and representative of the original transcriptome, and assigned higher importance to more biologically relevant genes. NetActivity returns GSAS w/ a more consistent definition and higher interpretability than GSVA.





□ ProjectSVR: Mapping single-cell RNA-seq data to reference atlases by supported vector regression

>> https://www.biorxiv.org/content/10.1101/2023.07.31.551202v1

ProjectSVR, a machine learning-based algorithm for mapping the query cells onto well-constructed reference embeddings using Supported Vector Regression.

ProjectS VR follows a two-step process for reference mapping: (1) Fitting a collection of SR model ensembles to learn embeddings from feature scores of the reference atlas; (2) Projecting the query cells onto the consistent embeddings of the reference via trained SR models.





□ BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giad062/7233988

BigSeqKit, a parallel toolkit to manipulate FASTA and FASTQ files at scale with speed and scalability at its core.

BigSeqKit takes advantage of IgnisHPC, a computing engine that unifies the development, combination, and execution of high-performance computing (HPC) and Big Data parallel tasks using different languages and programming models.





□ SMAI: Is your data alignable? Principled and interpretable alignability testing and integration of single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.08.03.551836v1

SMAI (a spectral manifold alignment and inference) provides a statistical test to robustly determine the alignability between datasets to avoid misleading inference, and is justified by high-dimensional statistical theory. SMAI obtains a symmetric invertible alignment function.

SMAI-align incorporates a high-dimensional shuffled Procrustes analysis, which iteratively searches for the sample correspondence and the best similarity transformation that minimizes the discrepancy between the intrinsic low-dimensional signal structures.





□ demuxmix: Demultiplexing oligonucleotide-barcoded single-cell RNA sequencing data with regression mixture models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad481/7234612

demuxmix’s probabilistic classification framework provides error probabilities for droplet assignments that can be used to discard uncertain droplets and inform about the quality of the HTO data and the success of the demultiplexing process.

demuxmix utilizes the positive association between detected genes in the RNA library and HTO counts to explain parts of the variance in the HTO data resulting in improved droplet assignments.





□ scHumanNet: Construction and analysis of cell-type-specific functional gene network, with SCINET and HumanNetv3

>> https://github.com/netbiolab/scHumanNet

scHumanNet enables cell-type specific networks with scRNA-seq data. The SCINET framework takes a single cell gene expression profile and the “reference interactome” HumanNet v3, to construct a list of cell-type specific network.

With the modified version of SCINET source code and the detailed tutorial described below, researchers could take any single-cell RNA sequencing (scRNA-seq) data of any biological context (e.g., disease) and construct their own cell-type specific network for downstream analysis.





□ Lior RT

>> https://twitter.com/alphasignalai/status/1687878483899207680?s=61&t=YtYFeKCMJNEmL5uKc0oPFg

Impressive. MetaGPT is about to reach 10,000 stars on Github.

It's a Multi-Agent Framework that can behave as an engineer, product manager, architect, project managers.

With a single line of text it can output the entire process of a software company along with carefully orchestrated SOPs:
▸ Data structures
▸ APIs
▸ Documents
▸ User stories
▸ Competitive analysis
▸ Requirements





□ S-PLM: Structure-aware Protein Language Model via Contrastive Learning between Sequence and Structure

>> https://www.biorxiv.org/content/10.1101/2023.08.06.552203v1

S-PLM, a 3D structure-aware protein language model developed through multi-view contrastive learning. Unlike the joint-embedding-based methods that rely on both protein structure and sequence for inference, S-PLM encodes the sequence and 3D structure of proteins individually.

S-PLM sequence encoder was fine-tuned based on the pre-trained ESM2 model. S-PLM demonstrates the ability to align sequence and structure embeddings of the same protein effectively while keeping other embeddings from other proteins further apart.





□ Few-shot biomedical named entity recognition via knowledge-guided instance generation and prompt contrastive learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad496/7238215

A knowledge-guided instance generation for few-shot BioNER, which generates diverse and novel entities based on similar semantic relations of neighbor nodes.

And by introducing question prompts, we natively formulate BioNER as a QA task, and propose prompt contrastive learning to improve the robustness of the model by measuring the mutual information between query and entity.





□ The Helix Nebula