lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Stardust

2024-09-07 19:17:37 | 写真

(Created with Midjourney v6.1)




□ Crystal Skies & HALIENE / “Stardust”

Max Richter / “In a Landscape”

2024-09-06 19:23:32 | art music

□ Max Richter / “In a Landscape”

Release Date: 09/06/2024
Label: Decca
Cat.No.: 5882352

『reconciling polarities (極性の調和)』をテーマに、静謐なオーケストラ(弦楽五重奏)と透明なエレクトロニクスを融合。『The Blue Notebooks』の頃のダイナミズムに回帰し、ジャケットアートもあの名盤を彷彿とさせる



□ Max Richter / “In a Landscape: Late and Soon”

Producer, Associated Performer, Synthesizer Programming: Max Richter
Mixer: Rupert Coulson
Mastering Engineer: Cicely Balston
Recording Engineer: Alex Ferguson
Violin: Eloisa-Fleur Thom
Violin: Max Baillie
Viola: Connie Pharoah
Cello: Max Ruisi
Cello: Zara Hudson-Kozdoj


□ Max Richter - Love Song (After JE)



□ Max Richter / “Only Silent Words”



Perfection.

2024-09-06 19:22:23 | 写真




Lysis.

2024-08-31 20:08:08 | Science News

(Created with Midjourney v6.1)



□ scCello: Cell-ontology guided transcriptome foundation model https://arxiv.org/abs/2408.12373

scCello (single cell, Cell-ontology guided TFM) learns cell representation by integrating cell type information and cellular ontology relationships into its pre-training framework.

scCello's pre-training framework is structured with three levels of objectives:

Gene level: a masked token prediction loss to learn gene co-expression patterns. Intra-cellular level: an ontology-based cell-type coherence loss to encourage cell representations of the same cell type to aggregate. Inter-cellular level: a relational alignment loss to guide the cell representation learning by consulting the cell-type lineage from the cell ontology graph.





□ scDiffusion: Conditional generation of high-quality single-cell data using diffusion model

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae518/7738782

scDiffusion, an in silico scRNA-seq data generation model combining latent diffusion model (LDM) w/ the foundation model, to generate single-cell gene expression data with given conditions. scDiffusion has 3 parts, an autoencoder, a denoising network, and a condition controller.

scDiffusion employs the pre-trained model SCimilarity as an autoencoder to rectify the raw distribution and reduce the dimensionality of scRNA-seq data, which can make the data amenable to diffusion modeling.

The denoising network was redesigned based on a skip-connected multilayer perceptron (MLP) to learn the reversed diffusion process. scDiffusion uses a new condition control strategy, Gradient Interpolation, to interpolate continuous cell trajectories from discrete cell states.





□ biVI: Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing data

>> https://www.nature.com/articles/s41592-024-02365-9

biVI combines the variational autoencoder framework of scVI w/ biophysical models describing the transcription and splicing kinetics. Bivariate distributions arising from biVI models can be used in variational autoencoders for principled integration of unspliced and spliced data.

biVI retains the variational autoencoder’s ability to capture cell type structure in a low-dimensional space while further enabling genome-wide exploration of the biophysical mechanisms, such as system burst sizes and degradation rates, that underlie observations.





□ SNOW: Variational inference of single cell time series

>> https://www.biorxiv.org/content/10.1101/2024.08.29.610389v1

SNOW (SiNgle cell flOW map), a deep learning algorithm to deconvolve single cell time series data into time--dependent and time--independent contributions. SNOW enables cell type annotation based on the time--independent dimensions.

SNOW yields a probabilistic model that can be used to discriminate between biological temporal variation and batch effects contaminating individual timepoints, and provides an approach to mitigate batch effects.

SNOW is capable of projecting cells forward and backward in time, yielding time series at the individual cell level. This enables gene expression dynamics to be studied without the need for clustering or pseudobulking, which can be error prone and result in information loss.





□ Cluster Buster: A Machine Learning Algorithm for Genotyping SNPs from Raw Data

>> https://www.biorxiv.org/content/10.1101/2024.08.23.609429v1

Cluster Buster is a system for recovering the genotypes of no-call SNPs on the Neurobooster array after genotyping with the Illumina Gencall algorithm. It is a genotype-predicting neural network and SNP genotype plotting system.

In the Cluster Buster workflow, SNP metrics files from all available ancestries in GP2 are split into valid gencall SNPs and no-call SNPs. Valid genotypes are split 80-10-10 for training, validation, and testing of the neural network. The trained neural network is then applied to no-call SNPs.





□ IVEA: an integrative variational Bayesian inference method for predicting enhancer–gene regulatory interactions

>> https://academic.oup.com/bioinformaticsadvances/article/4/1/vbae118/7737507

IVEA, an integrative variational Bayesian inference of regulatory element activity for predicting enhancer–gene regulatory interactions. Gene expression is modelled by hypothetical promoter/enhancer activities, which reflect the regulatory potential of the promoters/enhancers.

Using transcriptional readouts and functional genomic data of chromatin accessibility, promoter and enhancer activities were estimated through variational Bayesian inference, and the contribution of each enhancer–promoter pair to target gene transcription was calculated.

<br/ >



□ FateNet: an integration of dynamical systems and deep learning for cell fate prediction

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae525/7739702

FateNet, a novel computational model that combines the theory of dynamical systems and deep learning to predict cell fate decision-making using scRNA-seq data. FateNet leverages universal properties of bifurcations such as scaling behavior and normal forms.

FateNet learns to predict and distinguish different bifurcations in pseudotime simulations of a 'universe' of different dynamical systems. The universality of these properties allows FateNet to generalise to high-dimensional gene regulatory network models and biological data.





□ FlowSig: Inferring pattern-driving intercellular flows from single-cell and spatial transcriptomics

>> https://www.nature.com/articles/s41592-024-02380-w

FlowSig, a method that identifies ligand–receptor interactions whose inflows are mediated by intracellular processes and drive subsequent outflow of other intercellular signals.

FlowSig learns a completed partial directed acyclic graph (CPDAG) describing intercellular flows between three types of constructed variables: inflowing signals, intracellular gene modules and outflowing signals.





□ VISTA Uncovers Missing Gene Expression and Spatial-induced Information for Spatial Transcriptomic Data Analysis

>> https://www.biorxiv.org/content/10.1101/2024.08.26.609718v1

VISTA leverages a novel joint probabilistic modeling approach to predict the expression levels of unobserved genes. VISTA jointly models scRNA-seq data and SST data based on variational inference and geometric deep learning, and incorporates uncertainty quantification.

VISTA uses a Multi-Layer Perceptron (MLP) to encode information from the expression domain and a GNN to encode information from the spatial domain. VISTA facilitates RNA velocity analysis and signaling direction inference by imputing dynamic properties of genes.





□ GNNRAI: An explainable graph neural network approach for integrating multi-omics data with prior knowledge to identify biomarkers from interacting biological domains.

>> https://www.biorxiv.org/content/10.1101/2024.08.23.609465v1

GNNRAI (GNN-derived representation alignment and integration) uses graphs to model relationships among modality features (for example, genes in transcriptomics and proteins in proteomics data). This enables us to encode prior biological knowledge as graph topology.

Integrated Hessians was applied to this transformer model to derive interaction scores between its input tokens. The biodomains partition gene functions into distinct molecular endophenotypes.





□ SCellBOW: Pseudo-grading of tumor subpopulations from single-cell transcriptomic data using Phenotype Algebra

>> https://elifesciences.org/reviewed-preprints/98469v1

SCellBOW, a Doc2vec20 inspired transfer learning framework for single-cell representation learning, clustering, visualization, and relative risk stratification of malignant cell types within a tumor. SCellBOW intuitively treats cells as documents and genes as words.

SCellBOW learned latent representations capture the semantic meanings of cells based on their gene expression levels. Due to this, cell type or condition-specific expression patterns get adequately captured in cell embeddings.

SCellBOW can replicate this feature in the single-cell phenotype space to introduce phenotype algebra. The query vector was subtracted from the reference vector to calculate the predicted risk score using a bootstrapped random survival forest.





□ QDGP: Disease Gene Prioritization With Quantum Walks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae513/7738783

By encoding self-loops for the seed nodes into the underlying Hamiltonian, the quantum walker was shown to remain more local to the seed nodes, leading to improved performance.

QDGP is a novel method centered around quantum walks on the interactome. Continuous-time quantum walks are the quantum analogues of continuous-time classical random walks, which describe the propagation of a particle over a graph.





□ Chronospaces: An R package for the statistical exploration of divergence times promotes the assessment of methodological sensitivity

>> https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.14404

Chronospaces are low-dimensional graphical representations. It provides novel ways of visualizing, quantifying and exploring the sensitivity of divergence time estimates, contributing to the inference of more robust evolutionary timescales.

By representing chronograms as collections of node ages, standard multivariate statistical approaches can be readily employed on populations of Bayesian posterior timetrees.





□ Normalization of Single-cell RNA-seq Data Using Partial Least Squares with Adaptive Fuzzy Weight

>> https://www.biorxiv.org/content/10.1101/2024.08.18.608507v1

The present approach overcomes biases due to library size, dropout, RNA composition, and other technical factors and is motivated by two different methods: pooling normalization, and scKWARN, which does not rely on specific count-depth relationships.

A partial least squares (PLS) regression was performed to accommodate the variability of gene expression in each condition, and upper and lower quantiles with adaptive fuzzy weights were utilized to correct unwanted biases in scRNA-seq data.





□ Modeling relaxation experiments with a mechanistic model of gene expression

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05816-4

They recently proposed a piecewise deterministic Markov process (PDMP) version of the 2-state model which rigorously approximates the original molecular model.

A moment-based method has been proposed for estimating parameter values from a experimental distribution assumed to arise from the functioning of a 2-states model. They recall the mathematical description of the model through the piecewise deterministic Markov process formalism.





□ UnigeneFinder: An automated pipeline for gene calling from transcriptome assemblies without a reference genome

>> https://www.biorxiv.org/content/10.1101/2024.08.19.608648v1

UnigeneFinder converts the raw output of de novo transcriptome assembly software such as Trinity into a set of predicted primary transcripts, coding sequences, and proteins, similar to the gene sequence data commonly available for high-quality reference genomes.

UnigeneFinder achieves better precision while improving F-scores than the individual clustering tools it combines. It fully automates the generation of primary sequences for transcripts, coding regions, and proteins, making it suitable for diverse types of downstream analyses.





□ Approaches to dimensionality reduction for ultra-high dimensional models

>> https://www.biorxiv.org/content/10.1101/2024.08.20.608783v1

The mechanistic approach (SNP tagging) and two approaches considering biological and statistical contexts by fitting a multiclass logistic regression model followed by either 1-dimensional clustering (1D-SRA) or multi-dimensional feature clustering (MD-SRA).

MD-SRA (Multi-Dimensional Supervised Rank Aggregation) provides a very good balance between classification quality, computational intensity, and required hardware resources.

SNP selection-based 1D-SRA approach integrates both biological and statistical contexts by assessing the importance of SNPs for the classification by fitting a multiclass logistic regression model and thus adding the biological component to the feature selection process.





□ The Lomb-Scargle periodogram-based differentially expressed gene detection along pseudotime

>> https://www.biorxiv.org/content/10.1101/2024.08.20.608497v1

The Lomb-Scargle periodogram can transform time-series data with non-uniform sampling points into frequency-domain data. This approach involves transforming pseudotime domain data from scRNA-seq and trajectory inference into frequency-domain data using LS.

By transforming complex structured trajectories into the frequency domain, these trajectories can be reduced to a vector-to-vector comparison problem. This versatile method is capable of analyzing any inferred trajectory, including tree structures with multiple branching points.





□ SMeta: a binning tool using single-cell sequences to aid reconstructing metageome species accurately

>> https://www.biorxiv.org/content/10.1101/2024.08.25.609542v1

SMeta (Segment Tree Based Metagenome Binning Algorithm) takes FASTA files of metagenomic and single-cell sequencing data as input and the binning results for each metagenomic sequence as output.

Tetranucleotide frequency is the frequency of combinations of 4 continuous base pattern in a DNA sequence. Tetranucleotides taken from sliding window on a sequence are 136-class counted and seen as a vector.





□ DIAMOND2GO: A rapid Gene Ontology assignment and enrichment tool for functional genomics

>> https://www.biorxiv.org/content/10.1101/2024.08.19.608700v1

DIAMONDGO (D2GO) is a new toolset to rapidly assign Gene Ontology (GO) terms to genes or proteins based on sequence similarity searches. D2GO uses DIAMOND for alignment, which is 100 - 20,000 X faster than BLAST.

D2GO leverages GO-terms already assigned to sequences in the NCBI non-redundant database to achieve rapid GO-term assignment on large sets of query sequences.





□ GCphase: an SNP phasing method using a graph partition and error correction algorithm

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05901-8

GCphase utilizes the minimum cut algorithm to perform phasing. First, based on alignment between long reads and the reference genome, GCphase filters out ambiguous SNP sites and useless read information.

GCphase constructs a graph in which a vertex represents alleles of an SNP locus and each edge represents the presence of read support; moreover, GCphase adopts a graph minimum-cut algorithm to phase the SNPs.

GCpahse uses two error correction steps to refine the phasing results obtained from the previous step, effectively reducing the error rate. Finally, GCphase obtains the phase block.





□ Benchmarking DNA Foundation Models for Genomic Sequence Classification

>> https://www.biorxiv.org/content/10.1101/2024.08.16.608288v1

A benchmarking study of three recent DNA foundation language models, including DNABERT-2, Nucleotide Transformer version-2 (NT-v2), and HyenaDNA, focusing on the quality of their zero-shot embeddings across a diverse range of genomic tasks and species.

DNABERT-2 exhibits the most consistent performance across human genome-related tasks, while NT-v2 excels in epigenetic modification detection. HyenaDNA stands out for its exceptional runtime scalability and ability to handle long input sequences.





□ cytoKernel: Robust kernel embeddings for assessing differential expression of single cell data

>> https://www.biorxiv.org/content/10.1101/2024.08.16.608287v1

cytoKernel, a methodology for generating robust kernel embeddings via a Hilbert Space approach, designed to identify differential patterns between groups of distributions, especially effective in scenarios where mean changes are not evident.

CytoKernel diverges from traditional methods by conceptualizing the cell type-specific gene expression of each subject as a probability distribution, rather than as a mere aggregation of single-cell data into pseudo-bulk measures.





□ Melon: metagenomic long-read-based taxonomic identification and quantification using marker genes

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03363-y

Melon, a new DNA-to-marker taxonomic profiler that capitalizes on the unique attributes of long-read sequences. Melon is able to estimate total prokaryotic genome copies and provide species-level taxonomic abundance profiles in a fast and precise manner.

Melon first extracts reads that cover at least one marker gene using a protein database, and then profiles the taxonomy of these marker-containing reads using a separate, nucleotide database.





□ FindingNemo: A Toolkit for DNA Extraction, Library Preparation and Purification for Ultra Long Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2024.08.16.608306v1

The FindingNemo protocol for the generation of high occupancy ultra-long reads on nanopore platforms. This protocol can generate equivalent or more throughput to disc-based methods and may have additional advantages in tissues and non-human cell material.

The FindingNemo protocol can also be tuned to enable extraction from as few as one million human cell equivalents or 5 ug of human ultra-high molecular weight (UHMW) DNA as input and enables extraction to sequencing in one working day.





□ AdamMCMC: Combining Metropolis Adjusted Langevin with Momentum-based Optimization

>> https://arxiv.org/abs/2312.14027

AdamMCMC combines the well established Metropolis Adjusted Langevin Algorithm (MALA) with momentum-based optimization using Adam and leverages a prolate proposal distribution, to efficiently draw from the posterior.

The constructed chain admits the Gibbs posterior as an invariant distribution and converges to this Gibbs posterior in total variation distance.





□ Bioinformatics Copilot 2.0 for Transcriptomic Data Analysis

>> https://www.biorxiv.org/content/10.1101/2024.08.15.607673v1

Bioinformatic Copilot 2.0 introduces several new functionalities and an improved user interface compared to its predecessor. A key enhancement is the integration of a module that allows access to an internal server, enabling them to log in and directly access server files.

Bioinformatic Copilot 2.0 broadens the spectrum of figure types that users can generate, including heatmaps, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway maps, and dimension plots.





□ DeepSomatic: Accurate somatic small variant discovery for multiple sequencing technologies

>> https://www.biorxiv.org/content/10.1101/2024.08.16.608331v1

DeepSomatic, a short-read and long-read somatic small variant caller, adapted from Deep Variant. DeepSomatic is developed by heavily modifying Deep Variant, in particular, altering the pileup images to contain both tumor and normal aligned reads.

DeepSomatic takes the tensor-like representation of each candidate and evaluates it with the convolutional neural network to classify if the candidate is a reference or sequencing error, germline variant or somatic variant.





□ Sawfish: Improving long-read structural variant discovery and genotyping with local haplotype modeling

>> https://www.biorxiv.org/content/10.1101/2024.08.19.608674v1

Sawfish is capable of calling and genotyping deletions, insertions, duplications, translocations and inversions from mapped high-accuracy long reads.

The method is designed to discover breakpoint evidence from each sample, then merge and genotype variant calls across samples in a subsequent joint-genotyping step, using a process that emphasizes representation of each SV's local haplotype sequence to improve accuracy.

In a joint-genotyping context, sawfish calls many more concordant SVs than other callers, while providing a higher enrichment for concordance among all calls.





□ VAIV bio-discovery service using transformer model and retrieval augmented generation

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05903-6

VAIV Bio-Discovery, a novel biomedical neural search service which supports enhanced knowledge discovery and document search on unstructured text such as PubMed. It mainly handles w/ information related to chemical compound/drugs, gene/proteins, diseases, and their interactions.

VAIV Bio-Discovery system offers four search options: basic search, entity and interaction search, and natural language search.

VAIV Bio-Discovery employs T5slim_dec, which adapts the autoregressive generation task of the T5 (text-to-text transfer transformer) to the interaction extraction task by removing the self-attention layer in the decoder block.

VAIV assists in interpreting research findings by summarizing the retrieved search results for a given natural language query with Retrieval Augmented Generation. The search engine is built with a hybrid method that combines neural search with the probabilistic search, BM25.





□ Denoiseit: denoising gene expression data using rank based isolation trees

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05899-z

DenoiseIt, that aims to remove potential outlier genes yielding a robust gene set with reduced noise. The gene set constructed by DenoiseIt is expected to capture biologically significant genes while pruning irrelevant ones to the greatest extent possible.

DenoiseIt processes the gene expression data and decomposes it into basis and loading matrices using NMF. In the second step, each rank feature from the decomposed result are used to generate isolation trees to compute its outlier score.





□ COATI-LDM: Latent Diffusion For Conditional Generation of Molecules

>> https://www.biorxiv.org/lookup/content/short/2024.08.22.609169v1

COATI-LDM, a novel latent diffusion models to the conditional generation of property-optimized, rug-like small molecules. Latent diffusion for molecule generation allows models trained on scarce or non-overlapping datasets to condition generations on a large data manifold.

Partial diffusion allows one to start with a given molecule and perform a partial diffusion propagation to obtain conditioned samples in chemical space. COATI-LDM relies on a large-scale pre-trained encoder-decoder that maps chemical space to fixed-length latent vector.





□ Smccnet 2.0: a comprehensive tool for multi-omics network inference with shiny visualization

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05900-9

SmCCNet (Sparse multiple Canonical Correlation Network Analysis) is a framework designed for integrating one or multiple types of omics data with a quantitative or binary phenotype.

It’s based on the concept of sparse multiple canonical analysis (SmCCA) and sparse partial least squared discriminant analysis (SPLSDA) and aims to find relationships between omics data and a specific phenotype.

SmCCNet uses LASSO for sparsity constraints to identify significant features w/in the data. It has two modes: weighted and unweighted. In the weighted mode, it uses different scaling factors for each data type, while in the unweighted mode, all scaling factors are equal.

Ankylosis.

2024-08-31 20:07:08 | Science News

(Created with Midjourney v6.1)




□ Dynaformer: From Static to Dynamic Structures: Improving Binding Affinity Prediction with Graph-Based Deep Learning

>> https://onlinelibrary.wiley.com/doi/10.1002/advs.202405404

Dynaformer, a graph transformer framework to predict the binding affinities by learning the geometric characteristics of the protein-ligand interactions from the MD trajectories.

Dynaformer utilizes a roto-translation invariant feature encoding scheme, taking various interaction characteristics into account, including interatomic distances, angles between bonds, and various types of covalent or non-covalent interactions.






□ OmniBioTE: Large-Scale Multi-omic Biosequence Transformers for Modeling Peptide-Nucleotide Interactions

>> https://arxiv.org/abs/2408.16245

OmniBioTE is a large-scale multimodal biosequence transformer model that is designed to capture the complex relationships in biological sequences such as DNA, RNA, and proteins. OmniBioTE pushes the boundaries by jointly modeling nucleotide and peptide sequence.

Multi-omic biosequence transformers emergently learn useful structural information without any prior structural training. OmniBioTE excels in predicting peptide-nucleotide interactions, specifically the Gibbs free energy changes (ΔG) and the effects of mutations (ΔΔG).





□ TIANA: transcription factors cooperativity inference analysis with neural attention

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05852-0

TIANA (Transcription factors cooperativity Inference Analysis with Neural Attention), an MHA-based framework to infer combinatorial TF cooperativities from epigenomic data.

TIANA uses known motif weights to initialize convolution filters to ease the interpretation challenge, allowing convolution filter activations to be directly associated with known TF motifs.

TIANA uses integrated gradients to interpret the TF interdependencies from the attention units. We tested TIANA’s ability to recover TF co-binding pair motifs from ChIP-seq data, demonstrating that TIANA could identify key co-occurring TF motif pairs.





□ Amethyst: Single-cell DNA methylation analysis tool Amethyst reveals distinct noncanonical methylation patterns in human glial cells

>> https://www.biorxiv.org/content/10.1101/2024.08.13.607670v1

Amethyst is capable of efficiently processing data from hundreds of thousands of high-coverage cells in a relatively short time frame by performing initial computationally-intensive steps on a cluster followed by rapid local interaction of the output in RStudio.

By default, Amethyst calculates fast truncated singular values with the implicitly restarted Lanczos bidiagonalization algorithm (IRLBA). Amethyst provides a helper function for estimating how many dimensions are needed to achieve the desired amount of variance explained.





□ GITIII: Investigation of pair-wise single-cell interactions by statistically interpreting spatial cell state correlation learned by self-supervised graph inductive bias transformer

>> https://www.biorxiv.org/content/10.1101/2024.08.21.608964v1

GITIII (Graph Inductive Transformer for Intercellular Interaction Investigation), an interpretable self-supervised graph transformer-based language model that treats cells as words (nodes) and their cell neighborhood as a sentence to explore the communications among cells.

Enhanced by multilayer perceptron-based distance scaler, physics-informed attention, and graph transformer model, GITIII infers CCI by investigating how the state of a cell is influenced by the spatial organization, ligand expression, cell types and states of neighboring cells.

GITIII employs the Graph Inductive Bias Transformer (GRIT) model which encodes input tensors in a language model manner. It effectively encodes both the graph structure and expression profiles within cellular neighborhoods.





□ LineageVAE: Reconstructing Historical Cell States and Transcriptomes toward Unobserved Progenitors

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae520/7738923

LineageVAE is a deep generative model that transforms scRNA-seq observations with identical lineage barcodes into sequential trajectories toward a common progenitor in a latent cell state space.

LineageVAE depicts sequential cell state transitions from simple snapshots and infers cell states over time. It generates transcriptomes at each time point using a decoder. LineageVAE utilizes the property that the progenitors of cells introduced with a shared barcode are identical.

LineageVAE can reconstruct the historical cell states and their expression profiles from the observed time point toward these progenitor cells under the constraint that the cell state of each lineage converges to the progenitor state.





□ tombRaider: improved species and haplotype recovery from metabarcoding data through artefact and pseudogene exclusion.

>> https://www.biorxiv.org/content/10.1101/2024.08.23.609468v1

tombRaider, an open-source software package for improved species and
haplotype recovery from metabarcoding data through accurate artefact and pseudogene exclusion.

tombRaider features a modular algorithm capable of evaluating multiple criteria, including sequence similarity, co-occurrence patterns, taxonomic assignment, and the presence of stop codons.





□ PICASO: Profiling Integrative Communities of Aggregated Single-cell Omics data

>> https://www.biorxiv.org/content/10.1101/2024.08.28.610120v1

PICASO creates biomedical networks to identify explainable disease-associated gene communities and potential drug targets by using gene-regulatory network modeling on biomedical network representations.

The PICASO architecture can be used to embed single-cell transcriptomics data within a plentitude of available biomedical databases such as OpenTargets, Omnipath, GeneOntology, KEGG, STRING, Reactomeand Uniprot, and extract condition specific communities and associations.

The full PICASO network consists of 111032 nodes and 1617389 edges collected from the above 7 disparate resources. PICASO provides an implementation for calculating node and edge scores within the network by the MeanNetworkScorer.





□ LoRNASH: A long context RNA foundation model for predicting transcriptome architecture

>> https://www.biorxiv.org/content/10.1101/2024.08.26.609813v1

LoRNASH, the long-read RNA model with StripedHyena, an RNA foundation model that learns how the nucleotide sequence of unspliced pre-mRNA dictates transcriptome architecture-the relative abundances and molecular structures of mRNA isoforms.

LoRNASH uses causal language modeling and an expanded RNA token set. LoRNAS handles extremely long sequence inputs (~65 kilobase pairs), allowing for zero-shot prediction of all aspects of transcriptome architecture, incl isoform structure and the impact of DNA sequence variants.





□ pyVIPER: A fast and scalable Python package for rank-based enrichment analysis of single-cell RNASeq data

>> https://www.biorxiv.org/content/10.1101/2024.08.25.609585v1

pyVIPER, a fast, memory-efficient, and highly scalable Python-based VIPER implementation. The pyVIPER package leverages AnnData objects and is seemingly integrated with standard single cell analysis packages, such as Scanpy and others from the scverse ecosystem.

pyVIPER can directly interface with scikit-learn and TensorFlow to allow plug-and-play ML analyses that leverage VIPER-assessed protein activity profiles. pyVIPER scales more efficiently with the number of cells, enabling the analysis of 4x cells with the same memory allocation.





□ A Bioinformatician, Computer Scientist, and Geneticist lead bioinformatic tool development - which one is better?

>> https://www.biorxiv.org/content/10.1101/2024.08.25.609622v1

Medical Informatics is identified as the top-performing group in developing accurate bioinformatic software tools. The tools include a number of methods for structural variation detection, single-cell profiling, long-read assembly, multiple sequence alignment.

Bioinformatics and Engineering ranked lower in terms of software accuracy. Tools developed by authors who affiliated with "Bioinformatics" typically had slightly lower accuracy than that of other fields. However, this was not a statistically significant finding.





□ TRACS: Enhanced metagenomics-enabled transmission inference

>> https://www.biorxiv.org/content/10.1101/2024.08.19.608527v1

TRACS (TRAnsmision Clustering of Strains), a highly accurate and easy-to-use algorithm for establishing whether two samples are plausibly related by a recent transmission event.

The TRACS algorithm distinguishes the transmission of closely related strains by identifying genetic differences as small as a few Single Nucleotide Polymorphisms (SNP)s, which is crucial when considering slow-evolving pathogens.

TRACS was designed to estimate a lower bound of the SNP distance and can incorporate sampling date information. TRACS controls for major sources of error including variable sequencing coverage, within-species recombination and sequencing errors.





□ Pandagma: A tool for identifying pan-gene sets and gene families at desired evolutionary depths and accommodating whole genome duplications

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae526/7740678

Pandagma provides methods for efficiently and sensitively identifying pangene and gene family sets for annotation sets from eukaryotic genomes, with methods for handling polyploidy and for targeting family construction at specified taxonomic depths.

Pandagma is a set of configurable workflows for identifying and comparing pan-gene sets and gene families for annotation sets from eukaryotic genomes, using a combination of homology, synteny, and expected rates of synonymous change in coding sequence.





□ diffGEK: Differential Gene Expression Kinetics

>> https://www.biorxiv.org/content/10.1101/2024.08.21.608952v1

diffGEK assumes that rates can vary over a trajectory, but are smooth functions of the differentiation process. diffGEK initially estimates per-cell and per-gene kinetic parameters using known lineage and pseudo-temporal ordering of cells for a specific condition.

diffGEK integrates a statistical strategy to discern whether a gene exhibits differential kinetics between any two biological con-ditions, across all possible permutations.





□ GTAM: A Molecular Pretraining Model with Geometric Triangle Awareness

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae524/7739699


Geometric Triangle Awareness Model (GTAM). GTAM aims to maximize the mutual information using contrastive self-supervised learning (SSL) and generative SSL. GTAM uses diffusion generative models for generative SSL which can lead to a more accurate estimation in generative SSL.

GTAM employs the new molecular encoders that incorporate a novel geometric triangle awareness mechanism to enhance edge-to-edge updates in molecular representation learning, in addition to node-to-edge and edge-to-node updates, unlike other molecular graph encoders.





□ sparsesurv: A Python package for fitting sparse survival models via knowledge distillation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae521/7739697

sparsesurv, a Python package that contains a set of teacher-student model pairs, including the semi-parametric accelerated failure time and the extended hazards models as teachers.

sparsesurv also contains in-house survival function estimators, removing the need for external packages. Sparsesurv is validated against R-based Elastic Net regularized linear Cox proportional hazards models, based on kernel-smoothing the profile likelihood.





□ GOLDBAR: A Framework for Combinatorial Biological Design

>> https://pubs.acs.org/doi/10.1021/acssynbio.4c00296

GOLDBAR, a combinatorial design framework. GOLDBAR enables synthetic biologists to intersect and merge the rules for entire classes of biological designs to extract common design motifs and infer new ones.

GOLDBAR can refine/validate design spaces for TetR-homologue transcriptional logic circuits, verify the assembly of a partial nif gene cluster, and infer novel gene clusters for the biosynthesis of rebeccamycin.





□ Model-X knockoffs: Transcriptome data are insufficient to control false discoveries in regulatory network inference

>> https://www.cell.com/cell-systems/fulltext/S2405-4712(24)00205-9

This approach centers on a recent innovation in high-dimensional statistics: model-X knockoffs. Model-X knockoffs were originally intended to be applied to individual regression problems, not network inference.

Model-X knockoffs builds a network by regressing each gene on all other genes. If done naively, this process requires time proportional to the fourth power of the number of genes. Model-X uses Gaussian knockoffs with covariance equal to the sample covariance matrix.





□ Seqrutinator: scrutiny of large protein superfamily sequence datasets for the identification and elimination of non-functional homologues

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03371-y

Seqrutinator is an objective, flexible pipeline that removes sequences with sequencing and/or gene model errors and sequences from pseudogenes from complex, eukaryotic protein superfamilies.

Seqrutinator removes Non-Functional Homologues (NFHs) rather than FHs. Pseudogenes have no functional constraint and an elevated evolutionary rate by which they stand out in phylogenies.





□ SQANTI-reads: a tool for the quality assessment of long read data in multi-sample lrRNA-seq experiments.

>> https://www.biorxiv.org/content/10.1101/2024.08.23.609463v1

SQANTI-reads leverages SQANTI3, a tool for the analysis of the quality of transcript models, to develop a quality control protocol for replicated long-read RNA-seq experiments.

The number/distribution of reads, as well as the number/distribution of unique junction chains (transcript splicing patterns), in SQANTI3 structural categories are compiled. Multi-sample visualizations of QC metrics can also be separated by experimental design factors.





□ IL-AD: Adapting nanopore sequencing basecalling models for modification detection via incremental learning and anomaly detection

>> https://www.nature.com/articles/s41467-024-51639-5

IL-AD leverages machine learning approaches to adapt nanopore sequencing basecallers for nucleotide modification detection. It applies the incremental learning technique to improve the basecalling of modification-rich sequences, which are usually of high biological interests.

With sequence backbones resolved, IL-AD further runs anomaly detection on individual nucleotides to determine their modification status. By this means, IL-AD promises the single-molecule, single-nucleotide and sequence context-free detection of modifications.





□ grenedalf: Population genetic statistics for the next generation of Pool sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae508/7741639

grenedalf, a command line tool to compute widely-used population genetic statistics for Pool-seq data. It aims to solve the shortcomings of previous implementations, and is several orders of magnitude faster, scaling to thousands of samples.

The core implementation of the command line tool grenedalf is part of GENESIS, the high-performance software library for working with phyogenetic and population genetic data.





□ Eliater: A Python package for estimating outcomes of perturbations in biomolecular networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae527/7742268

Eliater checks the mutual consistency of the network structure and observational data with conditional independence tests, checks if the query is estimable from the available observational data.

Eliater detects and removes nuisance variables unnecessary for causal query estimation, generates a simpler network, and identifies the most efficient estimator of the causal query. Eliater returns an estimated quantitative effect of the perturbation.





□ funkea: Functional Enrichment Analysis in Python

>> https://www.biorxiv.org/content/10.1101/2024.08.24.609502v1

funkea, a Python package containing popular functional enrichment methods, leveraging Spark for effectively infinite scale. All methods have been unified into a single interface, giving users the ability to easily plug-and-play different enrichment approaches.

The variant selection and locus definitions are composed by the user, but each of the enrichment methods provided by funkea provide default configurations. The user can also define their own annotation component, which is required for all enrichment methods.





□ ARGV: 3D genome structure exploration using augmented reality

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05882-8

ARGV, an augmented reality 3D Genome Viewer. ARGV contains more than 350 pre-computed and annotated genome structures inferred from Hi-C and imaging data. It offers interactive and collaborative visualization of genomes in 3D space, using standard mobile phones or tablets.

ARGV allows users to overlay multiple annotation tracks onto a 3D chromosome model. ARGV is equipped with a database currently containing 343 whole-genome, high-resolution 3D models and annotations inferred from Hi-C and omics data, as well as several imaging-based structures.





□ NERD-seq: a novel approach of Nanopore direct RNA sequencing that expands representation of non-coding RNAs

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03375-8

NERD-seq expands the ncRNA representation in Nanopore direct RNA-seq to include multiple additional classes of ncRNAs genome-wide, while maintaining at the same time the ability to sequence high library complexity mRNA transcriptomes.

NERD-seq enables the generation of reads with higher coverage for the non-coding genome, while still detecting mRNAs and poly(A) ncRNAs. NERD-seq allows the successful detection of snoRNAs, snRNAs, scRNAs, srpRNAs, tRNAs, and other ncRNAs.





□ OrthoBrowser: Gene Family Analysis and Visualization

>> https://www.biorxiv.org/content/10.1101/2024.08.27.609986v1

OrthoBrowser, a static site generator that will index and serve phylogeny, gene trees, multiple sequence alignments, and novel multiple synteny alignments. This greatly enhances the usability of tools like OrthoFinder by making the detailed results much more visually accessible.

OrthoBrowser can scale reasonably up to hundreds of genomes. The multiple synteny alignment method uses a progressive hierarchical alignment approach in the protein space using orthogroup membership to establish orthology.





□ GageTracker: a tool for dating gene age by micro- and macro-synteny with high speed and accuracy

>> https://www.biorxiv.org/content/10.1101/2024.08.28.610050v1

Based on the micro- and macro-synteny algorithm, GageTracker was a one-command running software to search ortholog genome alignments suitable for multiple species and allow a fast and accurate trace gene age with minimal user inputs.

It obtained a high alignment quality as the optimized LastZ software but significantly saved the running time as well. GageTracker also showed a slightly higher support rate from orthoDB, FlyBase, and Ensembl ortholog database than the Gentree database.





□ Enhancement of network architecture alignment in comparative single-cell studies

>> https://www.biorxiv.org/content/10.1101/2024.08.30.608255v1

scSpecies pre-trains a conditional variational autoencoder-based model and fully re-initializes the encoder input layers and the decoder network during fine-tuning.

scSpecies aligns context scRNA-seq datasets with human target data, enabling the analysis of similarities and differences b/n the datasets. scSpecies enables nuanced comparisons of gene expression profiles by generating GE values for both species from a single latent variable.






□ LexicMap: efficient sequence alignment against millions of prokaryotic genomes

>> https://www.biorxiv.org/content/10.1101/2024.08.30.610459v1

LexicMap, a nucleotide sequence alignment tool for efficiently querying moderate length sequences (over 500 bp) such as a gene, plasmid or long read against up to millions of prokaryotic genomes.

A key innovation is to construct a small set of probe k-mers (e.g. n = 40,000) which "window-cover" the entire database to be indexed, in the sense that every 500 bp window of every database genome contains multiple seed k-mers each with a shared prefix with one of the probes.

Storing these seeds, indexed by the probes with which they agree, in a hierarchical index enables fast and low-memory variable-length seed matching, pseudoalignment, and then full alignment.

LexicMap is able to align with higher sensitivity than Blastn as the query divergence drops from 90% to 80% for queries ≥ 1 kb. Alignment of a single gene against 2.34 million prokaryotic genomes from GenBank and RefSeq takes 36 seconds (rare gene) to 15 minutes (16S RNA gene).





□ Enhlink infers distal and context-specific enhancer–promoter linkages

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03374-9

Enhlink detects biological effects and controls technical effects by incorporating appropriate covariates into a nonlinear modeling framework involving single cells, rather than aggregates.

Enhlink selects a parsimonious set of enhancers associated with a promoter to smooth the sparse representation of any individual enhancer while prioritizing those with the largest effect.

Enhlink uses a random forest-like approach, where cell-level (binary) accessibilities of enhancers and biological and technical factors are features and the cell-level accessibility of a promoter is the response variable.

Enhlink can further prioritize enhancers by associating them with the expression of the promoter’s target gene. Enhlink has the ability to predict both proximal and distal enhancer–gene linkages and identify linkage specific to biological covariates.





□ COBRA: Higher-order correction of persistent batch effects in correlation networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae531/7748404

COBRA (Co-expression Batch Reduction Adjustment), a method for computing a batch-corrected gene co-expression matrix based on estimating a conditional covariance matrix.

COBRA estimates a reduced set of parameters expressing the co-expression matrix as a function of the sample covariates, allowing control for continuous and categorical covariates.





Amy Brandon / “LYSIS”

2024-08-30 10:44:01 | art music

□ Amy Brandon / “LYSIS”

>> https://amybrandon.bandcamp.com/album/lysis


□ microchimerisms

Release Date: 16/08/2024
Label: New Focus Recordings
Cat.No: FCR414

01. microchimerisms
02. threads
03. Intermountainous
04. Caduceus
05. Tsiyr
06. Affine
07. Simulacra
08. Lysis

『現象と表象』としての音の二重性。有機的に連なる音像は癒着と断裂を反芻し、その構造を自己分解(Lysis)する。触覚すら刺激する鋭利なマイクロトーンは、時に獣のように、時に無機質な信号のように、原初にインプリントされた音の根源性を発露する


□ Affine



□ Intermountainous


Recorded at Music Multimedia Room, CIRMMT, McGill, Montréal
Engineer: John D.S. Adams, Stonehouse Sound
Assistant Engineer: Alexandre Calixte

threads recorded at Wild Sound Studio, Minneapolis
Engineer: Steve Kaul

Simulacra recorded at St. Andrew’s United Church, Halifax, Nova Scotia
Engineers: John D.S. Adams and Rod Sneddon
Assistant engineer: John Janigan-Mills
Producer: Amy Brandon
Editing: Amy Brandon
Production consultant: Jeff Reilly
Microtonal piano tuning, Tsiyr: Alan Whatmough, Pianocraft
Sculpture on cover: Nub 2, © Susan Roston
Courtesy of the artist and Studio Sixty Six, Ottawa, Ontario
Artwork photography: Andrew Rashotte
Music engraving: Matthew Karas
Simulacra: Matthew Karas & Aaron J. Kirschner
Affine: Jawher Matmati
Design: Marc Wolf

KAOS.

2024-08-29 04:10:07 | 映画

□ 『KAOS』

>> https://www.netflix.com/jp/title/80997258

ギリシャ神話の舞台を現代社会に移して再構築した、ポップでキッチュなダークコメディ。傲岸不遜なオリュンポスの神々と、ゼウスの失脚までを描く。ジェフ・ゴールドブラムの怪演が光る。どこか『ニーベルングの指環』Götz Friedrich演出版を彷彿とさせる趣も

Created by Charlie Covell
Directed by Georgi Banks-Davies / Runyararo Mapfumo
Music by Isabella Summers
Cinematography by Kit Fraser / Pau Esteve Birba







□ Isabella Summers / “KAOS I”

『KAOS』 全話視聴。ゼウスが猜疑心から暴走していく様は非常に見応えがあり、宗教と国家論の寓話としても良くできている。そして原初の神、カオス。そこで幕引きかー!という感じ。後半からはギリシャ神話のSF的な解釈がエピソードを牽引する

JANU.

2024-08-25 10:43:35 | ホテル

□ 『JANU Tokyo』

>> https://www.janu.com/janu-tokyo/ja/

ジャヌ東京、最上階Corner Suiteに滞在。




サンスクリット語で『魂』を意味するAMAN系列のリゾート。
プライベート・バルコニーやホールウェイを備え、デザイン性と機能性を両立した居室空間。



パノラマ・ウィンドウから一望する麻布台ヒルズが息を呑む美しさ。



バスルームを遮蔽できる障子風ドアも嬉しい、随所に『和』のおもてなしが息づく佇まい




プライベート・バルコニーからの眺望。パティオセットもあったけど、暑すぎてあまり活用できなかった。





今回特に感じたのが、スタッフ一人一人の親しげなお声がけと心配り。
担当のコンシェルジュからは、ゲートを出るまで情熱と誇りを持って接して頂き、このような滞在で『人』に感動したのは久々かも






プールで過ごした最高のひととき。AMANと同じく水温は高め、底にTマークがない(鏡面素材)ので、私のようなガチ泳ぎ勢もゆったりリゾートステイを余儀なくされる。暖炉前のジャグジーが天国。4,000㎡の広さを誇るウェルネスフロアにはボクシング・ジムも完備。いつか全貌を目にしたい



部屋にあったBang & Olufsenのブックシェルフ型スピーカー、 ”Beosound Emerge”。このエレガントなフォルムからは想像もつかないほど驚異的な低域再現力があり、スイートルームの複雑な間取りの隅々にまでサウンドを浸透させていた。愛用しているNaim MU-SOの後継にしたい




お部屋のダイニングルームで何より惹かれたのが、アマン・ブランドのフレグランステスター。 調香師Jacques Chabertが『東京の桜』を香りに描く儚いフローラルノート。アプリコットと緑茶香が優しく包み込む。私が香水に求めて続けてきた理想像が結晶化されたような作品



Eventide.

2024-08-25 02:40:18 | 写真






□ Trivecta / “Ocean”

Amen.

2024-08-24 10:38:35 | アート・文化

(Art by Megs)

Ray of gold.

2024-08-22 00:22:05 | 写真




□ Seven Lions, Tritonal & Kill The Noise Feat. HALIENE / “Horizon”

Echo nomad.

2024-08-18 20:20:20 | Science News

(Art by meg)




□ StaVia: spatially and temporally aware cartography with higher-order random walks for cell atlases

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03347-y

StaVia, an automated end-to-end trajectory inference (TI) framework. StaVia can optionally incorporate any combination of the following data to infer cell transitions: sequential or spatial metadata, RNA-velocity, pseudotime, and lazy or teleporting behaviors.

StaVia exploits a new form of lazy-teleporting random walks (LTRW) with memory to pinpoint end-to-end trajectories. StaVia generates single-cell embeddings with the underlying high-resolution connectivity of the KNN graph. StaVia can create a comprehensive cartographic Atlas.





□ P(all-atom) Is Unlocking New Path For Protein Design

>> https://www.biorxiv.org/content/10.1101/2024.08.16.608235v1

Pallatom, a novel approach for all-atom protein generation. by learning P (all-atom), high-quality all-atom proteins can be successfully generated, eliminating the need to learn marginal probabilities separately.

Pallatom employs a dual-track framework that tokenizes proteins into token- level and atomic-level representations, integrating them through a multi-layer decoding process with “traversing” representations and recycling mechanism.





□ FEDKEA: Enzyme function prediction with a large pretrained protein language model and distance-weighted k-nearest neighbor

>> https://www.biorxiv.org/content/10.1101/2024.08.12.604109v1

FEDKEA consists of two main parts: determining whether a protein is an enzyme and predicting the enzyme's EC number. For the binary classification task of determining if a protein is an enzyme, we use the ESM-2 model with 33 layers and 650M parameters.

FEDKEA tokenizes the amino acid sequence and then fine-tunes the weights of the last few layers. It was found that fine-tuning four layers yielded the best performance. The embeddings from the model are averaged to the sequence length, resulting in a 1280-dimensional vector.





□ GENOMICON-Seq: A comprehensive tool for the simulation of mutations in amplicon and whole exome sequencing

>> https://www.biorxiv.org/content/10.1101/2024.08.14.607907v1

GENOMICON-Seq is designed to simulate both amplicon sequencing and whole exome sequencing (WES), providing a robust platform for users to experiment with virtual genetic samples. It outputs sequencing reads compatible with mutation detection tools and a report on mutation origin.

GENOMICON-Seq generate samples with varying mutation frequencies, which are then subjected to a simulated library preparation process. GENOMICON-Seq supports the simulation of amplicon sequencing and WES with PCR and probe-capturing biases, and sequencing errors.





□ DeepSME: De Novo Nanopore Basecalling of Motif-insensitive DNA Methylation and Alignment-free Digital Information Decryptions at Single-Molecule Level

>> https://www.biorxiv.org/content/10.1101/2024.08.15.606762v1

DeepSME (Deep-learning based Single-Molecule Encryption) tackle the basecalling bottleneck of the modified dataset by expanding k-mer dictionary from scratch. DeepSME provides independent k-mer tables and exploit the properties of signal disruptions at single-molecule level.

DeepSME’s scheme underpinned the potential for secure DNA-based data storage and communication with high information density, addressing the increasing demand for robust information security in an era of evolving biotechnological threats.





□ scParser: sparse representation learning for scalable single-cell RNA sequencing data analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03345-0

scParser is based on an ensemble of matrix factorization and sparse representation learning. scParser summarizes the expression patterns of thousands of genes to a few metagenes/gene modules, which provides a high-level summary of the gene activities.

scParser models the variation caused by biological conditions via gene modules, which bridge gene expression with the phenotype. The gene modules in scParser are learned adaptively from the data and encode the biological processes that are affected by these biological conditions.





□ DeepAge: Harnessing Deep Neural Network for Epigenetic Age Estimation From DNA Methylation Data of human blood samples

>> https://www.biorxiv.org/cgi/content/short/2024.08.12.607687v1

DeepAge utilizes Temporal Convolutional Networks (TCNs), which are particularly adept at handling sequence data, to model the sequential nature of CpG sites across the genome.

DeepAge allows for an effective capture of long-range dependencies and interactions between CpG sites, which are essential for understanding the complex biological processes underlying aging.

By integrating layers of temporal blocks that include dilated convolutions, DeepAge can access a broader context of the input sequence, thus enhancing its ability to discern pertinent aging signals from the methylation patterns.





□ CauFinder:mn Steering cell-state and phenotype transitions by causal disentanglement learning

>> https://www.biorxiv.org/content/10.1101/2024.08.16.607277v1

CauFinder, a advanced deep learning-based causal model designed to identify a subset of master regulators that collectively exert a significant causal impact during cell-state or phenotype transitions from the observed data.

CauFinder elucidates state transitions by identifying causal factors within a latent space and quantifying causal information flow from latent features to state predictions. It can theoretically identify and circumvent confounders using the backdoor adjustment formula.





□ seq2squiggle: End-to-end simulation of nanopore sequencing signals with feed-forward transformers

>> https://www.biorxiv.org/content/10.1101/2024.08.12.607296v1

seq2squiggle, a novel transformer-based, non-autoregressive model designed to generate nanopore sequencing signals from nucleotide sequences. seq2squiggle learns sequential contextual information from the signal data.

seq2squiggle leverages feed-forward transformer blocks, it effectively captures broader sequential contexts, enabling the generation of artificial signals that closely resemble experimental observations.

seq2squiggle calculates event levels using pre-defined pore models, sample event durations from random distributions, and add Gaussian noise with fixed parameters across all input sequences.






□ noSpliceVelo infers gene expression dynamics without separating unspliced and spliced transcripts

>> https://www.biorxiv.org/content/10.1101/2024.08.08.607261v1

noSpliceVelo leverages its underlying biophysical model to infer key kinetic parameters of gene regulation: burst frequency and burst size.

Burst frequency quantifies the rate at which a promoter actively transcribes mRNA, serving as an aggregate parameter for multiple upstream processes, including chromatin remodeling, transcription activator binding, and transcription initiation complex assembly.

The noSpliceVelo architecture is consists of two VAEs. First VAE infers gene-cell specific mean and variance. Second VAE encodes these estimates into a latent cellular representation, which further encodes the transcriptional state assignment for each cell in all genes.





□ Transformers in single-cell omics: a review and new perspectives

>> https://www.nature.com/articles/s41592-024-02353-z

Geneformer reveales cellular regulatory mechanisms. Attention values are context specific, incorporating ATAC-seq and RNA-seq data may reveal context-specific gene regulation based on the expression of co-binding transcription factors and chromatin accessibility.

TOSICA operates on pathway attention scores as cell representations that capture cellular trajectories and link changes in the trajectory to specific pathways or regulons, highlighting the regulatory networks driving disease progression.

scGPT uses gene attention scores not only to infer GRNs, but also to analyze the impact of genetic perturbations on these networks, showcasing the variety of insights that can be extracted from attention scores in single-cell transformers.




□ DeepCSCN: Deep Learning Driven Cell-Type-Specific Embedding for Inference of Single-Cell Co-expression Networks

>> https://www.biorxiv.org/content/10.1101/2024.08.12.607542v1

DeepCSCN, an unsupervised deep-learning framework, to infer gene co-expression modules from single-cell RNA sequencing (scRNA-seq) data. DeepCSCN accurately infers cell-type-specific co-expression networks from large samples by employing features decoupling of cell types.

DeepCSCN first trains on all samples to extract gene embeddings, then selects cell-type-specific dimensions from these embeddings based on feature disentanglement. This approach enables the inference of co-expression networks from a whole-sample level to a specific cell type level.





□ Allocater: Advancing mRNA subcellular localization prediction with graph neural network and RNA structure

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae504/7731719

Allocator incorporates various networks in its architecture, including multilayer perceptron (MLP), self-attention, and graph isomorphism network (GIN).

Allocator employs a parallel deep learning framework to learn two views of mRNA representations including sequence-based features and structural features. Then these learned features are combined and used to predict six subcellular localization categories of mRNA.





□ ctyper: High-resolution global diversity copy number variation maps and association

>> https://www.biorxiv.org/content/10.1101/2024.08.11.607269v1

ctyper, an alignment-free approach to genotype sequence-resolved copy-number variation and overcome the limitations of alignments on repetitive DNA in pangenomes.

The ctyper method traces individual gene copies in NGS data to their nearest alleles in the database and identifies allele-specific copy numbers using multivariate linear regression on k-mer counts and phylogenetic clustering.

This entails two challenges: annotating sequences orthologous and paralogous copies of a given gene and organizing into functionally equivalent groups, and genotyping sequence composition with estimated copy-number on these groups.





□ DREAMIT: Associating transcription factors to single-cell trajectories

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03368-7

DREAMIT (Dynamic Regulation of Expression Across Modules in Inferred Trajectories) aims to analyze dynamic regulatory patterns along trajectory branches, implicating transcription factors (TFs) involved in cell state transitions within scRNAseq datasets.

DREAMIT uses pseudotime ordering within a robust subrange of a trajectory branch to group individual cells into bins. It aggregates the cell-based expression data into a set of robust pseudobulk measurements containing gene expression averaged within bins of neighboring cells.





□ SEACON: Improved allele-specific single-cell copy number estimation in low-coverage DNA-sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae506/7731720

SEACON (Single-cell Estimation of Allele-specific COpy Numbers) employs a Gaussian Mixture Model (GMM) to identify latent copy number states and breakpoints between contiguous segments across cells, filters the segments for high quality breakpoints.

SEACON adopts several strategies for tolerating noisy read-depth and allele frequency measurements. SEACON minimizes the distance between segment means and allele-specific copy number states.







□ BEROLECMI: a novel prediction method to infer circRNA-miRNA interaction from the role definition of molecular attributes and biological networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05891-7

BEROLECMI, a CMI prediction method which defines role attributes for each molecule through molecular attribute features, molecular self-similarity networks, and molecular network features for advanced prediction tasks.

Specifically, BEROLECMI first uses the pre-trained Bidirectional Encoder Representations from the Transformers model for DNA language in genome (DNABERT) to extract attribute features from RNA sequence.

BEROLECMI constructs RNA self-similarity networks through Gaussian kernel function and sigmoid kernel function respectively, and the high-level representation is learned by SAE - sparse autoencoder.





□ NLSExplorer: Discovering nuclear localization signal universe through a novel deep learning model with interpretable attention units

>> https://www.biorxiv.org/content/10.1101/2024.08.10.606103v1

NLSExplorer leverages large-scale protein language models to capture crucial biological information with a novel attention-based deep network. NLSExplorer is able to detect various kinds of segments highly correlated with nuclear transport, such as nuclear export signals.

NLSExplorer involves the Search and Collect NLS (SCNLS) algorithm for post-analysis of recommended segments. This algorithm is primarily designed to detect NLSs patterns, demonstrating capabilities for mining discontinuous NLS patterns.





□ RGAST: Relational Graph Attention Network for Spatial Transcriptome Analysis

>> https://www.biorxiv.org/content/10.1101/2024.08.09.607420v1

RGAST (Relational Graph Attention network for Spatial Transcriptome analysis), constructs a relational graph attention network to learn the representation of each spot in the ST data.

RGAST considers both gene expression similarity and spatial neighbor relationships to construct a heterogeneous graph network. RGAST learns low-dimensional latent embeddings with both spatial information and gene expressions.

The expression after dimensionality reduction by PCA of each spot is first transformed into a d-dimensional latent embedding by an encoder and then reversed back into a reconstructed expression profile via a linear decoder.





□ PLSKO: a robust knockoff generator to control false discovery rate in omics variable selection

>> https://www.biorxiv.org/content/10.1101/2024.08.06.606935v1

Partial Least Squares Knockoff (PLSKO), an efficient and assumption-free knockoff generator that is robust to varying types of biological omics data. We compare PLSKO with a wide range of existing methods.

PLSKO is the only method that controls FDR with sufficient statistical power in complex non-linear cases. In semi-simulation studies based on real data, we show that PLSKO generates valid knockoff variables for different types of biological data.





□ Maptcha: an efficient parallel workflow for hybrid genome scaffolding

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05878-4

Maptcha addresses the hybrid genome scaffolding problem, which involves combining contigs and long reads to create a more complete and accurate genome assembly. Maptcha constructs a contig graph from the mapping information between long reads and contigs to generate scaffolds.

Maptcha is a sketching-based, alignment-free mapping step to build and refine the graph. Maptcha employs a vertex-centric heuristic called wiring to generate ordered walks of contigs as partial scaffolds.





□ Genomic reproducibility in the bioinformatics era

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03343-2

One approach to create synthetic replicates is randomly shuffling the order of the reads reported from a sequencer, which reflects the randomness of events in a sequencing experiment, such as DNA hybridization on the flow cell.

Another technique is to take the reverse complement of each read to assess strand bias when the reference genome is double-stranded. The bias arises due to a pronounced overabundance in one direction of NGS sequencing reads either forward or reverse, compared to the opposite direction.





□ BEASTIE: Bayesian Estimation of Allele-Specific Expression in the Presence of Phasing Uncertainty

>> https://www.biorxiv.org/content/10.1101/2024.08.09.607371v1

BEASTIE makes use of an external phasing algorithm, but accounts for possible phasing errors in a locus-specific and variant-specific manner by studying local phasing error rates and using those to statistically marginalize over all possible phasings when estimating ASE.

BEASTIE builds upon those previous studies by integrating information across exonic sites and incorporates additional information such as population allele frequencies, inter-SNP pair distance, and linkage disequilibrium.





□ Prevalence of and gene regulatory constraints on transcriptional adaptation in single cells

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03351-2

The stochastic mathematical models of biallelic gene regulation and simulate over tens of millions of cells.

Even a relatively parsimonious model of transcriptional adaptation can recapitulate paralog upregulation after mutation and diverse population-level gene expression distributions of downstream effectors qualitatively similar to those observed in real data.





□ fastkqr: A Fast Algorithm for Kernel Quantile Regression

>> https://arxiv.org/abs/2408.05393

The core of fastkqr is a finite smoothing algorithm that magically produces exact regression quantiles, rather than approximations. fastkqr uses a novel spectral technique that builds upon the accelerated proximal gradient descent.

The fastkqr algorithm operates at a complexity of only O (n^2) after an initial eigen-decomposition of the kernel matrix. fastkqr is scalable for the KQR computation. fastkqr significantly advances the computation of quantile regression in reproducing kernel Hilbert spaces.





□ SynGAP: a synteny-based toolkit for gene structure annotation polishing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03359-8

SynGAP (Synteny-based Gene structure Annotation Polisher), which uses gene synteny information to accomplish precise and automated polishing of gene structure annotation of genomes.

SynGAP dual is a module designed for the mutual gene structure annotation correction of two species. With the genome sequences and genome annotations of two species, synteny blocks are firstly identified using the MCscan pipeline in the JCVI toolkit.





□ Squigualiser: Interactive visualisation of nanopore sequencing signal data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae501/7732912

Squigualiser (Squiggle visualiser) builds upon existing methodology for signal-to-sequence alignment in order to anchor raw signal data points to their corresponding positions within basecalled reads or within a reference genome/transcriptome sequence.

Squigualiser uses a new encoding technique (the ss tag) enables efficient, flexible representation of signal alignments and normalises outputs from alternative alignment tools.

Squigualiser employs a new method for k-mer-to-base shift correction addresses ambiguity in signal alignments to enable visualisation of genetic variants, modified bases, or other features, at single-base resolution.





□ fastkqr: A Fast Algorithm for Kernel Quantile Regression

>> https://arxiv.org/abs/2408.05393

The core of fastkqr is a finite smoothing algorithm that magically produces exact regression quantiles, rather than approximations. fastkqr uses a novel spectral technique that builds upon the accelerated proximal gradient descent.

The fastkqr algorithm operates at a complexity of only O (n^2) after an initial eigen-decomposition of the kernel matrix. fastkqr is scalable for the KQR computation. fastkqr significantly advances the computation of quantile regression in reproducing kernel Hilbert spaces.





□ AFFECT: an R package for accelerated functional failure time model with error-contaminated survival times and applications to gene expression data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05831-5

AFFECT refers to Accelerated Functional Failure time model with Error-Contaminated survival Times. Here "functional" reflects nonlinear functions between the failure time and the covariates.

AFFECT is based on the estimation function derived by the Buckley-James method, which is different from and does not require to specify the distribution of the noise term.





□ How Transformers Learn Causal Structure with Gradient Descent

>> https://arxiv.org/abs/2402.14735

The Gradient descent on a simplified two-layer transformer learns to solve this task by encoding the latent causal graph in the first attention layer. The key insight of this proof is that the gradient of the attention matrix encodes the mutual information between tokens.

As a consequence of the data processing inequality, the largest entries of this gradient correspond to edges in the latent causal graph. As a special case, when the sequences are generated from in-context Markov chains, transformers learn an induction head.





□ Seq2Topt: a sequence-based deep learning predictor of enzyme optimal temperature

>> https://www.biorxiv.org/content/10.1101/2024.08.12.607600v1

Seq2Topt can accurately predict enzyme optimal temperature values just from protein sequences. Seq2Topt can predict the shift of enzyme optimal temperature caused by point mutations.

Residue attention weights of Seq2Topt can reveal important sequence regions for enzyme thermoactivity. The architecture of Seq2Topt can be used to build predictors of other enzyme properties.





□ scatterbar: an R package for visualizing proportional data across spatially resolved coordinates

>> https://www.biorxiv.org/content/10.1101/2024.08.14.606810v1

scatterbar, an open-source R package that extends ggplot, to visualize proportional data across many spatially resolved coordinates using scatter stacked bar plots.

scatterbar uses stacked bar charts instead of pie charts. Given a set of (x,y) coordinates and matrix of associated proportional data, scatterbar creates a stacked bar chart, where bars are stacked based on the proportions of different categories centered at each (x, y) location.





□ Autoencoders with shared and specific embeddings for multi-omics data integration

>> https://www.biorxiv.org/content/10.1101/2024.08.14.607979v1

A novel architecture of AE model for multi-omics data integration, where the joint component is derived from the concatenated data sources and the individual component comes from the corresponding individual data source.

To encourage the model to separate and extract the joint/shared information contained between different omic data and the specific information contained in each data source, an additional orthogonal penalty is applied between the joint and the individual embedding layers.





Metáfora.

2024-08-18 19:16:03 | コスメ・ファッション

□ 『Metáfora (II-XVII)』 (Fueguia 1833)


詩人ボルヘスの世界観を投影したミステリアスな作品。準ヴィンテージとも言えるロットで、フエギア初期の核を構成する作品。ハイブランド香水に最も近い雰囲気。ピンクペッパー、ジャスミン、ジンジャーのグラデーションが『メタフォラ(概念)』の層を行き来する

Jacarandá

2024-08-18 19:09:59 | コスメ・ファッション

□ 『Jacarandá (I-XXIII)』 (Fueguia 1833)


ハカランダの花をトニックノートに据え、ローズウッド、ベンゾインが『19世紀のギター』の質感と輪郭を縁取る。『バベルの図書館』に代表される、ピーキーなものが多いフエギアのウッド・レザー系の香りの中でも、比較的シンプルで濁りの無い香り



A Lux night in Azabudai Hills.

2024-08-16 01:06:12 | 写真