lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

CLEAN.

2024-05-30 01:22:44 | コスメ・ファッション

□ 『CLEAN -CLASSIC- Shower Fresh』 (eau de parfum)

>> https://www.cleanbeauty.com/

Clean Beauty Collective Inc.
New York, NY 10036

数ある『お風呂上がり系』を謳う香水の中でも、ここまで「やっと探し当てた!」と思える程シックリ来たのは初めてかも。シトラスがトップノートを眩く飾り、続いてジャスミンの瑞々しい甘さが優しく包み込む

Celestia.

2024-05-25 17:25:35 | Science News




□ STT: Spatial transition tensor of single cells

>> https://www.nature.com/articles/s41592-024-02266-x

STT, a spatial transition tensor approach to reconstruct cell attractors in spatial transcriptome data using unspliced and spliced mRNA counts, to allow quantification of transition paths between spatial attractors as well as analysis of individual transitional cells.

STT assumes the coexistence of multiple attractors in the joint unspliced (U)–spliced (S) counts space. A 4-dimensional transition tensor across cells, genes, splicing states and attractors is constructed, with attractor-specific quantities associated with each attractor basin.

By iteratively refining the tensor estimation and decomposing the tensor-induced and spatial-constrained cellular random walk, STT connects the scales between local gene expression and splicing dynamics as well as the global state transitions among attractors.






□ D3 - DNA Discrete Diffusion: Designing DNA With Tunable Regulatory Activity Using Discrete Diffusion

>> https://www.biorxiv.org/content/10.1101/2024.05.23.595630v1

DNA Discrete Diffusion (D3), a generative framework for conditionally sampling regulatory sequences with targeted functional activity levels. D3 can accept a conditioning signal, a scalar or vector, alongside the data as input to the score network.

D3 generates DNA sequences that better capture the diversity of cis-regulatory grammar. D3 employs a similar method with a different function for Bregman divergence.





□ PHOENIX: Biologically informed NeuralODEs for genome-wide regulatory dynamics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03264-0

PHOENIX (Prior-informed Hill-like ODEs to Enhance Neuralnet Integrals with eXplainability), an innovative NeuralODE architecture that inherits the universal function approximation property (and thus the flexibility) of neural networks while resembling Hill-Langmuir kinetics.

PHOENIX operates on the original gene expression space and performs without any dimensional reduction. PHOENIX plausibly predicted continued periodic oscillations in gene expression, even though the training data consisted of only two full cell cycles.

PHOENIX incorporates two levels of back-propagation to parameterize the neural network while inducing domain knowledge-specific properties. PHOENIX estimates the local derivative, and an ODE solver integrates this value to predict expression at subsequent time points.





□ Spatial Coherence of DNA Barcode Networks

>> https://www.biorxiv.org/content/10.1101/2024.05.12.593725v1

"Spatial Coherence" follows Euclidean geometric laws. Spatial coherence is a feature of well-behaved spatial networks, and is reduced by introducing random, non-spatially-correlated edges b/n nodes in the network and is impacted by sparse or incomplete sampling of the network.

Spatial coherence is a measurable, ground-truth agnostic property that can be used to assess how well spatial information is captured in sequencing-based microscopy networks, and could aid in benchmark comparison, or provide a metric of confidence in reconstructed images.






□ LiftOn: Combining DNA and protein alignments to improve genome annotation

>> https://www.biorxiv.org/content/10.1101/2024.05.16.593026v1

LiftOn implements a two-step protein-maximization algorithm to find the best annotations at protein-coding gene loci. LiftOn uses a chaining algorithm, to find the exon-intron boundaries of protein coding transcripts.

LiftOn combines both DNA and protein sequence alignment to generate protein-coding gene annotations that maximize similarity to the reference proteins. LiftOn resolves issues such as overlapping gene loci and multi-mapping for genes.





□ HERRO: Telomere-to-telomere phased genome assembly using error-corrected Simplex nanopore reads

>> https://www.biorxiv.org/content/10.1101/2024.05.18.594796v1

HERRO, a framework based on a deep learning model capable of correcting Simplex nanopore regular and ultra-long reads. Combining HERRO with Hifiasm and Verkko for diploid and La Jolla Assembler, It achieves phased genomes with many chromosomes reconstructed T2T.

HERRO is optimised for both R9.4.1. and R10.4.1 pores and chemistry. HERRO achieves up to 100-fold improvement in read accuracy while keeping intact the most important sites, including haploid-specific variation and variations between segments in tandem duplications.





□ TRAPT: A multi-stage fused deep learning framework for transcriptional regulators prediction via integrating large-scale epigenomic data

>> https://www.biorxiv.org/content/10.1101/2024.05.17.594242v1

By leveraging two-stage self-knowledge distillation to extract the activity embedding of regulatory elements, TRAPT (Transcription Regulator Activity Prediction Tool) can predicts key regulatory factors for sets of query genes through a fusion strategy.

TRAPT calculates the epigenomic regulatory potential (Epi-RP) and the transcriptional regulator regulatory potential. It then predicts the downstream regulatory element activity of each TR and the context-specific upstream regulatory element activity of the queried gene set.





□ Gene2role: a role-based gene embedding method for comparative analysis of signed gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2024.05.18.594807v1

Gene2role, a gene embedding method for signed GRNs, employing the frameworks from SignedS2V and struc2vec. Gene2role leverages multi-hop topological information from genes within signed GRNs.

Gene2role efficiently captures the intricate topological nuances of genes using GRNs inferred from four distinct data sources. Then, applying Gene2role to integrated GRNs allowed us to identify genes with significant topological changes across cell types or states.





□ scDecorr: Feature decorrelation representation learning with domain adaptation enables self-supervised alignment of multiple single-cell experiments

>> https://www.biorxiv.org/content/10.1101/2024.05.17.594763v1

scDecorr takes as input single-cell gene-expression matrix coming from different studies (Domains) and uses a self-supervised feature decorrelation approach using a siamese twin model to obtain an optimal data representation.

scDecorr learns cell representations in a self-supervised fashion via a joint embedding of distorted gene profiles of a cell. It accomplishes this by optimizing an objective function that maximizes similarity among the distorted embeddings while also decorrelating their components.

scDecorr learns batch-invariant representations using the domain adaptation (DA) framework. It is responsible for projecting samples from multiple domains to a common manifold such that similar cell samples from all the domains lie close to each other.





□ DeepDive: estimating global biodiversity patterns through time using deep learning

>> https://www.nature.com/articles/s41467-024-48434-7

DeepDive (Deep learning Diversity Estimation), a framework to estimate biodiversity trajectories consisting of two main modules: 1) a simulation module that generates synthetic biodiversity and fossil datasets and 2) a deep learning framework that uses fossil data.

The simulator generates realistic diversity trajectories, encompassing a broad spectrum of regional heterogeneities. Simulated data also include fossil occurrences and their distribution across discrete geographic regions and through time.





□ CellWalker2: multi-omic discovery of hierarchical cell type relationships and their associations with genomic annotations

>> https://www.biorxiv.org/content/10.1101/2024.05.17.594770v1

CellWalker2 is a graph diffusion-based method for single-cell genomics data integration. It takes count matrices as inputs specifically gene-by-cell and/or peak-by-cell matrices from scRNA-Seq and scATAC-Seq respectively.

CellWalker2 builds a graph that integrates these inputs, plus a cell type ontology and optionally genome coordinates for regions of interest. The algorithm then conducts a random walk with restarts on this graph and computes an influence matrix.

From sub-blocks of the influence matrix, CellWalker2 learns relationships between different nodes. CellWalker2 can map genomic regions to cell ontologies, enabling precise annotation of elements derived from bulk data, such as enhancers, genetic variants, and sequence motifs.







□ bulk2sc: Generating Synthetic Single Cell Data from Bulk RNA-seq Using a Pretrained Variational Autoencoder

>> https://www.biorxiv.org/content/10.1101/2024.05.18.594837v1

bulk2sc, a bulk to single cell framework which utilizes a Gaussian mixture variational autoencoder (GMVAE) to generate representative, synthetic single cell data from bulk RNA-seq data by learning the cell type-specific means, variances, and proportions.

bulk2sc is composed of three parts: a single cell GMVAE (scGMVAE) that learns cell type specific Gaussian parameters, a bulk RNA-seq VAE (Bulk VAE) that learns the cell type specific means, variances and proportion (passed from the scGMVAE) using bulk RNA-seq data as input.

bulk2sc reconstructs the scRNA data using a bulk-to-single-cell encoder-decoder (genVAE) composed of the encoder-decoder components from Bulk VAE, which generates synthetic, representative scRNA-seq from bulk RNA-seq data.





□ StarFunc: fusing template-based and deep learning approaches for accurate protein function prediction

>> https://www.biorxiv.org/content/10.1101/2024.05.15.594113v1

StarFunc, a composite approach that integrates state-of-the-art deep learning models seamlessly with template information from sequence homology, protein-protein interaction partners, proteins with similar structures, and protein domain families.

StarFunc’s structure-based component adds a fast Foldseek-based structure prefiltering stage to select the subset of related templates for full length TM-align alignment, providing both the efficiency of Foldseek and the sensitivity of TM-align for structural template detection.





□ CellAgent: An LLM-driven Multi-Agent Framework for Automated Single-cell Data Analysis

>> https://www.biorxiv.org/content/10.1101/2024.05.13.593861v1

CellAgent, a zero-code LLM-driven multi-agent collaborative framework for scRNA-seq data analysis. CellAgent can directly comprehend natural language task descriptions, completing complex tasks with high quality through effective collabo-ration, autonomously.

CellAgent introduces a hierarchical decision-making mechanism, with upper-level task planning via Planner, and lower-level task execution via Executor.

CellAgent uses a self-iterative optimization mechanism, encouraging Executors to autonomously optimize the planning process by incorporating automated evaluation results and accounting for potential code execution exceptions.






□ ESM All-Atom: Multi-scale Protein Language Model for Unified Molecular Modeling

>> https://www.biorxiv.org/content/10.1101/2024.03.04.583284v2.full.pdf

ESM-AA (ESM All-Atom), which achieves multi-scale unified molecular modeling through pre-training on multi-scale code-switch protein sequences and describing relationships among residues and atoms using a multi-scale position encoding.

ESM-AA generates multi-scale code-switch protein sequences by randomly unzipping partial residues. ESM-AA uses 12 stacked Transformer layers, each with 20 attention heads. The model dimension and feed-forward dimension of each Transformer layer are 480 and 1920.





□ COCOA: A Framework for Fine-scale Mapping Cell-type-specific Chromatin Compartmentalization Using Epigenomic Information

>> https://www.biorxiv.org/content/10.1101/2024.05.11.593669v1

COCOA (mapping chromatin compartmentalization with epigenomic information), a method that predict the cell-type-specific correlation matrix (CM) using six types of accessible epigenomic modification signals.

COCOA employs the cross attention fusion module to fuse bi-directional epigenomic track features. The cross attention fusion module mainly contains two attention feature fusion layers. Each AFF layer has: global feature extraction, local feature extraction and attention fusion.





□ CLEAN-Contact: Contrastive Learning-enabled Enzyme Functional Annotation Prediction with Structural Inference

>> https://www.biorxiv.org/content/10.1101/2024.05.14.594148v1

CLEAN-Contact framework harnesses the power of ESM-2, a pretrained protein language model responsible for encoding amino acid sequences, and ResNet, a convolutional neural network utilized for encoding contact maps.

Sequence and structure representations are combined and projected into high-dimensional vectors using the projector. Positive samples are those with the same EC number as the anchor sample and negative samples are chosen from EC numbers with cluster centers close to the anchor.





□ CellSNAP: Cross-domain information fusion for enhanced cell population delineation in single-cell spatial-omics data

>> https://www.biorxiv.org/content/10.1101/2024.05.12.593710v1

CellSNAP (Cell Spatio- and Neighborhood-informed Annotation and Patterning), an unsupervised information fusion algorithm, broadly applicable to different single-cell spatial-omics data modalities, for learning cross-domain integrative single-cell representation vectors.

CellSNAP uses SNAP-GNN-duo, they train a pair of graph neural networks with an overarching multi-layer perceptron (MLP) head to predict each cell's neighborhood-composition-plus-cell-cluster vectors, using both its feature expressions and its local tissue image encoding.





□ MetaGraph: Indexing All Life's Known Biological Sequences

>> https://www.biorxiv.org/content/10.1101/2020.10.01.322164v3

MetaGraph can index biological sequences of all kinds, such as raw DNA/RNA sequencing reads, assembled genomes, and protein sequences. The MetaGraph index consists of an annotated sequence graph that has two main components:

The first is a k-mer dictionary representing a De Bruijn graph. The k-mers stored in this dictionary serve as elementary tokens in all operations on the MetaGraph index. The second is a representation of the metadata encoded as a relation b/n k-mers and any categorical features.





□ Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA

>> https://www.nature.com/articles/s41592-024-02273-y

Metabuli is metagenomic classifier that jointly analyze both DNA and amino acid (AA) sequences. DNA-based classifiers can make specific classifications, exploiting point mutations to distinguish close taxa.





□ IFDlong: an isoform and fusion detector for accurate annotation and quantification of long-read RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2024.05.11.593690v1

IFDlong, an Isoform Fusion Detector that was tailored for long-RNA-seq data for the annotation and quantification of isoform and fusion transcripts.

IFDlong employs multiple selection criteria to control FP in the detection of novel isoforms and fusion transcripts. IFDlong enhances the accuracy of fusion detection by filtering out fusion candidates involving pseudogenes, genes from the same family, and readthrough events.





□ Parallel maximal common subgraphs with labels for molecular biology

>> https://www.biorxiv.org/content/10.1101/2024.05.10.593525v1

The parallel algorithms to compute the Maximal Common Connected Partial Subgraphs (MCCPS) over shared memory, distributed memory, and a hybrid approach.

A novel memory-efficient distributed algorithm that allows to exhaustively enumerate all Maximal Common Connected Partial Subgraphs when considering backbones, canonical and noncanonical contacts, as stackings





□ MR-GGI: accurate inference of gene–gene interactions using Mendelian randomization

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05808-4

MR-GGI requires gene expression and the genotype of the data. MR-GGI identifies gene–gene interaction by inferring causality between two genes, where one gene is used as an exposure, the other gene is used as an outcome, and causal cis-SNP(s) for the genes are used as IV(s).





□ Readsynth: short-read simulation for consideration of composition-biases in reduced metagenome sequencing approaches

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05809-3

Readsynth first reads each input genome assembly individually to capture the set of possible fragments and calculate the probability of each sequence fragment surviving to the final library.

Fragments resulting from any combination of palindromic restriction enzyme motifs are modeled probabilistically to account for partial enzyme digestion.

The probability of a fragment remaining at the end of digestion is calculated based on the probability of an enzyme cut producing the necessary forward and reverse adapter-boundary sites, adjusted accordingly for fragments harboring internal cut sites.





□ Cluster efficient pangenome graph construction with nf-core/pangenome

>> https://www.biorxiv.org/content/10.1101/2024.05.13.593871v1

nf-core/pangenome, an easy-to-install, portable, and cluster-scalable pipeline for the unbiased construction of pangenome variation graphs. It is the first pangenomic nf-core pipeline enabling the comparative analysis of gigabase-scale pangenome datasets.

nf-core/pangenome can distribute the quadratic all-to-all base-level alignments across nodes of a cluster by splitting the approximate alignments into problems of equal size using the whole-chromosome pairwise sequence aligner WMASH.





□ SANGO: Deciphering cell types by integrating scATAC-seq data with genome sequences

>> https://www.nature.com/articles/s43588-024-00622-7

SANGO, a method for accurate single-cell annotation by integrating genome sequences around the accessibility peaks. The genome sequences of peaks are encoded into low-dimensional embeddings, and iteratively reconstruct the peak statistics through a fully connected network.

SANGO was demonstrated to consistently outperform competing methods on 55 paired scATAC-seq datasets across samples, platforms and tissues. SANGO was also shown to be able to detect unknown tumor cells through attention edge weights learned by the graph transformer.





□ Flawed machine-learning confounds coding sequence annotation

>> https://www.biorxiv.org/content/10.1101/2024.05.16.594598v1

An assessment of nucleotide sequence and alignment-based de novo protein-coding detection tools. The controls we use exclude any previous training dataset and include coding exons as a positive set and length-matched intergenic and shuffled sequences as negative sets.

<r />



□ Telogator2: Characterization of telomere variant repeats using long reads enables allele-specific telomere length estimation

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05807-5

Telogator2, a method for reporting ATL and TVR sequences from long read sequencing data. Telogator2 can identify distinct telomere alleles in the presence of sequencing errors and alignments where reads may be mapped to chromosome arms different from where they originated.

Telogator2 extracts a subset of reads containing a minimum number of canonical repeats. Telomere region boundaries are estimated based on the density of telomere repeats, and reads that terminate in telomere sequence on one end and non-telomere sequence on the other are selected.





□ PQSDC: a parallel lossless compressor for quality scores data via sequences partition and Run-Length prediction mapping

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae323/7676123

PQSDC (Parallel QSD Compressor), a novel parallel lossless QSD-dedicated compression algorithm. PQSDC is robust when compress QSD w/ varying data distributions. This is attributed to the proposed PRPM model, which integrates the strengths of mapping and dynamic run-length coding.





□ mosGraphGen: a novel tool to generate multi-omic signaling graphs to facilitate integrative and interpretable graph AI model development

>> https://www.biorxiv.org/content/10.1101/2024.05.15.594360v1

mosGraphGen (multi-omics signaling graph generator), a novel computational tool that generates multi-omics signaling graphs of individual samples by mapping the multi-omics data onto a biologically meaningful multi-level background signaling network.





□ iSeq: An integrated tool to fetch public sequencing data

>> https://www.biorxiv.org/content/10.1101/2024.05.16.594538v1

iSeq automatically detects the accession format and fetches metadata from the appropriate source, prioritizing ENA among the partner organizations of INSDC or GSA due to their extensive data availability.

iSeq can merge multiple FASTQ files from the same experiment into a single file for single-end (SE) sequencing data, or maintain the order and consistency of read names in two files for paired-end (PE) sequencing data.





□ SCIITensor: A tensor decomposition based algorithm to construct actionable TME modules with spatially resolved intercellular communications

>> https://www.biorxiv.org/content/10.1101/2024.05.21.595103v1

SCIlTensor, a framework that decomposes the patterns of ME units and the spatial interaction modules based on NTD, an unsupervised method that can identify spatial patterns and modules from multidimensional matrices.

SCIlTensor constructs a three-dimensional matrix by stacking intensity matrices of interactions in each TME unit, and it is decomposed by NTD. The decomposed patterns in each dimension indicate events related to specific cellular and molecular function modules within TME modules.





□ SpatialDiffusion: Predicting Spatial Transcriptomics with Denoising Diffusion Probabilistic Models

>> https://www.biorxiv.org/content/10.1101/2024.05.21.595094v1

stDiffusion adapts Denoising Diffusion Probabilistic Models principles. stDiffusion learns ST data from a single slice and predict heldout slices, effectively interpolating b/n a finite set of ST slices.

stDiffusion incorporates an embedding layer for cell types and a linear transformation for spatial coordinates. An embedding layer for cell type classification allows the model to interpret cell types as dense vectors of a specified dimension.





□ BioInformatics Agent (BIA): Unleashing the Power of Large Language Models to Reshape Bioinformatics Workflow

>> https://www.biorxiv.org/content/10.1101/2024.05.22.595240v1

BIA is operationalized via textual interactions with Large Language Models (LLMs). Overall, the engagement with the LLM is orchestrated via four structured narrative segments: the Thought segment instigates a reflective assessment of the task's progression;

the Action and Action Input segments direct the LLM to invoke a particular tool and specify its required inputs, thereby promoting instrumental engagement; finally, the Observation phase permits the LLM to interpret the result from the executed tool.

The Zone of Interest.

2024-05-24 22:10:10 | 映画

□ 『The Zone of Interest』(関心領域)

>> https://happinet-phantom.com/thezoneofinterest/

2024
Directed by Jonathan Glaser
Based on the novel by Martin Amis
Field Recording / Sound effects by Maximillian Behrens
Cinematography by Lukasz Zal
Music by Mica Levi

アウシュビッツ収容所の環境音の中、10台の定点カメラが理想の家庭像を捉える空間演劇。これがグロテスクに映るのは、何よりも現代に生きる人々の良識を深く抉る鏡に他ならないからだ。あらゆる隔壁も時を経れば朽ち果てる。我々に時代の目に晒される覚悟はあるか



□ Mica Levi / “The Zone of Interest”






Giraffes.

2024-05-17 20:08:08 | Science

法令や制度対応に伴う新システムの敷衍において、どんなに実証実験を重ねていてもデプロイする過程で初めて凹凸が分かるほどに現実環境の予測は難しく、得てしてそれは人的運用の問題が大きなウェイトを占めている場合が多い

Pleni sunt caeli et terra gloria tua.

2024-05-15 22:50:55 | Science News

(Art by Samuel Krug)




□ Wasserstein Wormhole: Scalable Optimal Transport Distance with Transformers

>> https://arxiv.org/abs/2404.09411

Wasserstein Wormhole, an algorithm that represents each point cloud as a single embedded point, such that the Euclidean distance in the embedding space matches the OT distance between point clouds. The problem solved by Wormhole is analogous to multidimensional scaling.

In Wormhole space, they compute Euclidean distance in O(d) time for an embedding space with dimension d, which acts as an approximate OT distance and enables Wasserstein-based analysis without expensive Sinkhorn iterations.

Wormhole minimizes the discrepancy between the embedding pairwise distances and the pairwise Wasserstein distances of the batch point clouds. The Wormhole decoder is a second transformer trained to reproduce the input point clouds from the embedding by minimizing the OT distance.





□ Symphony: Symmetry-Equivariant Point-Centered Spherical Harmonics for Molecule Generation

>> https://arxiv.org/abs/2311.16199

Symphony, an autoregressive generative model that uses higher-degree equivariant features and spherical harmonic projections to build molecules while respecting the E(3) symmetries of molecular fragments.

Symphony builds molecules sequentially by predicting and sampling atom types and locations of new atoms based on conditional probability distributions informed by previously placed atoms.

Symphony stands out by using spherical harmonic projections to parameterize the distribution of new atom locations. This approach enables predictions to be made using features from a single 'focus' atom, which serves as the chosen origin for that step of the generation process.





□ Distributional Graphormer: Predicting equilibrium distributions for molecular systems with deep learning

>> https://www.nature.com/articles/s42256-024-00837-3

Distributional Graphormer (DiG) can generalize across molecular systems and propose diverse structures that resemble observations. DiG draws inspiration from simulated annealing, which transforms a uniform distribution to a complex one through a simulated annealing process.

DiG enables independent sampling of the equilibrium distribution. The diffusion process can also be biased towards a desired property for inverse design and allows interpolation between structures that passes through high-probability regions.





□ Pathformer: a biological pathway informed transformer for disease diagnosis and prognosis using multi-omics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae316/7671099

Pathformer transforms various modalities into distinct gene-level features using a series of statistical methods, such as the maximum value method, and connects these features into a novel compacted multi-modal vector for each gene.

Pathformer employs a sparse neural network based on the gene-to-pathway mapping to transform gene embedding into pathway embedding. Pathformer enhances the fusion of information b/n various modalities and pathways by combining pathway crosstalk networks with Transformer encoder.





□ RNAErnie: Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning

>> https://www.nature.com/articles/s42256-024-00836-4

RNAErnie is built upon the Enhanced Representation through Knowledge Integration (ERNIE) framework and incorporates multilayer and multihead transformer blocks, each having a hidden state dimension of 768.

RNAErnie model consists of 12 transformer layers. In the motif-aware pretraining phase, RNAErnie is trained on a dataset of approximately 23 million sequences extracted from the RNAcentral database using self-supervised learning with motif-aware multilevel random masking.

RNAErnie first predicts the possible coarse-grained RNA types using output embeddings and then leverages the predicted types as auxiliary information for fine-tuning. RNAErnie leverages an RNAErnie basic block to predict the top-K most possible coarse-grained RNA types.





□ LucaOne: Generalized Biological Foundation Model with Unified Nucleic Acid and Protein Language

>> https://www.biorxiv.org/content/10.1101/2024.05.10.592927v1

LucaOne possesses the capability to interpret biological signals and, as a foundation model, can be guided through input data prompts to perform a wide array of specialized tasks in biological computation.

LucaOne leverages a multifaceted computational training strategy that concurrently processes nucleic acids (DNA / RNA) and protein data from 169,861 species. LucaOne comprised 20 transformer-encoder blocks with an embedding dimension of 2560 and a total of 1.8 billion parameters.





□ BIMSA: Accelerating Long Sequence Alignment Using Processing-In-Memory

>> https://www.biorxiv.org/content/10.1101/2024.05.10.593513v1

BIMSA (Bidirectional In-Memory Sequence Alignment), a PIM-optimized implementation of the state-of-the-art sequence alignment algorithm BiWFA (Bidirectional Wavefront Alignment), incorporating hardware-aware optimizations for a production-ready PIM architecture (UPMEM).

BIMSA follows a coarse-grain parallelization scheme, assigning one or more sequence pairs to each DPU thread. This parallelization scheme is the best fit when targeting the UPMEM platform, as it removes the need for thread synchronization or data sharing across compute units.





□ MrVI: Deep generative modeling of sample-level heterogeneity in single-cell genomics

>> https://www.biorxiv.org/content/10.1101/2022.10.04.510898v2

MrVI (Multi-resolution Variational Inference) identifies sample groups without requiring a priori clustering of the cells. It allows for different sample groupings to be conferred by different subsets of cells that are detected automatically.

MrVI enables both DE and DA in an annotation-free manner and at high resolution while accounting for uncertainty and controlling for undesired covariates, such as the experimental batch.

MrVI provides a principled methodology for estimating the effects of sample-level covariates on gene expression at the level of an individual cell. MrVI leverages the optimization procedures incl. in sevi-tools, allowing it to scale to multi-sample studies with millions of cells.





□ DeChat: Repeat and haplotype aware error correction in nanopore sequencing reads

>> https://www.biorxiv.org/content/10.1101/2024.05.09.593079v1

DeChat corrects sequencing errors in ONT R10 long reads in a manner that is aware of repeats, haplotypes or strains. DeChat combines the concepts of de Bruijn graphs (dBG) and variant-aware multiple sequence alignment via partial order alignment algorithm.

DeChat divides raw reads into small kmers and eliminates those with extremely low frequencies. Subsequently, it constructs a compacted de Bruijn graph (dBG). Each raw read is then aligned to the compacted dBG to identify the optimal alignment path.





□ CELLama: Foundation Model for Single Cell and Spatial Transcriptomics by Cell Embedding Leveraging Language Model Abilities

>> https://www.biorxiv.org/content/10.1101/2024.05.08.593094v1

CELLama (Cell Embedding Leverage Language Model Abilities), a framework that leverage language model to transform cell data into 'sentences' that encapsulate gene expressions and metadata, enabling universal cellular data embedding for various analysis.

CELLama transforms scRNA-seq data into natural language sentences. CELLama can utilize pretrained models that cover general NLP processes for embedding, and it can also be fine-tuned using large-scale cellular data by generating sentences and their similarity metrics.





□ scBSP: A fast and accurate tool for identifying spatially variable genes from spatial transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2024.05.06.592851v1

scBSP (single-cell big-small patch), a significantly enhanced version of BSP, to address computational challenges in the identification of SVGs from large-scale two/three-dimensional SRT data.

scBSP selects a set of neighboring spots within a certain distance to capture the regional means and filters the SVGs using the velocity of changes in the variances of local means with different granularities.





□ EpiTrace: Tracking single-cell evolution using clock-like chromatin accessibility loci

>> https://www.nature.com/articles/s41587-024-02241-z

EpiTrace counts the fraction of opened clock-like loci from scATAC-seq data to perform lineage tracing. The measurement was performed using a hidden Markov model -mediated diffusion-smoothing approach, borrowing information from similar single cells to reduce noise.

The EpiTrace algorithm simply leverages the fact that heterogeneity of given reference ClockDML reduces during cell replication and then uses such information as an intermediate tool variable to infer cell age.





□ SYNY: a pipeline to investigate and visualize collinearity between genomes

>> https://www.biorxiv.org/content/10.1101/2024.05.09.593317v1

Collinear segments, also known as syntenic blocks, can be inferred from sequence alignments and/or from the identification of genes arrayed in the same order and relative orientations between investigated genomes.

SYNY investigates gene collinearity (synteny) between genomes by reconstructing clusters from conserved pairs of protein-coding genes identified from DIAMOND homology searches. It also infers collinearity from pairwise genome alignments with minimap2.





□ seismic: Disentangling associations between complex traits and cell types

>> https://www.biorxiv.org/content/10.1101/2024.05.04.592534v1

seismic, a framework that enables robust and efficient discovery of cell type-trait associations and provides the first method to simultaneously identify the specific genes and biological processes driving each association.

seismic eliminates the need to select arbitrary thresholds to characterize trait or cell-type association. seismic calculates the statistical significance of a cell type-trait association using a regression-based framework with the gene specificity scores and MAGMA z-scores.





□ Fairy: fast approximate coverage for multi-sample metagenomic binning

>> https://www.biorxiv.org/content/10.1101/2024.04.23.590803v1

fairy, a much faster, k-mer-based alignment-free method of computing multi-sample coverage for metagenomic binning. fairy is built on top of their metagenomic profiler sylph, but fairy is specifically adapted for metage-nomic binning of contigs.

Fairy indexes (or sketches) the reads into subsampled k-mer-to-count hash tables. K-mers from contigs are then queried against the hash tables to estimate coverage. Finally, fairy's output is used for binning and is compatible with several binners (e.g. MetaBAT2, MaxBin2).





□ Causal K-Means Clustering

>> https://arxiv.org/abs/2405.03083

Causal k-Means Clustering harnesses the widely-used k-means clustering algorithm to uncover the unknown subgroup structure. Their problem differs significantly from the conventional clustering setup since the variables to be clustered are unknown counterfactual functions.

They present a plug-in estimator which is simple and readily implementable using off-the-shelf algorithms, and study its rate of convergence.

They also develop a new bias-corrected estimator based on nonparametric efficiency theory and double machine learning, and show that this estimator achieves fast root-n rates and asymptotic normality in large nonparametric models.





□ GoT–ChA: Mapping genotypes to chromatin accessibility profiles in single cells

>> https://www.nature.com/articles/s41586-024-07388-y

GoT–ChA (genotyping of targeted loci with single-cell chromatin accessibility) links genotypes to chromatin accessibility at single-cell resolution across thousands of cells within a single assay.

Integration of mitochondrial genome profiling and cell-surface protein expression measurement allowed expansion of genotyping onto DOGMA-seq through imputation, enabling single-cell capture of genotypes, chromatin accessibility, RNA expression and cell-surface protein expression.





□ stDyer enables spatial domain clustering with dynamic graph embedding

>> https://www.biorxiv.org/content/10.1101/2024.05.08.593252v1

stDyer employs a Gaussian Mixture Variational AutoEncoder (GMVAE) with graph attention networks (GAT) and graph embedding in the latent space. stDyer enables deep representation learning and clustering from Gaussian Mixture Models (GMMs) simultaneously.

stDyer also introduces dynamic graphs to involve more edges to a KNN spatial graph. Dynamic graphs can increase the likelihood that units at the domain boundaries establish connections with others belonging to the same spatial domain.

stDyer introduces mini-batch neighbor sampling to enable its application to large-scale datasets. stDyer is the first method that could enable multi-GPU training for spatial domain clustering.





□ xLSTM: Extended Long Short-Term Memory

>> https://arxiv.org/abs/2405.04517


Enhancing LSTM to xLSTM by exponential gating with memory mixing and a new memory structure. xLSTM models perform favorably on language modeling when compared to state-of-the-art methods like Transformers and State Space Models.

XLSTM is based on a matrix memory. Lack of parallelizability due to memory mixing, i.e., the hidden-hidden connections between hidden states from one time step to the next, which enforce sequential processing.

An XLSTM architecture is constructed by residually stacking building blocks. An xLSTM block should non-linearly summarize the past in a high-dimensional space. Separating histories is the prerequisite to correctly predict the next sequence element such as the next token.





□ COEXIST: Coordinated single-cell integration of serial multiplexed tissue images

>> https://www.biorxiv.org/content/10.1101/2024.05.05.592573v1

COEXIST, a novel algorithm that synergistically combines shared molecular profiles with spatial information to seamlessly integrate serial sections at the single-cell level.

COEXIST not only elevates MTI platform validation but also overcomes the constraints of MTI's panel size and the limitation of full nuclei on a single slide, capturing more intact nuclei in consecutive sections and enabling deeper profiling of cell lineages and functional states.





□ Streamlining remote nanopore data access with slow5curl

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giae016/7644676

Slow5curl uses an index to quickly fetch specific reads from a large dataset in SLOW5/BLOW5 format and highly parallelized data access requests to maximize download speeds.

The initiative is inspired by the SAM/BAM alignment data format and its many associated utilities, such as the remote client feature in samtools/htslib, which slow5curl emulates for nanopore signal data.





□ MerCat2: a versatile k-mer counter and diversity estimator for database-independent property analysis obtained from omics database

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbae061/7657691

MerCat2 (“Mer - Catenate2") computes k-mer frequency counting to any length k on assembled contigs as nucleotide fasta, raw reads or trimmed (e.g., fastq), and translated protein-coding open reading frames (ORFs) as a protein fasta.

MerCat2 has two analysis modes utilizing nucleotide or protein files. In nucleotide mode, outputs include %G+C and %A+T content, contig assembly statistics, and raw/trim read quality reports are a provided output. For protein mode, nucleotide files (can be translated into ORFs.





□ Comparative Genome Viewer: whole-genome eukaryotic alignments

>> https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3002405

Comparative Genome Viewer (CGV), a new visualization tool for analysis of whole-genome assembly-assembly alignments. CGV visualizes pairwise same-species and cross-species alignments provided by NCBI.

The main view of CGV takes the “stacked linear browser” approach—chromosomes from 2 assemblies are laid out horizontally with colored bands connecting regions of sequence alignment.

These sequence-based alignments can be used to analyze gene synteny conservation but can also expose similarities in regions outside known genes, e.g., ultraconserved regions that may be involved in gene regulation.





□ DiSMVC: a multi-view graph collaborative learning framework for measuring disease similarity

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae306/7666859

DiSMVC is a supervised graph collaborative framework incl. two major modules. The former one is cross-view graph contrastive learning module, aiming to enrich disease representation by considering their underlying molecular mechanism from both genetic and transcriptional views.

while the latter module is association pattern joint learning, which can capture deep association patterns by incorporating phenotypically interpretable multimorbidities in a supervised manner.

DiSMVC can identify molecularly interpretable similar diseases, and the synergies gained from DiSMVC contributed to its superior performance in measuring disease similarity.






□ scDAPP: a comprehensive single-cell transcriptomics analysis pipeline optimized for cross-group comparison

>> https://www.biorxiv.org/content/10.1101/2024.05.06.592708v1

scDAPP (single-cell Differential Analysis and Processing Pipeline) implements critical options for using replicates to generate pseudobulk data automatically, which are more appropriate for cross-group comparisons, for both gene expression and cell composition analysis.

scDAPP uses DoubletFinder to predict doublets for removal from further analysis. DoubletFinder hyperparameters such as the homotypic doublet rate are automatically estimated for each sample using the number of cells and the empirical multiplet rate provided by 10X Genomics.





□ Direct transposition of native DNA for sensitive multimodal single-molecule sequencing

>> https://www.nature.com/articles/s41588-024-01748-0

SAMOSA by tagmentation (SAMOSA-Tag), which adds a concurrent channel for mapping chromatin structure. In SAMOSA-Tag, nuclei were methylated using the non-specific EcoGII m6dAase and tagmented in situ with hairpin-loaded transposomes.

DNA was purified, gap-repaired and sequenced, resulting in molecules where the ends resulted from Tn5 transposition, the m6dA marks represented fiber accessibility and computationally defined unmethylated ‘footprints’ captured protein–DNA interactions.





□ CAREx: context-aware read extension of paired-end sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05802-w

CAREx—a new read extension algorithm for Illumina PE data based on indel-free multiple-sequence-alignment (MSA). The key idea is to build MSAs of reads sequenced from the same genomic region.

CAREx gains efficiency by applying a variant of minhashing to quickly find a set of candidate reads which are similar to a query read with high probability and aligning with fast bit-parallel algorithms.





□ wgbstools: A computational suite for DNA methylation sequencing data representation, visualization, and analysis

>> https://www.biorxiv.org/content/10.1101/2024.05.08.593132v1

wgbstools is an extensive computational suite tailored for bisulfite sequencing data. It allows fast access and ultra-compact data representation, as well as machine learning and statistical analysis, and visualizations, from fragment-level to locus-specific representations.

wgbstools converts data from standard formats (e.g., bam, bed) into tailored compact yet useful and intuitive formats (pat, beta). These can be visualized in terminal, or analyzed in different ways - subsample, merge, slice, mix, segment and more.





□ fastCCLasso: a fast and efficient algorithm for estimating correlation matrix from compositional data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae314/7668443

FastCCLasso solves a penalized weighted least squares problem with the sparse assumption of the covariance matrix. Instead of the alternating direction method of multipliers, fastCCLasso introduces an auxiliary vector and provides a simple updating scheme in each iteration.

FastCCLasso only involves the calculation of multiplications between matrices and vectors and avoids the eigenvalue decomposition and multiplications of large dense matrices in CCLasso. The computational complexity of fastCCLasso is O(p2) per iteration.





□ SCIPAC: quantitative estimation of cell-phenotype associations

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03263-1

SCIPAC enables quantitative estimation of the strength of association between each cell in a scRNA-seq data and a phenotype, with the help of bulk RNA-seq data with phenotype information. SCIPAC enables the estimation of association between cells and an ordinal phenotype.

SCIPAC identifies cells in single-cell data that are associated with a given phenotype. This phenotype can be binary, ordinal, continuous, or survival. The association strength and its p-value between a cell cluster and the phenotype are given to all cells in the cluster.





□ Bayesian modelling of time series data (BayModTS) - a FAIR workflow to process sparse and highly variable data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae312/7671098

BayModTS, a FAIR workflow for processing time series data that incorporates process knowledge. BayModTS is designed for sparse data with low temporal resolution, a small number of replicates and high variability between replicates.

BayModTS is based on a simulation model, representing the underlying data generation process. This simulation model can be an Ordinary Differential Equation (ODE), a time-parameterised function, or any other dynamic modelling approach.

BayModTS infers the dynamics of time series data via Retarded Transient Functions. BayModTS uses Markov Chain Monte Carlo (MCMC) sampling. Parameter ensembles are simulated from the posterior distribution to transfer the uncertainty from the parameter to the data space.





□ Giraffe: a tool for comprehensive processing and visualization of multiple long-read sequencing data

>> https://www.biorxiv.org/content/10.1101/2024.05.10.593289v1

Giraffe stands out by offering features that allow for the assessment of read quality, sequencing bias, and genomic regional methylation proportions of DNA reads and direct RNA sequencing reads.





□ RESHAPE: A resampling-based approach to share reference panels

>> https://www.nature.com/articles/s43588-024-00630-7

RESHAPE (Recombine and Share Haplotypes), a method that enables the generation of a synthetic haplotype reference panel by simulating hypothetical descendants of reference panel samples after a user-defined number of meiosis.

This data transformation helps to protect against re-identification threats and preserves data attributes, such as linkage disequilibrium patterns and, to some degree, identity-by-descent sharing, allowing for genotype imputation.







iPad Pro (M4) 13”

2024-05-15 22:10:10 | デジタル・インターネット

『iPad Pro (M4) 13-inch Wi-Fi + Cellular』 (Ultra Retina XDR - Tandem OLED, Space Black, 2024)

>> https://www.apple.com/ipad-pro/

M4チップ搭載の新しい13インチiPad Proが届いた!有機ELならではの発光表現力が超美麗で、これだけのパワーを誇りながら軽量でモビリティに優れるバケモノ。Thunderboltによるデータ転送も快適






iPad Pro 13” Ultra Retina XDR (Tandem OLED)、最大1,600ニトを誇るピーク輝度は流石で、星の一つ一つが手を伸ばせば触れられそうなほど煌々と瞬いている。リビングで大画面有機ELテレビを使用しているので効果は疑っていなかったけど、この画質で作業レベルのことが出来るのが贅沢すぎる


iPad Pro (M4) 13-inch (2024)
M4 chip
10-core CPU
10-core GPU
16-core Neural Engine

Ultra Retina XDR display
ProMotion technology
P3 wide color
True Tone
Antireflective coating
Nano-texture display glass option on 1TB and 2TB models
12MP Wide camera
4K video, ProRes
Landscape 12MP Ultra Wide front camera
TrueDepth camera system

Brad Mehldau / “Après Fauré“

2024-05-14 20:41:54 | art music

□ Brad Mehldau / “Après Fauré

>> https://www.bradmehldaumusic.com/apres-faure


□ Brad Mehldau / “Après Fauré: Nocturne No. 4 in E-Flat Major, Op. 36”
メルドーが『After Bach: II』に続きGabriel Fauréを翻案した『After Fauré』。大胆な編曲が目立つAfter Bachと対照に、黒鍵を感じるフォーレの繊細な夜想曲に豊穣な極みを効かせる現代ピアノ曲へ昇華


□ Brad Mehldau / “Après Fauré: Caprice”


Après Fauré

Gabriel Fauré:
1. Nocturne No. 13 in B Minor, Op. 119 (1921)
2. Nocturne No. 4 in E Major, Op. 36 (c. 1884)
3. Nocturne No. 12 in E Minor, Op. 107 (1915)

Brad Mehldau:
4. Prelude
5. Caprice
6. Nocturne
7. Vision

Gabriel Fauré:
8. Nocturne No. 7 in C-Sharp Minor, Op. 74 (1898)
9. Extract from Piano Quartet No. 2, Opus 45 (c. 1887): III. Adagio non troppo

Piano: Brad Mehldau
Unknown: Tom Lazarus
Unknown: Tom Lazarus
Unknown: Tom Lazarus
Composer: Gabriel Fauré

Koyamame Roastery.

2024-05-14 20:30:30 | アート・文化


□ Koyamame Roastery

>> https://www.koyamameroastery.com











GPT-4 Omni.

2024-05-14 20:08:08 | Science


GPT-4o (Omni) 、レポジトリ解析や発話者分離、Html生成などを試用。Plusユーザーなので処理の爆速化は実感できたが、タスクパフォーマンスの劣化が致命的で一部に言われてる通りHype(≒過大評価)が目立つ印象。iOS版の対話型AIインターフェースとしての完成度は高いのでブラッシュアップが待たれる






Brad Mehldau / "After Bach II"

2024-05-12 16:10:25 | art music

□ Brad Mehldau / "After Bach II"


□ Brad Mehldau / “Between Bach | Fugue No. 20 in A Minor”

おそらく今世紀ジャズシーンにおける最重要作品の一つ、メルドーの”After Bach”続編。バッハ『平均律クラヴィーア』に加え『ゴルトベルク変奏曲』を軸に、独自の感性と先鋭的な解釈で詩情を奏でる

The Bach album comprises four preludes and one fugue from the Well-Tempered Clavier, as well as the Allemande from the fourth Partita, interspersed with seven compositions or improvisations by Mehldau inspired by the complementary works of Bach—including Mehldau’s Variations on Bach’s Goldberg Theme.


01. Prelude to Prelude
02. Prelude No. 9 in E Major from The Well-Tempered Clavier, Book I, BWV 854
03. Prelude No. 6 in D Minor from The Well-Tempered Clavier Book I, BWV 851
04. After Bach: Toccata
05. Partita for Keyboard No. 4 in D Major, BWV 828: II. Allemande
06. After Bach: Cavatina
07. Prelude No. 20 in A Minor from The Well-Tempered Clavier Book I, BWV 865
08. Between Bach
09. Fugue No. 20 in A Minor from The Well-Tempered Clavier Book I, BWV 865
10. Intermezzo

Variations on Bach’s Goldberg Theme:
11. Aria-like
12. Variation I, Minor 5/8 a
13. Variation II, Minor 5/8 b
14. Variation III, Major 7/4
15. Variation IV, Breakbeat
16. Variation V, Jazz
17. Variation VI, Finale
18. Prelude No. 7 in E-Flat Major from The Well-Tempered Clavier Book I, BWV 852
19. Postlude

Release Date: 10/05/2024
Label: Nonesuch
Brad Mehldau, piano
Recorded April 18-20, 2017 and June 21, 2023 at Mechanics Hall, Worcester, MA
Engineered, mixed, and mastered by Tom Lazarus
Additional mixing by Brian Montgomery
Piano technician: Barbara Renner
Production coordinator: Tom Korkidis

>> https://www.bradmehldaumusic.com/



After Bach: Cavatina


Prelude No. 7 in E-Flat Major from The Well-Tempered Clavier Book I, BWV 852