lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Celestial Longing for Flesh.

2024-09-19 21:19:39 | Science News

(Art by Gavin BIC)




□ Prophet: Scalable and universal prediction of cellular phenotypes

>> https://www.biorxiv.org/content/10.1101/2024.08.12.607533v2.full.pdf

Prophet (Predictor of phenotypes), a transformer-based regression model that learns the relationships between these factors. Prophet enables it to be pretrained on 4.7 million experiments across a broad spectrum of phenotypes from multiple independent datasets.

Prophet's architecture consists of 8 transformer encoder units with 8 attention heads per layer and a feed-forward network with a hidden dimensionality of 1,024 to generate a 512-dimensional embedding of each experiment.

Prophet leverages knowledge of cellular states and treatments by projecting prior knowledge-based representations into a common token space using neural networks as tokenizers. The readout representations are modeled as learnable embeddings, directly projected in the token space.






□ Genes2Genes: Gene-level alignment of single-cell trajectories

>> https://www.nature.com/articles/s41592-024-02378-4

Genes2Genes, a new framework for aligning single-cell pseudotime trajectories of a reference and query system at single-gene resolution. G2G utilizes a Dynamic Programming algorithm that handles matches and mismatches in a formal way.

Genes2Genes captures sequential matches and mismatches of individual genes between a reference and query trajectory, highlighting distinct clusters of alignment patterns. G2G computes a pairwise Levenshtein distance matrix across all five-state alignment strings.

Genes2Genes combines the Gotoh’s algorithm with Dynamic Time Warping (DTW) and employing a Bayesian information-theoretic scoring scheme to quantify distances of gene expression distributions. G2G infers individual alignments for all genes.





□ DeepPolisher: Highly accurate assembly polishing

>> https://www.biorxiv.org/content/10.1101/2024.09.17.613505v1

DeepPolisher, an encoder-only transformer model for assembly polishing. DeepPolisher predicts corrections to the underlying sequence using Pacbio HiFi read alignments to a diploid assembly.

DeepPolisher introduces a method, PHARAOH (Phasing Reads in Areas Of Homozygosity), which uses ultra-long ONT data to ensure alignments are accurately phased and to correctly introduce heterozygous edits in falsely homozygous regions.






□ Biological arrow of time: Emergence of tangled information hierarchies and self-modelling dynamics

>> https://arxiv.org/abs/2409.12029

When macro-scale patterns are encoded within micro-scale components, it creates fundamental tensions between what is encodable at a particular evolutionary stage and what is potentially realisable in the environment.

A resolution of these tensions triggers an evolutionary transition which expands the problem-space, at the cost of generating new tensions in the expanded space, in a continual process. Biological complexification can be interpreted computation-theoretically, within the Gödel--Turing--Post recursion-theoretic framework.





□ CRAK-Velo: Chromatin Accessibility Kinetics integration improves RNA Velocity estimation

>> https://www.biorxiv.org/content/10.1101/2024.09.12.612736v1

CRAK-Velo (ChRomatin Accessibility Kinetics integration in RNA Velocity), a simpler model which directly integrates chromatin accessibility data in the estimation of individual gene transcription rates.

CRAK-Velo employs the PAGA graph approach. CRAK-Velo correctly recognises the cell states as independent terminally differentiated states. Itachieves accurate reconstruction of complex dynamic flows, and superior capabilities in cell-type deconvolution.





□ OTVelo: Optimal transport reveals dynamic gene regulatory networks via gene velocity estimation

>> https://www.biorxiv.org/content/10.1101/2024.09.12.612590v1

OTVelo can predict past and future states of individual cells via an optimal-transport plan, which then allows us, via a finite-difference scheme, to calculate gene velocities for each cell at each time point.

OTVelo infers gene-to-gene interactions across consecutive time point by computing, and thresholding, time-lagged correlation or Granger causality of the gene velocities. OTVelo employs fused Gromov-Wasserstein optimal transport in cell space.





□ CodonTransformer: a multispecies codon optimizer using context-aware neural networks

>> https://www.biorxiv.org/content/10.1101/2024.09.13.612903v1

CodonTransformer, a multispecies deep learning model trained on over 1 million DNA-protein pairs from 164 organisms spanning all kingdoms of life.

CodonTransformer demonstrates context-awareness thanks to the attention mechanism and bidirectionality of the Transformers they used, and to a novel sequence representation that combines organism, amino acid, and codon encodings.

CodonTransformer generates host-specific DNA sequences with natural-like codon distribution profiles and with negative cis-regulatory elements. This work introduces a novel strategy of STREAM: Shared Token Representation and Encoding with Aligned Multi-masking.





□ Pangenome-Informed Language Models for Privacy-Preserving Synthetic Genome Sequence Generation

>> https://www.biorxiv.org/content/10.1101/2024.09.18.612131v1

Synthetic DNA data generation based on pangenomes in combination with Pretrained-Language Models. Pangenome-based Node Tokenization is to tokenize the DNA sequences directly based on the nodes on the pangenome graph. Each node in the pangenome graph is treated as a token.

Pangenome-based k-mer Tokenization, is to tokenize the DNA sequences based on the ki-mers that are connected by the nodes in the pangenome graph. Instead of directly using the node IDs as the tokens, it tokenizes the sequences that they represent as non- overlapping k-mers.





□ ESCHR: a hyperparameter-randomized ensemble approach for robust clustering across diverse datasets.

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03386-5

ESCHER (EnSemble Clustering with Hyperparameter Randomization) performs ensemble clustering using randomized hyperparameters to obtain a set of base partitions.

This set of base partitions is represented using a bipartite graph where one type of node consists of all data points and one type of node consists of all clusters from all base partitions.

ESCHR performs Leiden community detection on kNN graph using a randomly selected value for the required resolution-determining hyperparameter.





□ PangeBlocks: customized construction of pangenome graphs via maximal blocks

>> https://www.biorxiv.org/content/10.1101/2024.09.17.613426v1

By leveraging the notion of maximal block in a Multiple Sequence Alignment, they reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC).

pangeblocks, an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph.

pangeblocks is able to produce graphs with a smaller number of nodes in general, and in particular has significantly fewer nodes that are used by only a smaller percentage of the input genome sequences.





□ scCAFE: Unveiling multi-scale architectural features in single-cell Hi-C data

>> https://biorxiv.org/cgi/content/short/2024.09.10.611762v1

scCAFE (Calling Architectural FeaturEs at the single-cell level) utilizes multi-task learning techniques to predict 3D architectural elements from scHi-C data w/o relying on dense imputation. scCAFE can predict chromatin loops and reconstruct sparse contact maps.

In the scCAFE architecture, each input contact map is treated as a graph and passed through a GraphSAGE encoder to generate latent variables. These latent features are decoded by two decoders, Φ and Θ, to reconstruct the original contact maps and classify the loops, respectively.

Subsequently, the latent features are treated as an ordered sequence. They are input to a connectivity-constrained hierarchical clustering model for TLD predictions and fed to a hidden Markov model (HMM) for compartment predictions.





□ CREME: Interpreting cis-regulatory interactions from large-scale deep neural networks

>> https://www.nature.com/articles/s41588-024-01923-3

CREME (cis-regulatory element model explanations), an in silico perturbation toolkit that interprets the rules of gene regulation learned by a genomic DNN. CREME provides interpretations at various scales, incl. at a coarse-grained CRE level as well as a fine-grained motif level.

CREME is based on the notion that by fitting experimental data, the DNN essentially approximates the underlying function. It can be treated as a surrogate for the experimental assay, enabling in silico measurements for any sequence, assuming generalization under covariate shifts.





□ A Comparison of Tokenization Impact in Attention Based and State Space Genomic Language Models

>> https://www.biorxiv.org/content/10.1101/2024.09.09.612081v1

A new definition for the tokenization metric of fertility, the token per word ratio, in the context of gLMs, and introduce the concept of tokenization parity to measure how consistently a tokenizer parses homologous sequences.

When using attention-based models, tokenization methods that compress the input, thereby increasing the total information per sample given to a model and significantly reducing the computational cost to train, are preferred.

In state-space models, where a limited context window is not a concern, it indicates that character-based tokenization are the best choice for all genomic language. A slight increase in the depth of the model can improve performance when using character-based tokenization.





□ Novae: a graph-based foundation model for spatial transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2024.09.09.612009v1

Novae, a self-supervised graph attention network that encodes local environments into spatial representations. Novae can operate with multiple gene panels, allowing for the application across diverse technologies and tissues.

Novae can compute relevant representations via zero-shot or fine-tuning on any new slide from any tissue. Novae are provides a nested organization of spatial domains for different resolutions, and natively corrects batch effect across slides.





□ Carta: Inferring cell differentiation maps from lineage tracing data

>> https://www.biorxiv.org/content/10.1101/2024.09.09.611835v1

CARTA employs a MILP to solve a constrained maximum parsimony problem to infer (i) a cell differentiatoin map and (ii) an ancestral cell type labeling for a set of cell lineage trees.

Carta represents a cell differentiation map by a directed acyclic graph whose vertices are cell types and whose edges represent transitions (differentiation events) between cell types that occur during development.





□ Celcomen: spatial causal disentanglement for single-cell and tissue perturbation modeling

>> https://arxiv.org/abs/2409.05804

Celcomen leverages a mathematical causality framework to disentangle intra- and intercellular gene regulation programs in spatial transcriptomics and single-cell data through a generative graph neural network.

Simcomen leverages learned gene-gene relationships from CCC to model tissue behavior after cellular or genetic perturbation. It possesses generative properties to create tissue-condition representative spatial data given an established matrix of gene-gene relationships.





□ Doblin: Inferring dominant clonal lineages from DNA barcoding time-series

>> https://www.biorxiv.org/content/10.1101/2024.09.08.611892v1

Doblin, an R-based pipeline designed to extract meaningful insights from complex DNA barcoding time series data obtained through longitudinal sampling.

Doblin employs a clustering approach to group relative abundance trajectories based on their shape. This method effectively clusters lineages with similar relative abundance patterns, thereby reflecting comparable fitness levels.






□ scBubbletree: computational approach for visualization of single cell RNA-seq data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05927-y

scBubbletre identifies clusters of cells of similar transcriptomes and visualizes such clusters as “bubbles” at the tips of dendrograms. scBubbletree can cluster scRNA-seq data in two ways, namely by graph-based community detection algorithms: Louvain or Leiden, and by k-means.

scBubbletree relies on the R-package ggplot2. scBubbletree provides three functions for visualization of numeric cell attributes. Categorical cell attributes are visualized using a matrix of tiles in which columns represent specific attribute categories.





□ genomesizeR: An R package for genome size prediction

>> https://www.biorxiv.org/content/10.1101/2024.09.08.611926v1

genomesizeR uses statistical modelling on data from NCBI databases and provides three statistical methods for genome size prediction of a given taxon, or group of taxa. A frequentist random effect model uses nested genus and family information to output genome size estimates.

A straightforward weighted mean method identifies the closest taxa with available genome size information in the taxonomic tree and averages their genome sizes using weights based on taxonomic distance.





□ m6AConquer: a Data Resource for Unified Quantification and Integration of m6A Detection Techniques

>> https://www.biorxiv.org/content/10.1101/2024.09.10.612173v1

m6AConquer (Consistent Quantification of External m°A RNA Modification Data) establishes a consistent multi-omics data-sharing standard, summarizing quantitative m6A data from 10 detection techniques using a unified reference feature set.

m6AConquer standardize site calling and m6A count matrix normalization procedures across platforms through a computational framework that accounts for over-dispersion in m6A levels.





□ YupanaNet: Brownian motion data augmentation: a method to push neural network performance on nanopore sensors

>> https://www.biorxiv.org/content/10.1101/2024.09.10.612270v1

The Brownian motion data augmentation method and YupanaNet, a novel neural network architecture with residual connections and a self-attention block. The Brownian motion augmentation method, while simple, showcases enhanced results in the mentioned barcode classification task.

Although further refinements could consider factors like nanopore capacitance filtering effects and accurate thermal noise models on instantaneous velocity, this method presents a viable and accessible means of enhancing neural network performance in DNA-based nanopore sensing.





□ ScReNI: single-cell regulatory network inference through integrating scRNA-seq and scATAC-seq data

>> https://www.biorxiv.org/content/10.1101/2024.09.10.612385v1

ScReNI initially integrates unpaired SCRNA-seq and scATAC-seq datasets through aligning them in a shared analytical space. It then establishes the association between genes and peaks across all cells.

ScReNI uses k-nearest neighbors and random forest algorithms to infer gene regulatory relationships for individual cells by modeling the integrated scRNA-seq and scATAC-seq data.





□ SVbyEye: A visual tool to characterize structural variation among whole genome assemblies

>> https://www.biorxiv.org/content/10.1101/2024.09.11.612418v1

SVbyEye, a data visualization R package, to facilitate direct observation of structural differences between two or more sequences. SVbyEye provides several visualization modes depending on application.

SVbyEye uses as input DNA sequence alignments in PAF format which can be easily generated with minimap2. SVbyEye has the ability to break PAF alignments at the positions of insertions and deletions and thereby delineate their breakpoints.





□ easybio: an R Package for Single-Cell Annotation with CellMarker2.0

>> https://www.biorxiv.org/content/10.1101/2024.09.14.609619v1

easybio, an R package designed to streamline single-cell annotation using the CellMarker2.0 database in conjunction with Seurat. easybio provides a suite of functions for querying the CellMarker2.0 database locally, offering insights into potential cell types for each cluster.

easybio operates independently of external reference datasets, thereby reducing the time and expertise required compared to manual annotation processes.





□ Colora: A Snakemake Workflow for Complete Chromosome-scale De Novo Genome Assembly

>> https://www.biorxiv.org/content/10.1101/2024.09.10.612003v1

Colora requires PacBio HiFi and Hi-C reads as mandatory inputs, and ONT reads can be optionally integrated into the process. With Colora, it is possible to obtain a scaffolded primary assembly or a phased assembly with separate haplotypes.





□ DeepFuseNMF: Interpretable high-resolution dimension reduction of spatial transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2024.09.12.612666v1

DeepFuseNMF (deep learning fused with NMF), a multi-modal dimension reduction framework to generate interpretable high-resolution representations of the ST data by leveraging histology images.

In DeepFuseNME, a two-modal encoder is developed to identify the interpretable high-resolution representations by integrating the low-resolution spatial gene expression from ST data with the high-resolution histological feature from histology images.

Then, a two-modal decoder uses the representations to recover the spatial gene expression and the histology image. Similar to NMF, the learnable loading matrix in the expression's decoder induces the interpretability to the high-resolution representation.





□ The Precise Basecalling of Short-Read Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2024.09.12.612746v1

The BioRNA complex is engineered from specific human tRNA, with the RNAi precursor (pre-miRNA) replacing the anticodon sequence. They prepared BioRNA nanopore sequencing libraries following the Nano-RNAseq protocol.

The scheme resolves the widespread 3' and 5'-basecalling artifacts, which can affect > 50 RNA nucleotides (> 15% length of a 0.3kb molecule) therefore may significantly compromise downstream bioinformatic analyses, through balancing training reads to cover both 3' and 5'-ends.





□ metagWGS: a comprehensive workflow to analyze metagenomic data using Illumina or PacBio HiFi reads

>> https://www.biorxiv.org/content/10.1101/2024.09.13.612854v1

metagWGS, a workflow implemented in Nextflow DSL2 that is able to analyze whole shotgun sequence metagenomic data. metagWGS is able to deal with Illumina short reads or PacBio HiFi reads. It is comprehensive as it analyzes contigs, genes and MAGS.

metagWGS produces a taxonomic abundance table from the contigs / MAGs. A list of non-binned contigs is provided. It produces a functional abundance table from the catalogue of genes found in the contigs. metagWGS includes an improved algorithm for automatic bin refinement.





□ Multi-pass, single-molecule nanopore reading of long protein strands

>> https://www.nature.com/articles/s41586-024-07935-7

A technique to reversibly thread long protein strands into a CsgG pore* using electrophoresis, and then enzymatically pull them back out of the pore using the protein unfoldase and translocase activity of CIpX4.

Unlike the rapid initial stage of threading the protein into the pore using electrophoretic force, the unfoldase-mediated translocation of proteins back out of the pore leads to slow, reproducible ionic current signals.

This method has resulted in the processive translocation of long proteins, enabling the detection of single amino acid substitutions and PTMs across protein strands up to hundreds of amino acids in length.

They have also developed an approach to rereading the same protein strand multiple times. Furthermore, this method enables the unfolding and translocation of a model folded protein domain for linear, end-to-end analysis.





□ MUSTARD: Trajectory-guided dimensionality reduction for multi-sample single-cell RNA-seq data reveals biologically relevant sample-level heterogeneity

>> https://www.biorxiv.org/content/10.1101/2024.09.14.613024v1

MUSTARD (MUlti-Sample Trajectory-Assisted Reduction of Dimensions), a trajectory-guided method for the dimension reduction of multi-sample scRNA-seq data.

MUSTARD utilizes single-cell resolution information to provide unsupervised low-dimensional representation of samples while simultaneously connecting the sample-level heterogeneity with gene modules and pseudotemporal patterns.

MUSTARD requires three inputs: a gene expression matrix for all cells, a categorical vector indicating which sample each cell belongs to, and the pseu-dotime values for each cell constructed based on the multi-sample scRNA-seq data

MUSTARD format the data into an order-3 temporal tensor with sample, gene, and pseudotime as its 3 dimensions. The tensor is decomposed into the summation of low-dimension, where each consists of a sample loading vector, a gene loading vector, and a temporal loading function.





□ QuickEd: High-performance exact sequence alignment based on bound-and-align

>> https://www.biorxiv.org/content/10.1101/2024.09.13.612714v1

QuickEd, a sequence alignment algorithm based on a bound-and-align strategy. First, QuickEd effectively bounds the maximum alignment-score using efficient heuristic strategies. Then, QuickEd utilizes this bound to reduce the computations required to produce the optimal alignment.

QuickEd's bound-and-align strategy reduce O(n^2) complexity of traditional dynamic programming algorithms to O(ns), where n is the sequence length and is an estimated upper bound of the alignment-score between the sequences.





□ CELEBRIMBOR: Core and accessory genes from metagenomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae542/7762100

CELEBRIMBOR (Core ELEment Bias Removal In Metagenome Binned ORthologs), an alternative method for core frequency threshold adjustment using genome completeness.

CELEBRIMBOR uses genome completeness, jointly with gene frequencies, to adjust the core frequency threshold in a single step by modelling the number of gene observations with a true frequency.





□ ArchMap: A web-based platform for reference-based analysis of single-cell datasets

>> https://www.biorxiv.org/content/10.1101/2024.09.19.613883v1

ArchMap is a free, no-code query-to-reference mapping framework that extends to python-based mapping methods. Archmap enables query-to-reference mapping and out-of-the-box cell type annotation for new data using existing references from a multitude of tissues.

ArchMap automatically calculates various performance metrics, including uncertainty quantification to evaluate mapping quality and identify novel or diseased cells. A CellGene plug-in allows for easy post-mapping visualization and marker gene identification.






『Open AI: o1』 でgenomesizeRのリプログラミングを試してみた。ヒストグラムとカーネル密度推定にggplot2を使用し、シャピロ・ウィルク検定を実行。4oで劣化が感じられたコーディング能力が回復した印象。Claudeからユーザーを取り戻せるか


Luke Howard & Nadje Noordhuis / “Ten Sails”

2024-09-19 18:19:53 | art music

□ Luke Howard & Nadje Noordhuis / “Ten Sails”

現代音楽家ルーク・ハワードによる詩情豊かなインストゥルメンタル、トランペット奏者ナジェ・ノールデュイスのフォーキーで何処か寂しい語り口。きらめく水面の逆光にプカプカと揺れるヨットのシルエットが目に浮かぶ



□ Luke Howard & Nadje Noordhuis/ “Bluebird”
夏の終わり~秋にかけて肌寒くなる頃に聞きたくなる名盤。水面にたゆたうピアノの音色と、過ぎゆく夏を愁うようなトランペットの響き。

Beyond Bach.

2024-09-18 00:31:21 | art music


'Beyond Bach' is the second single from @KsenijaSidorova's upcoming album, 'Crossroads' with @sinfoniettariga and Normunds Šnē.

Beyond Bach (Arr. for Accordion by George Morton and Ksenija Sidorova) · Ksenija Sidorova

Crossroads

℗ Alpha Classics / Outhere Music France & Latvijas Koncerti

Released on: 2024-10-18

Arranger: George Morton
Arranger: Ksenija Sidorova
Producer: Louise Burel
Composer: Gabriela Montero


□ Ksenija Sidorova / “Beyond Bach (Arr. for Accordion by George Morton and Ksenija Sidorova)”

Executor.

2024-09-13 21:19:39 | Science News

(Created with Midjourney v6.1)




□ Meta Flow Matching: Integrating Vector Fields on the Wasserstein Manifold

>> https://arxiv.org/abs/2408.14608

Meta Flow Matching (MFM) is the amortization of the Flow Matching generative modeling framework. By integrating along vector fields of the Wasserstein manifold, MFM allows for a more comprehensive model of dynamical systems with interacting particles.


MFM leverages graph neural networks to embed the initial population. Meta Flow Matching learns to integrate a vector field for every starting density. It defines a push-forward measure that integrates along the underlying vector field.





□ DeepKINET: a deep generative model for estimating single-cell RNA splicing and degradation rates

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03367-8

DeepKINET uses a deep generative model of mature and immature transcripts based on an RNA velocity equation. This enables optimization in which the splicing and degradation rates are adjusted according to the cell state.

DeepKINET assumes that the kinetic parameters for each cell are obtained from transformation of the latent cell state by the neural network. DeepKINET provides biologically meaningful insights by accounting for cellular heterogeneity in kinetic rates.





□ CAP-seq: High-coverage, massively parallel sequencing of single-cell genomes

>> https://www.biorxiv.org/content/10.1101/2024.09.10.612220v1

CAP-seq (single-cell genomic sequencing using compartments with adjusted permeability) employs semi-permeable compartments that allow reagent exchange while retaining large DNA fragments, enabling efficient genome processing.

Once the genomic DNA is processed, the CAPs, now containing single-cell genomes, are co-encapsulated with DNA barcode beads in droplets (second microfluidic step). This step assigns each genome a unique cell barcode.

Afterward, the CAPs are extracted from the droplets, washed, and dissolved to release the barcoded DNA fragments (~1 kb). These fragments are then further amplified and prepared for nanopore sequencing.

Finally, the sequenced reads are categorized into individual SAGs based on their cell barcodes, yielding high-coverage genomes with significantly improved throughput and resolution.





□ A near-tight lower bound on the density of forward sampling schemes

>> https://www.biorxiv.org/content/10.1101/2024.09.06.611668v1

Proving a near-tight lower bound on the density of forward sampling schemes, a class of schemes that generalizes minimizer schemes. For small w and k, optimal schemes and observe that this bound is tight when k = 1. For large w + k, the bound can be approximated by 1/w+k[w+k/w].

With the default minimap2 HiFi settings w = 19 and k = 19, The best known scheme for these parameters, the double decycling-set-based minimizer of Pellow et al., is at most 3% denser than optimal, compared to the previous gap of at most 50%.

Furthermore, when k = 1 (mod w) and o → ∞, mod-minimizers introduced by Groot Koerkamp and Pibiri achieve optimal density matching the lower bound.





□ Personalized pangenome references

>> https://www.nature.com/articles/s41592-024-02407-2

A personalized pangenome reference by sampling haplotypes. It works directly w/ assembled haplotypes and maintain phasing w/in 10 kbp blocks. The sampled graph is a subgraph of the original graph. Therefore, any alignments in the sampled graph are valid in the original graph.

This approach is tailored for Giraffe, as the indexes it needs for read mapping can be built quickly. It assumes a graph with a linear high-level structure, such as graphs built using the Minigraph-Cactus pipeline.

It further assumes that read coverage is high enough (at least 20x) that we can reliably classify k-mers into absent, heterozygous and homozygous according to k-mer counts.





□ Distinguishing word identity and sequence context in DNA language models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05869-5

The method to interrogate model learning, which is both relevant for the interpretability of the model and to evaluate its potential for specific tasks. We used DNABERT, a DNA language model trained on the human genome with overlapping k-mers as tokens.

Through the design of using tokens from overlapping k-mers, unmasked sequence partially shares sequence with the masked tokens. The central nucleotide of the combined masked tokens is the only nucleotide that is completely masked.

Next-k-mer prediction is a task that requires learning of context beyond token identity. It can thus serve as a measure of potenial for models to be used to discover new genome biology that goes beyond mechanisms associated with recurrent motifs and sequence content.





□ NetID Scalable identification of lineage-specific gene regulatory networks from metacells

>> https://www.biorxiv.org/content/10.1101/2024.09.08.611796v1

The NetID algorithm builds on the metacell concept applied to pruned KNN graphs. NetID preserves biological covariation of gene expression, and outperforms GRN inference with imputation-based methods.

NetID integrates GENIE for GRN inference from the Granger causal model. By incorporating cell fate probability, it enables the inference of cell-lineage specific GRNs, which permit the recovery of ground truths network motifs driven by lineage-determining transcription factors.





□ Methven: Predicting the effect of non-coding mutations on single-cell DNA methylation using deep learning

>> https://www.biorxiv.org/content/10.1101/2024.09.03.611114v1

Methven can predict the effects of non-coding mutations on DNA methylation at single-cell resolution. Methven supports dual tasks: classification to determine the direction of methylation change and regression to quantify its magnitude, enhancing predictive accuracy.

Methven integrates DNA sequences with ATAC-seq data using a divide-and-conquer strategy that addresses SNP-CpG interactions across variable distances up to 100kbp with a lightweight architecture.





□ GenoM7GNet: An Efficient N7-methylguanosine Site Prediction Approach Based on a Nucleotide Language Model

>> https://www.biorxiv.org/content/10.1101/2024.09.03.610976v1

GenoM7GNet, an efficient deep learning prediction model utilizing a nucleotide language model. GenoM7GNet primarily comprises two parts: a pre-trained Bidirectional Encoder Representation from Transform (BERT) model and a CNN model.

GenoM7GNet utilizes DNABERT model on human genomic data as an embedding layer to embed tokens into real-valued vectors. GenoM7GNet employs a one-dimensional CNN to learn the vectors outputted from the BERT embedding layer, thereby achieving the identification of m7G sites.





□ μFormer: Accelerating protein engineering with fitness landscape modeling and reinforcement learning

>> https://www.biorxiv.org/content/10.1101/2023.11.16.565910v3

μFormer can handle a variety of challenging sce-narios, including a limited number of measurements, orphan proteins with few homologs, complicated variants with multiple-point mutations, insertions and deletions, and mutants exhibiting hyperactivation.

μFormer exploits the pairwise masked language model (PMLM) which considers the dependency among masked tokens, taking into account the joint probability of a token pair. μFormer effectively identifies high-functioning variants with multi-point mutations.





□ LevSeq: Rapid Generation of Sequence-Function Data for Directed Evolution and Machine Learning

>> https://www.biorxiv.org/content/10.1101/2024.09.04.611255v1

LevSeq (Long-read every variant Sequencing), a pipeline that combines a dual barcoding strategy with nanopore sequencing to rapidly generate sequence-function data for entire protein-coding genes.

LevSeq reduces screening burden by enabling removal of sequences with no mutations, stop codons, and deletions. The pipeline facilitates data-driven protein engineering by consolidating sequence-function data to inform directed evolution.





□ SINUM: Inference of single-cell network using mutual information for scRNA-seq data analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05895-3

SINUM (a SIngle-cell Network Using Mutual information) integrates a measure of MI with the hypotheses of various dependent relations used in CSN to determine whether any given two genes are dependent or independent in a specific cell and further builds the undirected network.

SINUM SCNs can transform into the network degree matrix (DM) by counting and normalizing the number of edges connected to every gene in each SCN. Specifically, DM has the same dimension as the original gene expression matrix.





□ Ultrack: pushing the limits of cell tracking across biological scales

>> https://www.biorxiv.org/content/10.1101/2024.09.02.610652v1

Ultrack leverages information from adjacent time points to resolve large-scale cell segmentation and tracking ambiguities. Ultrack can track cells (or nuclei) in 2D, 3D, and multichannel datasets, accommodating a wide range of biological contexts.

Ultrack employs temporal consistency to select the most accurate segments. Ultrack builds segmentation hypotheses between frames for tracking and solves an Integer Linear Programming (ILP) problem to identify cell segments and their trajectories.





□ Ropebwt3: BWT construction and search at the terabase scale

>> https://arxiv.org/abs/2409.00613

ropebwt3 computes the partial multi-string Burrows-Wheeler Transform (BWT) of a subset of sequences with libsais and merges the partial BWT into the existing BWT run-length encoded as a B+-tree. It repeats this procedure until all input sequences are processed.

The BWT by default includes input sequences on both strands. This enables forward-backward search required by accelerated long MEM finding. Ropebwt3 could index 100 assembled human genomes in 21 hours and index 7.3 terabases of commonly studied bacterial assemblies in 26 days.

Ropebwt3 can find maximal exact matches and inexact alignments under affine-gap penalties using a revised BWA-SW algorithm, and can retrieve all distinct local haplotypes matching a query sequence.





□ SCIntRuler: Guiding the integration of multiple single-cell RNA-seq datasets with a novel statistical metric

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae537/7748406

SCIntRuler, a hypothesis-based testing framework that evaluates within-sample and cross-sample similarities of cell groups. The inputs of SCIntRuler include an scRNA-seq gene expression matrix and the study or batch information.

SCIntRuler outputs a numeric ratio that represents the level of information sharing across datasets and a figure illustrating the permutation test-based p value versus the relative between-within cluster distances.





□ AIGS: Interpretable scRNA-seq Analysis with Intelligent Gene Selection

>> https://www.biorxiv.org/content/10.1101/2024.09.01.610665v1

AIGS distinguishes itself from other frameworks by utilizing an intelligent gene selection algorithm that targets genes which indicate cell types, a minority of all genes that provide the most informative data on cell types.

AIGS systematically identifies class-indicating genes based on the normalized mutual information (NMI) between the learned pseudo-labels and quantified genes, effectively reducing data dimensionality and mitigating the negative impact of dropouts.





□ HBIcloud: An Integrative Multi-Omics Analysis Platform

>> https://www.biorxiv.org/content/10.1101/2024.08.31.607334v1

HBIcloud offers a suite of 94 tools covering various omics disciplines. For genomics, it includes tools for sequence alignment, variant calling, genome assembly, and annotation.

HBIcloud also provides tools for differential GE analysis, transcript assembly, and functional annotation. It offers tools for phenotype data analysis. The platform includes tools for multi-omics integration, such as clustering, dimensionality reduction, and network analysis.





□ VCF observer: a user-friendly software tool for preliminary VCF file analysis and comparison

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05860-0

VCF Observer, a VCF file analysis and comparison web tool, to address these issues. It can calculate similarity between VCF files and benchmark them based on user-provided validation sets.

VCF Observer supports the dynamic grouping of multiple VCF files based on user supplied metadata, facilitating the interpretation of relations between different sets of VCF files. It can also filter VCF files based on genomic regions and the filter status of variants.





□ scPS: A distribution-free and analytic method for power and sample size calculation in single-cell differential expression

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae540/7749386

scPS utilizes the distribution-free generalized estimating equations (GEE) approach. This method begins with normalized pilot data, allowing flexibility in normalization methods and making no assumptions about data distributions.

scPS is distribution-free and only learns the mean-variance relationship from pilot data. A given data distribution defines a specific mean-variance relationship, but a given mean-variance relationship does not define a specific distribution.

sPS accounts for cell-cell correlations within individual samples rather than assuming cell independence. If there is no intra-sample correlation, scPS simplifies to a cell-cell independence model.





□ Genotype inference from aggregated chromatin accessibility data reveals genetic regulatory mechanisms

>> https://www.biorxiv.org/content/10.1101/2024.09.04.610850v1

Calling genotypes using a pipeline incorporating Gencove's low-pass sequencing methods applied to ATAC-seq reads in accessible chromatin, which utilizes imputation to infer genotype for variants that are located outside of regions covered by observed reads in accessible regions.

Based on comparisons across various peak-calling approaches, they finalized a pipeline based on an Genrich, an ATAC-seq specific method for collectively calling peaks across large, diverse data sets and quantifying accessibility in each peak.





□ If we built a neural network where the weights were lenses instead of vectors, for instance, and the input was light-shaped, the inference cost would be zero.

レーザー・ニューラルネットの概念。実現可能かどうかは置いといて、畳み込み回路を集積して行く過程で光速度がボトルネックになるのでは…





□ CSV-Filter: a deep learning-based comprehensive structural variant filtering method for both short and long reads

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae539/7750355

CSV-Filter, a deep learning-based SV filtering tool for both short / long reads. CSV-Filter uses a multi-level grayscale image encoding method based on the CIGAR string in the sequence alignment information, which ensures the robust applicability to both short / long reads.

CSV-Filter employs transfer learning of fine-tuning for a self-supervised pre-trained model, which boosts the model's accuracy and generalization ability, and significantly reduces the need for large amounts of annotated data by traditional CNN models for supervised learning.





□ KegAlign: Optimizing pairwise alignments with diagonal partitioning

>> https://www.biorxiv.org/content/10.1101/2024.09.02.610839v1

KegAlign, a very sensitive and yet equally slow tool. Here we describe an optimized GPU-enabled pairwise aligner KegAlign. It incorporates a new parallelization strategy, diagonal partitioning, with the latest features of modern GPUs.

With KegAlign a typical human/mouse alignment can be computed in under 6 hours on a machine containing a single NVidia A100 GPU and 80 CPU cores without the need for any pre-partitioning of input sequences: a ~150x improvement over lastZ.

While other pairwise aligners can complete this task in a fraction of that time, none achieves the sensitivity of KegAlign's main alignment engine, lastZ, and thus may not be suitable for comparing divergent genomes.





□ DcjComm: Dimension reduction, cell clustering, and cell–cell communication inference for single-cell transcriptomics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03385-6

DcjComm takes a single-cell gene expression matrix as input and then processes it through a preprocessing step to obtain the preprocessed matrix.

DcjComm performs dimension reduction by projected matrix decomposition and cell clustering by non-negative matrix factorization. DcjComm uses the inference statistical model to infer CCCs by integrating intercellular and related intracellular signals.





□ Scywalker: Scalable end-to-end data analysis workflow for long-read single-cell transcriptome sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae549/7754485

scywalker, an innovative and scalable package developed to comprehensively analyze long-read sequencing data of full-length single-cell or single-nuclei cDNA.

scywalker uses novel scalable methods for cell barcode demultiplexing and single-cell isoform calling and quantification and incorporated these in an easily deployable package.

Scywalker streamlines the entire analysis process, from sequenced fragments in FASTQ format to demultiplexed pseudobulk isoform counts, into a single command suitable for execution on either server or cluster.





□ CellMATE: Unlocking cross-modal interplay of single-cell and spatial joint profiling

>> https://www.biorxiv.org/content/10.1101/2024.09.06.610031v1

CellMATE utilizes a multi-head adversarial training module to enable nonlinear early-integration of sc-multiomics. The input multimodal data, concatenation of features from all modalities, is simultaneously used to learn a modal-free low-dimensional stochastic latent space.

CellMATE adeptly captures both the additive and synergistic advantages of joint profiling. CellMATE is robust across diverse paired sc-multimodal scenarios, showcasing its unparalleled capability to elucidate synergistic strength even amidst modal discrepancies.





□ Uncertainty quantification in high-dimensional linear models incorporating graphical structures with applications to gene set analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae541/7754484

GCDL (the graph-constrained desparsified LASSO), a new procedure that makes use of auxiliary network information in a high-dimensional linear model.

GCDL combines the LASSO and the Laplacian quadratic as the penalty function. GCDL uses the Laplacian quadratic penalty to encourage smoothness among coefficients associated with the correlated predictors.





□ HiCMC: High-Efficiency Contact Matrix Compressor

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05907-2

HiCMC achieves better performance by exploiting the underlying properties of contact matrices, such as their symmetry and correlations between genomic distance and interactions, as well as further hierarchical structures of chromosomal organization reflected in the matrices.

The HiCMC compression pipeline consists of splitting the genome-wide contact matrix into intra- and inter-chromosomal contact matrices, row/column masking, model-based transformation, row binarization, and entropy coding.





□ A semi-parametric multiple imputation method for high-sparse, high-dimensional, compositional data

>> https://www.biorxiv.org/content/10.1101/2024.09.05.611521v1

Treating all zeros as missing values would not significantly alter analysis results if the proportion of structural zeros is similar for all taxa, and they propose a semi-parametric multiple imputation method for high-sparse, high-dimensional, compositional data.

The random-selection-and-amalgamation approach implemented in MIC avoids the high sparse and high dimensional issues while capturing some dependence structure in taxa. It also allows for multiple imputations.





□ PASSAGE: Learning phenotype associated signature in spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2024.09.06.611564v1

PASSAGE (Phenotype Associated Spatial Signature Analysis with Graph-based Embedding) combines graph attention auto-encoder (GATE)-based cell/spot-level spatial encoding with slice-level information aggregation through a dedicated attention pooling strategy.

PASSAGE introduces a dedicated attention pooling layer that aggregates the embeddings of all cells/spots within each slice into a single slice-level embedding, which functions as a learnable dynamic averaging process capable of focusing on specific spatial regions.





□ mgikit: Demultiplexing toolkit for MGI fastq files

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae554/7755041

mgikit is a tool collection to demultiplex MGI fastq data, reformat it effectively and produce visual quality reports. mgikit overcomes several limitations of the standard MGI demultiplexer.

mgikit generates all possible indices from the indices in the sample sheet allowing 0 to m mismatches and assigning these indices to the relevant samples.





□ NucBalancer: Streamlining Barcode Sequence Selection for Optimal Sample Pooling for Sequencing

>> https://www.biorxiv.org/content/10.1101/2024.09.06.611747v1

NucBalancer is a versatile tool designed to assist in optimizing nucleotide pooling strategies for high-throughput genomic analyses. The tool evaluates nucleotide distribution uniformity across positions and allows users to set customizable red flag thresholds.

NucBalancer ensures optimal results while accommodating variability. NucBalancer employs a comprehensive assessment mechanism to gauge the adherence of a nucleotide pooling set to the desired nucleotide distribution range.





□ CNValidatron, automated validation of CNV calls using computer vision

>> https://www.biorxiv.org/content/10.1101/2024.09.09.612035v1

A novel solution to this problem based on machine vision. It can automate the visual inspection of CNVs with an accuracy and precision comparable to (if not better than) that of a human analyst and distribute it as an R package.

They also developed a method to group CNVs into biologically-plausible CNV regions (CNVRs) based on network analysis, and we demonstrate its function in a selected set of well characterised loci.



ALIEN: ROMULUS

2024-09-09 01:14:04 | 映画

□ 『ALIEN: ROMULUS』

侵犯される人間と、外殻を食い破る寄生体との構造が反転。小惑星帯に浮かぶ実験施設を舞台にしたタフでソリッドなSFホラー。SF情緒たっぷりの光源と音響によるアトモスフィア。4DXは排気の表現で劇場内に風が吹き荒れる。映画館を出たら「顔真っ青だよ!」と言われる位には満喫






2024
Directed by Fede Alvarez
Produced by Ridley Scott / Michael Pruss / Walter Hill
Production Design by Naaman Marshall
Cinematography by Galo Olivares
Music by Benjamin Wallfisch


□ Benjamin Wallfisch / “He's Glitchy”



『エイリアン:ロムルス』VFXのクレジットにデカデカと『METAPHYSIC』と銘記されていて、Deep Fake技術のハリウッドへの浸透をアリアリと感じるようになった。画家ミシェル・セールの『1720年のペストの際の(マルセイユ)市庁舎の様子』のバックに流れるのは、ワーグナーの『神々のヴァルハラへの入城』

Max Richter / “In a Landscape”

2024-09-06 19:23:32 | art music

□ Max Richter / “In a Landscape”

Release Date: 09/06/2024
Label: Decca
Cat.No.: 5882352

『reconciling polarities (極性の調和)』をテーマに、静謐なオーケストラ(弦楽五重奏)と透明なエレクトロニクスを融合。『The Blue Notebooks』の頃のダイナミズムに回帰し、ジャケットアートもあの名盤を彷彿とさせる



□ Max Richter / “In a Landscape: Late and Soon”

Producer, Associated Performer, Synthesizer Programming: Max Richter
Mixer: Rupert Coulson
Mastering Engineer: Cicely Balston
Recording Engineer: Alex Ferguson
Violin: Eloisa-Fleur Thom
Violin: Max Baillie
Viola: Connie Pharoah
Cello: Max Ruisi
Cello: Zara Hudson-Kozdoj


□ Max Richter - Love Song (After JE)



□ Max Richter / “Only Silent Words”