2024年2月のブログ記事一覧-lens, align.

SHOGUN.

2024-02-28 08:20:20 | 映画

□ 『SHŌGUN』

>> https://www.fxnetworks.com/shows/shogun/viewers-guide

封建制下の日本の世界観を題材としたフィクション。ハリウッド資本の映像書法で描かれる壮大なスケールの大河。戦国末期、異端者の渡来によって微妙なパワーバランスが崩れていく日本を、政治・血族・異国の宗教対立など、あらゆる深謀遠慮が複雑に絡み会う重層的なドラマに仕上げている。傑作

真田広之自らがプロデュースに名を連ね、『本物の日本』のプロダクトデザインに心血を注いだだけあり、仄暗いライティングによって匂い立つ潮や土埃に薄汚れた、血腥い戦国末期の日本がかつてないリアリティで映し出されている。故にエログロ描写もあって、Disney+で配信出来たことに驚き

2024
Created by Rachel Kondo / Justin Marks
Based on the novel by James Clavel
Series Directed by Frederick E.O. Toye / Jonathan van Tulleken / Charlotte Brändström / Takeshi Fukunaga / Hiromi Kamata / Emmanuel Osei-Kuffour

Restricted Area.

2024-02-28 00:10:46 | アート・文化

(Created with Midjourney v6.0 ALPHA)

goo
□ Max Richter - Path 19 (Yet Frailest)

The air stood still.

2024-02-25 15:17:26 | 旅行

Icefall.

2024-02-25 11:31:33 | 旅行

『奥入瀬渓流氷瀑ナイトツアー』に参加。夜の渓流沿いに覗く氷瀑をライトアップ、月灯も競演して、まるで夢を見ているかのような幻想的な風景でした。川岸は特に冷え込み、同時に湿度も高いので睫毛が凍りつくほど！

桜楽

2024-02-25 11:17:13 | 旅行

『十和田湖畔桜楽』さんに滞在中。全室レイクビューで、真冬の湖面を眺めながらゆったりと夕暮れのひととき。夜の屋外アクティビティに参加した後は、ラジウム石を使用した大浴場で身も心もホカホカに。半分ゲストハウス的な「セルフラグジュアリー」というコンセプトが心地好い

Iteration 257.

2024-02-22 22:22:22 | Science News

(“A Generative Odyssey - iteration 257” by HAL)

□ Mapping Cell Fate Transition in Space and Time

>> https://www.biorxiv.org/content/10.1101/2024.02.12.579941v1

TopoVelo (Topological Velocity inference) jointly infers the dynamics of cell fate transition over time and space. TopoVelo extends the RNA velocity framework to model single-cell gene expression dynamics of an entire tissue with spatially coupled differential equations.

TopoVelo models the differentiation of all cells using spatially coupled differential equations, formulates a principled Bayesian latent variable model that describes the data generation process, and derives an approximate Bayesian estimation using autoencoding variational Bayes.

□ NuPose: Genome-wide Nucleosome Positioning and Associated Features uncovered with Interpretable Deep Residual Networks

>> https://www.biorxiv.org/content/10.1101/2024.02.09.579668v1

NuPose is an interpretable framework based on the concepts of deep residual networks. NuPose able to learn sequence and structural patterns and their dependencies associated with nucleosome organization in human genome.

NuPoSe can be used to identify nucleosomal regions, not covered by experiments, and be applied to unseen data from different cell types. Their findings point to 43 informative DNA sequence features, most of them constitute tri-nucleotides, di-nucleotides and one tetra-nucleotide.

□ Scywalker: scalable end-to-end data analysis workflow for nanopore single-cell transcriptome sequencing

>> https://www.biorxiv.org/content/10.1101/2024.02.22.581508v1

Scywalker is an integrated workflow for analyzing nanopore long-read single-cell sequencing data, currently tailored to the 10x Genomics platform. Scywalker orchestrates a complete workflow from FASTQ to cell-type demultiplexed gene and isoform discovery and quantification.

Scywalker supports scalable parallelization. Most steps are subdivided into smaller jobs, which are efficiently distributed over different processing cores, either on the same computer or over different computers in a cluster.

□ ConvNet-VAE: Integrating single-cell multimodal epigenomic data using 1D-convolutional neural networks

>> https://www.biorxiv.org/content/10.1101/2024.02.16.580655v1

ConvNet-VAE is a convolutional variational autoencoder based upon a Bayesian generative model. To apply Conv1D, the input multimodal data are transformed into 3-dimensional arrays (cell x modality x bin), following window-based genome binning at 10 kilobase resolution.

The encoder efficiently extracts latent factors, which are then mapped back to the input feature space by the decoder network. ConvNet-VAE uses a discrete data likelihood (Poisson distribution) to directly model the observed raw counts.

In this model, the categorical variables (e.g., batch information) are one-hot encoded and then concatenated with the flattened convolutional layer outputs, instead of being combined directly with the multimodal fragment count data over the sorted genomic bins.

□ Discrete Probabilistic Inference as Control in Multi-path Environments

>> https://arxiv.org/abs/2402.10309

Maximum Entropy Reinforcement Learning (MaxEnt RL) to solve this problem for some distributions, it has been shown that in general, the distribution over states induced by the optimal policy may be biased in cases where there are multiple ways to generate the same object.

Generative Flow Networks (GFlowNets) learn a stochastic policy that samples objects proportionally to their reward by approximately enforcing a conservation of flows across the a finite-horizon Markov Decision Process.

□ Proformer: a hybrid macaron transformer model predicts expression values from promoter sequences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05645-5

Proformer, an over-parametrized Transformer architecture for large scale regression task on DNA sequences. Proformer includes a new design named multiple expression heads (MEH) to stabilize the convergence, compared with the conventional average pooling heads.

Proformer has two half-step feed forward (FFN) layers were placed at the beginning and the end of each encoder block, and a separable 1D convolution layer was inserted after the first FFN layer and in front of the multi-head attention layer.

The sliding k-mers from one-hot encoded sequences were mapped onto a continuous embedding, combined with the learned positional embedding and strand embedding (forward strand vs. reverse complemented strand) as the sequence input.

□ LineageVAE: Reconstructing Historical Cell States and Transcriptomes toward Unobserved Progenitors

>> https://www.biorxiv.org/content/10.1101/2024.02.16.580598v1

LineageVAE utilizes deep learning based on the property that cells sharing barcodes have identical progenitors. LineageVAE transforms scRNA-seq observations with an identical lineage barcode into sequential trajectories toward a common progenitor in a latent cell state space.

LineageVAE depicts sequential cell state transitions from simple snapshots and infers cell states over time. Moreover, LineageVAE can generate transcriptomes at each time point using a decoder.

□ scGIST: gene panel design for spatial transcriptomics with prioritized gene sets

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03185-y

scGIST (single-cell Gene-panel Inference for Spatial Transcriptomics), a deep neural network with a custom loss function that casts sc-ST panel design as a constrained feature selection problem.

scGIST learns to classify the individual cells given their gene expression values. Its custom loss function aims at maximizing both cell type classification accuracy and the number of genes included from a given gene set of interest while staying w/in the panel’s size constraint.

□ CADECT: Evaluating the Benefits and Limits of Multiple Displacement Amplification with Whole-Genome Oxford Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2024.02.09.579537v1

CADECT (Concatemer Detection Tool) enables the identification and removal of putative inverted chimeric concatemers, thus improving the accuracy and contiguity of the genome assembly.

CADECT effectively mitigates the impact of concatemeric sequences, enabling the assembly of contiguous sequences even in cases where the input genomic DNA was degraded.

Annealing of random hexamer primers and addition of phi29-DNA polymerase leads to concatemers-mediated multiple displacement amplification from linear and circular concatemers respectively.

□ NUCLUSION: Scalable nonparametric clustering with unified marker gene selection for single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2024.02.11.579839v1

NUCLUSION, an infinite mixture model that leverages Bayesian sparse priors to identify marker genes while simultaneously performing clustering on single-cell expression data. NCLUSION works directly on normalized count data, bypassing the need to perform dimensionality reduction.

Based on a sparse hierarchical Dirichlet process normal mixture model, NCLUSION learns the optimal number of clusters based on the variation observed b/n expression profiles and uses sparse prior distributions to identify genes that significantly influence cluster definitions.

□ Proteus: pioneering protein structure generation for enhanced designability and efficiency

>> https://www.biorxiv.org/content/10.1101/2024.02.10.579791v1

Proteus surpasses the designability of RFdiffusion by utilizing a graph-based triangle technique and a multi-track interaction network with great enhancement of the dataset.

The graph triangle block is applied to update the edge representation and employs a graph-based attention mechanism on edge representation with a sequence representation-gated structure bias.

Proteus transfers triangle techniques into the integration of latent representation of residue edges by the construction of KNN graph and building multi-track interaction networks, Proteus even largely surpasses RF diffusion on longer monomer (over 400 amino acids) generation.

□ BMTC: The De Bruijn Mapping Problem with Changes in the Graph

>> https://www.biorxiv.org/content/10.1101/2024.02.15.580401v1

Reformulating the Graph Sequence Mapping Problem, this work introduced concepts such as the s-transformation of a De Bruijn graph and the Bipartition and matching between two sets of k-mers.

BMTC, an algorithm which utilizes the Hungarian algorithm to find a maximum-cost minimum matching in a bipartite graph, resulting in a modified set of vertices for the De Bruijn graph.

The theorem demonstrates that the cost of the maximum matching found in the bipartite graph is equal to the Hamming distance b/n the given sequence and the original graph. BMTC allows changes in the De Bruijn graph, proving advantageous for finding polynomial-time solutions.

□ RUDEUS: a machine learning classification system to study DNA-Binding proteins

>> https://www.biorxiv.org/content/10.1101/2024.02.19.580825v1

RUDEUS, a Python library for DNA-binding classification systems and recognis-ing single-stranded and double-stranded interactions.

RUDEUS incorporates a generalizable pipeline that combines protein language models, supervised learning algorithms, and hyperparameter tuning guided by Bayesian approaches to train predictive models.

RUDEUS collects the protein sequences by incorporating length filters and removing non-canonical residues. Numerical representation strategies are applied to obtain encoded vectors through protein language, and all the different pre-trained models in the bio-embedding library.

□ scSemiGCN: boosting cell-type annotation from noise-resistant graph neural networks with extremely limited supervision

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae091/7609673

scSemiGCN, a robust cell-type annotation method based on graph convolutional networks. Built upon a denoised network structure that characterizes reliable cell-to-cell connections, scSemiGCN generates pseudo labels for unannotated cells.

scSemiGCN projectins raw features onto a discriminative representation space by supervised contrastive learning. Finally, message passing with the refined features over the denoised network structure is conducted for semi-supervised cell-type annotation.

□ ChemGLaM: Chemical-Genomics Language Models for Compound-Protein Interaction Prediction

>> https://www.biorxiv.org/content/10.1101/2024.02.13.580100v1

ChemGLaM is based on the 2 independent language models, MoLFormer for compounds and ESM-2 for proteins, and fine-tuned for the CPI datasets using an interaction block with a cross-attention mechanism.

ChemGLaM is capable of predicting interactions between unknown compounds and proteins with higher accuracy.

ChemGLaM combines the independently pre-trained foundation models is effective for obtaining sophisticated representation of compound-protein interactions. Furthermore, ChemGLaM visualizes the learned cross-attention map.

□ SSBlazer: a genome-wide nucleotide-resolution model for predicting single-strand break sites

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03179-w

SSBlazer is a novel computational framework for predicting Single-strand breaks (SSB) sites within local genomic windows. This method utilizes advanced deep learning techniques such as residual blocks and self-attention mechanisms to enhance the accuracy of predictions.

SSBlazer is capable of quantifying the contribution of each nucleotide to the final prediction, thereby aiding in the identification of SSB-associated motifs, such as the GGC motif and regions with a high frequency of CpG sites.

□ HairSplitter: haplotype assembly from long, noisy reads

>> https://www.biorxiv.org/content/10.1101/2024.02.13.580067v1

HairSplitter first calls variants using a custom process to distinguish actual variants from alignment or sequencing artefacts, clusters the reads into an unspecified number of haplotypes, creates the new separated contigs and finally untangles the assembly graph.

Hairsplitter takes as input an assembly (obtained by any means) and the long reads (including high-error rate long reads) used to build this assembly. For each contig it checks if the contig was built using reads from different haplotypes/regions.

Hairsplitter separates the reads into as many groups as necessary and computes the different versions (e.g. alleles) of the contig actually present in the genome. It outputs a new assembly, where different versions of contigs are not collapsed into one but assembled separately.

□ DeepMod2: A signal processing and deep learning framework for methylation detection using Oxford Nanopore sequencing

>> https://www.nature.com/articles/s41467-024-45778-y

DeepMod2 takes ionic current signal from POD5/FAST5 files and read sequences from a BAM file as input and makes 5mC methylation prediction for each read independently using a BiLSTM or Transformer model.

DeepMod2 combines per-read predictions to estimate overall methylation level for each CpG site in the reference genome. It additionally provides haplotype-specific methylation counts if the input BAM file is phased.

□ Graphasing: Phasing Diploid Genome Assembly Graphs with Single-Cell Strand Sequencing

>> https://www.biorxiv.org/content/10.1101/2024.02.15.580432v1

Graphasing, a Strand-seq alignment-to-graph-based phasing and scaffolding workflow that assembles telomere-to-telomere (T2T) human haplotypes using data from a single sample.

Graphasing leverages a robust cosine similarity clustering approach to synthesize global phase signal from Strand-seq alignments with assembly graph topology, producing accurate haplotype calls and end-to-end scaffolds.

□ Pasa: leveraging population pangenome graph to scaffold prokaryote genome assemblies

>> https://academic.oup.com/nar/article/52/3/e15/7469957

Pasa, a graph-based algorithm that utilizes the pangenome graph and the assembly graph information to impro v e scaff olding quality. Pasa is able to utilize the linkage information of the gene families of the species to resolve the contig graph of the assembly.

Pasa orients the gene-level genomes such that they have the most common consecutive gene pairs. The orientations of the gene-level genomes are determined by the following procedure: The algorithm begins with the first genome, and its orientation is chosen arbitrarily.

Pasa identifies an orientation of the second genome that maximizes the number of common pairs of consecutive genes with the first genome.

Similarly, Pasa finds an orientation of the third genome that has the largest number of common pairs of consecutive genes with the first two genomes, and the procedure is repeated for the remaining genomes.

□ TERRACE: Accurate Assembly of Circular RNAs

>> https://www.biorxiv.org/content/10.1101/2024.02.09.579380v1

TERRACE (accuraTe assEmbly of circRNAs using bRidging and mAChine lEarning), a new tool for assembling full-length circRNAs from paired-end total RNA-seq data. TERRACE stands out by assembling circRNAs accurately without relying on annotations.

TERRACE identifies back-spliced reads, which will be assembled into a set of candidate, full-length circular paths. The candidate paths, augmented by the annotated transcripts, are subjected to a selection process followed by a merging procedure to produce the resultant circRNAs.

□ TopoQual polishes circular consensus sequencing data and accurately predicts quality scores

>> https://www.biorxiv.org/content/10.1101/2024.02.08.579541v1

TopoQual, a tool utilizing partial order alignments (POA), topologically parallel bases, and deep learning to polish consensus sequences and more accurately predict base qualities.

TopoQual can find the alternative, or parallel, bases of the calling base in the POA graph. The parallel bases, in conjunction with the trinucleotide sequence of the read and the target base's quality score, are input to the deep learning model treating mismatch bases.

□ Motif Interactions Affect Post-Hoc Interpretability of Genomic Convolutional Neural Networks

>> https://www.biorxiv.org/content/10.1101/2024.02.15.580353v1

Since multiple regulatory elements can be involved in a regulatory mechanism, interactions between motifs complicate the prediction task. Motif interactions can occur in multiple forms, including additive effects as well as multiplicative interactions.

Genomic sequences have to be transformed into numerical matrices so they can be processed by CNNs. Each column of this matrix stands for one sequence position where the base at this position is represented by a one-hot-encoding vector.

They obtain real transcription factor binding motifs from the JASPAR database for the evaluation. They distinguish here between subsets of homologous and heterologous motif subsets to investigate if motif similarity influences interpretability.

Many approaches to interpreting genomic LLM models focus on the analysis of the attention scores or the output with post-hoc methods that mostly offer interpretations on the input token level.

One ongoing challenge is to uncover the grammar between interacting motifs so that interpreting genomic LLMs beyond those approaches could give better explanations of underlying biological processes.

□ SomaScan Bioinformatics: Normalization, Quality Control, and Assessment of Pre-Analytical Variation

>> https://www.biorxiv.org/content/10.1101/2024.02.09.579724v1

Pre-analytical variation (PAV) due to sample collection, handling, and storage is known to affect many analyses in molecular biology. By implementing data modeling techniques similar to those previously developed to find SomaScan signatures associated with clinical phenotypes.

SomaLogic has developed a novel set of so-called SomaSignal Tests (SSTs) to assess pre-analytical variation due to different sample processing factors, including fed-fasted time, number of freeze-thaw cycles, time-to-decant, time-to-spin, and time-to-freeze.

□ ELATUS: Uncovering functional lncRNAs by scRNA-seq

>> https://www.biorxiv.org/content/10.1101/2024.01.26.577344v2

ELATUS, a computational framework based on the pseudoaligner Kallisto that enhances the detection of functional lncRNAs previously undetected and exhibits higher concordance with the ATAC-seq profiles in single-cell multiome data.

ELATUS workflow to uncover biologically important IncRNAs. It started by importing the raw count matrices obtained after preprocessing with both Cell Ranger and Kallisto.

ATAC-seq data from the high-quality nuclei were normalized using a Latent Semantic Indexing approach. "Weighted nearest neighbour" (WNN) analysis was then performed to integrate the ATAC-seq with the gene expression obtained by Cell Ranger and Kallisto.

□ LAVASET: Latent variable stochastic ensemble of trees. An ensemble method for correlated datasets with spatial, spectral, and temporal dependencies

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae101/7612229

LAVASET derives latent variables based on the distance characteristics of each feature and thereby incorporates the correlation factor in the splitting step. LAVASET inherently groups correlated features and ensures similar importance assignment for these.

LAVASET operates given a number of prerequisites and hyperparameters that can be optimized. LAVASET produces non-inferior performance results to traditional Random Forests in all but one of the examples, and in both simulated and real datasets.

□ SHARE-Topic: Bayesian interpretable modeling of single-cell multi-omic data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03180-3

SHARE-Topic extends the cisTopic model of single-cell chromatin accessibility by coupling the epigenomic state with gene expression through latent variables (topics) which are associated to regions and genes within an individual cell.

SHARE-Topic extracts a latent space representation of each cell informed by both the epigenome / transcriptome, but crucially also to model the joint variability of individual genes regions, providing an interpretable analysis tool which can help in generating novel hypotheses.

□ ScRAT: Phenotype prediction from single-cell RNA-seq data using Attention-Based neural networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae067/7613064

ScRAT, a phenotype prediction framework that can learn from limited numbers of scRNA-seq samples with minimal dependence on cell-type annotations. ScRAT utilizes the attention mechanism to measure interactions between cells as their correlations, or attention weights.

ScRAT establishes the connection between the input (cells) and the output (phenotypes) of the Transformer model simply using the attention weights. ScRAT hence selects cells containing the most discriminative information to specific phenotypes, or critical cells.

□ SpaCCC: Large language model-based cell-cell communication inference for spatially resolved transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2024.02.21.581369v1

spaCCC first relied on our fine-tuned single-cell LLM and functional gene interaction network to embed ligand and receptor genes expressed in interacting individual cells into a unified latent space.

Second, the ligand-receptor pairs with a significant closer distance in latent space were taken to be more likely to interact with each other.

Third, molecular diffusion and permutation test strategy were respectively employed to calculate the communication strength and filter out communications with low specificities.

□ Large-scale characterization of cell niches in spatial atlases using bio-inspired graph learning

>> https://www.biorxiv.org/content/10.1101/2024.02.21.581428v1

NicheCompass is a generative graph deep learning method designed based on the principles of cellular communication, enabling interpretable and scalable modeling of spatial omics data.

NicheCompass has a unique in-built capability for spatial reference mapping31 based on fine-tuning, thereby empowering computationally efficient integration and contextualization of a query dataset with a large-scale spatial reference atlas.

□ MaskGraphene: Advancing joint embedding, clustering, and batch correction for spatial transcriptomics using graph-based self-supervised learning

>> https://www.biorxiv.org/content/10.1101/2024.02.21.581387v1

MaskGraphene, a graph neural network with both self-supervised and self-contrastive training strategies designed for aligning and integrating ST data with gene expression and spatial location information while generating batch-corrected joint node embeddings.

MaskGraphene integrates node-to-node matching links from a local alignment algorithm. MaskGraphene selects spots across slices as triplets based on their embeddings, with the goal of bringing similar spots closer and pushing different spots further apart in an iterative manner.

Fragment - II.

2024-02-22 22:11:22 | Science News

□ MuSiCal: Accurate and sensitive mutational signature analysis

>> https://www.nature.com/articles/s41588-024-01659-0/figures/1

MuSiCal (Mutational Signature Calculator) decomposes a mutation count matrix into a signature matrixand an exposure matrix through four main modules: preprocessing, de novo discovery, matching and refitting, and in silico validation/optimization.

MuSiCal leverages several new methods, including minimum-volume nonnegative matrix factorization (mvNMF), likelihood-based sparse nonnegative least squares (NNLS) and a data-driven approach for systematic parameter optimization and in silico validation.

□ CompSeed: A compressive seeding algorithm in conjunction with reordering-based compression

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae100/7611649

CompSeed, in collaboration with the reordering-based compression tools, finishes the BWA-MEM seeding in about half the time by caching all intermediate seeding results in compact trie structures to directly answer repetitive inquiries that frequently cause random memory accesses.

CompSeed demonstrates better performance as sequencing coverage increases, as it focuses solely on the small informative portion of sequencing reads after compression.

CompSeed fully utilizes the redundancy information provided from upstream compressors using trie structures, and avoids ~50% of the redundant time-consuming FM-index operations during the BWA-MEM seeding process.

□ Finimizers: Variable-length bounded-frequency minimizers for k-mer sets

>> https://www.biorxiv.org/content/10.1101/2024.02.19.580943v1

finimizers (frequency-bounded minimizers) uses an order relation ＜ for minimizer comparison that depends on the frequency of the minimizers within the indexed k-mers.

With finimizers, the length m of the m-mers is not fixed, but is allowed to vary depending on the context, so that the length can increase to bring the frequency down below a user-specified threshold t.

Setting a maximum frequency solves the issue of very frequent minimizers and gives us a worst-case guarantee for the query time. They show how to implement a particular finimizer scheme using the Spectral Burrows-Wheeler Transform augmented with longest common suffix information.

□ stMMR: accurate and robust spatial domain identification from spatially resolved transcriptomics with multi-modal feature representation

>> https://www.biorxiv.org/content/10.1101/2024.02.22.581503v1

stMMR utilizes spatial location information as a bridge to establish adjacency relationships between spots. It encodes gene expression data and morphological features extracted from histological images using Graph Convolutional Networks.

stMMR achieves joint learning of intra-modal and inter-modal features. stMMR employs self-attention mechanisms to learn the relationships of different spots. stMMR utilizes similarity contrastive learning along with the reconstruction of GE features and adjacency information.

□ SVarp: pangenome-based structural variant discovery

>> https://www.biorxiv.org/content/10.1101/2024.02.18.580171v1

SVarp addresses the gap by calling SVs on graph genomes using third generation long sequencing reads. It enables us to find additional SVs that are currently missing, including SVs on top of alternative sequences present in the pangenome but not in a linear reference.

SVarp calls novel phased variant sequences, which they call ‘svtigs’. The variant representation is not tied to a single linear reference and allows for flexible downstream workflows that derive variant calls. The svtigs can serve as a basis to amend a pangenome graph.

□ CoCoPyE: feature engineering for learning and prediction of genome quality indices

> https://www.biorxiv.org/content/10.1101/2024.02.07.579156v1

CoCoPyE is a fast tool based on a novel two-stage feature extraction and transformation scheme. CoCoPyE identifies genomic markers and then refines the marker-based estimates with a machine learning approach.

The original feature space comprises more than 10,000 dimensions which correspond to different protein domain families. Large-scale machine learning within such a high-dimensional space is burdensome.

CoCoPyE mapps the original profile space to a lower dimensional histogram space. A count ratio histogram (CRH) arises from the comparison of a candidate profile with a reference profile in terms of the observed ratios between the corresponding protein domain counts.

□ T-S2Inet: Transformer-based sequence-to-Image network for accurate nanopore sequence recognition

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae083/7609038

T-S2Inet, the transformer-based model to capture the accurate Nanopore Sequence Recognition. T-S2Inet uses a Sequence-to-Image (S2I) module that applies transformation rules to convert the unequal length sequence to a fixed-size image.

The objective of the S2I module is to convert sequences of unequal lengths into images of uniform dimensions. T-S2Inet utilizes GASF/GADF for nanopore sequence transformation, and trains and predicts the model through a subsequent deep neural network.

□ BootCellNet, a resampling-based procedure, promotes unsupervised identification of cell populations via robust inference of gene regulatory networks.

>> https://www.biorxiv.org/content/10.1101/2024.02.06.579236v1

BootCellNet employs smoothing and resampling to infer GRNs. Using the inferred GRNs, BootCellNet further infers the minimum dominating set (MDS), a set of genes that determines the dynamics of the entire network.

In BootCellNet, GRN reconstruction is performed with the ARACNe method. NestBoot utilizes a nested bootstrap to control FDR in GRN inference, and they showed that the bootstrapping procedure improved the accuracy of the GRN inference by various inference methods such as GENIE3.

□ MultiXrank: Random walk with restart on multilayer networks: from node prioritisation to supervised link prediction and beyond

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05683-z

MultiXrank, a Random Walk with Restart algorithm able to explore such multilayer networks. MultiXrank outputs scores reflecting the proximity between an initial set of seed node(s) and all the other nodes in the multilayer network.

In this multilayer framework, all the networks can also be weighted and/or directed. MultiXrank outputs scores representing a measure of proximity between the seed(s) and all the nodes of the multilayer network.

□ 123VCF: an intuitive and efficient tool for filtering VCF files

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05661-5

123VCF filters input variants in accordance with a predefined filter sequence applied to the input variants. Users are provided the flexibility to define various filtering parameters, such as quality, coverage depth, and variant frequency within the populations.

123VCF can generate a Tab-Separated Values (TSV) file containing all passed variants, which can be easily imported into spreadsheet-based programs for further analysis. 123VCF can also generate another TSV file specifically for variants that overlap w/ a user-provided BED file.

□ KRANK: Memory-bound k-mer selection for large evolutionary diverse reference libraries

>> https://www.biorxiv.org/content/10.1101/2024.02.12.580015v1

KRANK (K-mer RANKer) combines several components, including a hierarchical selection strategy with adaptive size restrictions and an equitable coverage strategy.

KRANK is centered around a hierarchical traversal of the taxonomy, constructing hash tables separately for each taxon, and merging these to represent the parent taxa. Thus, instead of constructing a global hash table once at the root, it builds the library gradually.

□ Klumpy: A tool to evaluate the integrity of long-read genome assemblies and illusive sequence motifs

>> https://www.biorxiv.org/content/10.1101/2024.02.14.580330v1

Klumpy, a bioinformatic tool designed to detect genome misassemblies, misannotations, and incongruities in long-read-based genome assemblies and their constituent raw reads.

Klumpy scans through a genome assembly and provide users with a list of potentially misassembled regions, and annotate sequences of interest (e.g., an assembled genome or its underlying raw reads) given a query of interest.

These two modes of operation can work synergistically to annotate an assembly and the constituent raw reads together, based on a supplied, specific query (defined as any nucleotide sequence including, e.g., genes, regulatory motifs, or transposable elements).

□ RUBICON: a framework for designing efficient deep learning-based genomic basecallers

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03181-2

RUBICON, the first framework for specializing and optimizing a machine learning-based basecaller. RUBICON uses two machine learning techniques to develop hardware-optimized basecallers that are specifically designed for basecalling.

RUBICON uses QABAS, an automatic architecture search for computation blocks and optimal bit-width precision, and SkipClip, a dynamic skip connection removal module. QABAS uses neural architecture search to evaluate millions of different basecaller architectures.

RUBICALL is the first hardware-optimized basecaller, demonstrates fast, accurate, and efficient basecalling, achieving 6.88× reductions in model size with 2.94× fewer neural network parameters.

□ Fasta2Structure: a user-friendly tool for converting multiple aligned FASTA files to STRUCTURE format

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05697-7

Fasta2Structure, a graphical user interface (GUI) application designed to simplify the process of converting multiple sequence alignments into a single, cohesive file that is compatible with the STRUCTURE software.

fasta2structure incorporates all variable sites present in the alignments. fasta2structure exhibits a higher degree of robustness in converting a wider array of data types, encompassing those with significant genetic variation.

□ pipesnake : Generalized software for the assembly and analysis of phylogenomic datasets from conserved genomic loci

>> https://www.biorxiv.org/content/10.1101/2024.02.13.580223v1

ausarg/pipesnake is a bioinformatics best-practice analysis pipeline for phylogenomic reconstruction starting from short-read 'second-generation' sequencing data.

pipesnake workflow generates a number of output files that are stored in process-specific directories. This allows the user to store and inspect intermediate files such as individual sample PRGs, alignment files, and locus trees.

□ PIMENTA: PIpeline for MEtabarcoding through Nanopore Technology used for Authentication

>> https://www.biorxiv.org/content/10.1101/2024.02.14.580249v1

PIMENTA, a PIpeline for MEtabarcoding through Nanopore Technology used for Authentication. PIMENTA is a pipeline for rapid taxonomic identification in samples using MinION metabarcoding sequencing data.

The PIMENTA pipeline consists of eight linked tools, and data analysis passes through 3 phases: 1) pre-processing the MinION data through read calling, demultiplexing, trimming sequencing adapters, quality trimming and filtering the reads,

2) clustering the reads, continued by MSA and consensus building per cluster, 3) reclustering of consensus sequences, followed by another MSA and consensus building per cluster, 4) Taxonomy identification with the use of a BLAST analysis.

□ EvoAug-TF: Extending evolution-inspired data augmentations for genomic deep learning to TensorFlow

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae092/7609674

EvoAug-TF adapts the functionality of the PyTorch-based EvoAug framework in TensorFlow, including the augmentation techniques (e.g., random transversion, insertion, translocation, deletion, mutation, and noise).

EvoAug-TF employs the same two-stage training curriculum, where stochastic augmentations are applied online to each mini-batch during training, followed by a finetuning step on the original, unperturbed data.

Since EvoAug-TF imposes transformations on the input data while maintaining the same labels as the wildtype sequence, in its current form, EvoAug-T only supports DNNs that output scalars in single-task or multi-task settings.

□ K2R: Tinted de Bruijn Graphs for efficient read extraction from sequencing datasets

>> https://www.biorxiv.org/content/10.1101/2024.02.15.580442v1

K2R, a highly scalable index that implement such search efficiently within this framework. K2R consistently outperforms contemporary solutions in most metrics and is the only tool capable of scaling to larger datasets.

K2R's performance, in terms of index size, memory footprint, throughput, and construction time, is benchmarked against leading methods, including hashing techniques (e.g., Short Read Connector) and full-text indexing (e.g., Spumoni and Movi), across various datasets.

□ Delineating the Effective Use of Self-Supervised Learning in Single-Cell Genomics

>> https://www.biorxiv.org/content/10.1101/2024.02.16.580624v1

Central to this framework is the use of fully connected autoencoder architectures, selected for their ubiquitous application in SCG tasks and for minimizing architectural influences on our study, yet still large enough to capture underlying biological variations.

In this framework, they integrate key SSL pretext tasks based on masked autoencoders and contrastive learning to benchmark their performance. The framework operates in two stages: The first stage is pre-training pretext task, where the model learns from unlabeled data.

They call the resulting model 'SSL-zero-shot' for its zero-shot evaluation. The second stage is the optional fine-tuning. Calling the resulting model the 'SSL' model, which is further trained to specific downstream tasks such as cell type annotation.

Thie SSL framework leverages Masked Autoencoder with Random Masking and Gene Program Masking (GP) strategies, along with the Isolated Masked Autoencoder (iMAE) approaches GP to GP and Gene Program to Transcription Factor masking, considering isolated sets of genes.

The strategies entail leveraging different degrees of biological insight, from random masking with a minimal inductive bias to isolated masking that intensively utilizes known gene functions, emphasizing targeted biological relationships.

□ The Backpack Quotient Filter: a dynamic and space-efficient data structure for querying k-mers with abundance.

>> https://www.biorxiv.org/content/10.1101/2024.02.15.580441v1

The Backpack Quotient Filter (BQF) is an indexing data structure with abundance. Although the data can be anything, it's been thought to index genomic datasets. The BQF is a dynamic structure, with a correct hash function it can add, delete and enumerate elements.

BQF relies on a hash-table-like structure called Quotient Filter. Part of the information inserted is stored implicitly within the address in the table where it is written.

BQF inserts and query s-mers but virtualizes the presence of k-mers at query time. In other words, a query sequence is broken down into k-mers, and each k-mer is virtually queried through all of its s-mers.

□ Identifying Reproducible Transcription Regulator Coexpression Patterns with Single Cell Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2024.02.15.580581v1

Adopting a "TR-centric" approach towards aggregating single cell coexpression networks, with the primary goal of learning reproducible TR interactions. It assembles a diverse range of scRNA-seq data to better understand the coexpression range of all measurable.

The key aim was to prioritize the genes that are most frequently coexpressed with each TR, hypothesizing that this prioritization can facilitate the identification of direct TR-target interactions.

□ Marsilea: An intuitive generalized visualization paradigm for complex datasets

>> https://www.biorxiv.org/content/10.1101/2024.02.14.580236v1

Marsilea, a Python library designed for creating complex visualizations with ease. Marsilea is notable for its modularity, diverse plot types, compatibility with various data formats, and is available in a coding-free web-based interface for users of all experience levels.

For datasets with categorical axis, the paradigm allows incorporation of data-driven structure, for example, through hierarchical clustering showcasing similarities within and between data groups, adding a deeper analytical dimension.

Additionally, the paradigm offers versatility through concatenation and recursion: secondary plots can transform into central plots of new cross-layouts that are connected to the initial one, allowing for intricate and detailed visual representations of the data.

□ SNVstory: inferring genetic ancestry from genome sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05703-y

SNVstory incorporates samples/variants from three different curated datasets, expanding the number of labels and the granularity of the model classification beyond the main continental divisions.

Drawing upon the gnomAD database produces a much larger number of variants on which our models were trained, providing the opportunity to classify ancestry on a wider (or more diverse) range of features.

SNVstory excludes consanguineous samples, ensuring that the overrepresentation of closely related individuals does not bias the model. This implementation is optimized for individualized results rather than clustering large cohorts of samples into shared ancestral groups.

□ SF-Relate: Secure Discovery of Genetic Relatives across Large-Scale and Distributed Genomic Datasets

>> https://www.biorxiv.org/content/10.1101/2024.02.16.580613v1

SF-Relate, a practical and secure federated algorithm for identifying genetic relatives across data silos. SF-Relate vastly reduces the number of individual pairs to compare while maintaining accurate detection through a novel locality-sensitive hashing approach.

SF-Relate constructs an effective hash function that captures identity-by-descent (IBD) segments in genetic sequences, which, along with a new bucketing strategy, enable accurate and practical private relative detection.

SF-Relate uses a novel encoding scheme that splits and subsamples genotypes into k-SNPs (similar to k-mers, but non-contiguous), such that the similarity between k-SNPs reflects extended runs of identical genotypes, typically indicative of relatedness.

□ Flexiplex: a versatile demultiplexer and search tool for omics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae102/7611801

Flexiplex, a versatile and fast sequence searching and demultiplexing tool, which is based on the Levenshtein distance. Given a set of reads as either .fastq or .fasta it will demultiplex and/or identify target sequences, reporting matching reads and read-barcode assignment.

Flexiplex first uses edlib to search for a left and right flanking sequence within each read. For the best match with an edit distance of “f” or less it will trim to the barcode + UMI sequence +/- 5 bp either side, and search for the barcode against a known list.

Occassionally reads are chimeric, meaning two or more molecules get sequence togther in the same read. Flexiplex will repeat the search again with the previously found primer to polyT sequence masked out. This is repeated until no new barcodes are found in the read.

□ Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae104/7612230

SPIRES with a model of chemical to disease (CTD) associations based on the Biolink Model. Biolink extends the simple triple model of associations to include qualifiers on the predicate, subject, and object.

SPIRES performs grounding and normalization with the Ontology Access Kit library (OAKlib), which provides interfaces for multiple annotation tools, including the Gilda entity normalization tool, the BioPortal annotator, and the Ontology Lookup Service.

For identifier normalization a number of services can be used, including OntoPortal mappings, with the default being the NCATS Biomedical Translator Node Normalizer.

□ Squigualiser: Interactive visualisation of raw nanopore signal data

>> https://www.biorxiv.org/content/10.1101/2024.02.19.581111v1

Squigualiser builds upon existing methodology for signal-to-sequence alignment in order to anchor raw signal data points to their corresponding positions within basecalled reads or within a reference genome/transcriptome sequence.

Squigualiser enables efficient representation of signal alignments and normalises outputs. A new method for k-mer-to-base shift correction addresses ambiguity in signal alignments to enable visualisation of genetic variants, modified bases, at single-base resolution.

□ SLIDE: Significant Latent Factor Interaction Discovery and Exploration across biological domains

>> https://www.nature.com/articles/s41592-024-02175-z

Significant Latent factor Interaction Discovery and Exploration (SLIDE), a first-in-class interpretable machine learning technique for identifying significant interacting latent factors underlying outcomes of interest from high-dimensional omic datasets.

SLIDE makes no assumptions regarding data-generating mechanisms, comes with theoretical guarantees regarding identifiability of the latent factors/corresponding inference, outperforms/performs at least as well as state-of-the-art approaches in terms of prediction.

□ A tractable tree distribution parameterized by clade probabilities and its application to Bayesian phylogenetic point estimation

>> https://www.biorxiv.org/content/10.1101/2024.02.20.581316v1

A new tractable tree distribution and associated point estimator that can be constructed from a posterior sample of trees. This point estimator performs at least as well and often better than standard methods of producing Bayesian posterior summary trees.

□ Fast and accurate short read alignment with hybrid hash-tree data structure

>> https://www.biorxiv.org/content/10.1101/2024.02.20.581311v1

The actual sequencer should be able to generate many small fasta files for the data of one human genome, since actual reading process is highly parallel. They assume that input data are available as a number small fasta files.

This new hybrid hash-tree algorithm requires fairly large (around 100GB) table to express the reference genome. Therefore, this table must be shared by processes which handle the reads in parallel.

The SWG program performs the match through Smith-Waterman-Gotoh algorithm and calculates the matching sore, does not require large tables. It process one file and generate SAM-format output. For parallel processing they just run a fixed number of this program in parallel.

Constellation.

2024-02-22 22:01:10 | 映画

□ 『Constellation』 (Apple TV+)

>> https://www.apple.com/tv-pr/originals/constellation/

2024
Created by Peter Harness
Based on a concept by Sean Jablonski
Music by Ben Salisbury and Suvi-Eeva Äikäs

『Constellation』 (Apple TV+) 今、宇宙服が最も似合う女優Noomi Rapaceが挑む新次元のSFスリラー。1-2話で描かれる、壊滅的な打撃を受けたISSからの緊迫のサバイバル劇が序盤の見どころ。夜明けの周回軌道や砂漠での車両の行軍など、ヴィルヌーブな叙情的なカットも印象に残る

□ Surrogate Sibling / “Tellur”
(Constellation Opening | Opening theme | APPLE TV+)

Luca d'Alberto / “In Our Hearts”

2024-02-22 22:00:10 | art music

□ Luca d'Alberto / “In Our Hearts”

>> https://luca-dalberto.com/

Release Date; 23/02/2024
Label; Decca

1. Adore
2. Beautiful As A Memory
3. We Fall In
4. Tomorrow
5. Malinconica (6/5/1917)
6. Silence
7. Ready For Life
8. A New Dress
9. Anima

Studio Personnel, Mixer, Producer, Associated Performer, Recording Arranger, Pianoforte, Cello, Viola, Violin, Electric 6- String, Programming, Synthesizer: Luca D'Alberto
Studio Personnel, Mixer: Martyn Heyne
Composer: Luca D'Alberto

イタリアの現代音楽家、ルカ・ダルベルトのニューアルバム。室内楽調のPost-Classicalな演奏に、Electronica要素が鮮やかな濃淡を添える。Clint Mansell風の哀愁の旋律から、後半にかけて突き抜けるように昂っていく展開が多く、今作のテーマであるところの『希望』を謳い切っている

□ Luca d'Alberto / “Flowers & Thorns”

□ Luca d'Alberto feat. Tom Smith / “We Fall In”

Lavinia Meijer / “WINTER”

2024-02-22 21:13:52 | art music

□ Lavinia Meijer / “WINTER”

Released on: 2024-02-02

Producer, Recording Engineer: Joris Wolff

ハープによる表現の可能性を追求し続ける世界的ソリスト、ラヴィニア・マイヤー。アムステルダムを拠点に活動するAlma Qualtetと、ベーシストReyer Zwartとの競演

□ Lavinia Meijer · Alma Quartet Amsterdam · Reyer Zwart / “Amethyst”

THE DAYS

2024-02-21 21:30:44 | 映画

□ 『THE DAYS』

>> https://www.netflix.com/jp/title/81233755

Netflix (2023)

8 episodes

Directed by Masaki Nishiura / Hideo Nakata
Based on the novel by Ryusho Kadota
Screenplay by Hun Masumoto
Music by Brian D'Oliveira
Cinematography by Gen Kobayashi
Production design by Yuko Iizuka

その日、福島第一原発で何が起こったのか。故・吉田昌郎所長が克明に刻んだ記録と証言に基づき、圧倒的なリアリティと緊迫感で描く。暴走する原発と組織運営、政治に翻弄されながら綱渡りの判断を下す苦渋の演技に脱帽。Brian D'Oliveiraの’見えない恐怖’を演出する重厚な劇伴音楽が白眉

□ nemureruongakunooto / “眠る時に聞く音楽”

‘Til the End of Time

2024-02-15 00:00:04 | アート・文化

(Created with Midjourney v6.0 ALPHA)

□ Delerium / “‘Til The End Of Time” https://youtu.be/ygn3PqOXRD0

	ブログを読むだけ。毎月の訪問日数に応じてポイント進呈
	gooブロガーの今日のひとこと
	訪問者数に応じてdポイント最大1,000pt当たる！
	goo blogは20周年を迎えました！

lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.