lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Oblivion.

2023-03-13 03:12:03 | Science News




□ InClust+: the multimodal version of inClust for multimodal data integration, imputation, and cross modal generation

>> https://www.biorxiv.org/content/10.1101/2023.03.13.532376v1

inClust+ extends the inClust by adding two new modules, namely, the input-mask module in front of encoder and the output-mask module behind decoder. It could integrate multimodal data profiled from different cells in similar populations or from a single cell.

The inClust+ encodes the scRNA and MERFISH data into latent space respectively. After covariates (modalities) removal by vector subtraction, the samples from different modalities were mixed together and clustered according to their cell types.





□ RNA-MSM: Multiple sequence-alignment-based RNA language model and its application to structural inference

>> https://www.biorxiv.org/content/10.1101/2023.03.15.532863v1

While BERT (Bidirectional Encoder Representations from Transformers)-like language models have been developed for RNA, they are ineffective at capturing the evolutionary information from homologous sequences because unlike proteins, RNA sequences are less conserved.

RNA MSA-transformer language model (RNA-MSM) takes the multiple aligned sequences as an input, and outputs corresponding embeddings and attention maps. RNA-MSM can be directly mapped with high accuracy to 2D base pairing probabilities and 1D solvent accessibilities.






□ Quantum computing algorithms: getting closer to critical problems in computational biology

>> https://academic.oup.com/bib/article/23/6/bbac437/6758194

QiBAM basically extends Grover’s search algorithm to allow for errors in the alignment between reads and the reference sequence stored in a quantum memory. The qubit complexity is equal to O(M · log2A + log2 N − M ).

Longest diagonals patterns in the matrix, possibly not perfectly shaped owing to mismatches and short insertions/deletions, highlight the regions of highest similarity and can be detected w/ a quantum pattern recognition. The overall time complexity of the method is O(log2(NM)).

Quantum solutions for the de novo assembly problems are based on strategies for efficiently solving the Hamiltonian path in OLC graphs.

The iterative application of the time evolution operators relative to the cost and mixing Hamiltonian approximates the adiabatic transition between the ground state of the mixing Hamiltonian and the ground state of the cost Hamiltonian that represents the optimal solution.





□ On quantum computing and geometry optimization

>> https://www.biorxiv.org/content/10.1101/2023.03.16.532929v1

This work attempts to explore a few ways in which classical data, relating to the Cartesian space representation of biomolecules, can be encoded for interaction with empirical quantum circuits not demonstrating quantum advantage.

Using the quantum circuit for random state generation in a variational arrangement together with a classical optimizer, this work deals with the optimization of spatial geometries with potential application to molecular assemblies.

Dihedral data is used with a quantum support vector classifier to introduce machine learning capabilities. Aditionally, empirical rotamer sampling is demonstrated using quantum Monte Carlo simulations for side-chain conformation sampling.





□ DTWax: GPU-accelerated Dynamic Time Warping for Selective Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2023.03.05.531225v1

Subsequence Dynamic Time Warping (sDTW) is a two-dimensional dynamic programming algorithm tasked with finding the best map of the whole of the input query squiggle in the longer target reference.

DTWax, a GPU-accelerated sDTW software for nanopore Read Until to save time and cost of nanopore sequencing and compute. DTWax uses use floating point operations and Fused-Multiply-Add operations. DTWax achieves ∼1.92X sequencing speedup and ∼3.64X compute speedup.





□ Quantum algorithm for position weight matrix matching

>> https://www.biorxiv.org/content/10.1101/2023.03.06.531403v1

The PWM matching is applied to a long genome DNA sequence of million bases such that every segment i in the DNA sequence is assigned a score WM(ui ...ui+m−1) and they search Psol, segments with scores higher than the threshold wth .

The PWM matching quantum algorithm based on the naive iteration method. For any sequence with length n and any K PWMs for sequence motifs with length m, given the oracles to get the specified entry It can find n matches with high probability making queries to the oracles.





□ scMCs: a framework for single cell multi-omics data integration and multiple clusterings

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad133/7079796

scMCs uses the omics-independent deep autoencoders to learn the low-dimensional representation of each omics. scMCs utilizes the contrastive learning strategy, and fuses the individuality and commonality features into a compact co-embedding representation for data imputation.

scMCs applies multi-head attention mechanism on the co-embedding representation to generate multiple salient subspaces, and reduce the redundancy between subspaces. scMCs optimizes a Kullback Leibler (KL) divergence based clustering loss in each salient subspace.





□ CLASSIC: Ultra-high throughput mapping of genetic design space

>> https://www.biorxiv.org/content/10.1101/2023.03.16.532704v1

CLASSIC (combining long- and short- range sequencing to investigate genetic complexity), a generalizable genetic screening platform that combines long- and short-read NGS modalities to quantitatively assess pooled libraries of DNA constructs of arbitrary length.

Due to the random assignment of barcodes to assembled constructs, each variant in a CLASSIC library is associated with multiple unique barcodes that generate independent phenotypic meas- urements, leading to greater accuracy than a one-to-one construct-to-barcode library.





□ EnsembleTR : A deep population reference panel of tandem repeat variation

>> https://www.biorxiv.org/content/10.1101/2023.03.09.531600v1

EnsembleTR, which takes TR genotypes output by existing tools (currently ExpansionHunter, adVNTR, HipSTR, and GangSTR) as input, and outputs a consensus TR callset by converting TR genotypes to a consistent internal representation and using a voting-based scheme.

They apply EnsembleTR to genotype 1.7 million TRs based on the hg38 reference genome across deep PCR-free WGS for 3,202 individuals from the 1000GP2 and PCR+ WGS data for 348 individuals from H3Africa Project.

EnsembleTR then identifies overlapping TR regions genotyped by two or more tools, infers a mapping between alternate allele sets reported by each method, and outputs a consensus genotype and quality score for each call.





□ Direct Estimation of Parameters in ODE Models Using WENDy: Weak-form Estimation of Nonlinear Dynamics

>> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10002818/

WENDy is a highly robust and efficient method for parameter inference in differential equations. Without relying on any numerical differential equation solvers, WENDy computes accurate estimates and is robust to large (biologically relevant) levels of measurement noise.

WENDy is competitive with conventional forward solver-based nonlinear least squares methods in terms of speed and accuracy. For both higher dimensional systems and stiff systems, WENDy is typically both faster and more accurate than forward solver-based approaches.





□ miloDE: Sensitive cluster-free differential expression testing.

>> https://www.biorxiv.org/content/10.1101/2023.03.08.531744v1

miloDE exploits the notion of overlapping neighborhoods of homogeneous cells, constructed from graph-representation of scRNA-seq data, and performs testing within each neighborhood. Multiple testing correction is performed either across neighborhoods or across genes.

As input, the algorithm takes a set of samples with given labels (case or control) alongside a joint latent embedding. Next, miloDE generates a graph recapitulating the distances between cells and define neighbourhoods using the 2nd-order kNN graph.





□ GPMeta: a GPU-accelerated method for ultrarapid pathogen identification from metagenomic sequences

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbad092/7077155

GPMeta can rapidly and accurately remove host contamination, isolate microbial reads, and identify potential disease-causing pathogens. GPMeta is much faster than existing CPU-based tools, being 5-40x faster than Kraken2 and Centrifuge and 25-68x faster than Bwa and Bowtie2.

GPMeta offers GPMetaC clustering algorithm, a statistical model for clustering and rescoring ambiguous alignments to improve the discrimination of highly homologous sequences.





□ SpaSRL: Spatially aware self-representation learning for tissue structure characterization and spatial functional genes identification

>> https://www.biorxiv.org/content/10.1101/2023.03.13.532390v1

spatially aware self-representation learning (SpaSRL), a novel method that achieves spatial domain detection and dimension reduction in a unified framework while flexibly incorporating spatial information.

SpaSRL enhances and decodes the shared expression between spots for simultaneously optimizing the low-dimensional spatial components (i.e., spatial meta genes) and spot-spot relations through a joint learning model that can transfer spatial information constraint from each other.

SpaSRL can improve the performance of each task and fill the gap between the identification of spatial domains and functional (meta) genes accounting for biological and spatial coherence on tissue.





□ compare_genomes: a comparative genomics workflow to streamline the analysis of evolutionary divergence across genomes

>> https://www.biorxiv.org/content/10.1101/2023.03.16.533049v1

compare_genomes, a transferable and extendible comparative genomics workflow built using the Nextflow framework and Conda package management system.

compare_genomes provides a wieldy pipeline to test for non-random evolutionary patterns which can be mapped to evolutionary processes to help identify the molecular basis of specific features or remarkable biological properties of the species analysed.





□ LBConA: a medical entity disambiguation model based on Bio-LinkBERT and context-aware mechanism

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05209-z

LBConA first Bio-LinkBERT, which is capable of learning cross-document dependencies, to obtain embedding representations of mentions and candidate entities. Then, cross-attention is used to capture the interaction information of mention-to-entity and entity-to-mention.

Encoding the context of mentions using ELMo, which captures lexical information, and computing the context score using a self-attention mechanism to obtain contextual cues about disambiguation.





□ nPoRe: n-polymer realigner for improved pileup-based variant calling

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05193-4

Defining copy number INDELs as n-polymers (3+ exact copies of the same repeat unit), with a differing number of copies from the expected reference. For example, AAAA→AAAAA and ATATAT→ATAT meet this definition, but ATAT→ATATAT, AATAATAAAT→AATAAT, and ATATAT→ATATA do not.

nPoRe’s algorithm is directly designed to reduce alignment penalties for n-polymer copy number INDELs and improve alignment in low-complexity regions. It extends Needleman-Wunsch affine gap alignment by new gap penalties for more accurately aligning repeated n-polymer sequences.





□ PhyloSophos: a high-throughput scientific name mapping algorithm augmented with explicit consideration of taxonomic science

>> https://www.biorxiv.org/content/10.1101/2023.03.17.533059v1

PhyloSophos, a high-throughput scientific name processor designed to provide connections between scientific name inputs and a specific taxonomic system. PhyloSophos is conceptually a mapper that returns the corresponding taxon identifier from a reference of choice.

PhyloSophos can refer to multiple available references to search for synonyms and recursively map them into a chosen reference. It also corrects common Latin variants and vernacular names, subsequently returns proper scientific names and its corresponding taxon identifiers.





Singular Genomics RT

>> https://singulargenomics.com/g4/reagents/

We’ve designed a selection of kits for the G4 with multiple configurations depending on read length and size requirements for maximum system flexibility and cost efficiency.

Explore the capabilities of the F2, F3, and Max Read Kits for your application





□ Robust classification using average correlations as features (ACF)

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05224-0

In contrast to the KNN classifier, ACF intrinsically considers all cross-correlations between classes, without limiting itself to certain elements of CTrain. DBC incorporates cross-correlations but relies on a fixed claiming-scheme and weighted Kullback–Leibler decision rules.

For ACF, the baseline classifier may instead be chosen depending on the data and can be further adapted, e.g. increasing the depth of decision trees. The modularity of ACF allows to integrate deep-learning based methods, such as a Multi-Layer Perceptron as baseline classifier.





□ aenmd: Annotating escape from nonsense-mediated decay for transcripts with protein-truncating variants

>> https://www.biorxiv.org/content/10.1101/2023.03.17.533185v1

aenmd predicts escape from NMD for combinations of transcripts and PTC-generating variants by applying a set of NMD-escape rules, which are based on where the PTC is situated within the mutant transcript.

Variant-transcript pairs with a PTC conforming to any of the above rules will be annotated to escape NMD, but results for all rules are reported individually by aenmd; this allows users to focus on subsets of rules.





□ seqspec: A machine-readable specification for genomics assays

>> https://www.biorxiv.org/content/10.1101/2023.03.17.533215v1

seqspec, a machine-readable specification for libraries produced by genomics assays that facilitates standardization of preprocessing and enables tracking and comparison of genomics assays.

seqspec defines a machine-readable file format, based on YAML. Reads are annotated by Regions which can be nested and appended to create a seqspec. Regions are annotated with a variety of properties that simplify the downstream identification of sequenced elements.





□ C.Origami: Cell-type-specific prediction of 3D chromatin organization enables high-throughput in silico genetic screening

>> https://www.nature.com/articles/s41587-022-01612-8

C.Origami, a multimodal deep neural network that performs de novo prediction of cell-type-specific chromatin organization using DNA sequence and two cell-type-specific genomic features—CTCF binding and chromatin accessibility.

C.Origami enables in silico experiments to examine the impact of genetic changes on chromatin interactions. The accuracy of C.Origami allows systematic identification of cell-type-specific mechanisms of genomic folding through in silico genetic screening (ISGS).





□ Seqpac: A framework for sRNA-seq analysis in R using sequence-based counts

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad144/7082956

Seqpac is designed to preserve sequence integrity by avoiding a feature-based alignment strategy that normally disregards sequences that fail to align to a target genome.

Using an innovative targeting system, Seqpac process, analyze and visualize sample or sequence group differences using the PAC object. Seqpac uses a strategy for sRNA-seq analysis that preserves the integrity of the raw sequence making the data lineage fully traceable.





□ The hidden factor: accounting for covariate effects in power and sample size computation for a binary trait

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad139/7082519

When performing power estimation or replication sample size calculation for a continuous trait through linear regression, covariate effects are implicitly accounted for through residual variance.

When analyzing a binary trait through logistic regression, covariate effects must be explicitly specified and included in power and sample size computation, in addition to the genetic effect of interest.

SPCompute is used for accurate and efficient power and sample size computation for a binary trait that takes into account different types of non-genetic covariates E, and allows for different types of G-E relationship.





□ OutSingle: A Novel Method of Detecting and Injecting Outliers in RNA-seq Count Data Using the Optimal Hard Threshold for Singular Values

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad142/7083276

OutSingle (Outlier detection using Singular Value Decomposition), an almost instantaneous way of detecting outliers in RNA-Seq GE data. It uses a simple log-normal approach for count modeling.

OutSingle uses Optimal Hard Threshold method for noise detection, which itself is based on Singular Value Decomposition. Due to its SVD/OHT utilization, OutSingle’s model is straightforward to understand and interpret.





□ ReConPlot – an R package for the visualization and interpretation of genomic rearrangements

>> https://www.biorxiv.org/content/10.1101/2023.02.24.529890v2

ReConPlot (REarrangement and COpy Number PLOT), an R package that provides functionalities for the joint visualization of SCNAs and SVs across one or multiple chromosomes.

ReConPlot is based on the popular ggplot2 package, thus allowing customization of plots and the generation of publication-quality figures with minimal effort. ReConPlot facilitates the exploration, interpretation, and reporting of complex genome rearrangement patterns.





□ MetaLLM: Residue-wise Metal ion Prediction Using Deep Transformer Model

>> https://www.biorxiv.org/content/10.1101/2023.03.20.533488v1

MetaLLM, a metal binding site prediction technique, by leveraging the recent progress in self-supervised attention-based (e.g. Transformer) large language models (LLMs) and a considerable amount of protein sequences.

MetaLLM uses a transformer pre-trained on an extensive database of protein sequences and later fine-tuned on metal-binding proteins for multi-label metal ions prediction. A 10-fold cross-validation shows more than 90% precision for the most prevalent metal ions.





□ escheR: Unified multi-dimensional visualizations with Gestalt principles

>> https://www.biorxiv.org/content/10.1101/2023.03.18.533302v1

Existing visualization methods create cognitive gaps on how to associate the disparate information or how to interpret the biological findings of this multi-dimensional information regarding their (micro- )environment or colocalization.

escheR leverages Gestalt principles to improve the design and interpretability of multi-dimensional data in 2D data visualizations, layering aesthetics to display multiple variables.





□ RExPRT: a machine learning tool to predict pathogenicity of tandem repeat loci

>> https://www.biorxiv.org/content/10.1101/2023.03.22.533484v1

RExPRT is designed to distinguish pathogenic from benign TR expansions. Leave-one-out cross validation results demonstrated that an ensemble approach comprised of SVM and extreme gradient boosted decision tree (XGB).

RExPRT uses GridSearchCV to fine-tune the SVM and XGB models. RExPRT incorporates information on the genetic architecture of a TR locus, such as its proximity to regulatory regions, TAD boundaries, and evolutionary constraints.





□ Cue: a deep-learning framework for structural variant discovery and genotyping

>> https://www.nature.com/articles/s41592-023-01799-x

Cue, a novel generalizable framework for SV calling and genotyping, which can effectively leverage deep learning to automatically discover the underlying salient features of different SV types and sizes.

Cue genotype SVs that can learn complex SV abstractions directly from the data. Cue converts alignments to images that encode SV-informative signals and uses a stacked hourglass convolutional neural network to predict the type, genotype and genomic locus of the SVs captured in each image.





□ FLONE: fully Lorentz network embedding for inferring novel drug targets

>> https://www.biorxiv.org/content/10.1101/2023.03.20.533432v1

FLONE, a novel hyperbolic Lorentz space embedding-based method to capture the hierarchical structural information in the DDT network. FLONE generates more accurate candidate target predictions given the drug and disease than the Euclidean translation-based counterparts.

FLONE enables a hyperbolic similarity calculation based on FuLLiT (fully Lorentz linear transformation), which essentially calculates the Lorentzian distance (i.e., similarity) between the hyperbolic embeddings of candidate targets and the hyperbolic representation.





□ Flexible parsing and preprocessing of technical sequences with splitcode

>> https://www.biorxiv.org/content/10.1101/2023.03.20.533521v1

splitcode can simultaneously trim adapter sequences, parse combinatorial barcodes that are variable in length and inconsistent in location within a read, and extract UMIs that are defined in location with respect to other technical sequences rather than at a set position within a read.

splitcode can seamlessly interface with other commandline tools, including other read sequencing read preprocessors as well as read mappers, by streaming the pre-processed reads into those tools.





□ Inference of single cell profiles from histology stains with the Single-Cell omics from Histology Analysis Framework (SCHAF)

>> https://www.biorxiv.org/content/10.1101/2023.03.21.533680v1

SCHAF discovers the common latent space from both modalities across different samples. SCHAF then leverages this latent space to construct an inference engine mapping a histology image to its corresponding (model-generated) single-cell profiles.





Oxford Nanopore RT

>> https://newstimes18.com/how-ai-is-transforming-genomics/

Analysing sequencing data requires accelerated compute & #datascience to read and understand the genome. Read why #AI, #deeplearning, #RNN- and CNN-based models are essential for #genomics.





□ 現在の職務内容、以前の分析・施策から開発寄りの立場に変わったのだけど、GPT-4は戦略のコアにこそ最大の恩恵を齎すもので、要件定義が重畳する既存の統合環境では代替プログラミングの生成効率は限定的。特定のコスト条件で環境設計させるか、インターフェース間にダイアグノーシス機能を構築するか。



Infinite Improbability Drive.

2023-03-12 22:22:22 | アート・文化

(“Planets collectors.” by Andrei)




Segment - It is difficult to consistently deploy policy, strategy, scientific evidence, and social consensus in any issue along the same axis, and it requires a certain process from intersection to equilibrium. The problem is that the conflicts between the three layers of 'policy and strategy level,' 'policy and social consensus,' and 'strategy and scientific evidence' can lead to serious alienation of the masses.

政策・戦略・科学的根拠・社会的合意、これらを任意のイシューにおいて同軸で一貫的に展開することは困難であり、交錯から均衡状態に至る一定の過程を要する。問題は『政策と戦略レベル』、『政策と社会的合意』、『戦略と科学的根拠』、この3層のコンフリクトが深刻な群集乖離を齎すことである。







隠入塵煙: Return to Dust (小さき麦の花)

2023-03-12 18:44:06 | 映画

□ 『小さき麦の花(隠入塵煙: Return to Dust)』

>> https://moviola.jp/muginohana/

Directed by Ruijun Li
Music by Peyman Yazdanian
Cinematography by Weihua Wang

中国政府の検閲を乗り超え支持された奇跡の作品。清貧とは程遠く大地に根ざして暮らす農村民、そして痛烈な社会批判を含んだ劇薬。恵まれた者も疎外された者も、みな等しく土に還る。人も動物も理不尽に寄り添いあって、絆の実が芽吹く。

ロバの鈴の音に胸が詰まる。


政治・民族的にクリティカルなメッセージとは別に、現代が舞台とは思えない中国辺境の農村民の暮らしを眺めるだけで良い。季節の移ろい、昼夜の気温差、束ねた小麦と、肌に張り付く土埃が薫り立つ、自然と人と動物の息遣いを間近に感じる映画。

結末の解釈が少し分かれそうなのだけど(事実、中国公開版はベルリン国際映画祭出展版と差替えられている)、私はもう3回観て3回ともグショグショに泣いた。もうね…ロバを飼いたい人生だった…

あとパンフの装丁もとても綺麗で、日本語のソースも少ない映画なので購入激賞。


Payman Yazdanian / “The Snow”


劇伴音楽を手がけているのは、イラン人作曲家Peyman Yazdanian。ギター演奏とアンビエントのシンプルな味付けが雰囲気にぴったり。

ただ、私としてはこの映画の世界観って、ENIGMAの”Return to Innocence”のMVそのもので、非常にノスタルジーを感じる





Delerium / “Signs”

2023-03-10 22:13:37 | delerium


□ Delerium / “Signs”

>> https://www.metropolis-records.com/product/11858/signs

Release Date; 10/03/2023
Label; Metropolis Records

Writer: Bill Leeb
Writer: Rhys Fulber


>> tracklisting.

01. Falling Back to You (with Mimi Page)
02. Rain
03. Coast to Coast (feat. Phildel)
04. Sun Storm
05. In The Deep (feat. KANGA)
06. Esque
07. Remember Love (with Mimi Page)
08. Amebedo
09. Streetcar (feat. Inna Walters)
10. Glimmer (with Emily Haines) [Delerium Remix]
11. The Astronomer
12. Absolution (with Mimi Page)


Delerium / “SIGNS” カナダの夢幻的エレクトロニカ・デュオ、デレリアム7年ぶりの新譜。トラックメイキングは全篇Conjure One寄りだが、Bill Leebのポエティシズムが芯にあり、Mimi Pageのレイヤード・ヴォイスが目立つ点では、前作”Mythologie”の路線を色濃く継いでいる。

Delerium has explored any number of aspects of electronic music, ceaselessly evolving and exploring, seemingly traversing genres in search of the exquisite.

Signs is a masterwork of hypnotic rhythms and enveloping ambience, with stunning vocal contributions from Mimi Page, Phildel, Inna Walters, and KANGA. Each singer's unique voice elevates Signs, adding levels of aching beauty and romanticism.


□ Rain


□ The Astronomer





EDNE / ENDE

2023-03-10 18:12:06 | アート・文化


□ 『EDNE』 by junaida

Michael ENDE / “Der Spiegel im Spiegel” (鏡の中の鏡)へのオマージュ。見開きを境界に僅かに異なる鏡対称に描かれた一対の絵画と、一節ずつの引用からなる絵本。マルチバースの原点とも言える古典。我々は物語に溶け、誰にでもなれて、誰でもない誰かを探している。



Who went through this door.
When, and from which side.
And, why did they walk inside.

『誰がこの扉を通ったのか。どちらの側から通ったのか。それはいつだったのか。そして、なぜだったのか。』

Why go through this door.
When, and from which side.
And, who is it that actually walks inside.

『なぜこの扉を通るのか。いつ通るのか。それはどちらの側からなのか。そして、それは誰なのか。』


My Best Favourite Movies.

2023-03-09 03:09:03 | 映画
(“STALKER”)


私は基本好きな映画は複数回鑑賞することが多いのだけど、「これは凄い作品だ!」と認めながらも(世間的評価とは別)、「畏怖が強すぎて一度しか観れなかった」という限られた条件では『STALKER』(A. Tarkovsky)、『The Dark Knight』(C. Nolan)、『Dogville』(L. Trier)という感じに絞られてくる。



(“The Dark Knight”)

(“Dogville”)





III.

2023-03-03 03:03:03 | Science News




□ CLAIRE: contrastive learning-based batch correction framework for better balance between batch mixing and preservation of cellular heterogeneity

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad099/7055295

CLAIRE, a dynamical construction strategy by exploiting inter-batch mutual nearest neighbors (MNN) and intra-batch k-nearest neighbors (KNN). CLAIRE uses inter-batch MNN pairs as seeds of positive pairs and augments these seeds with intra-batch KNN to generate positive pairs.

CLAIRE directly removes some MNNs within only one iteration. CLAIRE’s integrated embeddings can accurately transfer labels between scRNA-seq datasets and across omics. CLAIRE can preserve the contiguous structure among cells after removing batch effect.

CLAIRE randomly samples cells from the whole dataset to generate negative samples for each positive pair. CLAIRE pushes positive pairs closer in the latent space while pushing each sample away from its negative keys.





□ HexSE: Simulating evolution in overlapping reading frames

>> https://academic.oup.com/ve/article/9/1/vead009/7023538

HexSE is a Python module designed to simulate sequence evolution along a phylogeny while considering the coding context of the nucleotides. The ultimate porpuse of HexSE is to account for multiple selection preasures on Overlapping Reading Frames.

HexSE uses an exact stochastic algorithm of discrete events. Traversing the event probability tree resolves the shared characteristics for a subset of substitution events. The tip stores references to the nucleotide substitution events that have the same probability of occurring.





□ PHOENIX: Biologically informed NeuralODEs for genome-wide regulatory dynamics

>> https://www.biorxiv.org/content/10.1101/2023.02.24.529835v1

PHOENIX, a modeling framework based on neural ordinary differential equations (NeuralODEs) and Hill-Langmuir kinetics, that can flexibly incorporate prior domain knowledge and biological constraints to promote sparse, biologically interpretable representations of ODEs.

PHOENIX operates on the original gene expression space and does not require any dimensionality reduction, thus preventing information loss. PHOENIX encodes an extractable GRN that captures key mechanistic properties of regulation such as activating edges.

PHOENIX incorporates two levels of back-propagation to parameterize the neural network while inducing domain knowledge-specific properties; the first aims to match the observed data, while the second uses simulated (ghost) expression vectors.






□ GFAse: Phased nanopore assembly with Shasta and modular graph phasing with GFAse

>> https://www.biorxiv.org/content/10.1101/2023.02.21.529152v1

GFAse relies on conventional mappings for phasing information. HiC, PoreC, or other proximity-ligated reads are mapped to the GFA contigs using whichever mapper is most appropriate for the sequence type.

GFAse employs transparent and reusable data structures, and similar to Shasta, produces comprehensive outputs that describe the homology, proximity linkage, and inferred haplotype chains. GFAse is capable of using any data type for phasing which can be aligned to the assembly.

GFAse loads the GFA using the VG HandleGraph and identifies tractable regions as anything which follows a strict diploid bubble chain topology. Chains are identified by traversing contiguous subgraphs of labeled nodes. Haplotypes are labeled w/ paths in the GFA formalism.





□ BioTranslator: Multilingual translation for zero-shot biomedical classification

>> https://www.nature.com/articles/s41467-023-36476-2

BioTranslator learns a cross-modal translation to bridge text data and non-text biological data. BioTranslator is a multilingual translation framework, where different modalities of biomedical data are all mapped to a shared latent space.

BioTranslator is based on fine-tune large-scale pretrained language models using existing biomedical ontologies based on a contrastive learning loss. It enables BioTranslator to perform zero-shot classification.





□ AGC: Compact representation of assembled genomes with fast queries and updates

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad097/7067744

AGC (Assembled Genomes Compressor), a highly efficient compression method for the collection of assembled genome sequences of the same species. The compressed collection can be easily extended by new samples.

AGC offers fast access to the requested contigs or samples without the need to decompress other sequences. AGC decompresses the reference segments and, partially, also the necessary blocks.





□ seqArchR: Identifying promoter sequence architectures via a chunking-based algorithm using non-negative matrix factorisation

>> https://www.biorxiv.org/content/10.1101/2023.03.02.530868v1

seqArchR, a chunking-based iterative algorithm using NMF for de novo identification of architectural elements. The input to seqArchR is a (0, 1)-matrix which is a one-hot encoded representation of dinucleotide profiles of a gapless alignment of DNA sequences.

seqArchR processes the whole collection of input sequences one chunk (subset of sequences) at a time. The (0, 1)-matrix for each chunk of sequences is processed with NMF. NMF decomposes the matrix into two low-rank matrices - the basis matrix and the coefficients matrix.

seqArchR finds the appropriate number of basis vectors suitable to represent the set of sequences in a lower-dimensional space. Columns of the basis matrix represent the different potential architectures, and along its rows are the loadings for the features per architecture.





□ Axioms for the category of sets and relations

>> https://arxiv.org/pdf/2302.14153.pdf

Axioms for the dagger category of sets and relations that recall recent axioms for the dagger category of Hilbert spaces and bounded operators.

No infinite-dimensional Hilbert space has a dagger dual. Let (C, ⊗, I, †) be a dagger symmetric monoidal category. Every morphism has a kernel that is dagger monic and that k and k⊥ are jointly epic for every dagger kernel. Morphisms I → X form a complete Boolean algebra.






□ ANIE: Neural Integral Equations

>> https://arxiv.org/abs/2209.15190

Neural Integral Equations (NIE), a method that learns an unknown integral operator from data through an IE solver. Attentional Neural Integral Equations (ANIE), where the integral is replaced by self-attention, which improves scalability and model capacity.

ANIE permits modeling the system purely from the observations. This model, via the learned integral operator, can be used to generate dynamics, as well as be used to infer the spatiotemporal relations. ANIE allows to continuously learn dynamics with arbitrary time resolution.





□ Categorical magnitude and entropy

>> https://arxiv.org/abs/2303.00879

Connecting the two ideas by considering the extension of Shannon entropy to finite categories endowed with probability, in such a way that the magnitude is recovered when a certain choice of "uniform" probability is made.

The entropy becomes the logarithm of the cardinality of the set when the uniform probability is used. Leinster introduced a notion of Euler characteristic for certain finite categories, also known as magnitude, that can be seen as a categorical generalization of cardinality.





□ AAMB: Adversarial and variational autoencoders improve metagenomic binning

>> https://www.biorxiv.org/content/10.1101/2023.02.27.527078v1

VAMB uses a VAE to integrate input contig abundances and tetranucleotide frequencies (TNF) to a common latent representation. The regularisation of the latent space is done using Kullback-Leibler divergence with respect to a prior distribution, in the Gaussian unit distribution.

AAMB encodes a continuous / categorical latent space, and reconstructs the input from these two as the output. AAMB leverages AAEs to yield more accurate bins than VAMB. AAMB integrates sequence co-abundances and tetranucleotide frequencies into a common denoised space.





□ scTEP: A robust and accurate single-cell data trajectory inference method using ensemble pseudotime

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05179-2

scTEP (the single-cell data Trajectory inference method using Ensemble Pseudotime inference) utilizes multiple clustering results to infer robust pseudotime. scTEP utilizes the pathway information and generates latent for all pathways.

scTEP uses a non-negative kernel autoencoder and a VAE. scTEP uses MST algorithm and fine-tuned trajectory inference, which utilizes the pseudotime inferred from the previous part and fine-tunes the constructed graph by sorting the vertex according to its average pseudotime.





□ GraphST: Spatially informed clustering, integration, and deconvolution of spatial transcriptomics

>> https://www.nature.com/articles/s41467-023-36796-3

GraphST can transfer scRNA-seq-derived sample phenotypes onto ST. GraphST combines graph neural networks with augmentation-based self-supervised contrastive learning to learn representations of spots for spatial clustering by encoding both gene expression and spatial proximity.

GraphST learns a mapping matrix to project the scRNA-seq data into the ST space based on learned features via an augmentation-free contrastive learning where the similarities of spatially neighboring spots are maximized, and those of spatially non-neighboring spots are minimized.





□ scPrisma infers, filters and enhances topological signals in single-cell data using spectral template matching

>> https://www.nature.com/articles/s41587-023-01663-5

scPrisma, a general spectral framework for the reconstruction, enhancement and filtering of signals in single-cell data based on their topology and inference of topologically informative genes.

scPrisma is versatile and enables topological signal manipulation without low-dimensional embedding. scPrisma can be used to manipulate diverse template types, enhance the separation between clusters, identify multiple cyclic processes and enhance spatial signals.





□ scSTAR reveals hidden heterogeneity with a real-virtual cell pair structure across conditions in single-cell RNA sequencing data

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbad062/7050908

scSTAR (single-cell State Transition Across-samples of Rna-seq data), a paired-cell model where for each real cell in one sample/condition, scSTAR estimates its virtual projection in the other.

scSTAR estimates of individual cell state transition is achieve by generating real-virtual cell pairs across samples/conditions. The cell state dynamics can be achieved by maximising the covariance b/n cell states from various samples, which is the partial least squares solution.





□ scGCL: an imputation method for scRNA-seq data based on Graph Contrastive Learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad098/7056638

scGCL, which integrates graph contrastive learning and Zero-inflated Negative Binomial (ZINB) distribution to estimate dropout values. scGCL introduces an autoencoder based on the ZINB distribution, which reconstructs the scRNA-seq data based on the prior distribution.

scGCL summarizes global and local semantic information through contrastive learning and selects positive samples to enhance the representation of target nodes.





□ maxATAC: Genome-scale transcription-factor binding prediction from ATAC-seq with deep neural networks

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010863

maxATAC deep neural network models use DNA sequence and ATAC-seq signal to predict TFBS in new cell types. The maxATAC architecture is based on “peak-centric, pan-cell” training approach.

maxATAC inputs are a 1,024bp one-hot encoded DNA-sequence w/ ATAC-seq signal for the corresponding region, while maxATAC output is an array of 32 TFBS predictions at 32bp resolution, spanning the 1024bp input sequence interval. Inputs go through a total of 5 convolutional blocks.





□ scMDC: Clustering of single-cell multi-omics data with a multimodal deep learning method

>> https://www.nature.com/articles/s41467-022-35031-9

scMDC is an end-to-end deep model that explicitly characterizes different data sources and jointly learns latent features of deep embedding for clustering analysis. scMDC can correct batch effects when analyzing multi-batch data.

scMDC employs a multimodal autoencoder, which applies one encoder for the concatenated data from different modalities and two decoders to separately decode the data from each modal. The whole model, incl. the KL-loss, and the deep K-means clustering, are optimized simultaneously.






□ GENECI: A novel evolutionary machine learning consensus-based approach for the inference of gene regulatory networks

>> https://www.sciencedirect.com/science/article/pii/S001048252300118X

GENECI, an evolutionary algorithm that acts as an organizer for constructing ensembles to process the results of the main inference techniques and to optimize the consensus network derived from them, according to their confidence levels and topological characteristics.

GENECI takes up the idea of weight assignment. The weight vectors are iteratively subjected to evaluation (depending on the quality and topology of the consensus networks), selection, crossover, mutation and finally an additional repair step to keep the sum of values at unity.





□ LuxHMM: DNA methylation analysis with genome segmentation via hidden Markov model

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05174-7

LuxHMM uses hidden Markov model (HMM) to segment the genome into regions and a Bayesian regression model, which allows handling of multiple covariates, to infer differential methylation of regions. In LuxHMM, candidate hypo- and hypermethylated regions.

Hamiltonian Monte Carlo (HMC) was used to sample from the posterior distribution with four chains, 1000 iterations for warmup for each chain and a total of 1000 iterations.





□ FitMultiCell: Simulating and parameterizing computational models of multi-scale and multi-cellular processes

>> https://www.biorxiv.org/content/10.1101/2023.02.21.528946v1

FitMultiCell, a computationally efficient and user-friendly open-source pipeline that can handle the full workflow of modeling, simulating, and parameterizing for multi-scale models of multi-cellular processes.

FitMultiCell integrates Morpheus and pyABC for parameter estimation. pyABC provides two parallelization strategies. FitMultiCell yields a wall-time reduction of several ten-fold compared to a single-node execution and several hundred-fold compared to single-core execution.





□ SnapCCESS: Ensemble deep learning of embeddings for clustering multimodal single-cell omics data

>> https://www.biorxiv.org/content/10.1101/2023.02.22.529627v1

SnapCCESS, an ensemble clustering framework that uses VAE and the snapshot ensemble learning to learn multiple embeddings each encoding multiple data modalities, and subsequently generate consensus clusters for multimodal omics data by combining clusters from each embedding.

SnapCCESS is based on the snapshot ensemble deep-learning model using learning rate annealing cycles where the model converges to and then escapes from multiple local minima, and multiple snapshots were taken at these minima for creating a multi-view of embeddings.

SnapCCESS consists of multimodality-specific encoders and decoders for data integration and dimension reduction. The encoders in the VAE component include one learnable point-wise parameters layer and one fully connected layer to the input layer.





□ Longcell: Single cell and spatial alternative splicing analysis with long read sequencing

>> https://www.biorxiv.org/content/10.1101/2023.02.23.529769v1

Longcell, a statistical framework for accurate isoform quantification for single cell and spatial spot barcoded long read sequencing data. Longcell performs computationally efficient cell/spot barcode extraction, UMI recovery / truncation- and mapping-error correction.

Longcell rigorously quantifies the level of inter-cell/spot vs. intra-cell/spot diversity in exon- usage and detects changes. Longcell improves expression quantification, and significant improvement in quantification accuracy is achieved by the scattering-reduction algorithm.





□ Cellograph: A Semi-supervised Approach to Analyzing Multi-condition Single-cell RNA-sequencing Data Using Graph Neural Networks

>> https://www.biorxiv.org/content/10.1101/2023.02.24.528672v1

Cellograph not only measures how prototypical cells are of each condition but also learns a latent space that is amenable to interpretable visualization and clustering. The learned gene weight matrix from training reveals pertinent genes driving the differences b/n conditions.

Cellograph uses a two-layer GCN to learn a latent representation according to how representative each cell is of its ground truth sample label. This latent space can be clustered to derive groups of cells associated with similar treatment response and transcriptomics.





□ Automatic Detection of Cell-cycle Stages using Recurrent Neural Networks

>> https://www.biorxiv.org/content/10.1101/2023.02.28.530432v1

The aim is to find the phases of mitosis of the cell in different time frames. The aim is to find the temporal segmentation of a video sequence of cell data. This means that the class labels are assigned to each frame of the video sequence to classify the mitotic phases.

The feature space has a time continuity in the high-dimensional space. This approach uses transfer learning on a ResNet18. It has eighteen deep layers with eight residual block connections. The time encoded ResNet18 model has the highest frame- to-frame accuracy.





□ scGAD: a new task and end-to-end framework for generalized cell type annotation and discovery

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbad045/7068949

scGAD builds the intrinsic correspondences on seen and novel cell types by retrieving geometrically and semantically mutual nearest neighbors as anchor pairs.

A soft anchor-based self-supervised learning module is then designed to transfer the known label information from reference data to target data and aggregate the new semantic knowledge within target data in the prediction space.

scGAD uses a confidential prototype self-supervised learning paradigm to implicitly capture the global topological structure of cells in the embedding space. A bidirectional dual alignment mechanism b/n embedding space / prediction space can handle batch effect / cell type shift.





□ biolord: Biological representation disentanglement of single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.03.05.531195v1

Biolord exposes the distinct effects of different biological processes or tissue structure on cellular gene expression. Based on that, biolord allows generating experimentally-inaccessible cell states by virtually shifting cells across time, space, and biological state.

The disentangled representation is obtained by inducing information constraints; the loss attempts to maximize the accuracy of the reconstruction (enforcing completeness) while minimizing the information encoded in the unknown attributes.

biolord finds a decomposed latent space, encompassing informative embeddings for each known attribute and an embedding for the remaining unknown attributes. The generative module can use the decomposed latent space to predict single-cell measurements for different cell states.





□ The motif composition of variable-number tandem repeats impacts gene expression

>> https://www.biorxiv.org/content/10.1101/2022.03.17.484784v2

Extending the application danbing-tk to examine the association between each path in the graph, or VNTR “motif”, and gene expression using the complete read-mapping output i.e. the coverages of all k-mers.

Estimating the dosages of VNTR motifs using a locus-RPGG. A locus-RPGG is built from haplotype-resolved assemblies by first annotating the orthology mapping of VNTR boundaries and then encoding the VNTR alleles with a de Bruijn graph (dBG), or locus-RPGG.

A compact dBG is constructed by merging nodes on a non-branching path into a unitig, denoted as a motif in this context. Motif dosages of a VNTR can be computed by aligning short reads to an RPGG and averaging the coverage of nodes corresponding to the same motif.





□ VIsoQLR: an interactive tool for the detection, quantification and fine-tuning of isoforms in selected genes using long-read sequencing

>> https://link.springer.com/article/10.1007/s00439-023-02539-z

VIsoQLR is designed to characterize aberrant mRNAs detected by functional assays targeting a single locus linked to specific phenotypes. VIsoQLR demonstrates an accurate isoform automatic detection using LRS data.

VIsoQLR has built-in options for mapping reads using GMAP or minimap2 aligners. Next, mapped reads are uploaded, and consensus exon coordinates (CECs) are defined based on the frequency of the reads' exon coordinates.






□ Matrix and analysis metadata standards (MAMS) to facilitate harmonization and reproducibility of single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.03.06.531314v1

Feature and observation matrices (FOMs) contain biological data at different stages of processing including reduced dimensional representations. The Observation Neighborhood Graph (ONG) classes store information related to the correlation, similarity, or distance b/n pairs.

Matrix and Analysis Metadata Standards (MAMS) defines fields that describe what type of data is contained within a matrix, relationships between matrices, and provenance related to the tool or algorithm that created the matrix.





□ Modelling capture efficiency of single cell RNA-sequencing data improves inference of transcriptome-wide burst kinetics

>> https://www.biorxiv.org/content/10.1101/2023.03.06.531327v1

This model captures burst kinetics, and appropriately accounts for the extrinsic variability introduced by cell-to-cell variations in scRNA-seq capture efficiency and cell size. The telegraph model satisfies the so-called stochastic concentration homeostasis condition.





□ ELITE: Expression deconvoLution using lInear optimizaTion in bulk transcriptomics mixturEs

>> https://www.biorxiv.org/content/10.1101/2023.03.06.531002v1

ELITE, a new digital cytometry method that utilizes linear programming to solve the deconvolution problem. ELITE uses as inputs a mixture matrix representing bulk measurements, and a signature matrix representing molecular fingerprints of the cell types to be identified.

ELITE calculates the pseudobulk mixture matrix by multiplying 100 vectors representative of the obtained fractions times the columns of the signature matrix. It can be obtained from relevant single-cell data, purified cell populations, or predefined signature matrices.




Everything Everywhere All At Once.

2023-03-03 03:02:03 | 映画


□ 『Everything Everywhere All At Once』

>> https://a24films.com/films/everything-everywhere-all-at-once

A24
Directors: Daniel Kwan / Daniel Scheinert
Music by Son Lux
Cinematography by Larkin Seiple


『Everything Everywhere All At Once』無限の可能性に人生を縛られるのではなく、無限の可能性から排他された、たった一つの不可能な人生を謳歌しよう。愛と宇宙法則の共時性を描いた作品は数あれど、カオスも破綻も何もかも織り込んで、ブラックホールのような勢いで絶望を否定する生命讃歌。映画、いや人間の想像力のアップデートを迫る映像革命だった。

「己が何者になれるか」だけの無限の可能性は、他者が与えてくれる可能性の大きさに遠く及ばない。そしてここで起きることは、きっと他の可能世界でも起こせるはずだから。


岩のシーン『捕まえちゃお!』、どう見てもシュールギャグの画なのに、涙が溢れて仕方なかった。

□ Son Lux / “Very busy”

□ Son Lux / “Opera Fight”


『Everything Everywhere All At Once』 Son Luxによるスコアは、20世紀中期から続くアヴァンギャルド/実験音楽の影響が色濃い。高速モンタージュはJohn Oswaldの作品、中国・アフリカ民族・西洋教会音楽の要素をプログレ風アレンジは、Terry Rileyの作品を彷彿とさせる。https://youtu.be/KV01m8qzP9k


幼少期から漠然と感じている人生の生き辛さって、学問や科学を身に付けていくと「カオス理論」だとか進化心理学といった定量的・分析的な客観視が出来るようになるのだけど、そこをエンタメ映画の文脈に圧縮して逆輸入したのが #エブエブ の成し遂げたことで、「全ては量子の軌道が描いた出来事」なんて台詞もある。



心は石のように自分の檻の中から動けないもの。それでも誰かを愛し、優しく寄り添おうとすることは、石を動かすような不可能性の侵食に他ならない。

18か19の頃、図書館に閉じ篭ってた時期に読んだ本で「タイの修行僧は石ころやアスファルトの罅割れを見つめ続けて、道そのものになる」みたいな記述に謎の感銘を覚えた結果、私は若くして『石になる』ことを覚えた。何を言ってるかわからないと思うが、#エブエブ の石のシーンにどれだけ感動したか汲み取って欲しい。


それとチェコの映像作家、ヤン・シュヴァンクマイエルの作品に『石のゲーム』(“Spiel mit Steinen”(1965) / Jan Švankmajer)というのがあってですね…(以下沼トーク) #エブエブ



敵役である「ジョブ・トゥパキ」は、全ての多元宇宙において「あらゆる可能性を経験してきた」ことにより、現実を思い通りに改変(ランダムな素粒子の再配列)するという最悪のヴィランなのだけど、対抗勢力が彼女に対峙してしまった時の対抗策が、「無視しろ」「関わるな」なのが面白い。


『Everything Everywhere All At Once』、『Cloud Atlas』(2012)っぽいという感想も多くて「それな!」しかない。別々の人物・時間軸・場所のドラマが、ある方向へ同時に収束していく演出。『Magnolia』(1999)なんかもそれっぽいし、極限まで遡ると『Intolerance』(1916)まで行き着く。#エブエブ

Terrence Malickが”The Tree of Life”や”Voyage of Time”でやりたかったことを超コミカルに表現しているし、実際ずっと上手く機能している。 #エブエブ



『Everything Everywhere All At Once』鑑賞4回目。超可愛いパンフレットも漸くゲット🪨👀他に見たい映画いっぱいあるのに、文字通り『石』に転んでしまった。



『Everything Everywhere All at Once』第95回アカデミー賞・作品賞受賞!映画史に刻まれる最も勇敢な試みだった。記念にSon LuxによるOSTから”Come Recover (Empathy Fight)”を紹介。ドビュッシーの導入から、サイケ・ゴアトランス風に最高潮にブチ上げる曲。力ではなく、『共感』が最大の武器。





Substance.

2023-03-03 03:01:03 | 写真


私が戦略部門から開発へ回ったのは、以前も述べたように施策を立脚するデータの分析手法が30年遅れであったこと、また、この方針を変えるにあたり現役のアナリストと軋轢を生じたことに起因して、情報処理の内製インフラから見直したいという動機があった為だ。現状、SE達とは問題意識を共有できている