lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

We were Once Kings.

2023-03-31 03:33:33 | Science News

(Photo by Joanne Hollings)




□ TXGNN: Zero-shot prediction of therapeutic use with geometric deep learning and clinician centered design

>> https://www.medrxiv.org/content/10.1101/2023.03.19.23287458v1

TXGNN is a graph neural network pre-trained on a comprehensive knowledge graph of 17,080 clinically-recognized diseases and 7,957 therapeutic candidates. The model can process various therapeutic tasks, such as indication and contraindication prediction, in a unified formulation.

TXGNN can perform zero-shot inference on new diseases without additional parameters or fine-tuning on ground truth labels. TXGNN uses a metric learning module that operates on the latent representation space.

TXGNN transforms points in the latent space representing the candidate and disease into predictions about their relationship. In TXGNN, we obtain a disease signature vector for each disease based on the set of neighboring proteins, exposures, and other biomedical entities.





□ NeuLay: Accelerating network layouts using graph neural networks

>> https://www.nature.com/articles/s41467-023-37189-2

The NeuLay algorithm, a Graph Neural Network (GNN) developed to parameterize node features, significantly improves both the speed and the quality of graph layouts, opening up the possibility to quickly and reliably visualize large networks.

NeuLay allows for the use of different GNN architecture other than GCN, such as Graph Attention. NeuLay encodes the graph structure by graph neural networks that maps the adjacency matrix to the node positions. NeuLay-2 w/ two GCN layers has the fastest convergence of the energy.





□ Con-AAE: Contrastive Cycle Adversarial Autoencoders for Single-cell Multi-omics Alignment and Integration

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad162/7091469

Con-AAE (Contrastive cycle adversarial Autoencoders), aiming at integrating and aligning the multi-omics data at the single-cell level. The contrastive loss minimizes the distance between positive pairs and maximizes the distance between negative pairs.

Con-AAE uses two autoencoders to map two modality data into two low-dimensional manifolds under the constrain of adversarial loss, trying to develop representations for each modality that are separated but cannot be identified by an adversarial network in a coordinated subspace.





□ Phenonaut; multiomics data integration for phenotypic space exploration

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad143/7082955

Phenonaut is a framework for applying workflows to multi-omics data. Originally targeting high-content imaging and the exploration of phenotypic space, with different visualisations and metrics.

Phenonaut runs are accompanied by cryptographic hashes proving reported inputs. Phenonaut allows now operates in a data agnostic manner, allowing users to describe their data (multi-view/multi-omics) and apply a series of generic or specialised data-centric transforms.





□ Accurate Flow Decomposition via Robust Integer Linear Programming

>> https://www.biorxiv.org/content/10.1101/2023.03.20.533019v1

A new ILP formulation for the flow decomposition problem for dealing with edge weights not forming a flow. It enables a macroscopic management of errors by attaching an error to each solution path instead of each edge.

This formulation defines the minimum path-error flow decomposition problem as the problem of finding a set of weighted paths with associated error variables, such that the superposition difference of each edge is within the sum of the error variables of the paths using the edge.





□ multiWGCNA: an R package for deep mining gene co-expression networks in multi-trait expression data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05233-z

multiWGCNA, a WGCNA-based procedure that can leverage the multidimensionality of experimental designs to study co-expression networks across variable conditions, such as space or time.

multiWGCNA generates a network for each condition separately, and subsequently maps these modules across designs, and performs relevant downstream analyses, incl. module-trait correlation and module preservation.





□ GVC: efficient random access compression for gene sequence variations

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05240-0

The Genomic Variant Codec(GVC), a novel approach for compressing gene sequence variations with random access capability. The genotypes are extracted from a VCF file and divided into blocks. Each block represents genotypes of all samples in a certain range of loci in a chromosome.

GVC uses two alternative binarization approaches to decompose the allele matrix into a binary representation: bit plane binarization and row binarization. GVC uses the Hamming distance to measure the similarity b/n adjacent rows/columns. Each binary matrix is entropy-encoded.





□ SoCube: an innovative end-to-end doublet detection algorithm for analyzing scRNA-seq data

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbad104/7081128

Several doublet detection algorithms are currently available, but their generalization performance could be further improved due to the lack of effective feature-embedding strategies with suitable model architectures.

SoCube proposed a novel 3D composite feature-embedding strategy that embedded latent gene information and constructed a multikernel, multichannel CNN-ensembled architecture in conjunction with the feature-embedding strategy.





□ OASIS: An interpretable, finite sample valid alternative to Pearson's X2 for scientific discovery

>> https://www.biorxiv.org/content/10.1101/2023.03.16.533008v1

OASIS (Optimized Adaptive Statistic for Inferring Structure) constructs a test-statistic which is linear in the normalized data matrix, providing closed form p-value bounds through classical concentration inequalities.

OASIS computes a bilinear form of residuals. OASIS provides a decomposition of the table, lending interpretability to its rejection of the null. The finite-sample bounds correctly characterize the p-value bound derived up to a variance term.





□ AIM: A Framework for High-throughput Sequence Alignment using Real Processing-in-Memory Systems

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad155/7087101

Alignment-in-Memory (AIM), a framework for PIM-based sequence alignment that targets the UPMEM system. AIM dispatches a large number of sequence pairs across different memory modules and aligns each pair using compute cores within the memory module where the pair resides.

AIM supports multiple alignment algorithms including NW, SWG, GenASM, WFA, and WFA-adaptive. Each algorithm has alternate implementations that manage the UPMEM memory hierarchy differently and are suitable for different read lengths.





□ scQA: Clustering scRNA-seq data via qualitative and quantitative analysis

>> https://www.biorxiv.org/content/10.1101/2023.03.25.534232v1

scQA (an architecture for clustering Single-Cell RNA-seq data based on Qualitative and Quantitative Analysis), which can efficiently cluster cells at various scale based on so called landmarks and each indicates the consensus of genes with similar expression patterns.

scQA constructs the consensus vector of genes whose qualitative expressions under certain cells are of similar trend: quasi-trend-preserved genes. scQA identifies distinct cell types, it proceeds to analyze the characteristics of the ID landmarks both internally / externally.





□ SpaceWalker: Interactive Gradient Exploration for Spatial Transcriptomics Data

>> https://www.biorxiv.org/content/10.1101/2023.03.20.532934v1

The intrinsic dimensionality can serve to guide the user to anatomically distinct regions, that changes in local intrinsic dimensionality in many cases mirror transitions between cell subclasses.

SpaceWalker consists of two key innovations: an interactive, real-time flood-fill and spatial projection of the local topology of the High-Dimensional space, and a gradient gene detector.





□ exFINDER: identify external communication signals using single-cell transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2023.03.24.533888v1

exFINDER analyzes the exSigNet by predicting signaling strength, calculating the maximal signal flow, clustering different ligand-target signaling paths, quantifying the signaling activities using the activation index, and evaluating the GO analysis outputs of exSigNet.





□ NOMAD2 provides ultra-efficient, scalable, and unsupervised discovery on raw sequencing reads

>> https://www.biorxiv.org/content/10.1101/2023.03.17.533189v1

NOMAD2 rapidly identifies candidate RNA editing de novo, including detecting potentially hyperedited events, filling a gap in existing bioinformatic tools. classified anchors as “mismatch” defined as cases where the two most abundant targets differ by single-base mismatches.

NOMAD2 enumerates all (a+g+t)-mers, these sequences are sorted lexicographically with KMC-tools. All occurrences of unique anchors are adjacent, which enables efficient gap removal and unique targets collapsing in the third step via a linear traversal over the (a+g+t)-mers.





□ PWN: enhanced random walk on a warped network for disease target prioritization

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05227-x

PWN (Prioritization with a Warped Network) uses the Forman–Ricci curvature instead of the Ollivier–Ricci curvature. PWN can be used for identifying the targets with properly given prior knowledge and gene scores.

PWN is designed to be an efficient variant of random walk with restart (RWR). PWN uses a weighted asymmetric network that is generated from an unweighted and undirected network. The weights come from two distinct features.

PWN is designed to manage the proportion of information circulating in and flowing out of certain regions by controlling the internal feature. PWN warps the network by assigning higher weights to prior knowledge-related edges.





□ Multi-Omics Integration For Disease Prediction Via Multi-Level Graph Attention Network And Adaptive Fusion

>> https://www.biorxiv.org/content/10.1101/2023.03.19.533326v1

This framework involves constructing co-expression and co-methylation networks for each subject, followed by applying multi-level graph attention to incorporate biomolecule interaction information.

The true-class-probability strategy is employed to evaluate omics-level confidence for classification, and the loss is designed using an adaptive mechanism to leverage both within- and across-omics information.

The initial feature is generated by the multi-level Graph Attention Network for each type of omics data respectively. The dicision feature of each type of omics data is generated by the TCP module. The decision features of each omics are concatenated into one fusion feature.





□ QADD: De Novo Drug Design by Iterative Multi-Objective Deep Reinforcement Learning with Graph-based Molecular Quality Assessment

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad157/7085596

QADD designs a multi-objective deep reinforcement learning pipeline to generate molecules w/ multiple desired properties iteratively, where a graph neural network-based model for accurate molecular quality assessment on drug potentials is introduced to guide molecule generation.

QADD uses the Deep Q-Network, a value-based reinforcement learning method, to estimate the action-value function under different action selection strategies. Since it does not require a fixed-dimensional action space, it is particularly suitable for discontinuous space search.





□ Distances and their visualization in studies of spatial-temporal genetic variation using single nucleotide polymorphisms (SNPs)

>> https://www.biorxiv.org/content/10.1101/2023.03.22.533737v1

They recommend selection of a distance measure for SNP genotype data that does not give differing outcomes depending on the arbitrary choice, and consideration of which state should be considered as zero when applying binary distance measures to fragment presence-absence data.





□ BSP: Dimension-agnostic and granularity-based spatially variable gene identification

>> https://www.biorxiv.org/content/10.1101/2023.03.21.533713v1

BSP (big-small patch), a spatial granularity-guided and non-parametric model to identify spatially variable genes SVGs from two or three- dimensional spatial transcriptomics data in a fast and robust manner.

BSP selects a set of neighboring spots within a certain distance to capture the regional means with different granularities. The variances of the expression mean across all spots are then calculated under different scales, and genes with high ratios are identified as the SVGs.





□ Capturing Spatiotemporal Signaling Patterns in Cellular Data with Geometric Scattering Trajectory Homology

>> https://www.biorxiv.org/content/10.1101/2023.03.22.533807v1

GSTH, a general framework that encapsulates time-lapse signals on a cell adjacency graph in a low-dimensional trajectory. GSTH integrates geometric scattering and topological data analysis (TDA) to provide a comprehensive understanding of complex cellular interactions.

Geometric scattering employs wavelet-based transformations to extract multiscale representations of the signaling data, capturing the intricate hierarchical structures present in the spatial organization of cells and the temporal evolution of signaling events.





□ Ensemble-GNN: federated ensemble learning with graph neural networks for disease module discovery and classification

>> https://www.biorxiv.org/content/10.1101/2023.03.22.533772v1

Ensemble-GNN allows to quickly build predictive models utilizing PPI networks consisting of various node features such as gene expression and/or DNA methylation.

Ensemble-GNNs were combined into a global federated model. In the federated case, each client has its dedicated data based on which a GNN classifier is trained. The trained models of the ensembles are shared among all clients, and predictions are again made via Majority Vote.





□ Scrooge: A Fast and Memory-Frugal Genomic Sequence Aligner for CPUs, GPUs, and ASICs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad151/7085594

Scrooge, a fast and memory-frugal genomic sequence aligner. Scrooge includes three novel algorithmic improvements which reduce the data movement, memory footprint, and the number of operations in the GenASM algorithm.

GenASM-DC uses only cheap bitwise operations to calculate the edit distance between two strings text and pattern. It builds an (n+1)×(k+1) dynamic programming (DP) table R, where n=length(text) and k is the maximum number of edits considered.





□ Estimation of a treatment effect based on a modified covariates method with L0 norm

>> https://www.biorxiv.org/content/10.1101/2023.03.22.533735v1

A new treatment effect estimation approaches based on the modified covariate method, one using lasso regression and the other ridge regression, using the L0 norm.

A modified covariate method based on the L0 norm and Lq norm (q = 1, 2). The first method estimates treatment effects using lasso regression with the L0 norm. The second method uses ridge regression with the L0 norm.





□ PENCIL: Supervised learning of high-confidence phenotypic subpopulations from single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.03.23.533712v1

PENCIL can perform gene selection during the training process, which allows learning proper gene spaces that facilitate accurate subpopulation identifications from single-cell data.

PENCIL has the flexibility to address various phenotypes such as binary, multi-category and continuous phenotypes. PENCIL can order cells to reveal the subpopulations undergoing continuous transitions between conditions.





□ xTrimoGene: An Efficient and Scalable Representation Learner for Single-Cell RNA-Seq Data

>> https://www.biorxiv.org/content/10.1101/2023.03.24.534055v1

xTrimoGene reduces FLOPs by one to two orders of magnitude compared to classical transformers while maintaining high accuracy, enabling us to train the largest transformer models over the largest scRNA-seq dataset today.

xTrimoGene proposes an asymmetric encoder-decoder framework that takes advantage of the sparse gene expression matrix, and establishes the projection strategy of continuous values with a higher resolution.





□ EnsInfer: a simple ensemble approach to network inference outperforms any single method

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05231-1

EnsInfer, an ensemble approach to the network inference problem: each individual network inference method will work as a first level learning algorithm that gives a set of predictions from the gene expression input.

EnsInfer uses a combination of state-of-the-art inference approaches and combines them using a simple Naive Bayes ensemble model. EnsInfer essentially turns all the predictions from different inference algorithms into priors about each edge in the network.





□ Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02899-9

Enformer were not trained on GTEx / Cardoso-Moreira et al. data. specifically and do not directly give predictions for many human tissues. To match CAGE tracks to tissues and stages of development in a simple, yet data-driven, way, they fitted a ridge regression.

Enformer can predict endogenous RNA abundance very well and consistently outperforms previous models. Enformer substantially outperformed Basenji2 even when it is restricted to the latter model‘s input window and even on tasks where the receptive field size is irrelevant.





□ ElasticBLAST: accelerating sequence search via cloud computing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05245-9

ElasticBLAST can handle anywhere from a few to many thousands of queries and run the searches on thousands of virtual CPUs.

ElasticBLAST leverages the cloud to provide multiple worker nodes to parallelize the computation by breaking the queries into query batches. ElasticBLAST relies on BLAST DB metadata that is automatically generated to determine the amount of main memory needed for that database.





□ SiPSiC: A novel method to accurately estimate pathway activity in single cells for clustering and differential analysis

>> https://www.biorxiv.org/content/10.1101/2023.03.27.534310v1

SiPSiC, a novel method for inferring pathway scores from scRNA-seq data. It has a high sensitivity, accuracy, and consistency with existing knowledge across different data types, including findings often missed by the original conventional analyses.

SiPSiC scores can be used to cluster the cells and compute their UMAP projections in a manner that better captures the biological underpinnings of tissue heterogeneity.





□ cnnLSV: detecting structural variants by encoding long-read alignment information and convolutional neural network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05243-x

cnnLSV can automatically adjust the images from different variants to a uniform size according to the length of each variant and the coverage of the dataset for training the filtering model.

cnnLSV converts the images in training set into one-dimensional arrays, and executes the principal component analysis and k-means clustering to eliminate the incorrectly labeled images to improve the filtering performance of the model.





□ KGETCDA: an efficient representation learning framework based on knowledge graph encoder from transformer for predicting circRNA-disease associations

>> https://www.biorxiv.org/content/10.1101/2023.03.28.534642v1

Knowledge Graph Encoder from Transformer for predicting CDA (KGETCDA) integrates more than 10 databases to construct a large heterogeneous non-coding RNA dataset, which contains multiple relationships between circRNA, miRNA, lncRNA and disease.

A biological knowledge graph is created based on this dataset and Transformer-based knowledge representation learning and attentive propagation layers are applied to obtain high-quality embeddings with accurately captured high-order interaction information.





□ C-DEPP: Scaling deep phylogenetic embedding to ultra-large reference trees: a tree-aware ensemble approach

>> https://www.biorxiv.org/content/10.1101/2023.03.27.534201v1

Clustered-DEPP (C-DEPP) uses carefully crafted techniques to enable quasi-linear scaling while maintaining accuracy. C-DEPP enables placing twenty million 16S fragments on the GG2 reference tree in 41 hours of computation.

C-DEPP trains a separate model for each of several overlapping subtrees; for each query, C-DEPP uses a 2-level classifier to select one or more subtrees, computes distances using those subtrees, and uses these distances as input to APPLES-II, leaving the other distances blank.





□ simpleaf: A simple, flexible, and scalable framework for single-cell transcriptomics data processing using alevin-fry

>> https://www.biorxiv.org/content/10.1101/2023.03.28.534653v1

simpleaf, a program that simplifies the processing of single-cell data using tools from the alevin-fry ecosystem, and adds new functionality and capabilities, while retaining the flexibility and performance of the underlying tools.

simpleaf quant, simpleaf quant will automatically recruit and parameterize the correct mapper, and will automatically locate and provide the file containing the transcript-to-gene mapping information to later quantification stages where appropriate.





□ Sequencing accuracy and systematic errors in nanopore direct RNA sequencing

>> https://www.biorxiv.org/content/10.1101/2023.03.29.534691v1

The presence of the same systematic error patterns in RODAN points to more fundamental causes of errors in the raw signal data, necessitating further development of better pore chemistry to produce higher quality dRNA-seq data.

Clearly, further development of dRNA-seq protocols, pore chemistry and basecalling algorithms are desirable. Appropriate quality control and error correction methods are needed to mitigate the effects of high error rates and systematic biases in downstream analyses.



Dihedral.

2023-03-31 02:22:22 | Science News

(Art by ekaitza)

GPT models are affected by various factors such as the size of the training dataset and architecture, which may influence the Kolmogorov complexity. Simpler algorithms can compress complex data. The performance of the GPT model is expected to improve as its complexity increases.






□ Split-Transformer Impute (STI): Genotype Imputation Using a Transformer-Based Model

>> https://www.biorxiv.org/content/10.1101/2023.03.05.531190v3

The model utilizes attention to capture correlations among the SNPs/SNVs in the data. It achieves high imputation accuracy at a modest memory consumption cost by dividing the data into chunks, enabling efficient application to long sequences.

STI uses Cat-Embedding layer in order to capture allele information per SNV. In conjunction with multi-headed attention layers, enables STI to model correlations among SNVs to impute missing values based on known and missing values per position.






□ Dagger Linear Logic and Categorical Quantum Mechanics

>> https://arxiv.org/abs/2303.14231

The existing frameworks of Categorical Quantum Mechanics (CQM) are categorical proof theories of compact dagger linear logic, and are motivated by the interpretation of quantum systems in the category of finite dimensional Hilbert spaces.

Mixed Unitary Categories is a novel non-compact framework. MUC is built upon linearly distributive categories and ∗-autonomous categories, which serve as categorical proof theories of non-compact multiplicative linear logic and can be applied to infinite dimensional systems.





□ AIBMD: Artificial Intelligence Boosted Molecular Dynamics

>> https://www.biorxiv.org/content/10.1101/2023.03.25.534210v1

In AIBMD, probabilistic Bayesian neural network models were used to construct boost potentials that exhibit Gaussian distribution with minimized anharmonicity for accurate energetic reweighting and enhanced sampling.

AIBMD has been demonstrated on model systems of the alanine dipeptide in explicit and implicit solvent, the chignolin fast-folding protein, and three hairpin RNAs with the GCAA, GAAA, and UUCG tetraloops.





□ Boolean Network Sketches: A Unifying Framework for Logical Model Inference

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad158/7099622

Boolean network sketch starts with an initial sketch that corresponds to the prior literature-based knowledge only. Subsequently, it is extended by adding restrictions representing experimental data resulting in the data-informed sketch.

BNs integrates partial knowledge about the network’s topology and the update logic, as well as dynamical restrictions representing knowledge or assumptions about the properties of the network’s transitions (e.g., attractor landscape), and restrictions on the model dynamics.





(Art by jaanus03)

人は人から人に似て産まれ、偶然乗り合わせた船の上で犇めき合っている。星を読むように類似した記号に意味を与え、風の追いやる方だけが確からしいと覚える。己が何を見つけて、何を想い、何を遂げようとしても、押し流していく風からは留めて置けないことを知る。誰もが、誰の名も忘れて解けていく。



□ StackOverflowのトップエンジニアからの提言。GPT-4に依存し続けると「枯れた川床から水を飲む危険がある。」という指摘。知識の再生産フェーズが可能になるかどうか。事実、Googleのトラフィックが下がっているという指摘も。因みに引用されている画像は、AIをテーマにした映画『Ex Machina』の撮影に使われたノルウェーのHotel Juvetですね。

Peter Nixey

I'm in the top 2% of users on StackOverflow. My content there has been viewed by over 1.7M people. And it's unlikely I'll ever write anything there again.

Which may be a much bigger problem than it seems. Because it may be the canary in the mine of our collective knowledge.

A canary that signals a change in the airflow of knowledge: from human-human via machine, to human-machine only. Don’t pass human, don’t collect 200 virtual internet points along the way.

StackOverflow is *the* repository for programming Q&A. It has 100M users & saves man-years of time & wig-factories-worth of grey hair every single day.

It is driven by people like me who ask questions that other developers answer. Or vice-versa. Over 10 years I've asked 217 questions & answered 77. Those questions have been read by millions of developers & had tens of millions of views.

But since GPT4 it looks less & less likely any of that will happen; at least for me. Which will be bad for StackOverflow. But if I'm representative of other knowledge-workers then it presents a larger & more alarming problem for us as humans.

What happens when we stop pooling our knowledge with each other & instead pour it straight into The Machine? Where will our libraries be? How can we avoid total dependency on The Machine? What content do we even feed the next version of The Machine to train on?

When it comes time to train GPTx it risks drinking from a dry riverbed. Because programmers won't be asking many questions on StackOverflow. GPT4 will have answered them in private. So while GPT4 was trained on all of the questions asked before 2021 what will GPT6 train on?

This raises a more profound question. If this pattern replicates elsewhere & the direction of our collective knowledge alters from outward to humanity to inward into the machine then we are dependent on it in a way that supercedes all of our prior machine-dependencies.

Whether or not it "wants" to take over, the change in the nature of where information goes will mean that it takes over by default.

Like a fast-growing Covid variant, AI will become the dominant source of knowledge simply by virtue of growth. If we take the example of StackOverflow, that pool of human knowledge that used to belong to us - may be reduced down to a mere weighting inside the transformer.

Or, perhaps even more alarmingly, if we trust that the current GPT doesn't learn from its inputs, it may be lost altogether. Because if it doesn't remember what we talk about & we don't share it then where does the knowledge even go?

We already have an irreversible dependency on machines to store our knowledge. But at least we control it. We can extract it, duplicate it, go & store it in a vault in the Arctic (as Github has done).

So what happens next? I don't know, I only have questions.

None of which you'll find on StackOverflow.





□ CONGAS+: A Bayesian method to infer copy number clones from single-cell RNA and ATAC sequencing

>> https://www.biorxiv.org/content/10.1101/2023.04.01.535197v1

CONGAS+, a Bayesian model to map single-cell RNA and ATAC profiles generated from independent or multimodal assays on the latent space of copy numbers clones. CONGAS+ is equipped with a shrinkage hyperparameter that can be used to weigh the evidence differently across RNA/ATAC.

CONGAS+ did retrieve complex subclonal architectures while providing a coherent mapping among ATAC and RNA, facilitating the study of genotype-phenotype mapping.






□ Reconstruction of Gene Regulatory Networks using sparse graph recovery models

>> https://www.biorxiv.org/content/10.1101/2023.04.02.535294v1

Categorizing graph recovery methods into four main types based on the underlying formulations: Regression-based, Graphical Lasso, Markov Networks and Directed Acyclic Graphs. And incorporate transcription factor information as a prior to ensure successful reconstruction of GRNs.

They modified the uGLAD algorithm to take into account TF information, called uGLAD-GRN, by using a post-hoc masking operation that only retains the edges having at least one node as a TF. It can be applied to most of the algorithms that recover Conditional Independence graphs.





□ STGRNS: An interpretable Transformer-based method for inferring gene regulatory networks from single-cell transcriptomic data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad165/7099621

STGRNS, a Transformer-based model, provides a fast and accurate tool to infer gene regulatory networks from a single-cell RNA-seq profile. By leveraging the newly designed neural network structure, STGRNS especially obtains an outperformance on GRN inference.

STGRNS has certain transferability on the TF-gene prediction task. STGRNS can accurately infer GRNs based on known relationships between genes, irrespective of whether the data is static, pseudo-time, or time-series.





□ SEQUENCE VS. STRUCTURE: DELVING DEEP INTO DATA DRIVEN PROTEIN FUNCTION PREDICTION

>> https://www.biorxiv.org/content/10.1101/2023.04.02.534383v1

The difference between the RGC TN and RG AT methods is that the former employs a transformer network and incorporates direction, orientation, and distance distribution information in the edge features, while the latter only includes distance and dihedral angle information.

The first fusion method directly splices the output of the ESM-1b model and the GAT model and feeds it to the classifier for the final prediction. The second fusion method involves taking the output of the ESM-1b model as the initialization characteristics of nodes in the graph.





□ Single-cell RNA-seq differential expression tests within a sample should use pseudo-bulk data of pseudo-replicates

>> https://www.biorxiv.org/content/10.1101/2023.03.28.534443v1

The results of the simulation experiments showed that bulk methods that use pseudo-bulk raw count data from pseudo-replicates ranked highest and were most effective in controlling the false discovery rate (FDR) for highly expressed genes.

For real scRNA-seq data, the top- performing pipelines were also dominated by the same kind of pipelines, but the differences between single-cell and pseudo-replicate methods were less clear.





□ sciPENN: A multi-use deep learning method for CITE-seq and single-cell RNA-seq data integration with cell surface protein prediction and imputation

>> https://www.nature.com/articles/s42256-022-00545-w

sciPENN is a flexible method that supports completion of multiple CITE-seq references (by imputing missing proteins for each reference) as well as protein expression prediction in an scRNA-seq test set, all in one framework.

sciPENN can transfer cell type labels from a training set to a test set, and can also integrate cells from the multiple datasets into a common latent space.

sciPENN’s model architecture comprises an input block, followed by a sequence of feed-forward (FF) blocks interleaved with updates to an internally maintained hidden state updated via an RNN cell.

The final hidden state is passed through three dense layers to compute protein predictions, protein prediction bounds and cell type class probability vectors.





□ Bayesian Multi-Study Non-Negative Matrix Factorization for Mutational Signatures

>> https://www.biorxiv.org/content/10.1101/2023.03.28.534619v1

A Bayesian multi-study NMF method that jointly decomposes multiple studies or conditions to identify signatures that are common, specific, or partially shared by any subset.

A “discovery-only" model that estimates de novo signatures in a completely unsupervised manner, and a “recovery-discovery" model that builds informative priors from previously known signatures to both update the estimates of these signatures and identify any novel signatures.





□ The impact of FASTQ and alignment read order on structural variation calling from long-read sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.03.27.534439v1

Comparisons of variant call format (VCF) files generated from the original and permutated FASTQ files demonstrated that the order of input data had a large impact on SV prediction, particularly for pbsv. The type of variant most affected by read order varied by caller.

For pbsv, most differences occurred for deletions and duplications, while for Sniffles, permutating the read order had a stronger impact on insertions. For SVIM, inversions and deletions accounted for most differences.





□ Spatial Transcriptomics Analysis of Gene Expression Prediction using Exemplar Guided Graph Neural Network

>> https://www.biorxiv.org/content/10.1101/2023.03.30.534914v1

Proposing a graph exemplar bridging (GEB) block to update window features by the exemplars and the gene expression of exemplars. Allowing dynamic information propagation, the exemplar feature also receives and is updated with the status of the window features.

Semantically, the former update corresponds w/ ‘the known gene expression’, and the latter corresponds w/ ‘the GE the model wants to be known’. Finally, It has an attention-based prediction block to aggregate exemplars of each window and the exemplar-revised window features.





□ CellTrackVis: interactive browser-based visualization for analyzing cell trajectories and lineages

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05218-y

CellTrackVis visualizes tracking results, e.g., cell trajectories, segmentation, raw or processed image sequence, cell lineages, or quantified information, on interconnected views. Those generally include the number of cell division or appearance/disappearance at each time step.

Distinct time-series data are plotted using line graphs and exact values appear with a vertical bar, moved by a mouse pointer. The statistic data set is not the mandatory input, and thus our tool supports its visual analysis while retaining the flexibility of input data.





□ A self-propagating, barcoded transposon system for the dynamic rewiring of genomic networks

>> https://www.embopress.org/doi/full/10.15252/msb.202211398

A modular, combinatorial assembly pipeline for the functionalization of transposons with synthetic or endogenous gene regulatory elements as well as DNA barcodes.

The continuous mobilization of transposons throughout the host genome yields multi-site adaptive mutations and growth phenotypes in both static and dynamic selective environments.

It first mimics a natural transposon, with the transposase acting in cis from within the region flanked by the inverted repeat sequences, while the second uses a medium copy helper plasmid (pHelper) to provide transposase acting in trans.





□ Sparse clusterability: testing for cluster structure in high dimensions

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05210-6

Clusterlab generates clusters of a user-provided dimension by a linear projection of two-dimensional Gaussian principal components into the desired higher-dimensional space. The clusterlab manual highlights 12 example two-dimensional structures to project into higher dimension.

Methods with the dip test and either sparse PCA or traditional PCA detected known cluster structure in high dimensional-omics based data and had high power in simulations. Type I error was controlled at or below the nominal level across all dimensions.





□ MBE: Model-based differential sequencing analysis

>> https://www.biorxiv.org/content/10.1101/2023.03.29.534803v1

Model-based enrichment (MBE) is based on sound theoretical principles, is easy to implement, and can trivially make use of advances in modern-day machine learning classification architectures or related innovations.

Increasingly, log-enrichment estimates are also being used as supervised labels for training machine learning models so that one may predict enrichment for unobserved sequences, or probe the model to gain further insights.





□ PanKmer: k-mer based and reference-free pangenome analysis

>> https://www.biorxiv.org/content/10.1101/2023.03.31.535143v1

PanKmer decomposes a set of input genomes into a table of observed k-mers and their presence-absence values in each genome. These are stored in an efficient k-mer index data format that encodes all forms of variation within the pangenome, including SNPs, INDELs, and SVs.

PanKmer includes functions for downstream analysis, such as calculating sequence similarity statistics b/n individuals at whole-genome or local scales. k-mers can be “anchored” in any individual genome to quantify sequence variability or conservation at a specific locus.





□ MOGAT: An Improved Multi-Omics Integration Framework Using Graph Attention Networks

>> https://www.biorxiv.org/content/10.1101/2023.04.01.535195v1

MOGAT, a novel multi-omics integration-based cancer subtype prediction leveraging a graph attention network (GAT) model that incorporates graph-based learning with an attention mechanism for analyzing multi-omics data.

MOGAT utilizes a multi-head attention mechanism that can efficiently extract information for a specific patient by assigning unique attention coefficients to its neighboring patients, i.e., getting the relative influence of neighboring patients in the patient similarity graph.





□ mlf-core: a framework for deterministic machine learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad164/7099608

mlf-core, a machine learning framework that enables building fully deterministic and therefore also reproducible machine learning projects. mlf-core is based on MLflow for machine learning experiment tracking, visualization and model deployment.

mlf-core provides project templates and static code analysis (linting) functionality that ensures the sole usage of deterministic algorithms for GPU computing as well as setting all necessary random seeds for deterministic results.





□ Discovering motifs and genomic patterns with SMT: a high-performance data structure for counting kmers

>> https://www.biorxiv.org/content/10.1101/2023.04.01.535163v1

The Sparse Motif Tree (SMT), an innovative tool specifically designed to store and count kmers efficiently. The SMT optimizes memory usage and computation.

The SMT provides advanced features, such as exact search in constant time, retrieval of the most abundant kmers, and approximate search in linear time to find fragments with up to d mutations uniformly distributed across their bases.





□ PanGraphViewer: A Versatile Tool to Visualize Pangenome Graphs

>> https://www.biorxiv.org/content/10.1101/2023.03.30.534931v1

PanGraphViewer targets pangenome graphs and allows the viewing of pangenome graphs built from multiple genomes in either the graphical fragment assembly format or the VCF. PanGraphViewer also integrates genome annotations with graph nodes to analyze insertions / deletions.

The graph node shapes in PanGraphViewer can represent different types of genomic variations when a VCF file is used. Notably, PanGraphViewer displays subgraphs from a chromosome or sequence segment based on any given coordinates.





□ ScRAT: Clinical Phenotype Prediction From Single-cell RNA-seq Data using Attention-Based Neural Networks

>> https://www.biorxiv.org/content/10.1101/2023.03.31.532253v1

ScRAT, a clinical phenotype prediction framework that can learn from limited numbers of scRNA-seq samples with minimal dependence on cell- type annotations.

ScRAT establishes the connection between the input (cells) and the output (phenotypes) of the Transformer model simply using the attention weights.





□ NEREL-BIO: A Dataset of Biomedical Abstracts Annotated with Nested Named Entities

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad161/7099619

NEREL-BIO contains annotations for 700+ Russian and 100+ English abstracts. NEREL-BIO comprises the following specific features: annotation of nested named entities, it can be used as a benchmark for cross-domain and cross-language transfer.

Transferability of trained models across two datasets with completely different contexts can be limited due to domain shift, while sequential training can cause complete retraining of model weights.





□ Dipwmsearch: a python package for searching di-PWM motifs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad141/7100340

dipwmsearch provides an easy and efficient procedure to find occurrences of di-PWMs in nucleotidic sequences, and well documented snippets. It offers practical advantages compared to an existing solution (like processing IUPAC codes, or an adaptable output).

dipwmsearch uses an original enumeration based search algorithm that handles di-PWMs. Coping with non selective positions was necessary to make search effective for some di-PWMs, which questions their information content, and in turn their construction process.





□ FRASER 2.0: Improved detection of aberrant splicing using the Intron Jaccard Index

>> https://www.medrxiv.org/content/10.1101/2023.03.31.23287997v1

As FRASER’s autoencoder works with values in the logit space, which is defined for values greater than 0 and less than 1, a pseudocount needs to be added to both the numerator and denominator when calculating each metric on raw read counts.

FRASER 2.0, a method to detect aberrant splicing using a novel intron-centric metric, the Intron Jaccard Index. In a single metric, the Intron Jaccard Index captures former metrics of splicing efficiency as well as alternative donor and acceptor site choice.

FRASER 2.0 decreases the number of reported splicing outliers by one order of magnitude, recovers splicing outliers associated with candidate splice-disrupting rare variants more accurately than competitor methods, and is more robust to variations in sequencing depth.





□ catchSalmon / catchKallisto: Dividing out quantification uncertainty allows efficient assessment of differential transcript expression

>> https://www.biorxiv.org/content/10.1101/2023.04.02.535231v1

Bootstrap samples generated by lightweight aligners can be used to accurately estimate the mapping ambiguity overdispersion which, in turn, can be used to scale down estimated transcript counts so that the resulting effective library sizes reflect their true precision.

As a result, standard methods designed for the differential expression analyses at the gene-level can be applied to transformed transcript counts for DTE analyses.

Functions catchSalmon and catchKallisto from edgeR import transcript-specific estimated counts (including bootstrap resamples) from Salmon and kallisto, respectively, and estimate the associated mapping ambiguity overdispersion.





□ HTOreader: A hybrid single-cell demultiplexing strategy that increases both cell recovery rate and calling accuracy

>> https://www.biorxiv.org/content/10.1101/2023.04.02.535299v1

HTOreader, an improved algorithm for cell hashing that distinguishes true positive from background for each individual hashtag at higher accuracy. This hybrid strategy increases cell recovery and calling accuracy while lowering experimental cost.

HTOreader uses a hybrid demultiplexing strategy for single-cell sample pooling and super loading. By integrating results of both cell hashing and SNP profiling, they successfully complement the two approaches with each other and hugely improve their weaknesses.




Air.

2023-03-30 03:03:03 | 日記・エッセイ・コラム

待たなくていい。けど、諦めなくていい


『人の本心』とは何処にあるのだろう。他者の心算を証明・定義するものはなく、時に覗きこもうとしては、その不確かさが敵意の矛先となる。恐れや不安は揺らぎの内にあり、優しさと気の迷いは表裏一体で、『今ここにしかないもの』だけが本物だった。





Coco Moon.

2023-03-26 03:03:03 | Music20

□ Owl City / “Coco Moon”

>> https://www.owlcitymusic.com/

Release Date; 24/03/2023
Label; Sky Harbor Records.


1. Adam, Check Please
2. Under the Circus Lights
3. Kelly Time
4. Field Notes
5. Sons of Thunder
6. The Tornado
7. Vitamin Sea
8. Dinosaur Park
9. Learn How to Surf
10. The Meadow Lark
11. My Muse


Owl Cityの新譜 ”Coco Moon”最高すぎるでしょ!
アルバム全曲に多幸感をギッシリ詰め込んだ、
月夜に煌めくキラキラ・トロピカル・ポップチューン🏝🌙!
チルなヴァイブスぐんぐんアゲてこ!૮^˶•̀д•́˶^ა


□ Under the Circus Lights


□ Vitamin Sea


□ Learn How to Surf







L’ Énigme.

2023-03-24 19:33:34 | Book

───Whoever speaks removes the light.
『言葉を話す者は皆、光を消し去る』(Pascal Quignard)

『De procedure - Harry Mulisch』
『ZÉRO - Denis Guedj』
『Désert - J.M.G. Le Clézio』
『L’ Énigme - Pascal Quignard collection』

フランス文学、原文で読むのは辛いので邦訳頼み。キニャールの思想からはレヴィナス(Emmanuel Lévinas)の哲学を色濃く感じる。





Peyman Yazdanian / “Pulse”

2023-03-24 19:33:14 | art music

□ Peyman Yazdanian / “Pulse”

>> https://music.apple.com/jp/album/pulse/1633126972

Release Year: 2020
Label: Hermes Records
Cat.No.: HER-091

>> tracklisting

1. Pulse 1 (For Prepared Piano)
2. Pulse 2

Graphics – علی بوستان
Photography By – ارمغان بوستان, نسترن فتوحی
Piano, Composed By – Peyman Yazdanian
Producer – رامین صدیقی
Production Manager – بیژن پاسبان حضرت, پوریا پور وزیری
Sound Designer – Reza Asgarzadeh


イランの作曲家、Peyman Yazdanian。プリペアド・ピアノによる瞑想的な旋律に、ペルシャ音楽と『禅』のバイブを織り交ぜた、澄み渡る静謐さを湛えた音の結晶。サントゥールやウッドベースも覗く、アンニュイだが美しい日常の陰翳を捉えたインプロヴィゼーション的な演奏。






BENEDETTA.

2023-03-19 21:44:23 | 映画


□ 『Benedetta』 (ベネデッタ)

>> https://cdn-media.festival-cannes.com/film_film/0002/66/09eb8592ee20c0f58994cdadef7b486f0ff24c69.pdf

Directed by Paul Verhoeven
Cast: Virginie Efira / Charlotte Rampling / Daphne Patakia

Writing by David Birke & Paul Verhoeven
Based on the book by Judith C. Brown

Music by Anne Dudley
Song text by Hildegard von Bingen

Cinematography by Jeanne Lapoirie
Art Direction by Eric Bourges


A 17th-century nun in Italy suffers from disturbing religious and erotic visions. She is assisted by a companion, and the relationship between the two women develops into a romantic love affair.

『BENEDETTA』(ベネデッタ) 奇蹟と偽証・支配と従属・神性と肉体性とを分つ『ベール』。その内と外との境界が食い破られる時、人は己の孕んだ矛盾を排他しようと破壊性を顕す。「賢さは危険、自分にも牙を剥くから」愛に背き茨の道を行かざるを得なかった女。Hildegard von Bingenの聖歌が物哀しく響く。


また、この映画はNunsploitationの典型的な構成で、ポルノ風の過激な性行為描写はあるのだけど、それが全て結果に収束するための不可分な要素として機能してるのが、巨匠Paul Verhoevenの手腕。『既成事実』と『不可知性』の2面性の解釈についての深い洞察。






□ Anne Dudley / “Beata Viscera” (Benedetta)


□ Anne Dudley / “The Bride of Christ”

『BENEDETTA』(ベネデッタ)のスコアを手がけたのはAnne Dudley。ベネデッタと同じく幻視体験をした中世ドイツBenedict教会の修道院長、Hildegard von Bingenの作曲した聖歌を現代風にアレンジして取り込んでいる。



また、この作品の通奏低音である背徳的な世界観やエロティシズム、これもENIGMAの”Principles of Lust”に通ずるものがあり、映画を思い出しながら聴いてたりします。





εν αρχη ην ο λογος.

2023-03-13 03:13:13 | Science News

(Art by joeryba.eth)

私たちが直面する問題は2種類に分けられる。それは「己の限界」と「他者の檻」である。全ての主観者が『反復』するプロセスを織り込んで、2つの問題は常に背中合わせとなる。自らが解決した問題は常に他者を囚え続け、鏡のようにその逆が成り立つ。檻から出た先は檻であり、入れ子のように循環する。




□ Φ-SO: Deep symbolic regression for physics guided by units constraints: toward the automated discovery of physical laws

>> https://arxiv.org/abs/2303.03192

Φ-SO, a Physical Symbolic Optimization framework for recovering analytical symbolic expressions from physics data using deep reinforcement learning techniques by learning units constraints.

Φ-SO restricts the freedom of the equation generator, and balanced units are proposed by construction, thus greatly reducing the search space. It enables the algorithm to zero-out the probability of forbidden symbols that would result in expressions that violate units rules.





□ scPheno: Extraction of biological signals by factorization enables the reliable analysis of single-cell transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.03.04.531126v1

scPheno, a deep auto-regressive factor model that is used to extract the biological signals imbedded in transcriptome, identify gene expression variations associated with each of the phenotypes, and re-build the accumulative effect of multiple phenotypes on cell states.

scPheno will factorize gene expression pertaining to a phenotypic factor and project cells onto a latent variable space, where the latent variable specifies a hidden cell state and cells of the same hidden states will cluster together.

The deep factor model will infer the factorized latent variable spaces. The factorization neural networks and the reconstruction neural network can be coupled to predict gene expression in relation to any factor combination.





□ INSnet: a method for detecting insertions based on deep learning network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05216-0

INSnet divides the reference genome into continuous sub-regions and takes five features for each locus through alignments between long reads and the reference genome. Next, INSnet uses a depthwise separable convolutional network.

INSnet uses two attention mechanisms, the convolutional block attention module (CBAM) and efficient channel attention (ECA) to extract key alignment features in each sub-region. INSnet uses a gated recurrent unit (GRU) network to further extract more important SV signatures.





□ LEMUR: Analysis of multi-condition single-cell data with latent embedding multivariate regression

>> https://www.biorxiv.org/content/10.1101/2023.03.06.531268v1

A new statistical model for differential expression analysis (or ANOVA) of multi-condition single-cell data that combines the ideas of linear models and principal compo- nent analysis (PCA).

Latent embedding multivariate regression (LEMUR) is based on a parametric mapping of latent space representations into each other and uses a design matrix to encode categorical and continuous covariates.





□ The Network Zoo: a multilingual package for the inference and analysis of gene regulatory networks

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02877-1

The Network Zoo, a platform that harmonizes the codebase for these methods, in line with recent similar efforts, and provides implementations in R, Python, MATLAB, and C. The netZoo codebase has helped develop an ecosystem of online resources for GRN inference and analysis.

netZoo integrates PANDA, LIONESS, and MONSTER to infer TF-gene targeting to explore how regulatory changes affect disease phenotype, and used DRAGON to integrate nine types of genomic information and find multi-omic markers that are associated with drug sensitivity.





□ RGT: a toolbox for the integrative analysis of high throughput regulatory genomics data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05184-5

Regulatory Genomics Toolbox (RGT) was programmed in an oriented-object fashion and its core classes provided functionalities to handle typical regulatory genomics data: regions and signals.

RGT built distinct regulatory genomics tools, i.e., HINT for footprinting analysis, TDF for finding DNA–RNA triplex, THOR for ChIP-seq differential peak calling, motif analysis for TFBS matching and enrichment, and RGT-viz for regions association tests and data visualization.

THOR is a Hidden Markov Model-based approach to detect and analyze differential peaks in two sets of ChIP-seq data from distinct biological conditions with replicates. Triplex Domain Finder (TDF) characterizes the triplex-forming potential between RNA and DNA regions.





□ phytools 2.0: An updated R ecosystem for phylogenetic comparative methods (and other things)

>> https://www.biorxiv.org/content/10.1101/2023.03.08.531791v1

The phytools library has now grown to be very large – consisting of hundreds of functions, a documentation manual that’s over 200 pages in length, and tens of thousands of lines of computer code.

For Mk model-fitter (which here will be the phytools function fitMk), and for the other discrete character methods of the phytools R package, the input phenotypic trait data will typically takes the form of a character or factor vector.





□ NextDenovo: An efficient error correction and accurate assembly tool for noisy long reads

>> https://www.biorxiv.org/content/10.1101/2023.03.09.531669v1

NextDenovo, a highly efficient error correction and CTA-based assembly tool for noisy long reads. NextDenovo can rapidly correct reads; these corrected reads contain fewer errors than other comparable tools and are characterized by fewer chimeric alignments.

NextDenovo uses the BOG algorithm to remove edges for non-repeat nodes. The graph usually contained some linear paths connecting some complex subgraphs. All paths were broken at the node connecting with multi-paths, and contigs were outputted from these broken linear paths.





□ vcfdist: Accurately benchmarking phased small variant calls in human genomes

>> https://www.biorxiv.org/content/10.1101/2023.03.10.532078v1

vcfdist, an alignment-based small variant calling evaluator that standardizes query and truth VCF variants to a consistent representation, requires local phasing of both input VCFs, and gives partial credit to variant calls which are mostly (but not exactly) correct.

A novel variant clustering algorithm reduces downstream computation while discovering long range variant dependencies. A novel alignment distance based metrics which are independent of variant representation, and measure the distance b/n the final diploid truth / query sequences.





□ scEvoNet: a gradient boosting-based method for prediction of cell state evolution

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05213-3

scEvoNet, a method that builds a cell type-to-gene network using the Light Gradient Boosting Machine (LGBM) algorithm overcoming different domain effects (different species/different datasets) and dropouts that are inherent for the scRNA-seq data.

ScEvoNet builds the confusion matrix of cell states and a bipartite network connecting genes and cell states. It allows a user to obtain a set of genes shared by the characteristic signature of two cell states even between distantly-related datasets.





□ NGenomeSyn: an easy-to-use and flexible tool for publication-ready visualization of syntenic relationships across multiple genomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad121/7072460

NGenomeSyn, an easy-to-use and flexible tool, for publication-quality visualization of syntenic relationships (user-defined or generated by our custom script) and genomic features (e.g. repeats, structural variations, genes) on tens of genomes with high customization.

NGenomeSyn allows its user to adjust default options for genome and link styles defined in the configuration file and simply adjusts options of moving, scaling, and rotation of target genomes, yielding a rich layout and publication-ready figure.





□ containX: Coverage-preserving sparsification of overlap graphs for long-read assembly

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad124/7074174

ContainX heuristics are promising in terms of improving assembly quality by avoiding coverage gaps. The string graph model filters out contained reads during graph construction.

containX is a prototype implementation of an algorithm that decides which contained reads can be dropped during overlap graph sparsfication. Reads which are substrings of longer reads are typically referred to as contained reads.

Hifiasm retained fewer contained reads than ContainX but it failed to resolve a majority of coverage gaps. The unitig graph of Hifiasm has the least number of junction reads because it does additional graph pruning which is necessary for computing longer unitigs.





□ LoMA: Localized assembly for long reads enables genome-wide analysis of repetitive regions at single-base resolution in human genomes

>> https://pubmed.ncbi.nlm.nih.gov/36895025/

LoMA constructs a CS spanning a target region. This process is initiated by finding overlaps of raw reads using pairwise all-to-all alignment of minimap2, followed by a layout of overlapped reads. It divides the layout into multiple blocks to make partial consensus sequences.

LoMA captures haplotype structures based on SVs and produces haplotype-resolved CSs. LoMA predicts heterozygous loci in the region based on the extent of deviation from the binomial distribution, and the reads derived from each estimated haplotype are gathered.





□ HiFiCNV : Copy number variant caller and depth visualization utility for PacBio HiFi reads

>> https://www.pacb.com/blog/hificnv/

HiFiCNV can generate several CNV related track files which can be loaded into IGV for visualization and assessment of its variant calls. HiFiCNV detected all large CNVs from this dataset, and 90% of those calls had high overlap accuracy when compared to the reported CNV.

Segmentation is performed by a Viterbi parse of the depth bins assuming the bin depth represents a Poisson sampling from a mean depth based on haploid depth. The haploid depth is computed from the zero-excluded mean depth of this chromosome set.





□ ReCo: automated NGS read-counting of single and combinatorial CRISPR gRNAs.

>> https://www.biorxiv.org/content/10.1101/2023.03.09.530923v1

ReCo! finds gRNA read counts (ReCo) in fastq files and runs as a standalone script or a python package. It can be used for single and combinatorial CRISPR-Cas libraries that have been sequenced with single-end or paired-end sequencing strategies.

ReCo works with conventionally cloned CRISPR-Cas libraries and 3Cs/3Cs-MPX libraries. ReCo can process multiple samples in a single run. It automatically determines the constant regions flanking the gRNAs, and utilizes Cutadapt to trim the fastq files.





□ StonPy: a tool to parse and query collections of SBGN maps in a graph database

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad100/7075543

The StonPy library allows users to store SBGN-ML maps into a running Neo4j database, and conversely retrieve them into SBGN-ML. StonPy includes a completion module that allows users to build valid SBGN maps from query results representing parts of maps automatically.

SBGN arcs are optionally modelled using additional Neo4j relationships that mimic the structure of the SBGN map. StonPy brings new capabilities for storing and analyzing large collections of CellDesigner and SBGN maps using Neo4j and Cypher.





□ SLEMM: million-scale genomic predictions with window-based SNP weighting

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad127/7075542

SLEMM (Stochastic-Lanczos-Expedited Mixed Models) uses the Stochastic Lanczos REML and SNP effects for large datasets. SLEMM is fast enough for million-scale genomic predictions.

SLEMM with SNP weighting had overall the best predictive ability among a variety of genomic prediction methods including GCTA’s empirical BLUP, BayesR, KAML, and LDAK’s BOLT and BayesR models.





□ scDeepInsight: a supervised cell-type identification method for scRNA-seq data with deep learning

>> https://www.biorxiv.org/content/10.1101/2023.03.09.531861v1

scDeepInsight can directly annotate the query dataset based on the model trained on the reference dataset. scDeepInsight does preprocessing of scRNA-seq data, including quality control and integration through batch normalization.

scDeepInsight is a single-cell labeling model based on supervised learning, so a reference dataset is also required. DeepInsight is utilized to convert the processed non-image data into images.





□ A general minimal perfect hash function for canonical k-mers on arbitrary alphabets with an application to DNA sequences

>> https://www.biorxiv.org/content/10.1101/2023.03.09.531845v1

A minimal perfect hash function of canonical k-mers on alphabets of arbitrary size, i.e., a mapping to the interval [0, σk /2−1]. The approach is introduced for canonicalization under reversal and extended to canonicalization under reverse complementation.

The encoding is based on the observation that there are fewer canonical k-mers than there are k-mers in general. A mapping is only required if k-mer x is canonical, i.e., x is lexicographically smaller than or equal to x^−1.





□ scBubbletree: quantitative visualization of single cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.03.09.531263v1

scBubbletree, a new scalable method for visualization of scRNA-seq data. The method identifies clusters of cells of similar transcriptomes and visualizes such clusters as “bubbles” at the tips of dendrograms, corresponding to quantitative summaries of cluster properties.

scBubbletree stacks bubble trees w/ further cluster-associated information. scBubbletree relies on the gap statistic method. scBubbletree can cluster scRNA-seq data in two ways, namely by graph-based community detection (GCD) algorithms such as Louvain or Leiden, and by k-means.





□ Panpipes: a pipeline for multiomic single-cell data analysis.

>> https://www.biorxiv.org/content/10.1101/2023.03.11.532085v1

Panpipes, a set of workflows designed to automate the analysis of multimodal single-cell datasets by incorporating widely used Python-based tools to efficiently perform QC, preprocessing, integration, clustering, and reference mapping at scale in the multiomic setting.

Panpipes generates a cluster matching metric, the Adjusted Rand Index, for global concordance evaluation. Panpipes can aid building unimodal or multimodal references and enables the user to query multiple references simultaneously using scArches.





□ plasma: Partial LeAst Squares for Multiomics Analysis

>> https://www.biorxiv.org/content/10.1101/2023.03.10.532096v1

plasma, a novel two-step algorithm to find models that can predict time-to-event outcomes on samples from multiomics data sets even in the presence of incomplete data. These components will be automatically associated with the outcome.

plasma uses partial least squares (PLS) for both steps, using Cox regression to learn the single omics models and linear regression. The plasma components are learned in a way that maximizes the covariance in the predictors and the response.





□ eOmics: an R package for improved omics data analysis

>> https://www.biorxiv.org/content/10.1101/2023.03.11.532240v1

eOmics combines an ensemble framework with limma, improving its performance on imbalanced data. It couples a mediation model with WGCNA, so the causal relationship among WGCNA modules, module features, and phenotypes can be found.

eOmics has some novel functional enrichment methods, capturing the influence of topological structure on gene set functions. It contains multi-omics clustering and classification functions to facilitate ML tasks. Some basic functions, such as ANOVA analysis, are also available.





□ Biomappings: Prediction and Curation of Missing Biomedical Identifier Mappings

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad130/7077133

Biomappings, a framework for semi-automatically creating and maintaining mappings in a public, version-controlled repository.

Biomappings combines multiple contributions: (i) a "curation cycle" workflow for creating mappings, (ii) an extensible pipeline for automatically predicting missing mappings between resources, and automatically detecting inconsistencies.

Biomappings currently makes available 9,274 curated mappings and 40,691 predicted ones, providing previously missing mappings between widely used identifier resources covering small molecules, cell lines, diseases, and other concepts.





□ fraguracy: overlapping bases in read-pairs from a fragment indicate accuracy.

>> https://github.com/brentp/fraguracy

Many factors can be predictive of the likelihood of an error. The dimensionality is a consideration because if the data is too sparse, prediction is less reliable. For each combination, while iterating over the bam, it stores the number of errors and the number of total bases in each bin.

fraguracy calculates real error rates using overlapping paired-end reads in a fragment. This avoids some bias. It does limit to the (potentially) small percentage of bases that overlap and it will sample less at the beginning of read 1 and the end of read2.





□ Genes2Genes: Gene-level alignment of single cell trajectories informs the progression of in vitro T cell differentiation

>> https://www.biorxiv.org/content/10.1101/2023.03.08.531713v1

Genes2Genes overcomes current limitations and is able to capture sequential matches and mismatches between a reference and a query at single gene resolution, highlighting distinct clusters of genes with varying patterns of gene expression dynamics.

Genes2Genes utilizes a Bayesian information-theoretic Dynamic Programming alignment algorithm that accounts for matches, warps and indels by combining the classical Gotoh’s biological sequence alignment algorithm and Dynamic Time Warping.





□ GenoPipe: identifying the genotype of origin within (epi)genomic datasets

>> https://www.biorxiv.org/content/10.1101/2023.03.14.532660v1

The three core modules of GenoPipe: EpitopeID, DeletionID, and StrainID were developed to identify major genotypical determinants of cellular identity. GenoPipe can detect genotype perturbations at realistic and practical sequencing depths as defined by ENCODE.

The DeletionID module models the background of a genomic experiment to identify depleted regions of the genome to predict genomic deletions. The StrainID uses existing SNP or variant calls databases of common cell lines to match a cell’s genetic identity inherent to each dataset.

The EpitopeID module identifies the presence and approximate location of specific DNA sequences within the genome. The algorithm functions by first aligning the raw sequencing data (i.e., FASTQ) against a curated DNA sequence database (tagDB) of common protein epitopes.





□ BioConvert: a comprehensive format converter for life sciences

>> https://www.biorxiv.org/content/10.1101/2023.03.13.532455v1

BioConvert aggregates existing software within a single framework and complemented them with original code when needed. It provides a common interface to make the user experience more streamlined instead of having to learn tens of them.

BioConvert supports about 50 formats and 100 direct conversions in areas such as alignment, sequencing, phylogeny, and variant calling. BioConvert can also be utilized by developers as a universal benchmarking framework for evaluating and comparing numerous conversion.





□ Fast Approximate IsoRank for Scalable Global Alignment of Biological Networks

>> https://www.biorxiv.org/content/10.1101/2023.03.13.532445v1

A new IsoRank approximation, which exploits the mathematical properties of IsoRank's linear system to solve the problem in quadratic time with respect to the maximum size of the two PPI networks.

A computationally cheaper refinement is proposed to this initial approximation so that the updated result is even closer to the original IsoRank formulation.

In synthetic experiments, they create random graphs using the Erd ̋os R ́enyi and Barab ́asi-Albert models, and ask IsoRank to recover the graph isomorphism between the graphs and a random node permutation.





□ IntLIM 2.0: identifying multi-omic relationships dependent on discrete or continuous phenotypic measurements

>> https://academic.oup.com/bioinformaticsadvances/article-abstract/3/1/vbad009/7022005

IntLIM 2.0 uncovers phenotype-dependent linear associations between two types of analytes. IntLIM 2.0 extends IntLIM 1.0 to support generalized analyte measurement data types, continuous phenotypic measurement, covariate correction, model validation and unit testing.

IntLIM 2.0 supports model validation using cross-validation and random permutation models.





□ NanoSquiggleVar: A method for direct analysis of targeted variants based on nanopore sequencing signals

>> https://www.biorxiv.org/content/10.1101/2023.03.15.532860v1

NanoSquiggleVar can directly identify targeted variants from the nanopore sequencing electrical signal without the requirement of base calling, sequence alignment, or variant detection with downstream analysis.

In each sequencing iteration, the signal is sliced into fragments by a moving window of 1-unit step size. Dynamic time warping is used to compare the signal squiggles to the detected variants. NanoSquiggleVar can only determine the existence of a mutation and not its frequency.





□ HiDecon: Accurate estimation of rare cell type fractions from tissue omics data via hierarchical deconvolution

>> https://www.biorxiv.org/content/10.1101/2023.03.15.532820v1

HiDecon, a penalized approach with constraints from both “parent” and “children” cell types to make full use of a hierarchical tree structure. The hierarchical tree is readily available from well-studied cell lineages or can be learned from hierarchical clustering of scRNA-seq.

The basic intuition of HiDecon is that there exists a summation relationship b/n the estimation results of adjacent layers. HiDecon implements the sum constraint penalties from the upper and lower layers to aggregate estimates across layers for more accurate cellular fraction.






□ Implementing Dynamic Time Warping (DTW) with Neural Networks and analyzing single-cell RNA data involves creating a custom model architecture with GPT-4.




Yubais RT

昔のAI観ではまず「知性そのもの」みたいなのをコンピュータ内に作って、それと人間が会話するためのインターフェースを別途作るようなイメージだったんだが、インターフェースであるはずの言語に知性っぽいものが内包されていたんじゃないか、と現状を見ていて思う