lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

melancholic.

2023-11-29 20:47:09 | 写真

(IPhone15 Pro Max)




□ Galileo Galilei × Porter Robinson『Circle Game (ANOHANA Ver.)』

アメリカのFuture Bass/ElectronicaシーンにおけるトップDJの一人、Porter Robinsonが、人気アニメの主題歌でもあった日本のロックバンドの有名曲を再録。自身の楽曲”Something Comforting”とのライブMash upで、ラストのコーラスが切なく響く

Godzilla Minus One

2023-11-28 20:58:58 | 映画


『Godzilla Minus One』

TOHO (2023)
Directed by Takashi Yamazaki
Music by Naoki Satô
Cinematography by Kôzô Shibasaki

暴虐のベクトル。それはあらゆる生物や人間が生来的に背負った業であり、その円環からは逃れられない。そして自己犠牲では何も終わらせることが出来ない。恐怖や殺戮は様々な貌で再生し、幾度となく対峙し進むしかないのだ。熱線がデス・スター並みの破壊力で絶望感が凄い。

Linkage.

2023-11-23 23:26:24 | 日記・エッセイ・コラム


Generalization is the extrapolating the vector of argument to the dynamic structure of the issue. It is only useful when considerable probability is involved. The key is to clarify who is talking to whom, what is the relationship between what, and to quantify the sample size.

一般化とは、議論のベクトルを問題の属性や力学構造へ外挿する行為であって、そこに相当の蓋然性・客観性が認められる場合にのみ実効性を有する。重要なのは「誰が誰に向けた話」なのか「何と何の関係性の話」なのかを明確にし、サンプルサイズを定量化することである

Clandestine.

2023-11-22 22:22:22 | Science News




□ scLKME: A Landmark-based Approach for Generating Multi-cellular Sample Embeddings from Single-cell Data

>> https://www.biorxiv.org/content/10.1101/2023.11.13.566846v1

scLKME, a landmark-based approach that uses kernel mean embedding to compute vector representations for samples profiled with single-cell technologies. scLKME sketches or sub-selects a limited set of cells across samples as landmarks.

scLKME maps them into a reproducing kernel Hilbert space (RKHS) using kernel mean embedding. The final embeddings are generated by evaluating these transformed distributions at the sampled landmarks, yielding a sample-by-landmark matrix.





□ Cellsig plug-in enhances CIBERSORTx signature selection for multi-dataset transcriptomes with sparse multilevel modelling

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad685/7413172

cellsig is a Bayesian multilevel generalised linear model tailored to RNA sequencing data. It uses joint hierarchical modelling to preserve the uncertainty of the mean-variability association of the gene-transcript abundance.

cellsig estimates the heterogeneity for cell-type transcriptomes, modelling population and group effects. They organised cell types into a differentiation hierarchy.

For each node of the hierarchy, cellsig allows for missing information due to partial gene overlap across samples (e.g. missing gene-sample pairs). The generated dataset is then input to CIBERSORTx to generate the transcriptional signatures.





□ RabbitKSSD: accelerating genome distance estimation on modern multi-core architectures

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad695/7424710

RabbitKSSD adopts the Kssd algorithm for estimating the similarities between genomes. In order to accelerate time-consuming sketch generation and distance computation, RabbitKSSD relies on a highly-tuned task partitioning strategy for load balancing and efficiency.

In the RabbitKSSD pipeline, the genome files undergo parsing to extract k-mers, which are subsequently used to generate sketches. Following this, the integrated pipeline computes pairwise distances among these genome sketches by retrieving the unified indexed dictionary.





□ MIXALIME: Statistical framework for calling allelic imbalance in high-throughput sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.11.07.565968v1

MIXALIME, a versatile framework for identifying ASEs from different types of high-throughput sequencing data. MIXALIME provides an end-to-end workflow from read alignments to statistically significant ASE calls, accounting for copy-number variation and read mapping biases.

MIXALIME offers multiple scoring models, from the simplest binomial to the beta negative binomial mixture, can incorporate background allelic dosage, and account for read mapping bias.

MIXALIME estimates the distribution parameters from the dataset itself, can be applied to sequencing experiments of various designs, and does not require dedicated control samples.





□ Mdwgan-gp: data augmentation for gene expression data based on multiple discriminator WGAN-GP

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05558-9

MDWGAN-GP, a generative adversarial network model with multiple discriminators, is proposed. In addition, a novel method is devised for enriching training samples based on linear graph convolutional network.

MDWGAN-GP-C (resp. MDWGAN-GP-E) represents the model adopting only Cosine distance (resp. Euclidean distance). Multiple discriminators are adopted prevent mode collapse via providing more feedback signals to the generator.





□ ReGeNNe: Genetic pathway-based deep neural network using canonical correlation regularizer for disease prediction

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad679/7420211

ReGeNNe, an end-to-end deep learning framework incorporating the biological clustering of genes through pathways and further capturing the interactions between pathways sharing common genes through Canonical Correlation Analysis.

ReGeNNe’s Canonical Correlation based neural network modeling captures linear/ nonlinear dependencies between pathways, projects the features from genetic pathways into the kernel space, and ultimately fuses them together in an efficient manner for disease prediction.





□ WFA-GPU: Gap-affine pairwise read-alignment using GPUs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad701/7425447

WFA-GPU, a GPU-accelerated implementation of the Wavefront Alignment algorithm for exact gap-affine pairwise sequence alignment. It combines inter-sequence and intra-sequence parallelism to speed up the alignment computation.

A heuristic variant of the WFA-GPU that further improves its performance. WFA-GPU uses a bit-packed encoding of DNA sequences using 2 bits per base. It reduces execution divergence and the total number of instructions executed, which translates into faster execution times.





□ BELB: a Biomedical Entity Linking Benchmark

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad698/7425450

BELB, a Biomedical Entity Linking Benchmark providing access in a unified format to 11 corpora linked to 7 knowledge bases and 6 entity types: gene, disease, chemical, species, cell line and variant. BELB reduces preprocessing overhead in testing BEL systems on multiple corpora.

Using BELB they perform an extensive evaluation of six rule-based entity-specific systems and three recent neural approaches leveraging pre-trained language models.

Results of neural approaches do not transfer across entity types, with specialized rule-based systems still being the overall best option on entity-types not explored by neural approaches, namely genes and variants.





□ A new paradigm for biological sequence retrieval inspired by natural language processing and database research

>> https://www.biorxiv.org/content/10.1101/2023.11.07.565984v1

This benchmarking study comparing the quality of sequence retrieval between BLAST and the HYFT methodology shows that BLAST is able to retrieve more distant homologous sequences with low percent identity than the HYFT-based search.

HYFT synonyms increases the recall. The HYFT methodology is extremely scalable as it does not rely on sequence alignment to find similars, but uses a parsing-sorting-matching scheme. The HYFT-based indexing is a solution to the biological sequence retrieval in a Big Data context.





□ ORCA: OmniReprodubileCellAnalysis: a comprehensive toolbox for the analysis of cellular biology data.

>> https://www.biorxiv.org/content/10.1101/2023.11.07.565961v1

OmniReproducibleCellAnalysis (ORCA), a new Shiny Application based in R, for the semi-automated analysis of Western Blot (WB), Reverse Transcription-quantitative PCR (RT-qPCR), Enzyme-Linked ImmunoSorbent Assay (ELISA), Endocytosis and Cytotoxicity experiments.

ORCA allows to upload raw data and results directly on the data repository Harvard Dataverse, a valuable tool for promoting transparency and data accessibility in scientific research.





□ TBtools-II: A “one for all, all for one” bioinformatics platform for biological big-data mining

>> https://www.cell.com/molecular-plant/fulltext/S1674-2052(23)00281-2

TBtools-II has the plugin mode to better meet personalized data analysis needs. Although there are methods available for quickly packaging command-line tools, such as PyQT, wxPython, and Perl/Tk, they often require users to be proficient with a programming language.

TBtools-II simplifies this process with its plugin “CLI Program Wrapper Creator”, making it easy for users to develop plugins in a standardized manner.

TBtools-II uses SSR Miner for the rapid identification of SSR (Simple Sequence Repeat) loci at the whole-genome level. To compare two genome sequences of two species or two haploids, users can also apply the “Genome VarScan” plugin to quickly identify structure variation regions.





□ A model-based clustering via mixture of hierarchical models with covariate adjustment for detecting differentially expressed genes from paired design

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05556-x

A novel mixture of hierarchical models with covariate adjustment in identifying differentially expressed transcripts using high-throughput whole genome data from paired design. In their models, the three gene groups allow to have different coefficients of covariates.

In future, they plan to try the hybrid algorithm of the DPSO (Discrete Particle Swarm Optimization) and the EM approach to improve the global search performance.





□ WIMG: WhatIsMyGene: Back to the Basics of Gene Enrichment

>> https://www.biorxiv.org/content/10.1101/2023.10.31.564902v1

WhatIsMyGene database (WIMG) will be the single largest compendium of transcriptomic and micro-RNA perturbation data. The database also houses voluminous proteomic, cell type clustering, IncRNA, epitranscriptomic (etc.) data.

WIMG generally outperforms in the simple task of reflecting back to the user known aspects of the input set (cell type, the type of perturbation, species, etc.), enhancing confidence that unknown aspects of the input may also be revealed in the output.

The WIMG database contains 160 lists based on WGCNA clustering. Typically, studies that utilize this procedure involve single-cell analysis, requiring large matrices to generate reliable gene-gene co-expression patterns.






□ SillyPutty: Improved clustering by optimizing the silhouette width

>> https://www.biorxiv.org/content/10.1101/2023.11.07.566055v1

SillyPutty is a heuristic algorithm based on the concept of silhouette widths. Its goal is to iteratively optimize the cluster assignments to maximize the average silhouette width.

SillyPutty starts with any given set of cluster assignments, either randomly chosen, or obtained from other clustering methods. SillyPutty enters a loop where it iteratively refines the clustering. The algorithm calculates the silhouette widths for the current clustering.

SillyPutty identifies the data point with the lowest silhouette width. The algorithm reassigns this data point to the cluster to which it is closest. The loop continues until all data points have non-negative silhouette widths, or an early termination condition is reached.





□ flowVI: Flow Cytometry Variational Inference

>> https://www.biorxiv.org/content/10.1101/2023.11.10.566661v1

flow VI, Flow Cytometry Variational Inference, an end-to-end multimodal deep generative model, designed for the comprehensive analysis of multiple MPC panels from various origins.

Flow VI learns a joint probabilistic representation of the multimodal cytometric measurements, marker intensity and light scatter, that effectively captures and adjusts for individual noise variances, technical biases inherent to each modality, and potential batch effects.





□ GeneToCN: an alignment-free method for gene copy number estimation directly from next-generation sequencing reads

>> https://www.nature.com/articles/s41598-023-44636-z

GeneToCN counts the frequencies of gene-specific k-mers in FASTQ files and uses this information to infer copy number of the gene. GeneToCN allows estimating copy numbers for individual samples without the requirement of cohort data.

GeneToKmer script has the flexibility to either treat them separately or to define all 3 copies as a single gene. In the first case, GeneToCN uses the k-mers specific to each different copy, whereas in the latter case, GeneToCN uses only k-mers that are present in all 3 copies.





□ ENCODE-rE2G: An encyclopedia of enhancer-gene regulatory interactions in the human genome

>> https://www.biorxiv.org/content/10.1101/2023.11.09.563812v1

ENCODE-rE2G, a new predictive model that achieves state-of-the-art performance across multiple prediction tasks, demonstrating a strategy involving iterative perturbations and supervised machine learning to build increasingly accurate predictive models of enhancer regulation.

Using the ENCODE-rE2G model, they build an encyclopedia of enhancer-gene regulatory interactions in the human genome, which reveals global properties of enhancer networks, identifies differences in the functions of genes that have more or less complex regulatory landscapes.





□ SEAMoD: A fully interpretable neural network for cis-regulatory analysis of differentially expressed genes

>> https://www.biorxiv.org/content/10.1101/2023.11.09.565900v1

SEAMoD (Sequence-, Expression-, and Accessibility-based Motif Discovery), implements a fully interpretable neural network to relate enhancer sequences to differential gene expression.

SEAMoD can make use of epigenomic information provided in the form of candidate enhancers for each gene, with associated scores reflecting local chromatin accessibility, and automatically search for the most promising enhancer among the candidates.

SEAMoD is a multi-task learner capable of examining DE associated with multiple biological conditions, such as several differentiated cell types compared to a progenitor cell type, thus sharing information across the different conditions in its search for underlying TF motifs.





□ Optimal control of gene regulatory networks for morphogen-driven tissue patterning

>> https://www.sciencedirect.com/science/article/pii/S2405471223002922

An alternative framework using optimal control theory to tackle the problem of morphogen-driven patterning: intracellular signaling is derived as the control strategy that guides cells to the correct fate while minimizing a combination of signaling levels and time.

This approach recovers observed properties of patterning strategies and offers insight into design principles that produce timely, precise, and reproducible morphogen patterning. This framework can be combined w/ dynamical-Waddington-like-landscape models of cell-fate decisions.





□ OrthoRep: Continuous evolution of user-defined genes at 1-million-times the genomic mutation rate

>> https://www.biorxiv.org/content/10.1101/2023.11.13.566922v1

OrthoRep, a new orthogonal DNA replication system that durably hypermutates chosen genes at a rate of over 104 substitutions per base in vivo.

OrthoRep obtained thousands of unique multi-mutation sequences with many pairs over 60 amino acids apart (over 15% divergence), revealing known and new factors influencing enzyme adaptation.

The fitness of evolved sequences was not predictable by advanced machine learning models trained on natural variation. OrthoRep systems would take 100 generations (8-12 days for the yeast host of OrthoRep) just to sample an average of 1 new mutation in a typical 1 kb gene.





□ RUBic: rapid unsupervised biclustering

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05534-3

RUBic converts the expressions into binary data using mixture of left truncated Gaussian distribution model (LTMG) and find the biclusters using novel encoding and template searching strategy and finally generates the biclusters in two modes base and flex.

RUBic generates maximal biclusters in base mode, and in flex mode results less and biological significant clusters. The average of maximum match scores of all biclusters generated by RUBic with respect to the BiBit algorithm and vice-versa are exactly the same.





□ EUGENe: Predictive analyses of regulatory sequences

>> https://www.nature.com/articles/s43588-023-00544-w

EUGENe (Elucidating the Utility of Genomic Elements with Neural nets) transforms sequence data from many common file formats; tains diverse model architectures; and evaluates and interpreting model behavior.

EUGENe provides flexible functions for instantiating common blocks and towers that are composed of heterogeneous sets of layers. EUGENe supports customizable fully connected, convolutional, recurrent and Hybrid architectures that can be instantiated from single function calls.





□ pUMAP: Robust parametric UMAP for the analysis of single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.11.14.567092v1

pUMAP is capable of efficiently projecting future data onto the same space as the training data. The effect of negative sample strength on the overall structure of the low dimensional embedding produced by trained pUMAP for pancreatic.

pUMAP uses neural networks to parameterize complex functions that map GE data onto a lower dimensional space of an arbitrary dimension. pUMAP constructs a KNN graph from the high dimensional space and computes a weight for the edge that points that scales w/ their local distance.





□ Hierarchical annotation of eQTLs enables identification of genes with cell-type divergent regulation

>> https://www.biorxiv.org/content/10.1101/2023.11.16.567459v1

A network-based hierarchical model to identify cell-type specific eQTLs in complex tissues with closely related and nested cell types. This model extends the existing CellWalkR modell to take a cell-type hierarchy as input in addition to cell-type labels and scATAC-seq data.

Briefly, the cell type hierarchy is taken as prior knowledge, and it is implemented as edges between leaf nodes that represent specific cell types and internal nodes that represent broader cell types higher in the hierarchy.

The cell type nodes are then connected to nodes representing cells based on how well marker genes correspond to each cell's chromatin accessibility, and cells are connected to each other based on the similarity of their genome-wide chromatin accessibility.

A random walk with random restarts model of node to each other node. In particular, this includes the probability that a walk starting at each cell node ends at each cell type node as well as each internal node representing portions of the cell-type hierarchy.





□ A Method for Calculating the Least Mutated Sequence in DNA Alignment Based on Point Mutation Sites

>> https://www.biorxiv.org/content/10.1101/2023.11.14.567125v1

The Least mutated sequence calculates the transition/transversion ratio for each sequence in an DNA alignment. It can be used as a rough measure for estimating selection pressure and evolutionary stability for a sequence.

By parsimony principle, the least mutated sequence should be the phylogenetic root for all the other sequences in an alignment result. This method is a non-parameter method and uses the point mutation sites in a DNA alignment result for calculation.

This method only needs a very small proportion of sequences to find the root sequence under random sampling, and quite robust against reverse mutation and saturation mutation, whose accuracy rises with the increasing number of sampling sequences.





□ HapHiC: Chromosome-level scaffolding of haplotype-resolved assemblies using Hi-C data without reference genomes

>> https://www.biorxiv.org/content/10.1101/2023.11.18.567668v1

HapHiC, a Hi-C-based scaffolding tool that enables allele-aware chromosome scaffolding of autopolyploid assemblies without reference genomes. They conducted a comprehensive investigation into the factors that may impede the allele-aware scaffolding of genomes.

HapHiC conducts contig ordering and orientation by integrating the algorithms from 3D-DNA and ALLHiC. HapHiC employs the "divide-and-conquer" strategy to isolate their negative impacts between the two steps.





□ DisCoPy: the Hierarchy of Graphical Languages in Python

>> https://arxiv.org/abs/2311.10608

DisCoPy is a Python toolkit for computing w/ monoidal categories. It comes w/ two flexible data structures for string diagrams: the first one for planar monoidal categories based on lists of layers, the second one for symmetric monoidal categories based on cospans of hypergraphs.

Algorithms for functor application then allow to translate string diagrams into code for numerical computation, be it differentiable, probabilistic or quantum.





□ SpaGRN: investigating spatially informed regulatory paths for spatially resolved transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2023.11.19.567673v1

SpaGRN, a statistical framework for predicting the comprehensive intracellular regulatory network underlying spatial patterns by integrating spatial expression profiles with prior knowledge on regulatory relationships and signaling paths.

SpaGRN identifies spatiotemporal variations in specific regulatory patterns, delineating the cascade of events from receptor stimulation to downstream transcription factors and targets, revealing synergetic regulation mechanism during organogenesis.





□ Snapper: high-sensitive detection of methylation motifs based on Oxford Nanopore reads

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad702/7429397

Snapper, a new highly-sensitive approach to extract methylation motif sequences based on a greedy motif selection algorithm. It collects normalized signal levels for this k-mer from multi-fast5 files for both native and WGA samples.

The algorithm directly compares the collected signal distributions. Using the Kolmogorov-Smirnov test in order to select k-mers that most likely contain a modified base. The result of the first stage is an exhaustive set of all potentially modified k-mers.

Next, the greedy motif enrichment algorithm implemented in Snapper iteratively extracts potential methylation motifs and calculates corresponding motif confidence levels.





□ Centre: A gradient boosting algorithm for Cell-type-specific ENhancer-Target pREdiction

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad687/7429396

CENTRE is a machine learning framework that predicts enhancer target interactions in a cell-type-specific manner, using only gene expression and ChIP-seq data for three histone modifications for the cell type of interest.

CENTRE extracts all the cCRE-ELS within 500KB of target genes and computes CT-specific and generic features for all potential ET pairs. ET feature vectors are then fed to a pre-trained XGBOOST classifier, and a probability of an interaction is assigned to ET pairs.





□ CellSAM: A Foundation Model for Cell Segmentation

>> https://www.biorxiv.org/content/10.1101/2023.11.17.567630v1

CellSAM, a foundation model for cell segmentation that generalizes across diverse cellular imaging data. CellSAM builds on top of the Segment Anything Model (SAM) by developing a prompt engineering approach to mask generation.

CellFinder, a transformer-based object detector that uses the Anchor DETR framework. It automatically detects cells and prompt SAM to generate segmentations.





□ Extraction and quantification of lineage-tracing barcodes with NextClone and CloneDetective

>> https://www.biorxiv.org/content/10.1101/2023.11.19.567755v1

NextClone and CloneDetective, an integrated highly scalable Nextflow pipeline and R package for efficient extraction and quantification of clonal barcodes from scRNA-seq data and DNA sequencing data tagged with lineage-tracing barcodes.

NextClone is particularly engineered for high scalability to take full advantage of the vast computational resources offered by HPC platforms. CloneDetective is an R package to interrogate clonal abundance data generated using lineage tracing protocol.




Desiderio.

2023-11-22 21:09:09 | Science News




□ DARDN: Identifying transcription factor binding motifs from long DNA sequences using multi-CNNs and DeepLIFT

>> https://www.biorxiv.org/content/10.1101/2023.11.17.567502v1

DARDN (DNAResDualNet), a computational method that utilizes convolutional neural networks (CNNs) coupled with feature discovery using DeepLIFT, for identifying DNA sequence features that can differentiate two sets of lengthy DNA sequences.

DARDN employs two CNNs with distinct initial kernel sizes for DNA sequence classification and residual connections in it to preserve complex relationships between distant DNA sequences. DARDN computes the binary cross entropy (BCE) loss between the predicted probability.





□ Lamian: A statistical framework for differential pseudotime analysis with multiple single-cell RNA-seq samples

>> https://www.nature.com/articles/s41467-023-42841-y

Lamian uses the harmonized data to construct a pseudotemporal trajectory and then quantifies the uncertainty of tree branches using bootstrap resampling. The cluster-based minimum spanning tree (cMST) approach described in TSCAN is used to construct a pseudotemporal trajectory.

Lamian will automatically enumerate all pseudotemporal paths and branches. Lamian first identifies variation in tree topology across samples and then assesses if there are differential topological changes associated with sample covariates.

Lamian estimates tree topology stability and accurately detects differential tree topology. Lamian uses repeated bootstrap sampling of cells along the branches to calculate a detection rate. Lamian comprehensively detects differential pseudotemporal GE and cell density.





□ GraphHiC: Improving Hi-C contact matrices using genome graphs

>> https://www.biorxiv.org/content/10.1101/2023.11.08.566275v1

A novel problem objective to formalize the inference problem. They choose the best source-to-sink path in the directed acyclic graph that optimizes the confidence of TAD infer. Optimizing the objective is NP-complete, a complexity that persists even w/ directed acyclic graphs.

A novel greedy heuristic for the problem and theoretically show that, under a set of relaxed assumptions, the heuristic finds the optimal path with a high probability. They also develop the first complete graph-based Hi-C processing pipeline.






□ GraphTar: applying word2vec and graph neural networks to miRNA target prediction

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05564-x

GraphTar, a new target prediction method that uses a novel graph-based representation to reflect the spatial structure of the miRNA–mRNA duplex. Unlike existing approaches, GraphTar uses the word2vec method to accurately encode RNA sequence information.

GraphTar use a graph neural network classifier that can accurately predict miRNA–mRNA interactions based on graph representation learning. GraphTar segments the sequences of both the mRNA and miRNA’s Minimal Binding Site (MBS) into triplets.





□ RNAkinet: Deep learning and direct sequencing of labeled RNA captures transcriptome dynamics

>> https://www.biorxiv.org/content/10.1101/2023.11.17.567581v1

RNAkinet, a computationally efficient, convolutional, and recurrent neural network (NN) that identifies individual 5EU-modified RNA molecules following direct RNA-Seq.

RNAkinet generalizes to sequences from unique experimental settings, cell types, and species and accurately quantifies RNA kinetic parameters, from single time point experiments.

RNAkinet can analyze entire experiments in hours, instead of days that nano-ID does, and predicts the modification status of RNA molecules directly from the raw nanopore signal without using basecalling or reference sequence alignment.





□ Med-PaLM 2: Genetic Discovery Enabled by A Large Language Model

>> https://www.biorxiv.org/content/10.1101/2023.11.09.566468v1

Med-PaLM 2 is a recently developed medically aligned LLM that was fine-tuned using high quality biomedical text corpora and was aligned using clinician feedback.

Despite these advances and the large volume of biomedical and scientific knowledge encoded within LLMs, it remains to be determined if LLMs can be used to generate novel hypotheses that facilitate genetic discovery.

Med-PaLM uncovers gene-phenotype associations. It correctly responded to free-text queries about potential sets of candidate genes and that it could identify a novel causative genetic factor for an important biomedical trait.





□ ESICCC as a systematic computational framework for evaluation, selection, and integration of cell-cell communication inference methods

>> https://genome.cshlp.org/content/33/10/1788.full

ESICCC, a systematic benchmark framework to evaluate 18 ligand-receptor (LR) inference methods and five ligand/receptor-target inference methods.

Regarding accuracy evaluation, RNAMagnet, CellChat, and scSeqComm emerge as the three best-performing methods for intercellular ligand-receptor inference based on scRNA-seq data, whereas stMLnet and HoloNet are the best methods for predicting ligand/receptor-target regulation.





□ EPIK: Precise and scalable evolutionary placement with informative k-mers

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad692/7425449

IPK (Inference of Phylo-K-mers), a tool for efficient computation of phylo-k-mers. IPK improves the running times of the phylo-k-mer construction step by up to two orders of magnitude. It reduces large phylo-k-mer collections with little or no loss in placement accuracy.

EPIK (Evolutionary Placement with Informative K-mers), an optimized parallel implementation of placement with filtered phylo-k-mers. EPIK substantially outperforms its predecessor. EPIK can place millions of short queries on a single thread in a matter of minutes or hours.





□ syntenyPlotteR: a user-friendly R package to visualize genome synteny, ideal for both experienced and novice bioinformaticians

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbad161/7382206

syntenyPlotteR, an R package specifically designed to plot syntenic relationships between genomes, allowing the clear identification of both inter- and intra-chromosomal rearrangements.

As with the Evolution Highway plots, regions that either do not align or were not assembled in the comparative species are depicted as uncoloured regions of the reference chromosomes.





□ BELMM: Bayesian model selection and random walk smoothing in time-series clustering

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad686/7420213

BELMM (Bayesian Estimation of Latent Mixture Models): a flexible framework for analyzing, clustering, and modelling time-series data in a Bayesian setting. The framework is built on mixture modelling.

BELMM is based on the most plausible model and the number of mixture components using the Reversible-jump Markov chain Monte Carlo. It assigns the time series into clusters based on the similarity to the cluster-specific trend curves determined by the latent random walk process.





□ EMVC-2: An efficient single-nucleotide variant caller based on expectation maximization

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad681/7420212

EMVC-2 employs a multi-class ensemble classification approach based on the expectation-maximization (EM) algorithm that infers at each locus the most likely genotype from multiple labels provided by different learners.

EMVC-2 uses a Decision Tree Classifier (DTC) to filter the untrue SNV candidates identified in the first step. A DTC is chosen as models based on DTs have been shown to discriminate well between true and false called variants in similar settings.






□ GexMolGen: Cross-modal Generation of Hit-like Molecules via Foundation Model Encoding of Gene Expression Signatures

>> https://www.biorxiv.org/content/10.1101/2023.11.11.566725v1

GexMolGen (Gene Expression-based Molecule Generator) based on a foundation model scGPT to generate hit-like molecules from gene expression differences. GexMolGen designs molecules that can induce the required transcriptome profile.

The molecules generated by GexMolGen exhibit a high similarity to known gene inhibitors. GexMolGen outperforms the cosine similarity method. This indicates that the model generates more molecular fragments and feature keys that are similar to the target molecules.





□ Methyl-TWAS: A powerful method for in silico transcriptome-wide association studies (TWAS) using long-range DNA methylation

>> https://www.biorxiv.org/content/10.1101/2023.11.10.566586v1

Methyl-TWAS predicts epigenetically regulated expression (eGReX), which incorporates genetically- (GReX), and environmentally-regulated expression, trait-altered expression, and tissue-specific expression to identify DEGs that could not be identified by genotype-based methods.

Methyl-TWAS incorporates both cis- and trans- CpGs, including enhancers, promoters, transcription factors, and miRNA regions to identify DEGs that would be missed using cis-DNA methylation-based methods.





□ GTExome: Modeling commonly expressed missense mutations in the human genome

>> https://www.biorxiv.org/content/10.1101/2023.11.14.567143v1

GTExome greatly simplifies the process of studying the three-dimensional structures of proteins containing missense mutations that are critical to understanding human health.

In contrast to current state-of-the-art methods, users with no external software or specialized training can rapidly produce three-dimensional structures of any possible mutation in nearly any protein in the human exome.





□ Nunchaku: Optimally partitioning data into piece-wise contiguous segments

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad688/7421911

Nunchaku, a statistically rigorous, Bayesian approach to infer the optimal partitioning of a data set not only into contiguous piece-wise linear segments, but also into contiguous segments described by linear combinations of arbitrary basis functions.

Nunchaku provides a general solution to the problem of identifying discontinuous change points. The nunchaku algorithm to identifies the linear range, using basis functions that generate straight lines and an unknown measurement error.

Two linear segments are optimal, and the one of interest, where OD is proportional to the number of cells, is the segment beginning at the smallest OD. This segment also has the highest coefficient of determination R^2.





□ Benchmarking multi-omics integration algorithms across single-cell RNA and ATAC data

>> https://www.biorxiv.org/content/10.1101/2023.11.15.564963v1

Benchmarking 12 methods in the three categories: integration methods designed for paired datasets (scMVP, MOFA+): paired-guided integration category (MultiVI, Cobolt): for both paired and unpaired datasets (scDART, UnionCom, MMD-MA, scJoint, Harmony, Seurat v3, LIGER, and GLUE).

GLUE would be the best choice, followed by MultiVI. And these 2 methods are also the best choices for trajectory conservation. If one focuses on omics mixing, scART, LIGER, and Seurat are worth a try. As for cell type conservation, MOFA+, scMVP could be taken into consideration.





□ DeepLocRNA: An Interpretable Deep Learning Model for Predicting RNA Subcellular Localization with domain-specific transfer-learning

>> https://www.biorxiv.org/content/10.1101/2023.11.17.567519v1

DeepLocRNA, an RNA localization prediction tool based on fine-tuning of a multi-task RBP-binding prediction method, which was trained to predict the signal of a large cohort of eCLIP data at single nucleotide resolution.

DeepLocRNA can gain performance from the learned RBP binding information to downstream localization prediction, and robustly predicts the localization. Functional motifs can be extracted to do the model interpretation derived from the IG score across 4 nucleotide dimensions.





□ PQVD: Diffusion in a quantized vector space generates non-idealized protein structures and predicts conformational distributions

>> https://www.biorxiv.org/content/10.1101/2023.11.18.567666v1

PVQD (protein vector quantization and diffusion) uses a graph-based Geometry Vector Perceptron (GVP) to encode and transform the structural context of a central residues surrounded by its 30 nearest neighbor residues. Each node of the graph corresponds to a residue.

PVQD models the joint distribution of the latent space vectors encoding backbone structures with a denoising diffusion probabilistic model (DDPM).

In DDPMs, a forward Markovian diffusion process of T time steps are used to gradually introducing Gaussian noise into the true data, while a network is trained to perform the inverse denoising process to recover the true data.

PVQD uses the denoising network architecture of Diffusion Transformers . The module was composed of 24 repeated Transformer blocks. The time step embedding is incorporated through the adaptive Layer Norm (AdaLN) modules.

Through denoising diffusion from Gaussian random noise, a sequence of the latent space vectors is generated by the diffusion module, which is subsequently mapped to a sequence of the quantized vectors, and decoded into a 3-dimensional backbone structure as in the auto-encoder.





□ CellSAM: A Foundation Model for Cell Segmentation

>> https://www.biorxiv.org/content/10.1101/2023.11.17.567630v1

CellSAM, a foundation model for cell segmentation that generalizes across diverse cellular imaging data. CellSAM builds on top of the Segment Anything Model (SAM) by developing a prompt engineering approach to mask generation.

CellFinder, a transformer-based object detector that uses the Anchor DETR framework. It automatically detects cells and prompt SAM to generate segmentations.





□ regioneReloaded: evaluating the association of multiple genomic region sets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad704/7439591

RegioneReloaded is a package that allows simultaneous analysis of associations between genomic region sets, enabling clustering of data and the creation of ready-to-publish graphs.

RegioneReloaded takes over and expands on all the features of its predecessor regioneR. It also incorporates a strategy to improve p-value calculations and normalize z-scores coming from multiple analysis to allow for their direct comparison.





□ MAJIQ-L: Contrasting and Combining Transcriptome Complexity Captured by Short and Long RNA Sequencing Reads

>> https://www.biorxiv.org/content/10.1101/2023.11.21.568046v1

MAJIQ-L, an extension of the MAJIQ to enable a unified view of transcriptome variations from both technologies and demonstrate its benefits. It can be used to assess any future long reads algorithm, and combine w/ short reads data for improved transcriptome analysis.

MAJIQ-L constructs unified gene splice graphs with all isoforms and all LSVs visible for analysis. This unified view is implemented in a new visualization package (VOILA v3), allowing users to inspect each gene of interest where the three sources agree or differ.





□ Improved quality metrics for association and reproducibility in chromatin accessibility data using mutual information

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05553-0

A random subsampling strategy to generate synthetic replicates with varying portions of shared peaks, as a proxy for reproducibility. Across this simulations, we apply the Pearson's R and Spearman's p and monitor their behavior, including the effect of removing co-zeros.

Removing co-zero values had a similar effect on association metrics, attenuating and improving the average AUC across the portion of shared peaks between synthetic replicates.





□ AOPWIKI-EXPLORER: An Interactive Graph-based Query Engine leveraging Large Language Models

>> https://www.biorxiv.org/content/10.1101/2023.11.21.568076v1

Unveiling the capacity of a Labeled Property Graph (LPG) data modelling paradigm to serve as a natural data structure for Adverse Outcome Pathways (AOP). In LPG, data is organized into nodes and relationships in contrast with RDF-triples which consist of subject-predicate-object.

AOPWIKI-EXPLORER provides a unified full-stack solution of graph data implementation that encompasses essential components i.e., data structure, query generator, and interactive interpretation. It harmoniously converges to create an invaluable toolset.





□ Design of Worst-Case-Optimal Spaced Seeds

>> https://www.biorxiv.org/content/10.1101/2023.11.20.567826v1

For any mask, using integer linear programs. (1) minimizing the number of unchanged windows; (2) minimizing the number of positions covered by unchanged windows. Then, among all masks of a given shape (k, w), the set of best masks that maximize these minima.

The optimal mask(s) unsurprisingly depend on the model parameters, but at least for simple Bernoulli models, where a change can appear at each sequence position independently with some small probability p, the problem has been comprehensively solved:

The probability of at least one hit can computed as parameterized polynomial in p, from which one can identify the small set of masks that are optimal for some value of p, or integrated over a certain p-interval.

In essence, one uses dynamic programming to count (or accumulate probabilities of binary sequences) that do not contain the mask as a substring; these calculations can be carried out symbolically.





□ PyCoGAPS: Inferring cellular and molecular processes in single-cell data with non-negative matrix factorization using Python, R and GenePattern Notebook implementations of CoGAPS

>> https://www.nature.com/articles/s41596-023-00892-x

A generalized discussion of NMF covering its benefits, limitations, and open questions in the field is followed by three vignettes for the Bayesian NMF algorithm CoGAPS (Coordinated Gene Activity across Pattern Subsets).

PyCoGAPS, a new Python interface for CoGAPS to enhance accessibility of this method. Their three protocols then demonstrate step-by-step NMF analysis across distinct software platforms.





□ A genome-wide segmentation approach for the detection of selection footprints

>> https://www.biorxiv.org/content/10.1101/2023.11.22.568282v1

Reformulating the problem of detecting regions with abnormally high Fst levels as a multiple changepoint detection or segmentation problem. The procedure relies on statistically grounded and computationally efficient approaches for multiple changepoint detection.

The time complexity of the FPOP algorithm is on average Onlog(n)). Its space complexity is O(n). Therefore, not storing the 2 matrices while running the pDPA and using FPOP to recover the segmentation in D segments yields an average 0(Dmaxn log(n)) time and O(n) space complexity.





□ MiREx: mRNA levels prediction from gene sequence and miRNA target knowledge

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05560-1

miREx, a Convolutional Neural Network (CNN) model for predicting mRNA expression levels from gene sequence and miRNA post-transcriptional information. miREx’s architecture is inspired by Xpresso, a SOTA model for mRNA level prediction that exploits DNA sequence and gene features.

MiREx exploits the Xpresso CNN architecture as a backbone. It consists of convolutional and max-pooling layers applied on the one-hot encoded DNA sequence. miRNA expression levels are also concatenated to the DNA sequence and half-life features.





□ MLN-O: analysis of multiple phenotypes for extremely unbalanced case-control association studies using multi-layer network

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad707/7441501

MLN-O (Multi-Layer Network with Omnibus) uses the score test to test the association of each merged phenotype in a cluster and a SNP and then uses the Omnibus test to obtain an overall test statistic to test the association between all phenotypes and a SNP.

MLN-O is designed for dimension reduction of correlated and extremely unbalanced case-control phenotypes.

MLN enhances the connectivity of phenotypes. It only considers individuals with at least one case status but does not consider individuals without any diseases. Because they do not carry any information to reveal the clustering structures among phenotypes.





□ Efficient construction of Markov state models for stochastic gene regulatory networks by domain decomposition

>> https://www.biorxiv.org/content/10.1101/2023.11.21.568127v1

Decomposing the state space via a Voronoi tessellation and estimate transition probabilities by using adaptive sampling strategies. They apply the robust Perron cluster analysis (PCCA+) to construct the final Markov State Models.

They provide a proof-of-concept by applying the approach to two different networks of mutually inhibiting gene pairs with different mechanisms of self-activation. These are frequently occurring motifs in transcriptional regulatory networks to control cell fate decisions.





□ ChromaX: a fast and scalable breeding program simulator

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad691/7441500

ChromaX is based on the high-performance numerical computing library JAX. Using JAX, ChromaX functions are compiled in XLA (Accelerated Linear Algebra), a compiler for linear algebra that accelerates function execution according to the domain and hardware available.

ChromaX simulates the genetic recombinations that take place during meiosis to create new haplotypes. ChromaX computes the genomic value by performing a tensor contraction of the marker effect with the input population array of markers.





□ Taxometer: Improving taxonomic classification of metagenomics contigs

>> https://www.biorxiv.org/content/10.1101/2023.11.23.568413v1

Taxometer, a neural network based method that improves the annotations and estimates the quality of any taxonomic classifier by combining contig abundance profiles and tetra-nucleotide frequencies.

Taxometer improves taxonomic annotations of any contig-level metagenomic classifier. Taxometer both filled annotation gaps and deleted incorrect labels. Additionally, Taxometer provides a metric for evaluating the quality of annotations in the absence of ground truth.





□ Charm is a flexible pipeline to simulate chromosomal rearrangements on Hi-C-like data.

>> https://www.biorxiv.org/content/10.1101/2023.11.22.568374v1

Charm, a novel simulator for Hi-C maps, also referred to as Chromosome rearrangement modeler. Charm captures different aspects of the Hi-C data structure, encompassing aspects like coverage bias and compartment patterns.

Charm employs Hi-C maps simulating different SV types to benchmark EagleC deep-learning framework. EagleC predicts SV breakpoint as a pair of genomic coordinates and provides four probability scores for each SV depending on the genomic orientation of rearranged loci.






multi-collinearity.

2023-11-21 18:52:22 | Science

Numpyで多重共線性 (multi-collinearity)を回避する。共分散行列に対し固有値分解を行うのが一般的なステップだが、linalg.eig()関数ではなくvariance_inflation_factor関数を使ってVIFの閾値で弾く方法で代替できそうだ

To avoid multicollinearity in Numpy, a common approach is to perform eigenvalue decomposition on the covariance matrix. However, an alternative method could be to use the `variance_inflation_factor` function instead of the `linalg.eig()` function, and filter out variables.