2024年7月のブログ記事一覧-lens, align.

EKPHRASIS.

2024-07-31 19:17:37 | Science News

(Art by Nikita Kolbovskiy )

□ scPRINT: pre-training on 50 million cells allows robust gene network predictions

>> https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1

sPRINT, a foundation model designed for gene network inference. scPRINT outputs cell type-specific genome-wide gene networks but also generates predictions on many related tasks, such as cell annotations, batch effect correction, and denoising, without fine-tuning.

scPRINT is trained with a novel weighted random sampling method3 over 40 million cells from the cellgene database from multiple species, diseases, and ethnicities, representing around 80 billion tokens.

□ biVI: Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing data

>> https://www.nature.com/articles/s41592-024-02365-9

biVI combines the variational autoencoder framework of scVI with biophysical models describing the transcription and splicing kinetics of RNA molecules. biVI successfully fits single-cell neuron data and suggests the biophysical basis for expression differences.

biVI retains the variational autoencoder’s ability to capture cell type structure in a low-dimensional space while further enabling genome-wide exploration of the biophysical mechanisms, such as system burst sizes and degradation rates, that underlie observations.

biVI consists of the three generative models (bursty, constitutive, and extrinsic) and scVI with negative binomial likelihoods. biVI models can be instantiated with single-layer linear decoders to directly link latent variables with gene mean parameters via layer weights.

□ Tiberius: End-to-End Deep Learning with an HMM for Gene Prediction

>> https://www.biorxiv.org/content/10.1101/2024.07.21.604459v1

Tiberius, a novel deep learning-based ab initio gene structure prediction tool that end-to-end integrates convolutional and long short-term memory layers with a differentiable HMM layer. The HMM layer computes posterior probabilities or complete gene structures.

Tiberius employs a parallel variant of Viterbi, which can run in parallel on segments of a sequence. The Tiberius model has approximately eight million trainable parameters and it was trained with sequences of length T = 9999 and a length of T = 500,004 was used for inference.

□ WarpDemuX: Demultiplexing and barcode-specific adaptive sampling for nanopore direct RNA sequencing

>> https://www.biorxiv.org/content/10.1101/2024.07.22.604276v1

WarpDemuX, an ultra-fast and highly accurate adapter-barcoding and demultiplexing approach. WarpDemuX operates directly on the raw signal and does not require basecalling. It uses novel signal preprocessing and a fast machine learning algorithm for barcode classification.

WarpDemuX integrates a Dynamic Time Warping Distance (DTWD) kernel into a Support Vector Machine (SVM) classifier. This DTWD-based kernel function captures the essential spatial and temporal signal information by quantifying how similar an unknown barcode is to known patterns.

□ STORIES: Learning cell fate landscapes from spatial transcriptomics using Fused Gromov-Wasserstein

>> https://www.biorxiv.org/content/10.1101/2024.07.26.605241v1

STORIES (SpatioTemporal Omics eneRglES), a novel trajectory inference method capable of learning a causal model of cellular differentiation from spatial transcriptomics through time using Fused Gromov-Wasserstein (FGW).

STORIES learns a potential function that defines each cell's stage of differentiation. STORIES allows one to predict the evolution of cells at future time points. Indeed, STORIES learns a continuous model of differentiation, while Moscot uses FGW to connect adjacent time points.

□ MultiMIL: Multimodal weakly supervised learning to identify disease-specific changes in single-cell atlases

>> https://www.biorxiv.org/content/10.1101/2024.07.29.605625v1

Multi-MIL employs a multiomic data integration strategy using a product-of-expert generative model, providing a comprehensive multimodal representation of cells.

MultiMIL accepts paired or partially overlapping single-cell multimodal data across samples with varying phenotypes and consists of pairs of encoders and de-coders, where each pair corresponds to a modality.

Each encoder outputs a unimodal representation for each cell, and the joint cell representation is calculated from the unimodal representations. The joint latent representations are then fed into the decoders to reconstruct the input data.

Cells from the same sample are combined with the multiple-instance learning (MIL) attention pooling layer, where cell weights are learned with the attention mechanism, and the sample representations are calculated as a weighted sum of cell representations.

□ scCross: a deep generative model for unifying single-cell multi-omics with seamless integration, cross-modal generation, and in silico exploration

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03338-z

sCross employs modality-specific variational autoencoders to capture cell latent embeddings for each omics type. sCross leverages biological priors by integrating gene set matrices as additional features for each cell.

sCross harmonizes these enriched embeddings into shared embeddings z using further variational autoencoders and critically, bidirectional aligners. Bidirectional aligners are pivotal for the cross-modal generation.

□ MultiMM: Multiscale Molecular Modelling of Chromatin: From Nucleosomes to the Whole Genome

>> https://www.biorxiv.org/content/10.1101/2024.07.26.605260v1

MultiMM (Multiscale Molecular Modelling) employs a multi-scale energy minimization strategy with a large choice of numerical integrators. MultiMM adapts the provided loop data to match the simulation's granularity, downgrading the data accordingly.

MultiMM consolidates loop strengths by summing those associated with the same loop after downgrading and retains only statistically significant ones, applying a threshold value. Loop strengths are then transformed to equilibrium distances.

MultiMM constructs a Hilbert curve structure. MultiMM employs a multi-scale molecular force-field. It encompasses strong harmonic bond and angle forces between adjacent beads, along with harmonic spring forces of variable strength to model the imported long-range loops.

□ GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning

>> https://arxiv.org/abs/2407.16940

GV-Rep, a large-scale dataset of functionally annotated genomic variants (GVs), which could be used for deep learning models to learn meaningful genomic representations. GV-Rep aggregates data from seven leading public GV databases and a clinician-validated set.

The dataset organizes GV records into a standardized format, consisting of a (reference, alternative, annotation) triplet, and each record is tagged with a label that denotes attributes like pathogenicity, gene expression influence, or cell fitness impact.

These annotated records are utilized to fine-tune genomic foundation models (GFMs). These finetuned GMs generates meaningful vectorized representations, enabling the training of smaller models for classifying unknown GVs or for search and indexing within a vectorized space.

□ ChromBERT: Uncovering Chromatin State Motifs in the Human Genome Using a BERT-based Approach

>> https://www.biorxiv.org/content/10.1101/2024.07.25.605219v1

ChromBERT, a model specifically designed to detect distinctive patterns within chromatin state annotation data sequences. By adapting the BERT algorithm as utilized in DNABERT, They pretrained the model on the complete set of genic regions using 4-mer tokenization.

ChromBERT extends the concept fundamentally to the adaptation of chromatin state-annotated human genome sequences by combining it with Dynamic Time Warping.

□ Nucleotide dependency analysis of DNA language models reveals genomic functional elements

>> https://www.biorxiv.org/content/10.1101/2024.07.27.605418v1

DNA language models are trained to reconstruct nucleotides, providing nucleotide probabilities given their surrounding sequence context. The probability of a particular nucleotide to be a guanine depends on whether it is intronic or located at the third base of a start codon.

Mutating a nucleotide in the sequence context (query nucleotide) into all three possible alternatives and record the change in predicted probabilities at a target nucleotide in terms of odds ratios.

This procedure, which can be repeated for all possible query-target combinations, quantifies the extent to which the language model prediction of the target nucleotide depends on the query nucleotide, all else equal.

□ The Genomic Code: The genome instantiates a generative model of the organism

>> https://arxiv.org/abs/2407.15908

The genome encodes a generative model of the organism. In this scheme, by analogy with variational autoencoders, the genome does not encode either organismal form or developmental processes directly, but comprises a compressed space of "latent variables".

These latent variables are the DNA sequences that specify the biochemical properties of encoded proteins and the relative affinities between trans-acting regulatory factors and their target sequence elements.

Collectively, these comprise a connectionist network, with weights that get encoded by the learning algorithm of evolution and decoded through the processes of development.

The latent variables collectively shape an energy landscape that constrains the self-organising processes of development so as to reliably produce a new individual of a certain type, providing a direct analogy to Waddington's famous epigenetic landscape.

□ AIVT: Inferring turbulent velocity and temperature fields and their statistics from Lagrangian velocity measurements using physics-informed Kolmogorov-Arnold Networks

>> https://arxiv.org/abs/2407.15727

Artificial Intelligence Velocimetry-Thermometry (AIVT) method to infer hidden temperature fields from experimental turbulent velocity data. It enables us to infer continuous temperature fields using only sparse velocity data, hence eliminating the need for direct temperature measurements.

AIVT is based on physics-informed Kolmogorov-Arnold Networks (not neural networks) and is trained by optimizing a combined loss function that minimizes the residuals of the velocity data, boundary conditions, and the governing equations.

AIVT can be applied to a unique set of experimental volumetric and simultaneous temperature and velocity data of Rayleigh-Bénard convection (RBC) that we acquired by combining Particle Image Thermometry and Lagrangian Particle Tracking.

□ Stability Oracle: a structure-based graph-transformer framework for identifying stabilizing mutations

>> https://www.nature.com/articles/s41467-024-49780-2

Stability Oracle uses a graph-transformer architecture that treats atoms as tokens and utilizes their pairwise distances to inject a structural inductive bias into the attention mechanism. Stability Oracle also uses a data augmentation technique—thermodynamic permutations.

Stability Oracle consists of the local chemistry surrounding a residue w/ the residue deleted and two amino acid embeddings. Stability Oracle generates all possible point mutations from a single environment, circumventing the need for computationally generated mutant structures.

□ TEA-GCN: Constructing Ensemble Gene Functional Networks Capturing Tissue/condition-specific Co-expression from Unlabled Transcriptomic Data

>> https://www.biorxiv.org/content/10.1101/2024.07.22.604713v1

TEA-GCN (Two-Tier Ensemble Aggregation - GCN) leverages unsupervised partitioning of publicly derived transcriptomic data and utilizes three correlation coefficients to generate ensemble CGNs in a two-step aggregation process.

TEA-GCN uses of k-means clustering algorithm to divide gene expression data into partitions before gene co-expression determination. Expression data must be provided in the form of an expression matrix where expression abundances are in the form of Transcript per Million.

□ MultiOmicsAgent: Guided extreme gradient-boosted decision trees-based approaches for biomarker-candidate discovery in multi-omics data

>> https://www.biorxiv.org/cgi/content/short/2024.07.24.604727v1

MOAgent can directly handle molecular expression matrices - including proteomics, metabolomics, transcriptomics, as well as combinations thereof. The MOAgent-guided data analysis strategy is compatible with incomplete matrices and limited replicate studies.

The core functionality of MOAgent can be accessed via the "RFE++" section of the GUI. At its core, their selection algorithm has been implemented as a Monte-Carlo-like sampling of recursive feature elimination procedures.

□ LatentDAG: Representing core gene expression activity relationships using the latent structure implicit in bayesian networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae463/7720781

LatentDAG, a Bayesian network can summarize the core relationships between gene expression activities. LatentDAG is substantially simpler than conventional co-expression network and ChiP-seq networks. It provides clearer clusters, without extraneous cross-cluster connections.

LatentDAG iterates all the genes in the network main component and selected the gene if the removal of the gene resulted in at least two separated components and each component having at least seven genes.

□ ASSMEOA: Adaptive Space Search-based Molecular Evolution Optimization Algorithm

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae446/7718495

A strategy to construct a molecule-specific fragment search space to address the limited and inefficient exploration to chemical space.

Each molecule-specific fragment library are initially included the decomposition fragments of molecules with satisfactory properties in the database, and then are enlarged by adding the fragments from the new generated molecules with satisfactory properties in each iteration.

ASSMEOA is a molecule optimization algorithm to optimize molecules efficiently. They also propose a dynamic mutation strategy by replacing the fragments of a molecule with those in the molecule-specific fragment search space.

□ Gencube: Efficient retrieval, download, and unification of genomic data from leading biodiversity databases

>> https://www.biorxiv.org/content/10.1101/2024.07.18.604168v1

Gencube, a open-source command-line tool designed to streamline programmatic access to metadata and diverse types of genomic data from publicly accessible leading biodiversity repositories. gencube fetches metadata and Fasta format files for genome assemblies.

Gencube crossgenome fetches comparative genomics data, such as homology or codon / protein alignment of genes from different species. Gencube seqmeta generates a formal search query, retrieves the relevant metadata, and integrates it into experiment-level and study-level formats.

□ Pangene: Exploring gene content with pangene graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae456/7718494

Pangene takes a set of protein sequences and multiple genome assemblies as input, and outputs a graph in the GFA format. It aligns the set of protein sequences to each input assembly w/ miniprot, and derives a graph from the alignment with each contig encoded as a walk of genes.

Pangene provides utilities to classify genes into core genes that are present in most of the input genomes, or accessory genes. Pangene identifies generalized bubbles in the graph, which represent local gene order, gene copy-number or gene orientation variations.

□ QUILT2: Rapid and accurate genotype imputation from low coverage short read, long read, and cell free DNA sequence

>> https://www.biorxiv.org/content/10.1101/2024.07.18.604149v1

QUILT2, a novel scalable method for rapid phasing and imputation from 1c-WGS and fDNA using very large haplotype reference panels. QUILT2 uses a memory efficient version of the positional burrows wheeler transform (PBWT), which they call the multi-symbol PBWT (msPBWT).

QUILT2 uses msPBWT in the imputation process to find haplotypes in the haplotype reference panel that share long matches to imputed haplotypes with constant computational complexity, and with a very low memory footprint.

QUILT2 employs a two stage imputation process, where it first samples read labels and find an optimal subset of the haplotype reference panel using information at common SNPs, and then use these to initialize a final imputation at all SNPs.

□ MENTOR: Multiplex Embedding of Networks for Team-Based Omics Research

>> https://www.biorxiv.org/content/10.1101/2024.07.17.603821v1

MENTOR is a software extension to RWRtoolkit, which implements the random walk with restart (RWR) algorithm on multiplex networks. The RWR algorithm traverses a random walker across a monoplex / multiplex network using a single node, called the seed, as an initial starting point.

As an abstraction of the edge density of these networks, a topological distance matrix is created and hierarchical clustering used to create a dendrogram representation of the functional interactions. MENTOR can determine the topological relationships among all genes in the set.

□ SGS: Empowering Integrative and Collaborative Exploration of Single-Cell and Spatial Multimodal Data

>> https://www.biorxiv.org/content/10.1101/2024.07.19.604227v1

SGS offer two modules: SC (single-cell and spatial visualization module) and SG (single-cell and genomics visualization module), w/ adaptable interface layouts and advanced capabilities.

Notably, the SG module incorporates a novel genome browser framework that significantly enhances the visualization of epigenomic modalities, including SCATAC, scMethylC, sc-eQTL, and scHiC etc.

□ Pseudovisium: Rapid and memory-efficient analysis and quality control of large spatial transcriptomics datasets

>> https://www.biorxiv.org/content/10.1101/2024.07.23.604776v1

Pseudovisium, a Python-based framework designed to facilitate the rapid and memory-efficient analysis, quality control and interoperability of high-resolution spatial transcriptomics data. This is achieved by mimicking the structure of 10x Visium through hexagonal binning of transcripts.

Pseudovisium increased data processing speed and reduced dataset size by more than an order of magnitude. At the same time, it preserved key biological signatures, such as spatially variable genes, enriched gene sets, cell populations, and gene-gene correlations.

□ SAVANA: reliable analysis of somatic structural variants and copy number aberrations in clinical samples using long-read sequencing

>> https://www.biorxiv.org/content/10.1101/2024.07.25.604944v1

SAVANA is a somatic SV caller for long-read data. It takes aligned tumour and normal BAM files, examines the reads for evidence of SVs, clusters adjacent potential SVs together, and finally calls consensus breakpoints, classifies somatic events, and outputs them in BEDPE and VCF.

SAVANA also identifies copy number abberations and predicts purity and ploidy. SAVANA provides functionalities to assign sequencing reads supporting each breakpoint to haplotype blocks when the input sequencing reads are phased.

□ GW: ultra-fast chromosome-scale visualisation of genomics data

>> https://www.biorxiv.org/content/10.1101/2024.07.26.605272v1

Genome-Wide (GW) is an interactive genome browser that expedites analysis of aligned sequencing reads and data tracks, and introduces novel interfaces for exploring, annotating and quantifying data.

GW's high-performance design enables rapid rendering of data at speeds approaching the file reading rate, in addition to removing the memory constraints of visualizing large regions. GW explores massive genomic regions or chromosomes without requiring additional processing.

□ ConsensuSV-ONT - a modern method for accurate structural variant calling

>> https://www.biorxiv.org/content/10.1101/2024.07.26.605267v1

ConsensuSV-ONT, a novel meta-caller algorithm, along with a fully automated variant detection pipeline and a high-quality variant filtering algorithm based on variant encoding for images and convolutional neural network models.

ConsensuSV-ONT-core, is used for getting the consensus (by CNN model) out of the already-called SVs, taking as an input vof files, and returns a high-quality vof file. ConsensuSV-ONT-pipeline is the complete out-of-the-box solution using as the input raw ONT fast files.

□ A fast and simple approach to k-mer decomposition

>> https://www.biorxiv.org/content/10.1101/2024.07.26.605312v1

An intuitive integer representation of a k-mer, which at the same time acts as minimal perfect hash. This is accompanied by a minimal perfect hash function (MPHF) that decomposes a sequence into these hash values in constant time with respect to k.

It provides a simple way to give these k-mer hashes a pseudorandom ordering, a desirable property for certain k-mer based methods, such as minimizers and syncmers.

□ SCCNAInfer: a robust and accurate tool to infer the absolute copy number on scDNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae454/7721932

SCCNAInfer calculates the pairwise distance among cells, and clusters the cells by a novel and sophisticated cell clustering algorithm that optimizes the selection of the cell cluster number.

SCCNAInfer automatically searches the optimal subclonal ploidy that minimizes an objective function that not only incorporates the integer copy number approximation algorithm, but also considers the intra-cluster distance and those in two different clusters.

□ scASfind: Mining alternative splicing patterns in scRNA-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03323-6

scASfind uses a similar data compression strategy as scfind to transform the cell pool-to-node differential PSI matrix into an index. This enables rapid access to cell type-specific splicing events and allows an exhaustive approach for pattern searches across the entire dataset.

scASfind does not involve any imputation or model fitting, instead cells are pooled to avoid the challenges presented by sparse coverage. Moreover, there is no restriction on the number of exons, or the inclusion/exclusion events involved in the pattern of interest.

□ HAVAC: An FPGA-based hardware accelerator supporting sensitive sequence homology filtering with profile hidden Markov models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05879-3

HAVAC (The Hardware Accelerated single-segment Viterbi Additional Coprocessor), an FPGA-accellerated implementation of the Single-segment Ungapped Viterbi algorithm for use in nucleotide sequence with profile hidden Markov models.

HAVAC concatenates all sequences in a fasta file and all models in an hmm file before transferring the data to the accelerator for processing. The HAVAC kernel represents a 227× matrix calculation speedup over nhmmer with one thread and a 92× speedup over nhmmer with 4 threads.

Splash!

2024-07-31 19:07:07 | 写真

(Created with Midjourney V6.1)

Midjourney V6.1 is awesome.

The Map of Tiny Perfect Things

2024-07-27 17:49:19 | 映画

□ 『The Map of Tiny Perfect Things』 (明日への地図を探して)

Directed by Ian Samuels
Based on the short story by Lev Grossman
Music by Tom Bromley
Cinematography by Andrew Wehde

真夏の同じ一日を無限にエンジョイしてたら、同じ現象を体験してる少女に出会って恋が始まること、あるよね。『4次元では全て見渡せる』2人は街中で起こる奇跡の全てを見届けようとするが…淡く切ないタイムループ青春映画。でも、待ち受ける決断はあまりにも重い

Tears fluxing.

2024-07-25 20:57:28 | 写真

(Created with Midjourney V6.0 ALPHA)

One Without.

2024-07-25 20:50:34 | 日記・エッセイ・コラム

(Created with Midjourney V6.0 ALPHA)

記憶は、『今』を起点に過去のある地点まで一様に線型を保って連続しているように思える。でも本当は、明滅するストロボに断続的に照らされている間だけ存在できる信号なのようなものなのかもしれない。この一瞬にしか無いもの。あの一瞬にしか無かったもの。抜け落ちた光の中に手を伸ばして見ても

□ Oliver Coates / “One Without”

記憶は、『今』を起点に過去のある地点まで一様に線型を保って連続しているように思える。でも本当は、明滅するストロボに断続的に照らされている間だけ存在できる信号なのようなものなのかもしれない。この一瞬にしか無いもの。あの一瞬にしか無かったもの。抜け落ちた光の中に手を伸ばして見ても

Rebirth.

2024-07-23 19:42:48 | Music20

□ Bahramji &. Justin Rezvani / “Rebirth”

ペルシアのスーフィー音楽とエレクトロニカの融合。Bahramjiはルーミーの影響を色濃くを受けたイランの神秘主義音楽家で、サントゥールの名手としても有名。起業家兼DJのJustin Rezvaniの大胆なディープハウス風アレンジが光る

2024/05/24
Here be dragons
Bahram Pourmand / Justin Rezvani

Sundown lane.

2024-07-22 01:28:40 | 写真

□ Avenue One / “Rio”

Aftersun.

2024-07-21 03:08:08 | 映画

□ 『Aftersun / アフターサン』

2022
Directed by Charlotte Wells
Music by Oliver Coates
Cinematography by Gregory Oke

胸にガラス片をザクザク突き立てるような危ない映画。淡い幸せの襞に見え隠れする絶望感。それを失ってはもはや生きていけないものを、運命はいとも簡単に取り上げる。時には失くしたことにさえ気付けない。人は刹那の幸福だけでは生きていけないのに

劇場で見た時は、全編を覆う不穏な空虚さと喪失感に圧倒されて言葉を失うほどだった。誰にも瑕疵がないのに終わりが迫ってくる悲壮感、再見しても色褪せないどころかますます鋭利にブッ刺さってくるからやはり危険な映画

Beachcomber.

2024-07-21 01:01:01 | 写真

(Created with Midnourney V6 ALPHA)

□ Blank & Jones / "Bottled Sunshine“

Vectorum.

2024-07-17 19:07:07 | Science News

(Art by megs)

God made everything out of nothing. But the nothingness shows through.
─── Paul Valéry( 1871–1945)

□ STARS AS SIGNALS / “We Are Stars”

□ HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae452/7714688

HyperGen is a Rust library used to sketch genomic files and boost genomic Average Nucleotide Identity (ANI) calculation. HyperGen combines FracMinHash and hyperdimensional computing (HDC) to encode genomes into quasi-orthogonal vectors (Hypervector) in high-dimensional space.

HyperGen adds a key step - Hyperdimensional Encoding for k-mer Hash. This step essentially converts the discrete and numerical hashes in the k-mer hash set to a D-dimensional and nonbinary vector, called sketch hypervector. HyperGen relied on recursive random bit generation.

□ ENGRAM: Symbolic recording of signalling and cis-regulatory element activity to DNA

>> https://www.nature.com/articles/s41586-024-07706-4

ENGRAM, a multiplex strategy for biologically conditional genomic recording in which signal-specific CREs drive the insertion of signal-specific barcodes to a common DNA Tape.

ENGRAM is a recorder assay in which measurements are written to DNA, and an MPRA is a reporter assay in which measurements are made from RNA.

All components would be genomically encoded by a recorder locus within the millions to billions of cells of a model organism, capturing biology as it unfolds over time, and collectively read out at a single endpoint.

□ scGFT: single-cell RNA-seq data augmentation using generative Fourier transformer

>> https://www.biorxiv.org/content/10.1101/2024.07.09.602768v1

scGFT (single-cell Generative Fourier Transformer), a cell-centric generative model built upon the principles of the Fourier Transform. It employs a one-shot transformation paradigm to synthesize GE profiles that reflect the natural biological variability in authentic datasets.

scGFT eschews the reliance on identifying low-dimensional data manifolds, focusing instead on capturing the intricacies of cell expression profiles into a complex space via the Discrete Fourier Transform and reconstruction of synthetic profiles via the Inverse Fourier Transform.

□ scKEPLM: Knowledge enhanced large-scale pre-trained language model for single-cell transcriptomics

>> https://biorxiv.org/cgi/content/short/2024.07.09.602633v1

scKEPLM is the first single-cell foundation model. scKEPLM covers over 41 million single-cell RNA sequences and 8.9 million gene relations. scKEPLM is based on a Masked Language Model (MLM) architecture. It leverages MLMs to predict missing or masked elements in the sequences.

sKEPLM consists of two parallel encoders. scKEPLM employs a Gaussian attention mechanism within the transformer architecture to model the complex high-dimensional interaction. scKEPLM precisely aligns cell semantics with genetic information.

□ HERMES: Holographic Equivariant neuRal network model for Mutational Effect and Stability prediction

>> https://www.biorxiv.org/content/10.1101/2024.07.09.602403v1

HERMES, a 3D rotation equivariant neural network with a more efficient architecture than Holographic Convolutional Neural Network (HCNN), pre-trained on amino-acid propensity, and computationally-derived mutational effects using their open-source code.

HERMES uses a the resulting Fourier encoding of the data an holographic encoding, as it presents a superposition of 3D spherical holograms. Then, the resulting holograms are fed to a stack of SO(3)-Equivariant layers, which convert the holograms to an SO(3)-equivariant embedding.

□ FoldToken3: Fold Structures Worth 256 Words or Less

>> https://www.biorxiv.org/content/10.1101/2024.07.08.602548v1

FoldToken3 re-designs the vector quantization module. FoldToken3 uses a 'partial gradient' trick to allow the encoder and quantifier receive stable gradient no matter how the temperature is small.

Compared to ESM3, whose encoder and decoder have 30.1M and 618.6M parameters with 4096 code space, FoldToken3 has 4.31M and 4.92M parameters with 256 code space.

FoldToken uses only 256 code vectors. FoldToken3 replaces the 'argmax' operation as sampling from a categorical distribution, making the code selection process to be stochastic.

□ RNAFlow: RNA Structure & Sequence Design via Inverse Folding-Based Flow Matching

>> https://arxiv.org/pdf/2405.18768

RNAFlow, a flow matching model for RNA sequence-structure design. In each iteration, RNAFlow first generates a RNA sequence given a noisy protein-RNA complex and then uses RF2NA to fold into a denoised RNA structure.

RNAFlow generates an RNA sequence and its structure simultaneously. Second, it is much easier to train because they do not fine-tune a large structure prediction network. Third, enables us to model the dynamic nature of RNA structures for inverse folding.

□ Mettannotator: a comprehensive and scalable Nextflow annotation pipeline for prokaryotic assemblies

>> https://www.biorxiv.org/content/10.1101/2024.07.11.603040v1

Mettannotator - a comprehensive Nextflow pipeline for prokaryotic genome
annotation that identifies coding and non-coding regions, predicts protein functions, including antimicrobial resistance, and delineates gene clusters.

The Mettannotator pipeline parses the results of each step and consolidates them into a final valid GFF file per genome. The ninth column of the file contains carefully chosen key-value pairs to report the salient conclusions from each tool.

□ Enhancing SNV identification in whole-genome sequencing data through the incorporation of known genetic variants into the minimap2 index

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05862-y

A linear reference sequence index that takes into account known genetic variants using the features of the internal representation of the reference sequence index of the minimap2 tool.

The possibility of modifying the minimap2 tool index is provided by the fact that the hash table does not impose any restrictions on the number of minimizers at a given position of the linear reference sequence.

Adding information about genetic variants does not affect the subsequent alignment algorithm. The linear reference sequence index allows the addition of branches induced by the addition of genetic variants, similar to a genomic graph.

□ GeneBayes: Bayesian estimation of gene constraint from an evolutionary model with gene features

>> https://www.nature.com/articles/s41588-024-01820-9

GeneBayes is an Empirical Bayes framework that can be used to improve estimation of any gene property that one can relate to available data through a likelihood function.

GeneBayes trains a gradient-boosted trees to predict the parameters of the prior distribution by maximizing the likelihood. GeneBayes computes a per-gene posterior distribution for the gene property of interest, returning a posterior mean and 95% credible interval for each gene.

□ METASEED: a novel approach to full-length 16S rRNA gene reconstruction from short read data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05837-z

METASEED, an alternative where they use amplicon 16S rRNA data and shotgun sequencing data from the same samples, helping the pipeline to determine how the original 16S region would look.

METASEED eliminates undesirable noises and produce high quality, reasonable length 16S sequences. The method is designed to broaden the repertoire of sequences in 16S rRNA reference databases by reconstructing novel near full length sequences.

□ Floria: fast and accurate strain haplotyping in metagenomes

>> https://academic.oup.com/bioinformatics/article/40/Supplement_1/i30/7700908

Floria, a novel method designed for rapid and accurate recovery of strain haplotypes from short and long-read metagenome sequencing data, based on minimum error correction (MEC) read clustering and a strain-preserving network flow model.

Floria can function as a standalone haplotyping method, outputting alleles and reads that co-occur on the same strain, as well as an end-to-end read-to-assembly pipeline (Floria-PL) for strain-level assembly.

□ CLADES: Unveiling Clonal Cell Fate and Differentiation Dynamics: A Hybrid NeuralODE-Gillespie Approach

>> https://www.biorxiv.org/content/10.1101/2024.07.08.602444v1

CLADES (Clonal Lineage Analysis with Differential Equations and Stochastic Simulations), a model estimator, namely a NeuralODE based framework, to delineate meta-clone specific trajectories and state-dependent transition rates.

CLADES is a data generator via the Gillespie algorithm, that allows a cell, for a randomly extracted time interval, to choose either a proliferation, differentiation, or apoptosis process in a stochastic manner.

CLADES can estimate the summary of the divisions between progenitors and progeny, and showed that the fate bias between all progenitor-fate pairs can be inferred probabilistically.

□ scRL: Reinforcement learning guides single-cell sequencing in decoding lineage and cell fate decisions https://www.biorxiv.org/content/10.1101/2024.07.04.602019v1

scRL utilizes a grid world created from a UMAP two-dimensional embedding of high-dimensional data, followed by an actor-critic architecture to optimize differentiation strategies and assess fate decision strengths.

The effectiveness of scRL is demonstrated through its ability to closely align pseudotime with distance trends in the two-dimensional manifold and to correlate lineage potential with pseudotime trends.

□ scMaSigPro: Differential Expression Analysis along Single-Cell Trajectories

>>

https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae443/7709407

scMaSigPro, a method initially developed for serial analysis of transcriptomics data, to the analysis of scRNA-seq trajectories. scMaSigPro detects genes that change their expression in Pseudotime and b/n branching paths.

scMaSigPro establishes the polynomial model by assigning dummy variables to each branch, following the approach of the original maSigPro method for the Generalized Linear Model. scMaSigPro is therefore suited for diverse topologies and cell state compositions.

□ spASE: Detection of allele-specific expression in spatial transcriptomics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03317-4

spASE detects ASE in spatial transcriptomics while accounting for cell type mixtures. spACE can estimate the contribution from each cell type to maternal and paternal allele counts at each spot, calculated based on cell type proportions and differential expression.

spASE enables modeling of the maternal allele probability spatial function both across and within cell types. spASE generates high resolution spatial maps of X-chromosome ASE and identify a set of genes escaping XCI.

□ Tuning Ultrasensitivity in Genetic Logic Gates using Antisense RNA Feedback

>> https://www.biorxiv.org/content/10.1101/2024.07.03.601968v1

The antisense RNAs (asRNAs) are expressed with the existing messenger RNA (mRNA) of a logic gate in a single transcript and target mRNAs of adjacent gates, creating a feedback of the protein-mediated repression that implements the core function of the logic gates.

A gate with multiple inputs logically consistent with the single-transcript RNA feedback connection must implement a generalized inverter structure on the molecular level.

□ GS-LVMOGP: Scalable Multi-Output Gaussian Processes with Stochastic Variational Inference

>> https://arxiv.org/abs/2407.02476

The Latent Variable MOGP (LV-MOGP) models the covariance between outputs using a kernel applied to latent variables, one per output, leading to a flexible MOGP model that allows efficient generalization to new outputs with few data points.

GS-LVMOGP, a generalized latent variable multi-output Gaussian process model w/in a stochastic variational inference. By conducting variational inference for latent variables and inducing values, GS-LVMOGP manages large-scale datasets with Gaussian/non-Gaussian likelihoods.

□ scTail: precise polyadenylation site detection and its alternative usage analysis from reads 1 preserved 3' scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2024.07.05.602174v1

scTail, an all-in-one stepwise computational method. scTail takes an aligned bam file from STARsolo (with higher tolerance of low-quality mapping) as input and returns the detected PASs and a PAS-by-cell expression matrix.

scTail embedded a pre-trained sequence model to remove the false positive clusters, which enabled us to further evaluate the reliability of the detection by examining the supervised performance metrics and learned sequence motifs.

□ MaxComp: Prediction of single-cell chromatin compartments from single-cell chromosome structures

>> https://www.biorxiv.org/content/10.1101/2024.07.02.600897v1

MaxComp, an unsupervised method to predict single-cell compartments using graph-based programming. MaxComp determines single-cell A/B compartments from geometric considerations in 3D chromosome structures.

Segregation of chromosomal regions into two compartments can then be modeled as the Max-cut problem, a semidefinite graph programming method, which optimizes a cut through a set of edges such that the total weights of the cut edges will be maximized.

□ REGLE: Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction

>> https://www.nature.com/articles/s41588-024-01831-6 https://www.nature.com/articles/s41588-024-01831-6

REGLE (Representation Learning for Genetic Discovery on Low-Dimensional Embeddings) is based on the variational autoencoder (VAE) model. REGEL learns a nonlinear, low-dimensional, disentangled representation.

REGLE performs GWAS on all learned coordinates. Finally, It trains a small linear model to learn weights for each latent coordinate polygenic risk scores to obtain the final disease-specific polygenic risk scores.

□ GALEON: A Comprehensive Bioinformatic Tool to Analyse and Visualise Gene Clusters in Complete Genomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae439/7709405

GALEON identifies gene clusters by studying the spatial distribution of pairwise physical distances among gene family members along with the genome-wide gene density.

GALEON can also be used to analyse the relationship between physical and evolutionary distances. It allows the simultaneous study of two gene families at once to explore putative co-evolution.

GALEON implements the Cst statistic, which measures the proportion of the genetic distance attributable to unclustered genes. Cst values are estimated separately for each chromosome (or scaffold), as well as for the whole genome data.

□ DNA walk of specific fused oncogenes exhibit distinct fractal geometric characteristics in nucleotide patterns

>> https://www.biorxiv.org/content/10.1101/2024.07.05.602166v1

Fractal geometry and DNA walk representation were employed to investigate the geometric features i.e., self-similarity and heterogeneity in DNA nucleotide coding sequences of wild-type and mutated oncogenes, tumour-suppressor, and other unclassified genes.

The mutation-facilitated self-similar and heterogenous features were quantified by the fractal dimension and lacunarity coefficient measures. The geometrical orderedness and disorderedness in the analyzed sequences were interpreted from the combination of the fractal measures.

□ Mutational Constraint Analysis Workflow for Overlapping Short Open Reading Frames and Genomic Neighbours

>> https://www.biorxiv.org/content/10.1101/2024.07.07.602395v1

sORFs show a similar mutational background to canonical genes, yet they can contain a higher number of high impact variants.

This can have multiple explanations. It might be that these regions are not intolerant against loss-of-function variants or that these non-constrained sORFs do not encode functional microproteins.

This similarity in distribution does not provide sufficient evidence for a potential coding effect in sORFs, as it may be fully explainable probabilistically, given that synonymous and protein truncating variants have fewer opportunities to occur compared to missense variants.

sORFs are mostly embedded into a moderately constraint genomic context, but within the gencode dataset they identified a subset of highly constrained sORFs comparable to highly constrained canonical genes.

□ SimSpliceEvol2: alternative splicing-aware simulation of biological sequence evolution and transcript phylogenies

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05853-z

SimSpliceEvol2 generates an output that comprises the gene sequences located at the leaves of the guide gene tree. The output also includes the transcript sequences associated with each gene at each node of the guide gene tree, by providing details about their exon content.

SimSpliceEvol2 also outputs all groups of orthologous transcripts. Moreover, SimSpliceEvol2 outputs the phylogeny for all the transcripts at the leaves of the guide tree. This phylogeny consists of a forest of transcript trees, describing the evolutionary history of transcripts.

□ d-Fulgor: Where the patterns are: repetition-aware compression for colored de Bruijn graphs

>> https://www.biorxiv.org/content/10.1101/2024.07.09.602727v1

The algorithms factorize the color sets into patterns that repeat across the entire collection and represent these patterns once, instead of redundantly replicating their representation as would happen if the sets were encoded as atomic lists of integers.

d-Fulgor, is a "horizontal" compression method which performs a representative/differential encoding of the color sets. The other scheme, m-Fulgor, is a "vertical" compression method which instead decomposes the color sets into meta and partial color sets.

□ MAGA: a contig assembler with correctness guarantee

>> https://www.biorxiv.org/content/10.1101/2024.07.10.602853v1

MAGA (Misassembly Avoidance Guaranteed Assembler), a model for structural correctness in de Bruijn graph based assembly. MAGA estimates the probability of misassembly for each edge in the de Bruijn graph.

when k-mer coverage is high enough for computing accurate estimates, MAGA produces as contiguous assemblies as a state-of-the-art assembler based on heuristic correction of the de Bruin graph such as tip and bulge removal.

□ SDAN: Supervised Deep Learning with Gene Annotation for Cell Classification

>> https://www.biorxiv.org/content/10.1101/2024.07.15.603527v1

SDAN encodes gene annotations using a gene-gene interaction graph and incorporates gene expression as node attributes. It then learns gene sets such that the genes in a set share similar expression and are located close to each other in the graph.

SDAN combines gene expression data and gene annotations (gene-gene interaction graph) to learn a gene assignment matrix, which specifies the weights of each gene for all latent components.

SDAN uses the gene assignment matrix to reduce the gene expression data of each cell to a low-dimensional space and then makes predictions in the low-dimensional space using a feed-forward neural network.

□ Orthanq: transparent and uncertainty-aware haplotype quantification with application in HLA-typing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05832-4

Orthanq relies on the statistically accurate determination of posterior variant allele frequency (VAF) distributions of the known genomic variation each haplotype (HLA allele) is made of, while still enabling to use local phasing information.

Orthanq can directly utilize existing pangenome alignments and type all HLA loci. By combining the posterior VAF distributions in a Bayesian latent variable model, Orthanq can calculate the posterior probability of each possible combination of haplotypes.

□ R2Dtool: Integration and visualization of isoform-resolved RNA features

>> https://www.biorxiv.org/content/10.1101/2022.09.23.509222v3

R2Dtool exploits the isoform- resolved mapping of RNA features, such as those obtained from long-read sequencing, to enable simple, reproducible, and lossless integration, annotation, and visualization of isoform-specific RNA features.

R2Dtool's core function liftover transposes the transcript-centric coordinates of the isoform-mapped sites to genome-centric coordinates.

R2Dtool introduces isoform-aware metatranscript plots and metajunction plots to study the positonal distribution of RNA features around annotated RNA landmarks.

□ Composite Hedges Nanopores: A High INDEL-Correcting Codec System for Rapid and Portable DNA Data Readout

>> https://www.biorxiv.org/content/10.1101/2024.07.12.603190v1

The Composite Hedges Nanopores (CHN) coding algorithm tailored for rapid readout of digital information storage in DNA. The Composite Hedges Nanopores could independently accelerate the readout of stored DNA data with less physical redundancy.

The core of CHN's encoding process features constructing DNA sequences that are synthesis-friendly and highly resistant to indel errors, launching a different hash function to generate discrete values about the encoding message bits, previous bits, and index bits.

□ Genome-wide analysis and visualization of copy number with CNVpytor in igv.js

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae453/7715874

The CNVpytor track in igv.js provides enhanced functionality for the analysis and inspection of copy number variations across the genome.

CNVpytor and its corresponding track in igv.js provide a certain degree of standardization for inspecting raw data. In the future, developing a standard format for inspecting raw signals and converting outputs from various callers into such a format would be ideal.

□ Festem: Directly selecting cell-type marker genes for single-cell clustering analyses

>> https://www.cell.com/cell-reports-methods/fulltext/S2667-2375(24)00173-5

Festem (feature selection by expectation maximization [EM] test) can accurately select clustering-informative genes before the clustering analysis and identify marker genes.

Festem performs a statistical test to determine if its expression is homogenously distributed (not a marker gene) or heterogeneously distributed (a marker gene) and assigns a p value based on the chi-squared distribution.

Momentum.

2024-07-17 19:06:05 | Science News

(Art by megs)

□ COSMOS+: Modeling causal signal propagation in multi-omic factor space

>> https://www.biorxiv.org/content/10.1101/2024.07.15.603538v1

COSMOS+ (Causal Oriented Search of Multi-Omics Space) connects data-driven analysis of multi-omic data with systematic integration of mechanistic prior knowledge interactions with factor weights resulting from the variance decomposition.

MOON (Meta-fOOtprint aNalysis for COSMOS) can generate mechanistic hypothesis, effectively connecting perturbations observed at the level of cells kinase receptors. Any receptor/kinase that shows a sign incoherence b/n its MOON score and the input score/measurement is pruned out.

□ Delphi: Deep Learning for Polygenic Risk Prediction

>> https://www.medrxiv.org/content/10.1101/2024.04.19.24306079v3

Delphi emplolys a transformer architecture to capture non-linear interactions. Delphi uses genotyping and covariate information to learn perturbations of mutation effect estimates.

Delphi can integrate up to hundreds of thousands of SNPs as input. Covariates were included as the first embedding in the sequence, and zero padding was used when necessary. The transformer's output was then mapped back into a vector the size of the number of input SNPs.

□ A BLAST from the past: revisiting blastp's E-value

>> https://www.biorxiv.org/content/10.1101/2024.07.16.603405v1

Via extensive simulated draws from the null we show that, while generally reasonable, blastp's E-values can at times be overly conservative, while at others, alarmingly, they can be too liberal, i.e., blastp is inflating the significance of the reported alignments.

A significance analysis using a sample of size from the distribution of the maximal alignment score. Assessing how unlikely it is that their original maximal alignment score came from the same null sample, assuming that all scores were generated by a Gumbel distribution.

□ RWRtoolkit: multi-omic network analysis using random walks on multiplex networks in any species

>> https://www.biorxiv.org/content/10.1101/2024.07.17.603975v1

RWR toolkit wraps the Random WalkRestartMH R package, which provides the core functionality to generate multiplex networks from a set of input network layers, and implements the Random Walk Restart algorithm on a supra-adjacency matrix.

RWRtoolkit provides commands to rank all genes in the overall network according to their connectivity, use cross-validation to assess the network's predictive ability or determine the functional similarity of a set of genes, and find shortest paths between sets of seed genes.

□ Unsupervised evolution of protein and antibody complexes with a structure-informed language model

>> https://www.science.org/doi/10.1126/science.adk8946

Inverse folding can interrogate protein fitness landscapes indirectly, without needing to explicitly model individual functional tasks or properties.

A hybrid autoregressive model integrates amino acid values and backbone structural information to evaluate the joint likelihood over all positions in a sequence.

Amino acids from the protein sequence are tokenized , combined with geometric features extracted from a structural encoder, and modeled with an encoder-decoder transformer. Sequences assigned high likelihoods represent high confidence in folding into the input backbone structure.

□ SmartImpute: A Targeted Imputation Framework for Single-cell Transcriptome Data

>> https://www.biorxiv.org/content/10.1101/2024.07.15.603649v1

Smartimpute focuses on a predefined set of marker genes, enhancing the biological relevance and computational efficiency of the imputation process while minimizing the risk of model misspecification.

Utilizing a modified Generative Adversarial Imputation Network architecture, Smartimpute accurately imputes the missing gene expression and distinguishes between true biological zeros and missing values, preventing overfitting and preserving biologically relevant zeros.

□ Genomics-FM: Universal Foundation Model for Versatile and Data-Efficient Functional Genomic Analysis

>> https://www.biorxiv.org/content/10.1101/2024.07.16.603653v1

Genomics-FM, a foundation model driven by genomic vocabulary tailored to enhance versatile and label-efficient functional genomic analysis. Genomic vocabulary, analogous to a lexicon in linguistics, defines the conversion of continuous genomic sequences into discrete units.

Genomics-FM constructs an ensemble genomic vocabulary that includes multiple vocabularies during pretraining, and selectively activates specific genomic vocabularies for the fine-tuning of different tasks via masked language modeling.

□ Nanotiming: telomere-to-telomere DNA replication timing profiling by nanopore sequencing

>> https://www.biorxiv.org/content/10.1101/2024.07.05.602252v1

Nanotiming eliminates the need for cell sorting to generate detailed Replication Timing maps. It leverages the possibility of unambiguously aligning long nanopore reads at highly repeated sequences to provide complete genomic RT profiles, from telomere to telomere.

Nanotiming reveals that yeast telomeric RT regulator Rifl does not directly delay the replication of all telomeres, as previously thought, but only of those associated with specific subtelomeric motifs.

□ MARCS: Decoding the language of chromatin modifications

>> https://www.nature.com/articles/s41576-024-00758-2

MARCS (Modification Atlas of Regulation by Chromatin States) offers a set of visualization tools to explore intricate chromatin regulatory circuits from either a protein-centred perspective or a modification-centred perspective.

The MARCS algorithm also identifies proteins with symmetrically opposite binding profiles, thereby expanding the selection to include factors with contrasting modification-driven responses. MARCS provides the complete set of co-regulated protein clusters.

□ Panpipes: a pipeline for multiomic single-cell and spatial transcriptomic data analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03322-7

Panpipes is based on scverse. Panpipes has a modular design and performs ingestion, preprocessing, integration and batch correction, clustering, reference mapping, and spatial transcriptomics deconvolution with custom visualization of outputs.

Panpipes can process any single-cell dataset containing RNA, cell-surface proteins, ATAC, and immune repertoire modalities, as well as spatial transcriptomics data generated through the 10 × Genomics’ Visium or Vizgen’s MERSCOPE platforms.

□ UCS: a unified approach to cell segmentation for subcellular spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2024.07.08.601384v1

UCS integrates accurate nuclei segmentation results from nuclei staining with the transcript data to predict precise cell boundaries, thereby significantly improving the segmentation accuracy. It offers a comprehensive perspective that enhances cell segmentation.

UCS employs a scaled softmask to maintain shape consistency w/ the nuclei, thereby preserving the morphological integrity of cells. UCS integrates marker gene information to enhance segmentation, ensuring that each nucleus is associated w/ the correct cell-type specific markers.

□ MPAQT: Accurate isoform quantification by joint short- and long-read RNA-sequencing

>> https://www.biorxiv.org/content/10.1101/2024.07.11.603067v1

MPAQT, a generative model that combines the complementary strengths of different sequencing platforms to achieve state-of-the-art isoform-resolved transcript quantification, as demonstrated by extensive simulations and experimental benchmarks.

MPAQT connects the latent abundances of the transcripts to the observed counts of the "observation units" (OUs). MPAQT infers the transcript abundances by Maximum A Posteriori estimation given the observed OU counts across all platforms, and experiment-specific model parameters.

□ HySortK: High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism

>> https://arxiv.org/abs/2407.07718

HySortK reduces the communication volume through a carefully designed communication scheme and domain-specific optimization strategies. HySortK uses an abstract task layer for flexible hybrid parallelism to address load imbalances in different scenarios.

HySortK uses flexible hybrid MPI and OpenMP parallelization. HySortK was integrated into a de novo long-read genome assembly workflow. HySortK achieves a 2-10x speedup compared to the GPU baseline on 4 and 8 nodes. 
HySorK significantly reduces the memory footprint, making a BLOOM filter superfluous. HySortK switches to a more efficient radix sort algorithm that requires an auxiliary array for counting.

□ GPS-Net: discovering prognostic pathway modules based on network regularized kernel learning

>> https://www.biorxiv.org/content/10.1101/2024.07.15.603645v1

Genome-wide Pathway Selection with Network Regularization (GPS-Net) extends bi-network regularization model to multiple-network and employs multiple kernel learning (MKL) for pathway selection.

GPS-Net reconstructs each network kernel with one Laplacian matrix, thereby transforming the pathway selection problem into a multiple kernel learning (MKL) process. By solving the MKL problem, GPS-Net identifies and selects kernels corresponding to specific pathways.

□ SIGURD: SIngle cell level Genotyping Using scRna Data

>> https://www.biorxiv.org/content/10.1101/2024.07.16.603737v1

SIGURD (SIngle cell level Genotyping Using scRna Data), an R package designed to combine the genotyping information from both s Var and mt Var analysis from distinct genotyping tools and integrative analysis across distinct samples.

SIGURD provides a pipeline with all necessary steps for the analysis of genotyping dat: candidate variant acquisition, pre-processing and quality analysis of scRNA-seq, cell-level genotyping, and representation of genotyping data in conjunction with the RNA expression data.

□ WeightedKgBlend: Weighted Ensemble Approach for Knowledge Graph completion improves performance

>> https://www.biorxiv.org/content/10.1101/2024.07.16.603664v1

WeightedKgBlend, a weighted ensemble method called for link prediction in knowledge graphs which combines the predictive capabilities of two types of Knowledge Graph completion methods: knowledge graph embedding and path based reasoning.

WeightedKgBlend fuses the predictive capabilities of various embedding algorithms and case-based reasoning model. WeightedKgBlend is assigning zero weight to the low performing algorithms like TransE, DistMult, ComplEx and simple CBR.

□ TRGT-denovo: accurate detection of de novo tandem repeat mutations

>> https://www.biorxiv.org/content/10.1101/2024.07.16.600745v1

TRGT-denovo, a novel method for detecting DNMs in TR regions by integrating TRGT genotyping results with read-level data from family members. This approach significantly reduces the number of likely false positive de novo candidates compared to genotype-based de novo TR calling.

TRGT-denovo analyzes both the genotyping outcomes and reads spanning the TRs generated by TRGT. TRGT-denovo enables the quantification of variations exclusive to the child's data as potential DNMs. TRGT-denovo can detect both changes in TR length and compositional variations.

□ lr-kallisto: Long-read sequencing transcriptome quantification

>> https://www.biorxiv.org/content/10.1101/2024.07.19.604364v1

Ir-kallisto demonstrates the feasibility of pseudoalignment for long-reads; we show via a series of results on both biological and simulated data that Ir-kallisto retains the efficiency of kallisto thanks to pseudoalignment, and is accurate on long-read data.

Ir-kallisto is comptible with translated pseudoalignment. Ir-kallisto can be used for transcript discovery. In particular, reads that do not pseudoalign with Ir-kallisto can be assembled to construct contigs from unannotated, or incompletely annotated transcripts.

□ SonicParanoid2: fast, accurate, and comprehensive orthology inference with machine learning and language models

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03298-4

SonicParanoid2 performs de novo orthology inference using a novel graph-based algorithm that halves the execution time with an AdaBoost classifier and avoiding unnecessary alignments.

SonicParanoid2 conducts domain-based orthology inference using Doc2Vec neural network models. The clusters of orthologous genes from each species pair predicted by these algorithms are merged and input into the Markov cluster algorithm to infer the multi-species ortholog groups.

□ SpatialQC: automated quality control for spatial transcriptome data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae458/7720780

SpatialQC provides a one-click solution for automating quality assessment, data cleaning, and report generation. SpatialQC calculates a series of quality metrics, the spatial distribution of which can be inspected, in the QC report, for spatial anomaly detection.

SpatialQC performs quality comparison between tissue sections, allowing for efficient identification of questionable slices. It provides a set of adjustable parameters and comprehensive tests to facilitate informed parameterization.

□ ClusterMatch aligns single-cell RNA-sequencing data at the multi-scale cluster level via stable matching

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae480/7723481

ClusterMatch, a stable match optimization model to align scRNA-seq data at the cluster level. In one hand, ClusterMatch leverages the mutual correspondence by canonical correlation analysis (CCA) and multi-scale Louvain clustering algorithms to identify cluster with optimized resolutions.

ClusterMatch utilizes stable matching framework to align scRNA-seq data in the latent space while maintaining interpretability with overlapped marker gene set. ClusterMatch successfully balances global and local information, removing batch effects while conserving biological variance.

□ RawHash2: Mapping Raw Nanopore Signals Using Hash-Based Seeding and Adaptive Quantization

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae478/7723993

RawHash2 uses a new quantization technique, adaptive quantization. RawHash2 improves the accuracy of chaining and subsequently read mapping. RawHash2 implements a more sophisticated chaining algorithm that incorporates penalty scores algorithm that incorporates penalty scores.

RawHash2 provides a filter that removes seeds frequently appearing in the reference genome. RawHash2 utilizes multiple features for making mapping decisions based on their weighted scores to eliminate the need for manual and fixed conditions to make decisions.

RawHash2 extends the hash-based mechanism to incorporate and evaluate the minimizer sketching technique, aiming to reduce storage requirements without significantly compromising accuracy.

□ GRIEVOUS: Your command-line general for resolving cross-dataset genotype inconsistencies https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae489/7723992

GRIEVOUS (Generalized Realignment of Innocuous and Essential Variants Otherwise Utilized as Skewed), a command-line tool designed to ensure cross-cohort consistency and maximal feature recovery of biallelic SNPs across all summary statistic and genotype files of interest.

GRIEVOUS harmonizes an arbitrary number of user-defined genomic datasets. Each dataset is passed through realign, sequentially, and passed to merge to generate composite dataset level reports of all identified biallelic / inverted variants resulting from the realignment process.

□ Poincaré and SimBio: a versatile and extensible Python ecosystem for modeling systems

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae465/7723995

Poincaré and SimBio, the novel Python packages for simulation of dynamical systems and CRNs. Poincaré serves as a foundation for dynamical systems modelling, while SimBio extends this functionality to CRNs, including support for the Systems Biology Markup Language.

Poincaré allows one to define differential equation systems using variables, parameters and constants, and assigning rate equations to variables. For defining CRNs, SimBio builds on top of poincaré providing species and reactions that keep track of stoichiometries.

□ SAFER: sub-hypergraph attention-based neural network for predicting effective responses to dose combinations

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05873-9

SAFER, a Sub-hypergraph Attention-based graph model, addressing these issues by incorporating complex relationships among biological knowledge networks and considering dosing effects on subject-specific networks.

SAFER uses two-layer feed-forward neural networks to learn the inter-correlation between these data representations along with dose combinations and synergistic effects at different dose combinations.

□ Multioviz: an interactive platform for in silico perturbation and interrogation of gene regulatory networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05819-1

Multioviz integrates various variable selection methods to give users a wide choice of statistical approaches that they can use to generate relevant multi-level genomic signatures for their analyses.

Multioviz provides an intuitive approach to in silico hypothesis testing, even for individuals with less coding experience. Here, a user starts by inputting molecular data along with an associated phenotype to graphically visualize the relationships between significant variables.

□ Logan: Planetary-Scale Genome Assembly Surveys Life's Diversity

>> https://www.biorxiv.org/content/10.1101/2024.07.30.605881v1

Logan is a dataset of DNA and RNA sequences. It has been constructed by performing genome assembly over a December 2023 freeze of the entire NCBI Sequence Read Archive, which at the time contained 50 petabases of public raw data.

Two related sets of assembled sequences are released: unitigs and contigs. Unitigs preserve nearly all the information present in the original sample, whereas contigs get rid of sequencing errors and biological variation for the benefit of increased sequence length.

□ MAMS: matrix and analysis metadata standards to facilitate harmonization and reproducibility of single-cell data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03349-w

MAMS (the matrix and analysis metadata standards) captures the relevant information about the data matrices and annotations that are produced during common and complex analysis workflows for single-cell data.

MAMS defines fields that describe what type of data is contained within a matrix, relationships between matrices, and provenance related to the tool or algorithm that created the matrix.

An Emperor's Jewel: The Making of the Bulgari Hotel Roma

2024-07-17 01:01:01 | 映画

『An Emperor's Jewel: The Making of the Bulgari Hotel Roma』ローマ・アウグストゥス霊廟の復元に伴い併設されたブルガリ・ホテル・ローマの開業ドキュメンタリー。世界のトップブランドとしての卓越性。建築素材から内装の細部まで、古代ローマから引き継がれる工匠技術へのリスペクトが感じられる

With the film AN EMPEROR'S JEWEL @Bulgariofficial celebrates its new hotel in Rome https://t.co/yHDW5M400NThe film draws parallels between hotel construction and #Bvlgari’s craftsmanship, emphasizing Italian heritagePatronage: #Altagamma, @mimit_gov and @Roma @priyankachopra pic.twitter.com/gOVjrH4gJ9
— ALTAGAMMA (@Altagamma_it) July 15, 2024

2024 Atomic
Directed by Andrea Rovetta
Artist Consultant by Claudio Prizio
Cinematography by Antonio De Rosa / Valeiro Martorelli
Music by Valerio Vigliar / Roberto Procaccini
Cast: Priyanka Chopra Jonas

PRAANA / “Summer Solstice”

2024-07-16 20:09:57 | Music20

□ PRAANA / “Summer Solstice”

夏の夕日の空を浮遊するようなディープ・プログレッシヴ・ハウス。Praanaによるシーズナルミックスの5作目で、BPMは従来より高めだが、ニューエイジ色も強くチルアウトとしても聴ける

□ Caes / “One More Day”

Release Date: 12/07/2024
Label: Colorize / Enhanced Music

2024 July Mix.

2024-07-14 03:11:34 | Music20

□ 2024 July Mix (Calm, Ambient, Electronica, Post-Classic.)

7月のプレイリストは、真夏の眩い陽射しと涼やかな風が交じり合う朝と、宵の雨上がりをイメージして選曲しました。

□ Apple Music版

□ YouTube版

>> tracklisting.

Aerian / “Lost in the Clouds”
Blank & Jones / “Azure Blue”
AETSRAL / “Freedom”
STARS AS SIGNALS / “We are Stars”
The Ambientalist / “Like Heaven”
Spark030 / “Afterlife”
Kiasmos / “Grown”
De-Tü & Congi / “Shallow Waters”
Enigma / “Traces (Light And Weight)”
Vince Staples / “Nothing Matters”
bt / “Good Evening Mrs. Lovelace”
Goldmund / “Light and Shade”
Max Richter / “The Poetry of Earth (Geophony)”

Swim Cap.

2024-07-13 16:57:39 | 写真

□ Swim Cap.

PANTONE
16-4526 TCX
Swim Cap
FHI Cotton TCX
Lab 62.33 -21.67 -32.84
SRGB 56 164 208
HEX 38A4D0

□ Orange Tiger.

PANTONE
16-1358 TCX
Orange Tiger
FHI Cotton TCX
Lab 63.52 55.14 70.25
SRGB 249 104 21
HEX F96815

Color palettes created by PANTONE Studio.

	ブログを読むだけ。毎月の訪問日数に応じてポイント進呈
	gooブロガーの今日のひとこと
	訪問者数に応じてdポイント最大1,000pt当たる！
	goo blogは20周年を迎えました！

2024年7月
日	月	火	水	木	金	土
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Lang ist Die Zeit, es ereignet sich aber Das Wahre.