lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Titanium.

2023-09-30 21:19:39 | Science News




□ starTracer: An Accelerated Approach for Precise Marker Gene Identification in Single-Cell RNA-Seq Analysis

>> https://www.biorxiv.org/content/10.1101/2023.09.21.558919v1

starTracer seamlessly accepts input in various formats, including Seurat objects, sparse expression matrices with annotation tables, or average expression matrix of each cell type.

StarTracer provides option to search marker genes from highly variable genes to further increase the calculation speed. A non-redundant matrix of marker genes will be presented according to the number of marker genes in each cluster.





□ veloVI: Deep generative modeling of transcriptional dynamics for RNA velocity analysis in single cells

>> https://www.nature.com/articles/s41592-023-01994-w

veloVI (velocity variational inference), a deep generative model for estimating RNA velocity. VeloVI reformulates the inference of RNA velocity via a model that shares information b/n all cells/genes, while learning the same quantities, namely kinetic parameters and latent time.

veloVI returns an empirical posterior distribution: matrix of cells by genes by posterior samples. veloVI illuminates cell states that have estimated with high uncertainty, which adds a notion of confidence to the velocity stream and highlights regions of the phenotypic manifold.





□ Divide-and-conquer quantum algorithm for hybrid de novo genome assembly of short and long reads

>> https://www.biorxiv.org/content/10.1101/2023.09.19.558544v1

Due to the path conflicts brought by repetitive sequences and sequencing errors, it is not feasible to directly determine an Eulerian path within the de Bruijn graph that faithfully reconstructs the original sequences.

A hybrid assembly quantum algorithm using high-accuracy short reads and error-prone long reads. It integrates short reads from next-generation sequencing technology and long reads from third-generation sequencing technology to address assembly path conflicts.

Using simulations of 10-qubit quantum computers, the algorithm addresses problems as large as 140 qubits, yielding optimal assembly results. The convergence speed is significantly improved via the problem-inspired ansatz based on the known information about the assembly problem.

This algorithm builds upon the variational quantum eigensolver and utilizes divide-and-conquer (VQE) strategies to approximate the ground state of larger Hamiltonian while conserving quantum resources.





□ CellPolaris: Decoding Cell Fate through Generalization Transfer Learning of Gene Regulatory Networks

>> https://www.biorxiv.org/content/10.1101/2023.09.25.559244v1

CellPolaris, a computational system that leverages transfer learning algorithms. Diverging from conventional GRN inference models, which heavily rely on integrating epigenomic data with transcriptomic information or adopt causal strategies through gene co-expression networks.

CellPolaris uses the transfer network to analyze single-cell transcriptomic data in the development or differentiation process and a Probabilistic Graphical Model (PGM) to predict the impact of TF perturbations on cell fate.





□ Finding related sequences by a simple sum over alignments

>> https://www.biorxiv.org/content/10.1101/2023.09.26.559458v1

A simplest-possible change to standard alignment sums probabilities of alternative alignments. It is easy to use in typical sequence-search software. It is also easy to calculate the probability of an equal or higher score between random sequences, based on a clear conjecture.

This method is a variant of "hybrid alignment”. Hybrid alignment has been neglected; the model produces different alignments with different probabilities. The method generalizes to different kinds of alignment e.g. DNA-versus-protein with frameshifts.





□ scBridge embraces cell heterogeneity in single-cell RNA-seq and ATAC-seq data integration

>> https://www.nature.com/articles/s41467-023-41795-5

scBridge models the discriminability and confidence of scATAC-seq cells with a Gaussian Mixture. scBridge achieves accurate scRNA-seq and scATAC-seq data integration, as well as label transfer with heterogeneous transfer learning.

scBridge uses the deep neural encoder and classifier. scBridge computes the ATAC prototypes as the weighted average of scATAC-seq cells with the same predicted cell type and aligns them with the RNA prototypes to achieve integration.





□ Uncertainty-aware single-cell annotation with a hierarchical reject option

>> https://www.biorxiv.org/content/10.1101/2023.09.25.559294v1

Hierarchical annotation in comparison to flat annot, leads to fewer label rejections under the full rejection, and these rejections are less severe under partial rejection. Consequently, if the rejection is implemented, hierarchical annotation proves to be the superior method.

With greedy label assignment, only the path with the highest probability scores in the hierarchy is followed. With non-greedy label assignment, all possible prediction paths are traversed and only the end score is considered for the final label assignment.





□ PG-SGD: Pangenome graph layout by Path-Guided Stochastic Gradient Descent

>> https://www.biorxiv.org/content/10.1101/2023.09.22.558964v1

PG-SGD (the Path Guided Stochastic Gtadient Descent) moves pairs of nodes in parallel applying a modified HOGWILD! strategy. The algorithm computes the pangenome graph layout that best reflects the nucleotide sequences in the graph.

PG-SGD stores node coordinates in a vector of atomic doubles. PG-SGD can be extended to any number of dimensions. It can be seen as a graph embedding algorithm that converts high-dimensional, sparse pangenome graphs into low-dimensional, dense, and continuous vector spaces.





□ DeepCCI: a deep learning framework for identifying cell-cell interactions from single-cell RNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad596/7281356

DeepCCI, a graph convolutional network (GCN) based deep learning framework for CCI identification. DeepCCI learns an embedding function that jointly projects cells into a shared embedding space using Autoencoder (AE) and GCN.

DeepCCI predicts intercellular crosstalk between any pair of clusters. It captures the essential hidden information of cells and makes full use of the topological relationships. DeepCCI determines the number of clusters before clustering, using the Louvain algorithm.





□ GPFN: Prior-Data Fitted Networks for Genomic Prediction

>> https://www.biorxiv.org/content/10.1101/2023.09.20.558648v1

A Genomic Prior-Data Fitted Network (GPFN), a new paradigm for GP. GPFNs perform amortized Bayesian inference by drawing hundreds of thousands or millions of synthetic breeding populations during the prior fitting phase.

GPFN fits the prior using a transformer model with 12 layers, an internal dimensionality of 2048, a hidden layer size of 2048, and a single attention head. Overfitting is no longer an issue, as training data is practically infinite.





□ LEOPARD: Missing view completion for multi-timepoints omics data via representation disentanglement and temporal knowledge transfer

>> https://www.biorxiv.org/content/10.1101/2023.09.26.559302v1

LEOPARD (missing view completion for multi-timepoints omics data via representation disentanglement and temporal knowledge transfer) extends representation disentanglement and style transfer techniques to the application of missing view completion in longitudinal omics data.

LEOPARD factorizes omics data from different timepoints into omics-specific content and timepoint-specific knowledge via contrastive learning. The generator learns mappings b/n two views, while temporal knowledge is injected into content representation via the AdalN operation.





□ Spacia: Mapping Cell-to-cell Interactions from Spatially Resolved Transcriptomics Data

>> https://www.biorxiv.org/content/10.1101/2023.09.18.558298v1

Spacia, a Bayesian framework to detect cell-cell communication (CCC) from SRT data, by fully exploiting their unique spatial modality, which dramatically increased the accuracy of the detection of CCC.

Spacia uses cell-cell proximity as a constraint and prioritizes cell-cell interactions that cause a downstream change. Spacia employs a Multi-instance learning (MIL) to assess CCC. Spacia allows spatial information to minimize the number of assumptions and arbitrary parameters.





□ Spectra: Supervised discovery of interpretable gene programs from single-cell data

>> https://www.nature.com/articles/s41587-023-01940-3

Spectra (supervised pathway deconvolution of interpretable gene programs) receives a gene expression count matrix with cell-type labels for each cell as well as predefined gene sets, which it converts to a gene–gene graph.

Spectra fits a factor analysis model using a loss function that optimizes reconstruction of the count matrix and guides factors to support the input gene–gene graph. Spectra provides factor loadings and gene programs corresponding to cell types and cellular processes.





□ TAGET: a toolkit for analyzing full-length transcripts from long-read sequencing

>> https://www.nature.com/articles/s41467-023-41649-0

TAGET uses polished high-quality transcripts in fasta format as input (Fig. 1) for full-length transcriptome analysis. Following the Iso-seq data analysis protocol, TAGET only considers transcripts supported by at least two circular consensus sequences (CCS).

TAGET aligns transcripts to the reference genome by integrating alignment results from long and short reads and improves splice site prediction using Convolutional Neural Network. TAGET annotates transcripts by comparing with reference DBs and classifies them into seven classes.





□ HQAlign: Aligning nanopore reads for SV detection using current-level modeling

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad580/7280145

HQAlign is a hybrid mechanism with two steps of alignment. In the initial alignment step, the reads are aligned onto the genome in the nucleotide space using minimap2 to determine the region of interest where a read can possibly align.

In the hybrid step, the read is realigned to the region of interest on the genome in the quantized space. An alignment of the read-to-genome is maintained w/o dropping the frequently occurring seed matches, while the error biases are taken into account thru quantized sequences.





□ skani: Fast and robust metagenomic sequence comparison through sparse chaining

>> https://www.nature.com/articles/s41592-023-02018-3

skani is a program for calculating average nucleotide identity (ANI) from DNA sequences (contigs/MAGs/genomes). skani uses an approximate mapping method without base-level alignment to get ANI. This allows for sequence identity estimation using k-mers on only the shared regions between two genomes avoiding the pitfalls of alignment-ignorant sketching methods.





□ scPipe: An extended preprocessing pipeline for comprehensive single-cell ATAC-Seq data integration in R/Bioconductor

>> https://www.biorxiv.org/content/10.1101/2023.09.25.559230v1

scPipe is able to take FASTQ format as input, which are demultiplexed based on quality and Ns, aligned to the reference genome and filtered based on various quality metrics such as mapping rate, fraction of reads mapping and the number of duplicate or high-quality reads.





□ CellOT: Learning single-cell perturbation responses using neural optimal transport

>> https://www.nature.com/articles/s41592-023-01969-x

CellOT, a new approach that predicts perturbation responses of single cells by directly learning and uncovering maps between control and perturbed cell states, thus explicitly accounting for heterogeneous subpopulation structures in multiplexed molecular readouts.

CellOT models cell responses as deterministic trajectories. CellOT learns an optimal transport map for each perturbation in a fully parameterized and highly scalable manner. CellOT parameterizes a pair of dual potentials with input convex neural networks.





□ imply: improving cell-type deconvolution accuracy using personalized reference profiles

>> https://www.biorxiv.org/content/10.1101/2023.09.27.559579v1

imply can utilize personalized reference panels to precisely deconvolute cell type proportions using longitudinal or repeatedly measured data. It borrows information across the repeatedly measured transcriptome samples w/in each subject, to recover personalized reference panels.

imply utilizes support vector regression within a mixed-effect modeling framework to retrieve personalized reference panels, based on subjects’ phenotypical information. Then, it uses the recovered personalized reference panels to estimate cell type proportions.





□ MuDCoD: Multi-Subject Community Detection in Personalized Dynamic Gene Networks from Single Cell RNA Sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad592/7281355

MuDCoD (Multi-subject Dynamic Community Detection), infers gene communities per subject and per time point by extending the temporal smoothness assumption to the subject dimension.

MuDCoD builds on the spectral clustering and promotes information sharing among the networks of the subjects AWA networks at different time points. It clusters genes in the personalized dynamic gene networks and reveals gene communities that are variable across time / subjects.





□ Bering: joint cell segmentation and annotation for spatial transcriptomics with transferred graph embeddings

>> https://www.biorxiv.org/content/10.1101/2023.09.19.558548v1

Bering, a graph deep learning model that leverages transcript colocalization relationships for joint noise-aware cell segmentation and molecular annotation in 2D and 3D spatial transcriptomics data.

The prediction outcome is binary classification, indicating whether the edges connect intercellular or intracellular spots. Molecular connectivity graphs are then constructed, and community detection algorithms such as Leiden Clustering are employed to identify cell borders.





□ CoalNN: Inference of coalescence times and variant ages using convolutional neural networks

>> https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad211/7279051

CoalNN uses a simulation-trained convolutional neural network (CNN) to jointly predict pairwise TMRCAs and recombination breakpoints, and further utilizes these predictions to estimate the age of genomic variants.

CoalNN is trained through simulation and can be adapted to varying parameters, such as demographic history, using transfer learning. CoalNN remains computationally efficient when applied to pairwise TMRCA inference, improving upon optimized coalescent Hidden Markov Models.





□ HAVAC: An FPGA-based hardware accelerator supporting sensitive sequence homology filtering with profile hidden Markov models

>> https://www.biorxiv.org/content/10.1101/2023.09.20.558701v1

Hardware Accelerated single segment Viterbi Additional Coprocessor (HAVAC), an FPGA-based hardware accelerator. The core HAVAC kernel calculates the SSV matrix at 1739 GCUPS on a Xilinx Alveo U50 FPGA accelerator card, ~227x faster than the optimized SSV implementation in nhmmer.





□ GammaGateR: semi-automated marker gating for single-cell multiplexed imaging

>> https://www.biorxiv.org/content/10.1101/2023.09.20.558645v1

GammaGateR provides estimates of marker-positive cell proportions and soft clustering of marker-positive cells. The model incorporates user-specified constraints that provide a consistent but slide-specific model fit.





□ Phylociraptor - A unified computational framework for reproducible phylogenomic inference

>> https://www.biorxiv.org/content/10.1101/2023.09.22.558970v1

Phylociraptor (the rapid phylogenomic tree calculator) performs all steps of typical phylogenomic workflows from orthology inference, MSA, trimming, and concatenation, to gene tree, supermatrix- and species tree reconstructions, complemented with various filtering steps.

Phylociraptor is organised into separate modules, which are executed consecutively, adhering to the principial stages of phylogenomic analyses. phylociraptor align creates MSAs using MAFFT, Clustal Omega and MUSCLE for each gene.





□ BaRDIC: robust peak calling for RNA-DNA interaction data

>> https://www.biorxiv.org/content/10.1101/2023.09.21.558815v1

BaRDIC (Binomial RNA-DNA Interaction Caller), that utilizes a binomial model to identify genomic regions significantly enriched in RNA-chromatin interactions, or "peaks", in All-To-All and One-To-All data.





□ GIA: A genome interval arithmetic toolkit for high performance interval set operations

>> https://www.biorxiv.org/content/10.1101/2023.09.20.558707v1

GIA (Genomic Interval Arithmetic) and BEDRS, a novel command-line tool and a rust library that significantly enhance the performance of genomic interval analysis.

Internally, both forms are treated as numeric, but during named serialization, gia calculates and stores a thin mapping of chromosome names to numeric indices - drastically reducing memory requirements and runtimes in most genomic interval contexts.





□ GET: a foundation model of transcription across human cell types

>> https://www.biorxiv.org/content/10.1101/2023.09.24.559168v1

GET (the general expression transformer) remarkable adaptability across new sequencing platforms and assays, enabling regulatory inference across a broad range of cell types and conditions, and uncovering universal and cell type specific transcription factor interaction networks.

GET learns transcriptional regulatory syntax from chromatin accessibility data acrs hundreds of diverse cell types. GET offers zero-shot prediction of reporter assay readout in new cell types, potentiating itself as a prescreening tool for cell type specific regulatory elements.





□ Modeling and interpretation of single-cell proteogenomic data

>> https://arxiv.org/abs/2308.07465

Single-cell proteogenomics will help connect single-cell genomics with the numerous post-transcriptional mechanisms - such as dynamically regulated protein synthesis, degradation, translocation, and post-translational modifications - that shape cellular phenotypes.





□ Single-cell lineage capture across genomic modalities with CellTag-multi reveals fate-specific gene regulatory changes

>> https://www.nature.com/articles/s41587-023-01931-4

An in situ reverse transcription (isRT) step is used to selectively reverse transcribe CellTag barcodes inside intact nuclei. The CellTag construct is modified to flank the random barcode with Nextera Read 1 and Read 2 adapters.

Direct lineage reprogramming presents a unique paradigm of cell identity conversion, with cells often transitioning through progenitor-like states or acquiring off-target identities. CellTag-multi identifies the distinct iEP reprogramming trajectories.





□ intNMF: Scalable joint non-negative matrix factorisation for paired single cell gene expression and chromatin accessibility data

>> https://www.biorxiv.org/content/10.1101/2023.09.25.559293v1

intNMF implements the accelerated hierarchical alternating least squares (acc-HALS) method, which they modified to jointly factorise two matrices. HALS is a block coordinate descent method where the optimisation problem is broken up into smaller sub-problems.

For the RNA modality this typically involved library sized normalisation followed by log(x + 1) transformation and for the ATAC modality Term Frequency-Inverse Document Frequency (TF-IDF) transforming the data. TF-IDF transformation is implemented in the intNMF package.





□ compleasm: a faster and more accurate reimplementation of BUSCO

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad595/7284108

compleasm, an efficient tool for assessing the completeness of genome assemblies. Compleasm utilizes the miniprot protein-to-genome aligner and the conserved orthologous genes from BUSCO.

A complete gene is considered to have a single-copy in the assembly if it only has one alignment, or duplicated if it has multiple alignments. Compleasm reports the proportion of genes falling into each of the four categories as the assessment of assembly completeness.





□ GeneCompass: Deciphering Universal Gene Regulatory Mechanisms with Knowledge-Informed Cross-Species Foundation Model

>> https://www.biorxiv.org/content/10.1101/2023.09.26.559542v1

GeneCompass, a knowledge-informed cross-species foundation model pre-trained on scCompass-126M, a currently largest corpus encompassing over 120 million single-cell transcriptomes.

Inspired by self-supervised learning in natural language processing (NLP) domain, GeneCompass employs the masked language modeling (MLM) strategy to randomly mask gene tokens.





□ SeqVerify: A quality-assurance pipeline for whole-genome sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.09.27.559766v1

SeqVerify, a computational pipeline designed to take raw WGS data and a list of intended edits, and verify that the edits are present and that there are no abnormalities.

SeqVerify operates on three main types of input for the majority of its results. These are paired-end short-read sequencing data, a reference genome to align this data to, and a list of "markers" - untargeted or targeted sequences.





□ StableMate: a new statistical method to select stable predictors in omics data

>> https://www.biorxiv.org/content/10.1101/2023.09.26.559658v1

StableMate, a flexible regression and variable selection framework based on the recent theoretical development of stabilised regression. Stabilised regression considers data collected from different 'environments'. i.e. technical or biological conditions.





□ Integrated DNA Technologies Reveals Launch of xGen™ Products for Ultima Genomics

>> https://sg.idtdna.com/pages/about/news/2023/09/26/integrated-dna-technologies-reveals-launch-of-xgen-products-for-ultima-genomics

xGen Universal Blocking Oligos for Ultima Genomics— proprietary blockers designed specifically for the platform’s native adapters, to reduce non-specific adapter interaction during probe hybridization and increase on-target capture performance.





□ Invest in Estonia

>> https://investinestonia.com/estonian-space-startup-kappazetta-leaps-forward-with-additional-funding/

🛰️ #Estonian #space technology #startup #KappaZeta has secured a new round to further help farmers and set sights on forest #carbon stock assessment.






John Wick: Chapter 4

2023-09-27 00:06:36 | 映画

□ 『John Wick: Chapter 4』

(Lions Gate / 2023)
Directed by Chad Stahelski
Written by Shay Hatten / Michael Finch
Based on characters created by Derek Kolstad
Music by Tyler Bates / Joel J. Richard
Cinematography by Dan Laustsen

掟と絆、ルールと報い。両者を分つものは「正しさ」ではなく、「力」。統制を失った秩序は力学的必然性によって平衡状態へと遷移する。殺しは粛々と遂行されるプロセスであり、結果は常に必然である。「死ねば殉教者、生きれば裏切り者」「では何者か?」「悲しむ者だ」

STARDUST.

2023-09-19 21:37:39 | Science News

我々は星屑から産まれ、星屑を集め、星屑へと還る。余燼を焚べた炉であり、煤けた情報の断片であり、己が綴られたコンテクストを読み取る術はない。しかし視えるのだ。擦れ合う骨が血と肉を運ぶように、天蓋の向こうで沈黙する岩と焦げたガスとの狭間に、我々を繋ぎ止めている一本の鎖が

We are made of stardust gather stardust, and return to stardust. We are a furnace that stokes the remaining embers, fragments of sooty information, with no means to decipher the context we've woven for ourselves.

Yet, we can see. Just as rubbing bones transport blood and flesh, amidst the silence of rocks and scorched gases beyond the canopy, there exists a single chain that binds us.





□ CodonBERT: Large Language Models for mRNA Design and Optimization

>> https://www.biorxiv.org/content/10.1101/2023.09.09.556981v1

CodonBERT, an LLM which extends the BERT model and applies it to the language of mRNAs. CodonBERT uses a multi-head attention transformer architecture framework. The pre-trained model can also be generalized to a diverse set of supervised learning tasks.

CodonBERT is pre-trained using 10 million mRNA coding sequences spanning an evolutionarily diverse set of organisms. CodonBERT takes the coding region as input using codons as tokens, and outputs an embedding that provides contextual codon representations.





□ scAce: an adaptive embedding and clustering method for single-cell gene expression data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad546/7261512

scAce constructs a VAE network to learn smoother low-dimensional embeddings compared with those methods based on traditional autoencoders. It utilizes a data-adaptive clustering approach based on the idea of cluster merging.

scAce iteratively performs network update and cluster merging based on the initial VAE network. scAce decides if a pair of clusters should be merged into a single cluster by comparing inter-cluster and intra-cluster distances.





□ scEval: Evaluating the Utilities of Large Language Models in Single-cell Data Analysis

>> https://www.biorxiv.org/content/10.1101/2023.09.08.555192v1

scEval (Single-cell Large Language Model Evaluation), a systematic evaluation of the effects of hyper-parameters, initial settings, and stability for training single-cell LLMs. Evaluating the performance of single-cell LLMs - scGPT, Geneformer, scBERT, CellLM and tGPT.

scGPT is capable of performing zero-shot learning tasks. For the Cell Lines dataset, the zero-shot learning approach even achieved the highest score, indicating that it can be an effective method for certain datasets.

GEARS was gen-erall better than scGPT. For the data simulation task, scGPT did not perform very well, which suggests that LLMs are remembering things rather than making inferences or generating enough novel information.





□ Autoturbo-DNA: Turbo-Autoencoders for the DNA data storage channel

>> https://www.biorxiv.org/content/10.1101/2023.09.15.557887v1

Autoturbo-DNA, an end-to-end autoencoder framework that combines the TurboAE principles with an additional pre-processing decoder, DNA data storage channel simulation, and constraint adherence check.

Autoturbo-DNA supports various Neural-Network architectures. Autoturbo-DNA trains encoder-transcoder-decoder models for DNA data storage. Autoturbo-DNA reconstructs performance close to single sequence non-NN error correction and constrained codes for DNA data storage.





□ On chaotic dynamics in transcription factors and the associated effects in differential gene regulation

>> https://www.nature.com/articles/s41467-018-07932-1

All deterministic simulations were performed by numerically integrating the dynamical equations using the Runge–Kutta fourth-order method, and for optimisation reasons, some of the equations were simulated using Euler integration.

Chaotic dynamics has far been underestimated as a means for controlling genes. They tested for chaos by calculating the divergence of trajectories that started at almost identical initial points. NF-κB driven by sufficiently large TNF amplitudes will exhibit deterministic chaos.





□ ZINBMM: a general mixture model for simultaneous clustering and gene selection using single-cell transcriptomic data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03046-0

ZINBMM, a zero-inflated negative binomial mixture model for scRNA-seq data clustering that can comprehensively account for the unique problems of batch effects, dropout events, and high dimensionality. ZINBMM directly applies to the raw counts without any transformation.

The mixture model with biological effects of genes being modelled using cell type-specific mean parameters is developed to accommodate heterogeneity, which achieves soft clustering and has the advantage of more meaningful probabilistic interpretations.

ZINBMM can accommodate zero-expressed gene counts and correct the confounding batch effects by introducing corresponding parameterisation. ZINBMM performs feature selection by imposing penalisation on the differences between cluster-specific and global mean values.





□ Borzoi: Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation

>> https://www.biorxiv.org/content/10.1101/2023.08.30.555582v1

Borzoi learns to predict cell- and tissue-specific RNA-seq coverage from DNA sequence. Borzoi isolates and accurately scores variant effects across multiple layers of regulation, including transcription, splicing, and polyadenylation.

Borzoi uses the core Enformer architecture, which includes a tower of convolution- and subsampling blocks followed by a series of self-attention blocks operating at 128 bp resolution embedding vectors.





□ scover: Predicting the impact of sequence motifs on gene regulation using single-cell data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03021-9

scover infers regulatory motifs that are predictive of the signal associated with a set of sequences using a neural network consisting of a single convolutional layer, an exponential linear unit, global max pooling, and a linear layer with bias term.

Scover takes as input a set of one-hot encoded sequences, e.g., promoters or distal enhancers, along with measurements of their activity, e.g., expression levels of the associated genes or accessibility levels of the enhancers.





□ GENIX: Comparative Analysis of Association Networks Using Single-Cell RNA Sequencing Data Reveals Perturbation-Relevant Gene Signatures

>> https://www.biorxiv.org/content/10.1101/2023.09.11.556872v1

GENIX (Gene Expression Network Importance eXamination), a novel platform for constructing gene association networks, equipped with an innovative network-based comparative model to uncover condition-relevant genes.

By leveraging this probabilistic graphical model, GENIX faithfully differentiates between direct and indirect connections while remaining immune to neglecting novel interactions, a common downside of reference-guided network construction methods.

GENIX uses a systematic module identification and analysis approach, and a two-dimensional quantitative metric, providing a more comprehensive understanding of changes in gene essentiality within the network upon perturbation.





□ NetAn: A Python Toolbox Leveraging Network Topology for Comprehensive Gene Annotation Enrichments

>> https://www.biorxiv.org/content/10.1101/2023.09.05.556339v1

NetAn (the Network Annotation Enrichment package), which takes a list of genes and uses network-based approaches such as network clustering and inference of closely related genes to include local neighbours.

NetAn draws the adjacency matrix of the input gene set from the loaded network, and applies either K-means clustering, maximal clique identification, or the extraction of separated network components to sort genes into individual sets.

NetAn has a functionality where the average shortest path length between all gene cluster pairs is computed and compared to the average path length of the loaded network. NetAn randomly samples pairs in batches until the mean converges.





□ PAN-GWES: Pangenome-spanning epistasis and co-selection analysis via de Bruijn graphs

>> https://www.biorxiv.org/content/10.1101/2023.09.07.556769v1

PAN-GWES, a phenotype- and alignment-free method for discovering co-selected and epistatically interacting genomic variation from genome assemblies covering both core and accessory parts of genomes.

PAN-GWES uses a compact coloured de Bruijn graph to approximate the intra-genome distances between pairs of loci. PAN-GWES leverages the computational efficiencies of the SpydrPick algorithm to rapidly calculate the pairwise MI values of millions of unitigs pairs.





□ PhaseDancer: a novel targeted assembler of segmental duplications unravels the complexity of the human chromosome 2 fusion going from 48 to 46 chromosomes in hominin evolution

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03022-8

PhaseDancer, a novel, fast, and robust assembler that follows a locally-targeted approach to resolve SD-rich complex genomic regions. The tool is designed to work with long-reads (ONT, PacBio) and tuned for error-prone data.

PhaseDancer enables the extension of a user-provided initial sequence contig even from complex genomic regions. PhaseDancer generates contigs with the fragments repeated up to several dozens times in the genome with at least 0.1% divergence.





□ Omix: A Multi-Omics Integration Pipeline

>> https://www.biorxiv.org/content/10.1101/2023.08.30.555486v1

Omix is built on four consecutive blocks, (1) preparation of the multimodal container, (2) processing and quality control, (3) single omic analyses, and (4) multi-omics vertical integration,

The modular framework of Omix enables the storage of analysis parameters and results from different algorithms within the same object, facilitating easy comparison of outputs. This design also allows for the incorporation of additional integrative models as the field progresses.





□ CellsFromSpace: A versatile tool for spatial transcriptomic data analysis with reference-free deconvolution and guided cell type/activity annotation

>> https://www.biorxiv.org/content/10.1101/2023.08.30.555558v1

CellsFromSpace decomposes spatial transcriptomic data into components that represent distinct cell types or activities. The direct annotation of components, allows users to identify and isolate cell populations in the latent space, even when they overlap.

CellsFromSpace overcomes some of the limitation of Latent Dirichlet Allocation. CFS is based on the independent component analysis (ICA), a blind source separation technique that attempts to extract sources from a mixture of these sources.





□ Scoring alignments by embedding vector similarity

>> https://www.biorxiv.org/content/10.1101/2023.08.30.555602v1

The E-score project focuses on computing Global-regular and Global-end-gap-free alignment between any two protein sequences using their embedding vectors computed by stat-of-art pre-trained models.

Instead of a fixed score between two pairs of amino acids, they use the cosine similarity between the embedding vectors of two amino acids and use it as the context-dependent score.





□ AliSim-HPC: parallel sequence simulator for phylogenetics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad540/7258693

AliSim-HPC is highly efficient and scalable, which reduces the runtime to simulate 100 large gap-free alignments (30,000 sequences of one million sites) from over one day to 11 minutes using 256 CPU cores from a cluster with 6 computing nodes, a 153-fold speedup.

AliSim-HPC parallelizes the simulation process at both multi-core and multi-CPU levels using the OpenMP and MPI libraries. AliSim-HPC employs The Scalable Parallel Random Number Generators Library (SPRNG) and requires users to specify a random number generator seed.





□ MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giad054/7230465

MuLan-Methyl, a deep learning framework for predicting DNA methylation sites, which is based on 5 popular transformer-based language models. The framework identifies methylation sites for 3 different types of DNA methylation: N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine.

Each of the employed language models is adapted to the task using the “pretrain and fine-tune” paradigm. Pretraining is performed on a custom corpus of DNA fragments and taxonomy lineages. Fine-tuning aims at predicting the DNA methylation status of each type.





□ SpatialDDLS: An R package to deconvolute spatial transcriptomics data using neural networks

>> https://www.biorxiv.org/content/10.1101/2023.08.31.555677v1

SpaDalDDLS leverages single-cell RNA sequencing data to simulate mixed transcripDonal profiles with predefined cellular composiDon, which are subsequently used to train a fully-connected neural network to uncover cell type diversity within each spot.

SpatialDDLS offers the option to keep only those genes present in a specified number of slides. These steps aim to expedite subsequent steps by avoiding the consideration of the entire noisy expression matrix.





□ spaTrack: Inferring cell trajectories of spatial transcriptomics via optimal transport analysis

>> https://www.biorxiv.org/content/10.1101/2023.09.04.556175v1

spaTrack, a trajectory inference method incorporating both expression and distance cost of cell transition. spaTrack utilizes Optimal Transport (OT) as a foundation to infer the transition probability between cells of ST data in a single sample.

spaTrack models the fate of a cell as a function of expression profile along temporal intervals driven by TF. spaTrack can construct a dynamic map of cell migration and differentiation across all tissue sections, providing a comprehensive view of transition behavior over time.





□ SCGP: Characterizing tissue structures from spatial omics with spatial cellular graph partition

>> https://www.biorxiv.org/content/10.1101/2023.09.05.556133v1

Spatial Cellular Graph Partitioning (SCGP) is a fast and flexible method designed to identify the anatomical and functional units in human tissues. It can be effectively applied to both spatial proteomics and transcriptomics measurements.

SCGP-Extension, which enables the generalization usage of extending a set of reference tissue structures to previously unseen query samples. SCGP-Extension can address challenges ranging from experimental artifacts, batch effects, to disease condition differences.





□ A novel interpretable deep transfer learning combining diverse learnable parameters for improved prediction of single-cell gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2023.09.07.556481v1

In terms of the TFt-based models, they keep weights of the bottom layers in the feature extraction part of pre-trained models unchanged while modifying weights in the proceeding layers including the densely connected classifier according to the Adam optimizer.

The densely connected classifier was altered to deal w/ the binary class classification problem pertaining to distinguishing between healthy controls and T2D SCGRN images. It can be seen that updating model weight parameters is done through the training w/ the Adam optimizer.





□ CS-CORE: Cell-type-specific co-expression inference from single cell RNA-sequencing data

>> https://www.nature.com/articles/s41467-023-40503-7

CORE (cell-type-specific co-expressions) models the unobserved true gene expression levels as latent variables, linked to the observed UMI counts through a measurement model that accounts for both sequencing depth variations and measurement errors.

CS-CORE implements a fast and efficient iteratively re-weighted least squares approach for estimating the true correlations between underlying expression levels, together with a theoretically justified statistical test to assess whether two genes are independent.





□ μ-PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad552/7265394

μ-PBWT, introducing a lightweight index for the PBWT data structure. It leverages the run-length encoding paradigm to significantly reduce the space requirements for solving two major problems: the SMEMs-finding (i.e. computing maximal matches) and SMEMs-location (i.e. finding occurrences).

μ-PBWT reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, μ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file.





□ Local read haplotagging enables accurate long-read small variant calling

>> https://www.biorxiv.org/content/10.1101/2023.09.07.556731v1

An approximate haplotagging method that can locally haplotag long reads without having to generate variant calls. This approach uses local candidates to haplotag the reads and then the deep neural network model uses the haplotag approximation to generate high-quality variants.

This approach eliminates the requirement for having the first two steps for haplotagging the reads and reduces the overhead for extending support to newer platforms. Approximate haplotagging with candidate variants has comparable accuracy to haplotagging with WhatsHap.





□ BAGO: Bayesian optimization of separation gradients to maximize the performance of untargeted LC-MS

>> https://www.biorxiv.org/content/10.1101/2023.09.08.556930v1

BAGO, a Bayesian optimization method for autonomous and efficient LC gradient optimization. BAGO is an active learning strategy that discovers the optimal gradient using limited experimental data.

BAGO evaluates the retention of all detected features in an unbiased manner regardless of ion abundance and identity, providing a robust index representing global compound separation.

Multiple optimizations of general Bayesian optimization framework were applied to ensure the high efficiency of BAGO on a diverse range of gradient optimization problems.





□ Automated Bioinformatics Analysis via AutoBA

>> https://www.biorxiv.org/content/10.1101/2023.09.08.556814v1

Auto Bioinformatics Analysis (AutoBA), the first autonomous AI agent meticulously crafted for conventional bioinformatics analysis. AutoBA streamlines user interactions by soliciting just three inputs: the data path, the data description, and the final objective.

AutoBA possesses the capability to autonomously generate analysis plans, write codes, execute codes, and perform subsequent data analysis. In essence, AutoBA marks the pioneering application of LLMs and automated AI agents in the realm of bioinformatics.





□ cloneRate: fast estimation of single-cell clonal dynamics using coalescent theory

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad561/7271182

cloneRate provides accessible methods for estimating the growth rate of clones. The input should either be an ultrametric phylogenetic tree with edge lengths corresponding to time, or a non-ultrametric phylogenetic tree with edge lengths corresponding to mutation counts.

This package provides the internal lengths and maximum likelihood methods for ultrametric trees and the shared mutations method for mutation-based trees. A fast way to simulate the coalescent (tree) of a sample from a birth-death branching process.





□ Hierarchical heuristic species delimitation under the multispecies coalescent model with migration

>> https://www.biorxiv.org/content/10.1101/2023.09.10.557025v1

Alternatively heuristic criteria based on population parameters under the MSC model (such as population/species divergence times, population sizes, and migration rates) estimated from genomic sequence data may be used to delimit species.

Extending the approach of species delimitation using the genealogical divergence index (gdi) to develop hierarchical merge and split algorithms for heuristic species delimitation, and implement them in a python pipeline called hhsd.





□ EvoDiff: Protein generation with evolutionary diffusion: sequence is all you need

>> https://www.biorxiv.org/content/10.1101/2023.09.11.556673v1

EvoDiff uses a discrete diffusion framework in which a forward process iteratively corrupts a protein sequence by changing its amino acid identities, and a learned reverse process, parameterized by a neural network, predicts the changes made at each iteration.

The reverse process can then be used to generate new protein sequences starting from random noise. EvoDiff's discrete diffusion formulation is mathematically distinct from continuous diffusion formulations previously used for protein structure design.





□ CRUSTY: a versatile web platform for the rapid analysis and visualization of high-dimensional flow cytometry data

>> https://www.nature.com/articles/s41467-023-40790-0

CRUSTY, an interactive, user-friendly webtool incorporating the most popular algorithms for FCM data analysis, and capable of visualizing graphical and tabular results and automatically generating publication-quality figures within minutes.





□ LIT: Identifying latent genetic interactions in genome-wide association studies using multiple traits

>> https://www.biorxiv.org/content/10.1101/2023.09.11.557155v1

LIT (Latent Interaction Testing) leverages multiple related traits for detecting latent genetic interactions. LIT is motivated by the observation that latent genetic interactions induce not only a differential variance pattern, but also a differential covariance pattern.

Combining the p-values from both approaches in aLIT maximized the number of discoveries while controlling the typeI error. LIT increased the power to detect latent genetic interactions compared to marginal testing, and the difference was drastic for certain genetic architectures.





□ The Interplay Between Sketching and Graph Generation Algorithms in Identifying Biologically Cohesive Cell-Populations in Single-Cell Data

>> https://www.biorxiv.org/content/10.1101/2023.09.15.557825v1

Combining a principled sketching approach with a simple k-nearest neighbor graph representation of the data can identify meaningful subsets of cells as robustly as, and sometimes better than, more sophisticated graph generation approaches.

Cell-similarity graphs are generally weighted, undirected, and simple. A weighted graph is one where each edge has a value assigned to it; large edge weights indicate strong connections between nodes.

Graph mining approaches perform better on sparse graphs than they do on dense graphs, and graph density varies significantly from the ultra-sparse GRASPEL to the 8-NN graph. Label propagation is more robust to noise and sparsity in the edges of a graph than Leiden clustering.







Astropath.

2023-09-19 21:09:09 | Science News




□ MaxFuse: Integration of spatial and single-cell data across modalities with weakly linked features

>> https://www.nature.com/articles/s41587-023-01935-0

MaxFuse (matching X-modality via fuzzy smoothed embedding), a cross-modal data integration method that, through iterative coembedding, data smoothing and cell matching, uses all information in each modality to obtain high-quality integration even when features are weakly linked.

MaxFuse is modality-agnostic. MaxFuse computes distances between all cross-modal cell pairs based on the smoothed, linked features and applies linear assignment on the cross-modal pairwise distances of the fuzzy-smoothed joint embedding coordinates.





□ Autometa 2: A versatile tool for recovering genomes from highly-complex metagenomic communities

>> https://www.biorxiv.org/content/10.1101/2023.09.01.555939v1

Autometa first performs pre-processing tasks where assembled contiguous sequences (contigs) are filtered by length and taxon. The latter process assigns contigs to kingdom-level taxonomies, effectively separating eukaryotic host-associated genomes from prokaryotic symbionts.

Contigs are recursively binned using nucleotide composition and read coverage, with successive rounds first splitting the remaining contigs into groups from less to more specific canonical ranks (i.e. kingdom, phylum, class, order, family, genus, species).

Autometa attempts to recruit any remaining unclustered sequences into one of the recovered putative metagenome- assembled genomes (MAGs) through classification by a decision tree classifier, or optionally, a random forest classifier.





□ EpiSegMix: A Flexible Distribution Hidden Markov Model with Duration Modeling for Chromatin State Discovery

>> https://www.biorxiv.org/content/10.1101/2023.09.07.556549v1

EpiSegMix first estimates the parameters of a hidden Markov model, where each state corresponds to a different combination of epigenetic modifications and thus represents a functional role, such as enhancer, transcription start site, active or silent gene.

The spatial relations are captured via the transition probabolities. After the parameter estimation, each region in the genome is annotated w/ the most likely chromatin state. The implementation allows to choose for each histone modification a different distributional assumption.





□ Xenomake: a pipeline for processing and sorting xenograft reads from spatial transcriptomic experiments

>> https://www.biorxiv.org/content/10.1101/2023.09.04.556109v1

Xenomake is a xenograft reads sorting and processing pipeline. It consists of the following steps: read tagging/trimming, alignment, annotation of genomic features, xenograft read sorting, subsetting bam, filtering multi mapping reads, and gene quantifications.

Xenomake contains a policy regarding handling reads classified as both and ambiguous by Xengsort. Xenomake differs from others in that it adopts a flexible strategy to resolve both/ambiguous categories to make reads in these categories usable, rather than removing them.

Xenomake uses the genomic location (exonic, intronic, intergenic, or pseudogene) to determine the best aligned location of a multimapping read. A multimapper favors the exonic alignment over intergenic, pseudogenic, and any other secondary alignments.





□ Multimodal learning of noncoding variant effects using genome sequence and chromatin structure

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad541/7260506

A multimodal deep learning scheme that incorporates both data of 1D genome sequence and 3D chromatin structure for predicting noncoding variant effects.

Specifically, they have integrated convolutional and recurrent neural networks for sequence embedding and graph neural networks for structure embedding despite the resolution gap between the two types of data, while utilizing recent DNA language models.

Numerical results show that our models outperform competing sequence-only models in predicting epigenetic profiles and their use of long-range interactions complement sequence-only models in extracting regulatory motifs.

They prove to be excellent predictors for noncoding variant effects in gene expression and pathogenicity, whether in unsupervised “zero-shot” learning or supervised “few-shot” learning.





□ PFGM++: Unlocking the Potential of Physics-Inspired Generative Models

>> https://arxiv.org/abs/2302.04265

PFGM++ unifies diffusion models and Poisson Flow Generative Models. These models realize generative trajectories for N dimensional data by embedding paths in N+D dimensional space while still controlling the progression with a simple scalar norm of the D additional variables.

PFGM++ models reduce to PFGM when D=1 and to diffusion models when D→∞. present an align-after the phase alignment. PFGM++ uses an alignment method that enables a "zero-shot" transfer of hyper-parameters across different Ds.





□ GWAS of random glucose in 476,326 individuals provide insights into diabetes pathophysiology, complications and treatment stratification

>> https://www.nature.com/articles/s41588-023-01462-3

While random glucose (RG) is inherently more variable than standardized measures, they reasoned that, across a very large number of individuals, it gives a more comprehensive representation of complex glucoregulatory processes occurring in different organ systems.

In the near future, larger well-phenotyped datasets will enable high-dimensional GWAS investigations, disentangling the role of diet composition, physical activity and lifestyle on RG level variability in relation to genetic effects.





□ phyloGAN: Phylogenetic inference using Generative Adversarial Networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad543/7260504

phyloGAN is a Generative Adversarial Network (GAN) that infers phylognetic relationships. phyloGAN takes as input a concatenated alignments, or a set of gene alignments, and then infers a phylogenetic tree either considering or ignoring gene tree heterogeneity.

phyloGAN heuristically explores phylogenetic tree space to find a tree topology that produces generated data that are similar to observed data. The generator generates a tree topology and branch lengths, which are used as input into an evolutionary simulator (AliSim).

At each iteration, new topologies are proposed using nearest neighbor interchange (NNI) and subtree pruning and regrafting (SPR). The discriminator is a CNN trained to differentiate real and generated data.





□ CoLA: Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra

>> https://arxiv.org/abs/2309.03060

CoLA (Compositional Linear Algebra) combines a linear operator abstraction with compositional dispatch rules. CoLA automatically constructs memory and runtime efficient numerical algorithms.

CoLA can accelerate many algebraic operations, while making it easy to prototype matrix structures and algorithms, providing an appealing drop-in tool for virtually any computational effort that requires linear algebra.





□ evopython: a Python package for feature-focused, comparative genomic data exploration

>> https://www.biorxiv.org/content/10.1101/2023.09.02.556042v1

evopython is a modular, object-oriented Python package, specifically designed for parsing features at genome-scale and resolving their alignments from whole-genome alignment data.

The fundamental capabilities of evopython are encapsulated within two key class-level functionalities: Parser and Resolver. The Parser class provides a dictionary-like interface for interacting with feature-storing formats, such as TF or BED.

The Resolver class then resolves these features from within the context of the whole-genome alignment. It performs the task of mapping the features onto the alignment and returns a nested dictionary representation that reflects the alignment structure.





□ ChromGene: gene-based modeling of epigenomic data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03041-5

ChromGene models the set of epigenomic data across genes with a mixture of Hidden Markov Models. The set of epigenomic data for each gene, along with a flanking region at each end, is binarized at fixed-width bins, indicating observations of each epigenomic mark.

ChromGene does not directly model gene position information. The prior probability that a gene belongs to a specific mixture component, that is, an individual HMM, corresponds to the sum of initial probabilities of the states of that component.





□ Regulatory Transposable Elements in the Encyclopedia of DNA Elements

>> https://www.biorxiv.org/content/10.1101/2023.09.05.556380v1

TE-derived cCREs are enriched for GWAS variants, albeit to a lesser extent than non-TE cCREs. While this could indicate that TEs are less likely to be physiologically relevant, it could also reflect technical shortcomings associated with genotyping within TE sequences.

Genotyping arrays, which use short oligonucleotide probes to discern SNPs, are designed to avoid repetitive regions of the genome.





□ SPEAQeasy: a scalable pipeline for expression analysis and quantification for R/bioconductor-powered RNA-seq analyses

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04142-3

SPEAQeasy (a Scalable Pipeline for Expression Analysis and Quantification) ultimately generates RangedSummarizedExperiment R objects that are the foundation block for many Bioconductor R packages and the statistical methods they provide.

SPEAQeasy produces the information that coupled with DNA genotyping information can be used for detecting and fixing sample swaps, RNA-seq processing quality metrics that are helpful for statistically adjusting for quality differences across samples.





□ SpatialPrompt: spatially aware scalable and accurate tool for spot deconvolution and clustering in spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.09.07.556641v1

SpatialPrompt, a spatially aware and scalable method for spot deconvolution as well as domain identification for spatial transcriptomics. SpatialPrompt integrates gene expression, spatial location, and scRNA-seq reference data to infer cell-type proportions of spatial spots accurately.

At the core, SpatialPrompt uses non-negative ridge regression and an iterative approach inspired by graph neural network (GNN) to capture the local microenvironment information in the spatial data.

Spatial Prompt takes spatial matrix with coordinate information and scRNA-seg matrix with cell type annotations as input for spot deconvolution and clustering.

The spatial spot simulation pipeline utilises scRNA-seq expression matrix and cell type annotations to generate simulated expression matrix with known cell type mixture.





□ A Quantitative Genetic Model of Background Selection in Humans

>> https://www.biorxiv.org/content/10.1101/2023.09.07.556762v1

A statistical method based on a quantitative genetics view of linked selection, that models how polygenic additive fitness variance distributed along the genome increases the rate of stochastic allele frequency change.

By jointly predicting the equilibrium fitness variance and substitution rate due to both strong and weakly deleterious mutations, they estimate the distribution of fitness effects (DFE) and mutation rate across three geographically distinct human samples.

While the model can accommodate weaker selection, they find evidence of strong selection operating similarly across all human samples. Although the model fits better than previous models, substitution rates of the most constrained sites disagree w/ observed divergence levels.





□ An Extensive Benchmark Study on Biomedical Text Generation and Mining with ChatGPT

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad557/7264174

Typical NLP tasks like named entity recognization, relation extraction, sentence similarity, question and answering, and document classification are included. Overall, ChatGPT got a BLURB score of 58.50 while the state-of-the-art model had a score of 84.30.

Among all types of task, QA task is the only type of task that ChatGPT is comparative to the baselines. In this case, ChatGPT (82.5) outperforms PubMedBERT (71.7) and BioLinkBERT-Base (80.8) and is very close to the BioLinkBERT-Large (83.5).





Nicholas Larus-Stone

>> https://sphinxbio.com/post/introducing-sphinx

🧬🛠 Introducing @sphinx_bio: Empowering Scientists to Make Better Decisions, Faster 🛠🧬

"What is #techbio apart from an anagram of #biotech?"

Read on below or see the full post here: sphinxbio.com/post/introduci…

I’m excited to share more about our vision for Sphinx. 👩‍🔬👨‍💻





□ Trackplot: A flexible toolkit for combinatorial analysis of genomic data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011477

Trackplot, a comprehensive tool that delivers high-quality plots via a programmable and interactive web-based platform.

Trackplot seamlessly integrates diverse data sources and utilizes a multi-threaded process, enabling users to explore genomic signal in large-scale sequencing datasets.





□ COLLAGENE enables privacy-aware federated and collaborative genomic data analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03039-z

COLLAGENE integrates components of MPC, HE, and matrix masking that is motivated by matrix-level differential privacy for performing complex operations (e.g., matrix inversion) efficiently while preserving privacy.

COLLAGENE provides ready-to-run implementations for encryption, collective decryption, matrix masking, a suite of secure matrix arithmetic operations, and network file input/output tools for sharing encrypted intermediate datasets among collaborating sites.





□ scDECAF: Identification of cell types, states and programs by learning gene set representations

>> https://www.biorxiv.org/content/10.1101/2023.09.08.556842v1

scDECAF (Single-cell disentanglement by canonical factors) enables reference-free automated annotation of cells with either discrete labels, such as cell types and states, or continuous phenotype scores for gene expression programs.

scDECAF can learn disentangled representations of gene expression profiles and select the most relevant subset of gene programs among a collection of gene sets. scDECAF constructs a shared lower-dimensional space b/n binarised gene lists and unlabelled gene expression profiles.

scDECAF provides vector representations of gene sets and gene expression profiles while simultaneously maximizing the correlation between the two. The association between individual cells and phenotpe is determined based on the similarity of their representations in CCA space.





□ DelSIEVE: joint inference of single-nucleotide variants, somatic deletions, and cell phylogeny from single-cell DNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.09.09.556903v1

DeISIEVE (somatic Deletions enabled SIngle-cell EVolution Explorer), a statistical phylogenetic model that includes all features of SIEVE, namely correcting branch lengths of the cell phylogeny for the acquisition bias, incorporating a trunk to model the establishment of the tumor clone.

DeISIEVE employs a Dirichlet-multinomial distribution to model the raw read counts for all nucleotides, as well as modeling the sequencing coverage using a negative binomial distribution, and extends them with the more versatile capacity of calling somatic deletions.





□ MUSTANG: MUlti-sample Spatial Transcriptomics data ANalysis with cross-sample transcriptional similarity Guidance

>> https://www.biorxiv.org/content/10.1101/2023.09.08.556895v1

MUSTANG (MUlti-sample Spatial Transcriptomics data ANalysis with cross-sample transcriptional similarity Guidance) simultaneousIy derives the spot cellular deconvolution of multiple tissue samples without the need for reference cell type expression profiles.

MUSTANG adjusts for potential batch effects as crucial multi-sample experiments considerations to enable cross-sample transcriptional information sharing to aid in parameter estimation.

MUSTANG is designed based on the assumption that the same or similar cell types exhibit consistent gene expression profiles across samples. MUSTANG allows both intra-sample and inter-sample information sharing by introducing a new spot similarity graph.





□ BiocMAP: a Bioconductor-friendly, GPU-accelerated pipeline for bisulfite-sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05461-3

The BiocMAP workflow consists of a set of two modules—alignment and extraction, which together process raw WGBS reads in FASTQ format into Bioconductor-friendly R objects containing DNA methylation proportions essentially as a cytosine-by-sample matrix.

The first BiocMAP module performs speedy alignment to a reference genome by Arioc, and requires GPU resources. Methylation extraction and remaining steps are performed in the second module, optionally on a different computing system where GPUs need not be available.





□ Cell4D: A general purpose spatial stochastic simulator for cellular pathways

>> https://www.biorxiv.org/content/10.1101/2023.09.10.557076v1

Cell4D is a C++-based graphical spatial stochatic cell simulator capable of simulating a wide variety of cellular pathways. Molecules are simulated as particles w/in a user-defined simulation space under a Smoluchowski-based reaction-diffusion system on a static time-step basis.

At each timestep, particles will diffuse under Brownian-like motion and any potential reactions between molecules will be resolved.

Simulation space is divided into cubic sub-partitions called c-voxels, groups of these c-voxels can be used to define spatial compartments that can have optional rules that govern particle permeability, and reactions can be compartment-specific as well.





□ INTEGRATE-Circ and INTEGRATE-Vis: Unbiased Detection and Visualization of Fusion-Derived Circular RNA

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad569/7273782

INTEGRATE-Circ is an open-source software tool capable of integrating both RNA and whole genome sequencing data to perform unbiased detection of novel gene fusions and report the presence of splice variants in gene fusion transcripts, including backsplicing events.

Recurrent gene fusions were identified from the COSMIC and theoretical backsolice junctions were randomly introduced to the selected fusions. Linear fusion transcripts and linearized versions of the regions that spanned the simulated backsplices were used to simulate reads.





□ SingleCellMultiModal: Curated single cell multimodal landmark datasets for R/Bioconductor

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011324

Collecting publicly available landmark datasets from important single-cell multimodal protocols, including CITE-Seq, ECCITE-Seq, SCoPE2, scNMT, 10X Multiome, seqFISH, and G&T.

SingleCellMultiModal R/Bioconductor package that provides single-command access to landmark datasets from seven different technologies, storing datasets using HDF5 and sparse arrays for memory efficiency and integrating data modalities via the MultiAssayExperiment class.





□ BioThings Explorer: a query engine for a federated knowledge graph of biomedical APIs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad570/7273783

BioThings Explorer (BTE) is an engine for autonomously querying a distributed knowledge graph. The distributed knowledge graph is made up of biomedical APIs that have been annotated with semantically-precise descriptions of their inputs and outputs in the SmartAPI registry.

BioThings Explorer leverages semantically precise annotations of the inputs and outputs for each resource, and automates the chaining of web service calls to execute multi-step graph queries.





□ The tidyomics ecosystem: Enhancing omic data analyses

>> https://www.biorxiv.org/content/10.1101/2023.09.10.557072v1

tidyomics, an interoperable software ecosystem that bridges Bioconductor and the tidyverse. tidyomics is easily installable with a single homonymous meta-package.

This ecosystem includes three new R packages: tidySummarizedExperiment, tidySingleCell Experiment, and tidySpatialExperiment, and five that are publicly available: plyranges", nullranges, tidyseura, tidybulk, tidytof.





□ EHE: Dissecting the high-resolution genetic architecture of complex phenotypes by accurately estimating gene-based conditional heritability

>> https://www.cell.com/ajhg/fulltext/S0002-9297(23)00282-3

EHE (the effective heritability estimator) can use p values from genome-wide association studies (GWASs) for local heritability estimation by directly converting marginal heritability estimates of SNPs to a non-redundant heritability estimate of a gene or a small genomic region.

EHE estimates the conditional heritability of nearby genes, where redundant heritability among the genes can be removed further. The conditional estimation can be guided by tissue-specific expression profiles to quantify more functionally important genes of complex phenotypes.





□ BG2: Bayesian variable selection in generalized linear mixed models with nonlocal priors for non-Gaussian GWAS data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05468-w

A novel Bayesian method to find SNPs associated with non-Gaussian phenotypes. Using generalized linear mixed models (GLMMs) and, thus, the method called Bayesian GLMMs for GWAS (BG2). This is the first time that nonlocal priors are proposed for regression coefficients in GLMMs.

BG2 uses a two-step procedure: first, BG2 screens for candidate SNPs; second, BG2 performs model selection that considers all screened candidate SNPs as possible regressors.

BG2 uses a pseudo-likelihood approach to facilitate integrating out the random effects. Such pseudo-likelihood approach leads to a Gaussian approximation for adjusted observations that allows analytically integrating out the random effects.





□ DNA sequencing at the picogram level to investigate life on Mars and Earth

>> https://www.nature.com/articles/s41598-023-42170-6

In this research, it is assumed that if there is a living organism within the returned Mars Sample Collection with the possibility to replicate, and thus, the type of organism that background planetary protection protocols need to contain and control.

It relies on the same chemical processes as terrestrial organisms and it codes its genetic information with the known bases (ATGC for DNA, and AUGC for RNA) that are ubiquitously used by life on Earth.





□ cdsBERT - Extending Protein Language Models with Codon Awareness

>> https://www.biorxiv.org/content/10.1101/2023.09.15.558027v1

cdsBERT (CoDing Sequence Bidirectional Encoder Representation Transformer) was seeded with ProtBERT and further trained on 4 million CoDing Sequences (CDS) compiled from the NIH and Ensembl databases.

MELD (Masked Extended Language Distillation) is a vocabulary extension pipeline that was trained w/ Knowledge Distillation. The hypothesis was that a shift in synonymous codon embeddings w/in the TEM would indicate a nontrivial addition of protein information after applying MELD.




ZENITH.

2023-09-19 21:08:09 | Science News

(Guanyin de la mer du Sud, dynastie Liao ou Jin (1115-1234) by Ariste85)




□ scEGOT: Single-cell trajectory inference framework based on entropic Gaussian mixture optimal transport

>> https://www.biorxiv.org/content/10.1101/2023.09.11.557102v1

scEGOT provides comprehensive outputs from multiple perspectives, incl. cell state graphs, velocity fields of cell differentiation, time interpolations of single-cell data, space-time continuous GE analysis, GRN, and reconstructions of Waddington’s epigenetic landscape.

scEGOT is formulated by an entropic regularization of the discrete optimal transport, which is a coarse-grained model derived by taking each Gaussian distribution as a single point.

scEGOT constructs the time interpolations of cell populations and the time-continuous gene expression dynamics using the entropic displacement interpolation and has certainly identified the bifurcation time.





□ Cell2Sentence: Teaching Large Language Models the Language of Biology

>> https://www.biorxiv.org/content/10.1101/2023.09.11.557287v1

Cell2Sentence transforms each cell's GE profile into a plaintext of gene names ordered by expression level . This rank transformation can be reverted w/ minimal loss of information. C2S allows any pretrained causal language model (LLMs) to be further fine-tuned on cell sequences.

C2S enables forward / reverse transformation with minimal information loss. Inference is done by generating cells via autoregressive cell completion, generating cells from text, or generating text from cells. The generated cell sentences can be converted back to gene expression.







□ FrameD: Framework for DNA-based Data Storage Design, Verification, and Validation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad572/7274858

FrameD, a software framework for designing, verifying, and validating DNA storage system designs. FrameD is not a library of every conceivable error correction algorithm, instead, it provides a fault-injection-based test bed in which DNA storage systems can be evaluated.

FrameD can be be configured to allocate compute resources in the form of MPI ranks to both fault injection iterations and work done during fault injection simulations like decoding individual strands, packet outer codes, and sequence alignment.





□ DPGA: DNA-based programmable gate arrays for general-purpose DNA computing

>> https://www.nature.com/articles/s41586-023-06484-9

DPGAs, a DIC system by integration of multilayer DNA-based programmable gate arrays. The use of generic single-stranded oligonucleotides as a uniform transmission signal can reliably integrate large-scale DICs with minimal leakage and high fidelity for general-purpose computing.

Reconfiguration of a single DPGA with 24 addressable dual-rail gates can be programmed with wiring instructions to implement over 100 billion distinct circuits.

They designed DNA origami registers to provide the directionality for asynchronous execution of cascaded DPGAs. A quadratic equation-solving DIC assembled with three layers of cascade DPGAs comprising 30 logic gates with around 500 DNA strands.





□ ARES: Geometric deep learning of RNA structure

>> https://www.science.org/doi/10.1126/science.abe5650

The Atomic Rotationally Equivariant Scorer (ARES), predicts the model’s root mean square deviation (RMSD) from the unknown true structure. ARES takes as input a structural model, specified by each atom’s element type and 3D coordinates.

Atom features are repeatedly updated based on the features of nearby atoms. Each feature is then averaged across all atoms, and the resulting averages are fed into additional neural network layers, which output the predicted RMSD of the structural model from the true structure.





□ Allo: Accurate allocation of multi-mapped reads enables regulatory element analysis at repeats

>> https://www.biorxiv.org/content/10.1101/2023.09.12.556916v1

Allo, combines probabilistic mapping based on UMR counts with a convolutional neural network (CNN) that has been trained to identify the appearance of peak-containing regions.

Allo loops through the alignment file and parses uniquely and multi-mapped reads. Alignment files can contain locations that do not have the highest alignment score and thus require extra parsing. Allo identifies the correct pairs when using paired-end sequencing data.

Allo analyzes one read at a time by grouping it with its possible locations. The vector contains the total read count and the output of the sigmoid function. The final score vector is normalized by dividing all entries but the sum of the vector giving the final probabilities.





□ SnapATAC2: a fast, scalable and versatile tool for single-cell omics analysis

>> https://www.biorxiv.org/content/10.1101/2023.09.11.557221v1

SnapATAC2 uses a nonlinear dimensionality reduction algorithm that achieves both computational efficiency and accuracy in discerning cellular composition of complex tissues from a broad spectrum of single-cell omics data types.

SnapATAC2 uses a matrix-free spectral embedding algorithm to project single-cell omics data into a low-dimensional space that preserves the intrinsic geometric properties. SnapATAC2 utilizes the Lanczos algorithm to derive eigenvectors while implicitly using the Laplacian matrix.





□ GraffiTE: a Unified Framework to Analyze Transposable Element Insertion Polymorphisms using Genome-graphs

>> https://www.biorxiv.org/content/10.1101/2023.09.11.557209v1

GraffiTE is a pipeline that finds polymorphic transposable elements (pMEs) in genome assemblies or long read datasets and genotypes the discovered polymorphisms in read sets using a pangenomic approach.

Each pME detected can be further genotyped by mapping short or long reads against a TE graph-genome. It represents each identified ME as a bubble, i.e. providing alternate paths in the graph, where both presence and absence alleles are available for read mapping and genotyping.





□ Cellatlas: Universal preprocessing of single-cell genomics data

>> https://www.biorxiv.org/content/10.1101/2023.09.14.543267v1

Cellatlas is based on parsing of machine-readable seqspec assay specifications to customize inputs for kb-python, which uses kallisto and bustools to catalog reads, error correct barcodes, and count reads.

Cellatlas requires sequencing reads, genomic references, and a seqspec file. It leverages seqspec functionality to auto generate the kallisto string that specifies the 0-index position of the cellular / molecular barcodes, and genomic features such as cDNA or genomic DNA.





□ Metaphor: A workflow for streamlined assembly and binning of metagenomes https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giad055/7233990

Metaphor, a fully automated workflow for genome-resolved metagenomics (GRM). Metaphor differs from existing GRM workflows by offering flexible approaches for the assembly and binning of the input data and by combining multiple binning algorithms with a bin refinement step.

Metaphor produces genome bins generated w/ Vamb / MetaBAT2 / CONCOCT that are refined w/ the DAS Tool. Metaphor processes multiple datasets in a single execution, performing assembly and binning in separate batches for each dataset, and avoiding the need for repeated executions.





□ Bayesian Maximum Entropy Ensemble Refinement

>> https://www.biorxiv.org/content/10.1101/2023.09.12.557310v1

A fully Bayesian treatment of the estimation of maximum entropy coupling parameters. It tackles the problem head on that the partition function of the maximum entropy ensemble is not tractable analytically.

This approach uses the generated MD trajectories to estimate the partition function using the weighted histogram analysis method (WHAM) algorithm. This achieves an approximation of the maximum entropy Boltzmann probability density, which can be used for MCMC parameter estimation.

This method converges to the maximum entropy ensemble similar to replica averaging, but the limit of infinitely many iterations required in This approach can be systematically improved by simply increasing run time of the algorithm.





□ HILAMA: High-dimensional multi-omic mediation analysis with latent confounding

>> https://www.biorxiv.org/content/10.1101/2023.09.15.557839v1

HILAMA (HIgh-dimensional LAtent-confounding Mediation Analysis) addresses two critical challenges in applying mediation analysis (or any causal inference method) to multi-omics studies: (1) accommodating both high-dimensional exposures and mediators, and (2) handling latent confounding.

Applying HILAMA to a real multi-omic dataset collected by the ADNI. This data analysis should be viewed as at most exploratory rather than confirmatory nature. It is highly likely that the linearity assumption imposed in the Structural Equation Model may not be a good approximation of the reality.





□ UNNT: A novel Utility for comparing Neural Net and Tree-based models

>> https://www.biorxiv.org/content/10.1101/2023.09.12.557300v1

UNNT (A novel Utility for comparing Neural Net and Tree-based models), a novel robust framework that trains and compares deep learning method such as CNN and tree-based method such as XGBoost on the user input dataset.

Grid search trains a new model for ever combination of hvperparameters while cross validation uses a different subset as test data to get an average across five subsets. Best set of hyperparameters found were ETA:0.1, Max depth: 10, Subsample: 0.5, N estimators:500.





□ InterDiff: Guided Diffusion for molecular generation with interaction prompt

>> https://www.biorxiv.org/content/10.1101/2023.09.11.557141v1

InterDiff, an interaction prompt guided diffusion mode. InterDiff is a graph neural network in which the atom denotes the nodes and the Euclidean distance between atoms denotes the edges.

InterDiff consists of 6 equivariant blocks and each block has three modules with transformer like structure. Atoms in ligand and protein are represented by one-hot vector initially and then transformed b a linear laver.





□ MAVEN: compound mechanism of action analysis and visualisation using transcriptomics and compound structure data in R/Shiny

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05416-8

MAVEN (Mechanism of Action Visualisation and Enrichment), an R/Shiny app which allows for GUI-based prediction of drug targets based on chemical structure, combined with causal reasoning based on causal protein–protein interactions and transcriptomic perturbation signatures.

MAVEN is designed to be scalable and flexible to the needs of the user by taking advantage of parallel processing available in PIDGINv4 and CARNIVAL for the two bottleneck steps, and depending on the available resources can handle large networks and gene expression signatures.






□ NAPU: Scalable Nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation

>> https://www.nature.com/articles/s41592-023-01993-x

Napu (Nanopore Analysis Pipeline) is a collection of WDL workflows for variant calling and de novo assembly of ONT data, optimized for single-flowcell ONT sequencing protocol. A new Hapdup method that generates de novo diploid assemblies from ONT sequencing only.

Outside of centromeres and segmental duplications, These assemblies are structurally highly concordant with the HPRC de novo assemblies that were produced from the more expensive combination of multiple sequencing technologies.





□ EMMA: Computing Multiple Sequence Alignments given a Constraint Subset Alignment

>> https://www.biorxiv.org/content/10.1101/2023.06.12.544642v2

EMMA (Extending Multiple alignments using MAFFT-- add) for the problem of adding a set of unaligned sequences into a multiple sequence alignment (i.e., a constraint alignment).

EMMA builds on MAFFT-- add, which is also designed to add sequences into a given constraint alignment. EMMA improves on MAFFT--add methods by using a divide-and-conquer framework to scale its most accurate version, MAFFT-linsi--add, to constraint alignments with many sequences.






□ Current and future directions in network biology

>> https://arxiv.org/abs/2309.08478

Distinct scientific communities may all analyze biological network data, or they may address identical computational challenges across various application domains, such as biological versus social networks. However, they often do not attend the same research forums.

An algorithmic solution to handling different approach categories is to design hybrid methods that employ techniques from all associated disciplines. Ex. deep learning methods can be combined w/ a network propagation approach to improve the embedding of multiple networks.





□ DeepCAC: a deep learning approach on DNA transcription factors classification based on multi-head self-attention and concatenate convolutional neural network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05469-9

DeepCAC (Deep Concatenate Attention Augmented Convolution) employs a multi-unit attention mechanism with a convolutional module in the feature extraction layer to form high-dimensional features. DeepCAC can automatically capture heterogeneous hidden features in DNA sequences.

DeepCAC is not designed to apply the Transformer model directly as DNABERT does. The organization of these modules form a complete feature vector by concatenating the feature vector of convolution and the feature vector of multi-head self-attention.





□ General encoding of canonical k-mers

>> https://www.biorxiv.org/content/10.1101/2023.03.09.531845v2

A general minimal perfect hash function for canonical k-mers on alphabets of arbitrary size, i.e., a mapping to the interval [0, σk /2−1]. The approach is introduced for canonicalization under reversal and extended to canonicalization under reverse complementation.

It is formulated recursively where in the i-th step of the recursion, substring x|i, k-i+ 1] is processed. The encoding of a palindromic k-mer solely consists of unspecific pairs until reaching the middle of the k-mer, which is either the empty string or a single character.





□ DeepLOF: An unsupervised deep learning framework for predicting human essential genes from population and functional genomic data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05481-z

DeepLOF can integrate genomic features and population genomic data to predict LOF-intolerant genes without human-labeled training data. DeepLOF may not suffer from label leakage and other pitfalls of supervised machine learning.

DeepLOF is outperformed by a missense intolerance score, UEECON-G, in the prioritization of dominant-negative disease genes, possibly because many dominant-negative mutations are missense mutations.





□ GeneSetR: A web server for gene set analysis based on genome-wide Perturb-Seq data

>> https://www.biorxiv.org/content/10.1101/2023.09.18.558211v1

Perturb-Seq based Gene Set Analyzer (GeneSetR), a user-friendly web-server that can analyze user-defined gene lists based on the data from a recently published genome-wide Perturb-Seq study, which targeted 9,866 genes with 11,258 sgRNAs in the K562 cell line.

The GeneSetR encompasses a diverse array of modules, each specifically designed to provide powerful functionalities utilizing the high-dimensional data derived from Perturb-Seq studies.





□ Biastools: easuring, visualizing and diagnosing reference bias

>> https://www.biorxiv.org/content/10.1101/2023.09.13.557552v1

Biastools, a tool for measuring and diagnosing reference bias in datasets from diploid individuals such as humans. biastools enables users to set up and run simulation experiments to compare different alignment programs and reference representations in terms of the bias they yield.

Biastools categorizes instances of reference bias according to their cause, which might be primarily due to genetic differences, repetitiveness, local coordinate ambiguity due to gaps, or other causes.





□ DISCERN: deep single-cell expression reconstruction for improved cell clustering and cell subtype and state detection

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03049-x

DISCERN, a novel deep generative neural network for directed single-cell expression reconstruction. DISCERN allows for the realistic reconstruction of gene expression information by transferring the style of hq data onto lq data, in latent and gene space.

DISCERN is based on a modified Wasserstein Autoencoder. DISCERN transfers the “style” of hq onto lq data to reconstruct missing gene expression, which sets it apart from other batch correction methods such as , which operate in a lower dimensional representation of the data.





□ CLEAN: Targeted decontamination of sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.08.05.552089v2

CLEAN, an easy-to-use all-in-one decontamination pipeline for short reads, long read. CLEAN automatically combines different user-defined FASTA reference sequences, built-in spike-in controls, and downloadable host species into one mapping index for decontamination.

CLEAN concatenates all specified contaminations, e.g., to clean reads of the host and the spike-in in one step. Each input file (FASTQ and/or FASTA) is mapped against the contamination reference with minimap2.





□ meK-Means: Biophysically Interpretable Inference of Cell Types from Multimodal Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2023.09.17.558131v2

meK-Means (mechanistic K-Means), a method to cluster cells from multimodal single-cell data under a self-consistent, biophysical model. Given a set of cell-by-gene count matrices, meK-Means learns clusters of cells which demonstrate shared transcriptional kinetics across genes of interest.

meK-Means infers cluster-specific biophysical parameters which describe transcriptional bursting and rates of mRNA splicing and degradation, alongside learning the partitions of cells into clusters as distinguished by the parameters.





□ Critical assessment of on-premise approaches to scalable genome analysis

> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05470-2

A comprehensive qualitative and quantitative comparison between BCFtools, SnpSift, Hail, GEMINI, and OpenCGA. The tools were compared in terms of data storage technology, query speed, scalability, annotation, data manipulation, visualization, data output representation, and availability.

GEMINI utilizes a Python indexing package called bcolz to speed up queries targeting genotype fields. Genotype columns in the GEMINI database are indexed to accelerate querying using the argument “–use-bcolz” in the same genotype filtering query to get a quick query response.





□ geNomad: Identification of mobile genetic elements https://www.nature.com/articles/s41587-023-01953-y

>> https://www.nature.com/articles/s41587-023-01953-y

geNomad employs a hybrid approach to plasmid and virus identification that combines an alignment-free classifier (sequence branch) and a gene-based classifier (marker branch) to improve classification performance by capitalizing on the strengths of each classifier.

geNomad processes user-provided nucleotide sequences through two branches. In the sequence branch, the inputs are one-hot encoded fed to an IGLOO neural network, which scores inputs based on the detection of non-local sequence motifs.





□ ARA: a flexible pipeline for automated exploration of NCBI SRA datasets

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giad067/7243537

The ARA (Automated SRA Records Analysis) tool is implemented in Perl and designed to be used from the shell prompt. It employs the NCBI SRA toolkit to download the raw data in FASTQ format from the SRA database.

ARA provides a full or partial SRA record analysis mode and a choice of the sequence screening method (BLAST and BOWTIE2) and taxonomic profiling (Kraken2). The modular design of the pipeline allows easy further expansion of the sequence analysis toolbox.





□ ANS: Adjusted Neighborhood Scoring to improve assessment of gene signatures in single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.09.20.558114v1

ANS (Adjusted Neighbourhood Scoring) is robust with regard to most influencing factors and returns comparable scores for multiple signatures.

Although all scoring methods demonstrate resilience against variations in data composition and variability in signature qualities, they do not exhibit, except for ANS, comparable score ranges for gene signatures designed to discriminate similar cell types.




□ Low-input and single-cell methods for Infinium DNA methylation BeadChips

>> https://www.biorxiv.org/content/10.1101/2023.09.18.558252v1

A new signal detection framework to address the computational challenge of processing data from limited DNA. This new method significantly improved array detection rates while effectively masking probes whose readings are dominated by background signals.

The Infinium BeadChip is compatible with samples of low input down to single cells. The modified detection p-values calculation achieved higher sensitivities for low-input datasets and was validated in over 100,000 public datasets with diverse methylation profiles.





□ MANOCCA: A multivariate outcome test of covariance

>> https://www.biorxiv.org/content/10.1101/2023.09.20.558234v1

MANOCCA (Multivariate Analysis of Conditional CovAriance) enables the identification of both categorical and continuous predictors associated with changes in the covariance matrix of a multivariate outcome while allowing for covariates adjustment.

MANOCCA outperforms existing covariance methods and that, given the appropriate parametrization, it can maintain a calibrated type I error in a range of realistic scenarios when analysing highly multidimensional data.





□ Fast and sensitive validation of fusion transcripts in whole-genome sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05489-5

A pipeline to validate gene fusions found in RNA-Seq data at the WGS level. The pipeline consists of extracting, processing and filtering discordant read pairs from specific areas of the genome defined by the detected fusion junctions of fusion transcripts.

The regions to search for discordant read pairs are defined by the junction coordinates of the observed fusion transcript.

Genomic evidence for a fusion will theoretically be found downstream of the sequences observed on fusion transcript for the 5′ partner and upstream for the 3′ partner, thereby limiting the region needed to search for discordant reads.





Aquamarine.

2023-09-13 01:17:36 | 写真


<svg width="50px" height="50px" viewBox="0 0 60 60" version="1.1" xmlns="https://www.w3.org/2000/svg" xmlns:xlink="https://www.w3.org/1999/xlink"><g stroke="none" stroke-width="1" fill="none" fill-rule="evenodd"><g transform="translate(-511.000000, -20.000000)" fill="#000000"><g><path d="M556.869,30.41 C554.814,30.41 553.148,32.076 553.148,34.131 C553.148,36.186 554.814,37.852 556.869,37.852 C558.924,37.852 560.59,36.186 560.59,34.131 C560.59,32.076 558.924,30.41 556.869,30.41 M541,60.657 C535.114,60.657 530.342,55.887 530.342,50 C530.342,44.114 535.114,39.342 541,39.342 C546.887,39.342 551.658,44.114 551.658,50 C551.658,55.887 546.887,60.657 541,60.657 M541,33.886 C532.1,33.886 524.886,41.1 524.886,50 C524.886,58.899 532.1,66.113 541,66.113 C549.9,66.113 557.115,58.899 557.115,50 C557.115,41.1 549.9,33.886 541,33.886 M565.378,62.101 C565.244,65.022 564.756,66.606 564.346,67.663 C563.803,69.06 563.154,70.057 562.106,71.106 C561.058,72.155 560.06,72.803 558.662,73.347 C557.607,73.757 556.021,74.244 553.102,74.378 C549.944,74.521 548.997,74.552 541,74.552 C533.003,74.552 532.056,74.521 528.898,74.378 C525.979,74.244 524.393,73.757 523.338,73.347 C521.94,72.803 520.942,72.155 519.894,71.106 C518.846,70.057 518.197,69.06 517.654,67.663 C517.244,66.606 516.755,65.022 516.623,62.101 C516.479,58.943 516.448,57.996 516.448,50 C516.448,42.003 516.479,41.056 516.623,37.899 C516.755,34.978 517.244,33.391 517.654,32.338 C518.197,30.938 518.846,29.942 519.894,28.894 C520.942,27.846 521.94,27.196 523.338,26.654 C524.393,26.244 525.979,25.756 528.898,25.623 C532.057,25.479 533.004,25.448 541,25.448 C548.997,25.448 549.943,25.479 553.102,25.623 C556.021,25.756 557.607,26.244 558.662,26.654 C560.06,27.196 561.058,27.846 562.106,28.894 C563.154,29.942 563.803,30.938 564.346,32.338 C564.756,33.391 565.244,34.978 565.378,37.899 C565.522,41.056 565.552,42.003 565.552,50 C565.552,57.996 565.522,58.943 565.378,62.101 M570.82,37.631 C570.674,34.438 570.167,32.258 569.425,30.349 C568.659,28.377 567.633,26.702 565.965,25.035 C564.297,23.368 562.623,22.342 560.652,21.575 C558.743,20.834 556.562,20.326 553.369,20.18 C550.169,20.033 549.148,20 541,20 C532.853,20 531.831,20.033 528.631,20.18 C525.438,20.326 523.257,20.834 521.349,21.575 C519.376,22.342 517.703,23.368 516.035,25.035 C514.368,26.702 513.342,28.377 512.574,30.349 C511.834,32.258 511.326,34.438 511.181,37.631 C511.035,40.831 511,41.851 511,50 C511,58.147 511.035,59.17 511.181,62.369 C511.326,65.562 511.834,67.743 512.574,69.651 C513.342,71.625 514.368,73.296 516.035,74.965 C517.703,76.634 519.376,77.658 521.349,78.425 C523.257,79.167 525.438,79.673 528.631,79.82 C531.831,79.965 532.853,80.001 541,80.001 C549.148,80.001 550.169,79.965 553.369,79.82 C556.562,79.673 558.743,79.167 560.652,78.425 C562.623,77.658 564.297,76.634 565.965,74.965 C567.633,73.296 568.659,71.625 569.425,69.651 C570.167,67.743 570.674,65.562 570.82,62.369 C570.966,59.17 571,58.147 571,50 C571,41.851 570.966,40.831 570.82,37.631"></path></g></g></g></svg>
この投稿をInstagramで見る

@razoralignがシェアした投稿


AURORA.

2023-09-13 01:01:01 | Music20



◻︎ AURORA / “The Seed”

>> https://youtu.be/_Mc_OM5oNA8

Auroraの作風や和声の書法について、良くBjorkやAdiemus (Karl Jenkins)が引き合いに出されるけど、どちらかというとValravnなどの北欧ネオフォーク勢に近く、ノルウェーやアフリカ、日本の民俗音楽など、もっと豊穣で根源的なバックボーンを持つ



□ AURORA / “Gentle Earthquakes”

Adiemusっぽいコーラス+エレクトロニカ・アレンジ的なサウンドはアルバム”Infections of a Different Kind”が最も色濃い



□ AURORA / “The Blood in the Wine”

これは最新アルバム”The Gods We can Touch”から。Enigma寄りの曲。こういうジャンル横断的なワールドミュージックが今だにメインストリームで市民権を得ていることは、私のような古参のニューエイジファンには嬉しくもある



□ AURORA / “Animal”

2組1対となるアルバム”A Different Kind of Human”から。先日紹介した”Infections〜”と同様、Adiemus風コーラスやブルガリアンチャントなど民族音楽要素の強い作品。先進的なEDMアレンジも特徴で、近未来・SF的な雰囲気も。MVを含めたアートコンセプトも良



Ahsoka

2023-09-13 00:00:00 | 映画

□ 『Star Wars: Ahsoka』

>> https://ondisneyplus.disney.com/show/ahsoka

(Disney+ , 2023)
Directed by Dave Filoni
Produced by Carrie Beck / Jon Favreau
Music by Kevin Kiner
Cinematography by Eric Steelberg


かつて戦の終焉で、銀河の脅威と共に宇宙の辺縁に消えた仲間を想う、切なくも壮大なストーリー。「ぼくの考えた最強のスターウォーズ」という感じで、今までのどのフランチャイズよりも『SFファンタジー』してる。巨大ハイパードライブ・リングがカッコいい。荘厳かつ愁いを帯びたエンディングテーマが最高に良い


□ Kevin Kiner - Ahsoka - End Credits (From "Ahsoka"/Visualizer Video)






『Ahsoka』EP5 - 世界中のStar Warsファンに『夢を見ているのか?』『フィローニ監督は神!』と崇められているエピソード。アメリカでは劇場公開も。古代の史跡にハイパースペースを渡るクジラ。スターウォーズのカノンを継承しながら、奔放な想像力とビジュアルで「SFファンタジー」を描き切っている

アナキンがアソーカの若き日の冒険心を呼び覚ます展開も泣けるし、壮大な音楽がかつてないほど盛り上げる!



□ Kevin Kiner / “The Hyperspace Jump” | Star Wars: Ahsoka OST

『アソーカ』第5話のサウンドトラックから。宇宙艦隊と”宇宙クジラ”の群れがすれ違うシーン。おそらくスターウォーズ、いやSF史上における最も美しいシークエンスの、最も美しい劇伴曲の一つ。