lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Iris.

2022-11-22 23:22:33 | Science News




□ ÉCOLE: Learning to call copy number variants on whole exome sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.11.17.516880v1

Based on a variant of the transformer architecture, ÉCOLE learns to call CNVs per exon, using high confidence calls made on matched WGS samples as the semi-ground truth. E ́COLE is able mimic the expert labeling for the first time with 68.7% precision and 49.6% recall.

ÉCOLE processes the read-depth signal over each exon. This information is transformed into a read depth embedding using a multi-layered perceptron. The model uses a positional encoding vector which is summed up w/ the transformed read depth encoding and the classification token.





□ MEOMI: An Approach of Gene Regulatory Network Construction Using Mixed Entropy Optimizing Context-Related Likelihood Mutual Information

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac717/6808612

MEOMI combines two entropy estimators to calculate the mutual information between genes. Then, distribution optimization was performed using a context-related likelihood algorithm to eliminate some indirect regulatory relationships and obtain the initial gene regulatory network.

MEOMI uses the conditional mutual inclusive information calculation method to gradually remove redundant edges. The conditional mutual inclusive information of a pair of genes under the influence of multiple related genes is calculated by multi-order traversal algorithm.





□ scmTE: multivariate transfer entropy builds interpretable compact gene regulatory networks by reducing false predictions

>> https://www.biorxiv.org/content/10.1101/2022.11.08.515579v1

scmTE, a new algorithm single-cell multivariate Transfer Entropy. scmTE is the unique algorithm that did not produce a hair-ball structure (due to too many predictions) and recapitulated known ground- truth relationships with high accuracy.

scmTE calculates causal relationships from a gene to a target gene while considering other genes that can influence the target. Similar to TE, mTE relies on the dynamic gene expression changes over time i.e. pseudo-time, the ordered trajectory.





□ scFormer: A Universal Representation Learning Approach for Single-Cell Data Using Transformers

>> https://www.biorxiv.org/content/10.1101/2022.11.20.517285v1

scFormer applies self-attention to learn salient gene and cell embeddings through masked gene modelling. scFormer provides a unified framework to readily address a variety of downstream tasks as data integration, analysis of gene function, and perturbation response prediction.

scFormer employs masked gene modelling to promote the learning of cross-gene relations, inspired by the masked-language modelling in NLM. The self-attention on gene expressions and the introduced MGM and MVC objectives significantly boost the cell-level and gene-level tasks.





□ scAWMV: an Adaptively Weighted Multi-view Learning Framework for the Integrative Analysis of Parallel scRNA-seq and scATAC-seq Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac739/6831091

scAWMV considers both the difference in importance across different modalities in multi-omics data and the biological connection of the features in the scRNA-seq and scATAC-seq data. It generates biologically meaningful low-dimensional representations for the transcriptomic and epigenomic profiles.

scAWMV is minimized via finding the optimal matrix factorization. scAWMV utilizes the linked information b/n the parallel transcriptomic and epigenomic layers. scAWMV uses Louvain clustering and groups the cells in the same clusters in the heatmap of the common latent structure.





□ mtANN: Cell-type annotation with accurate unseen cell-type identification using multiple references

>> https://www.biorxiv.org/content/10.1101/2022.11.17.516980v1

mtANN (multiple-reference-based scRNA-seq data annotation) learns multiple deep classification models from multiple reference datasets, and the multiple prediction results are used to calculate the metric for unseen cell-type identification and to vote for the final annotation.

mtANN integrates multiple references to enrich cell types in the reference atlas to alleviate the unseen cell-type problem. This metric is defined by three entropy indexes which are calculated from the prediction probability of multiple base classification and vote probability.





□ PAST: latent feature extraction with a Prior-based self-Attention framework for Spatial Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.11.09.515447v1

PAST, a variational graph convolutional auto-encoder for ST, which effectively integrates prior information via a Bayesian neural network, captures spatial patterns via a self-attention mechanism, and enables scalable application via a ripple walk sampler strategy.

PAST identifies k nearest neighbors (k-NN) for each spot using spatial coordinates in a Euclidean space, and adopts GCNs to aggregate spatial patterns from each spot’s neighbors.

PAST restricts the distance of latent embeddings between neighbors through metric learning, the insight of which is that spatially close spots are more likely to be positive pairs to show similar latent patterns.





□ Bambu: Context-Aware Transcript Quantification from Long Read RNA-Seq data

>> https://www.biorxiv.org/content/10.1101/2022.11.14.516358v1

Bambu estimates the likelihood that a novel transcript is valid, allowing the filtering of transcript candidates with a single, interpretable parameter, the novel discovery rate, that is calibrated to guarantee a reproducible maximum false discovery rate across different samples.

Bambu then employs a statistical model to assign reads to transcripts that distinguishes full-length and non full-length (partial) reads, as well as unique and non-unique reads, thereby providing additional evidence from long read RNA-Seq to inform downstream analysis.





□ SCARP: Single-Cell ATAC-seq analysis via Network Refinement with peaks location information

>> https://www.biorxiv.org/content/10.1101/2022.11.18.517159v1

SCARP utilizes the genomic information of peaks, which contributed to characterizing co-accessibility of peaks. SCARP used network to model the accessible relationships between cells and peaks, aggregated information with the diffusion method.

The output matrix derived from SCARP can be further processed by the dimension reduction method to obtain low-dimensional embeddings of cells and peaks, which can benefit the downstream analyses such as the cells clustering and cis-regulatory relationships prediction.





□ iEnhancer-DCLA: using the original sequence to identify enhancers and their strength based on a deep learning framework

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05033-x

iEnhancer − 2 L uses pseudo k-tuple nucleotide composition (PseKNC) as the encoding method of sequence characteristics. iEnhancer -ECNN uses one-hot encoding and k-mers to process the data, and uses CNN to construct the ensemble model.

iEnhancer -XG combines k-spectrum profile, mismatch k-tuple, subsequence profile and position-specific scoring matrix, and constructs a two-layer predictor using XGBoost. iEnhancer-EBLSTM uses 3-mer to encode the input DNA sequences and predicts enhancers by bidirectional LSTM.

iEnhancer-DCLA uses word2vec to convert k-mers into number vectors to construct an input matrix. Secondly, It uses convolutional neural network and BiLSTM network to extract sequence features, and finally uses the attention mechanism to extract relatively important features.





□ INSIDER: Interpretable Sparse Matrix Decomposition for Bulk RNA Expression Data Analysis

>> https://www.biorxiv.org/content/10.1101/2022.11.10.515904v1

INSIDER decomposes variation from different biological variables into a shared low-rank latent space. In particular, it considers interactions between biological variables and introduces the elastic net penalty to induce sparsity, thus facilitating interpretation.

INSIDER computes the adjusted expression that controls for variation in other confounders or covariates. The variation is decomposed into a shared latent space of rank K by matrix factorization. INSIDER incorporates the interaction b/n covariates and the gene representation V.





□ The geometry of Coherent topoi and Ultrastructures

>> https://arxiv.org/abs/2211.03104v1

The geometric properties of coherent topoi with respect to flat embeddings, and let the notion of ultrastructure emerge naturally from general considerations on the topology of flat embeddings.

Ultrastructures were defined to condense the main properties of the category of models of a first order theory. This technology provides a reconstruction theorem for first order logic that goes under the name of conceptual completeness.





□ scSSA: A clustering method for single cell RNA-seq data based on semi-supervised autoencoder

>> https://www.sciencedirect.com/science/article/abs/pii/S1046202322002298

scSSA is based on semi-supervised autoencoder, Fast Independent Component Analysis and Gaussian mixture clustering. It is an autoencoder based on depth counting, which aims to learn a lower dimensional space so that the original space can be reconstructed accurately.

scSSA also attaches a supervised target. The Gaussian mixture clustering model performs cell clustering on the low dimensional matrix, obtains the clustering results and identifies the cell type, and obtains the clustering visualization through FastICA.





□ DeepCCI: a deep learning framework for identifying cell-cell interactions from single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.11.11.516061v1

DeepCCI provides two deep learning models: a GCN-based unsupervised model for cell clustering, and a GCN-based supervised model for CCI identification.

DeepCCI learns an embedding function that jointly projects cells into a shared embedding space using Autoencoder and GCN. DeepCCI predicts intercellular crosstalk between any pair of clusters.





□ m6Anet: Detection of m6A from direct RNA sequencing using a multiple instance learning framework

>> https://www.nature.com/articles/s41592-022-01666-1

m6Anet, a MIL-based neural network model that takes in signal intensity and sequence features to identify potential m6A sites from direct RNA-Seq data.

m6Anet takes into account the mixture of modified and unmodified RNAs and outputs the m6A-modification probability at any given site for all DRACH fivemers represented in the training data.

m6Anet learns a high-dimensional representation of individual reads from each candidate site before aggregating them together to produce a more accurate prediction of m6A sites.





□ metaMIC: reference-free misassembly identification and correction of de novo metagenomic assemblies

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02810-y

metaMIC is a fully automated tool for identifying and correcting misassemblies of (meta)genomic assemblies with the following three steps. Firstly, metaMIC extracts various types of features from the alignment between paired-end sequencing reads and the assembled contigs.

The features extracted in the first step will be used as input of a random forest classifier for identifying misassemblies. metaMIC will localize misassembly breakpoints for each misassembled contig and then correct misassemblies by splitting into parts at the breakpoints.





□ End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac724/6820925

SMURF (Smooth Markov Unaligned Random Field), a new method that jointly learns an alignment and the parameters of a Markov Random Field for unsupervised contact prediction. SMURF takes as input unaligned sequences and jointly learns an MSA via LAM.

A Smooth Smith-Waterman (SSW) formulation in which the probability that any pair of residues is aligned can be formulated as a derivative.

LAM (Learned Alignment Module), a fully differentiable module for constructing MSAs and hence can be trained in conjunction with another differentiable downstream model. LAM employs a smooth and differentiable version of the Smith-Waterman algorithm.





□ Destin2: integrative and cross-modality analysis of single-cell chromatin accessibility data

>> https://www.biorxiv.org/content/10.1101/2022.11.04.515202v1

Destin2 is a statistical framework for cross-modality dimension reduction, clustering, and trajectory reconstruction of single-cell ATAC-seq data.

Destin2 integrates cellular-level epigenomic profiles from peak accessibility, motif deviation score, and pseudo-gene activity and learns a shared manifold using the multimodal input, followed by clustering and/or trajectory inference.





□ G2Φnet: Relating genotype and biomechanical phenotype of tissues with deep learning

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010660

G2Φnet directly provides a functional expression for a parameterized constitutive relation based on the neural operator architecture. G2Φnet formulates the sample feature w/ a limited dimension, which together with the injected genotype feature composes the material parameters.

G2Φnet formulation is formally similar to the classical approach of constitutive modeling by analytical expressions, hence endowing the method with generalizability and transferability across different specimens in multiple material classes.





□ DELFOS oracle: Managing the evolution of genomics data over time: a conceptual model-based approach

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04944-z

Updating the DELFOS Oracle so that its architecture can manage the temporal dimension. the Delfos module to change from a static-data perspective to a dynamic-data perspective,

The DELFOS oracle consists of four interconnected modules (HERMES, ULISES, DELFOS, SIBILA) that implement each one of the stages of SILE (Search, Identification, Load, and Exploitation). SIBILA, a genomic information system automatizes the Exploitation stage of the SILE method.





□ APARENT2: Deciphering the impact of genetic variation on human polyadenylation

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02799-4

APARENT2, a residual neural network model that can infer 3′-cleavage and polyadenylation from DNA sequence more accurately than any previous model. This model generalizes to the case of alternative polyadenylation (APA) for a variable number of polyadenylation signals.

APARENT2 was considerably better at variant effect size estimation for cryptic variants outside of the CSE. APARENT2 can score cis-regulatory stability elements near the PAS, but that a more general stability model such as Saluki is beneficial for 3′ UTRs with long isoforms.





□ scHumanNet: a single-cell network analysis platform for the study of cell-type specificity of disease genes

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac1042/6814446

scHumanNet enables cell-type specific networks with scRNA-seq data. The SCINET framework takes a single cell gene expression profile and the “reference interactome” HumanNet v3, to construct a list of cell-type specific network.

HumanNet v3 with 1.1 million weighted edges are used as a scaffold information to infer the likelihood of each gene interactions. scHumanNet could prioritize genes associated with particular cell types using CGN centrality and identified the differential hubness of CGNs.





□ Multiple Sequence Alignment based on deep Q network with negative feedback policy

>> https://www.sciencedirect.com/science/article/abs/pii/S1476927122001608

Leveraging the Negative Feedback Policy (NFP) to enhance the performance and accelerate the convergence of the model. A new profile algorithm is developed to compute the sequence from aligned sequences for the next profile-sequence alignment to facilitate the experiment.

Compared to six state-of-the-art methods, three different genetic algorithms, Q-learning, ClustalW, and MAFFT, this method exceeds these methods in terms of Sum-of-Pairs score and Column Score scores on most datasets in which the increased range of SP score is from 2 to 1056.





□ scAN10: A reproducible and standardized pipeline for processing 10X single cell RNAseq data

>> https://www.biorxiv.org/content/10.1101/2022.11.07.515546v1

scAN10, a processing pipeline of 10X single cell RNAseq data, that inherits the ability to be executed on most computational infrastructures, thanks to Nextflow DSL2.

Filtrating the GTF by removing unwanted genes based on 10X reference had a major impact both on the number of genes but also on gene counts. When using Kallisto-bustools instead of Cellranger the impact of the count numbers for specific genes seemed to be small but meaningful.





□ Adversarial Attacks on Genotype Sequences

>> https://www.biorxiv.org/content/10.1101/2022.11.07.515527v1

A gradient-based adversarial attack to change the prediction of commonly used genotype classification and segmentation methods (i.e. global and local ancestry inference), while minimally modifying the input sequences.

A d-dimensional binary ’mutation mask’ indicates which positions of the DNA sequence need to be changed. When the adversarial sequences are used as input, each method outputs the category specified as target label (EUR for PCA, AHG for k-NN, AMR for LAI-Net, and OCE for N. ADM).





□ Structured Joint Decomposition (SJD) identifies conserved molecular dynamics across collections of biologically related multi-omics data matrices

>> https://www.biorxiv.org/content/10.1101/2022.11.07.515489v1

SJD focuses specifically on within experiment variation and protects against warping of a single jointly learned manifold by between experiment variation that is often related to technological and/or batch effects.

SJD can process matrices from any data modality that uses systematic row names that map across matrices. Prior to running the SJD decomposition functions, the sjdWrap() function can be used to automatically find shared rows across all the input matrices.





□ SparkEC: speeding up alignment-based DNA error correction tools

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05013-1

SparkEC, a new parallel tool based on Apache Spark aimed at correcting errors in genomic reads that relies on accurate algorithms based on multiple sequence alignment strategies. SparkEC also uses a novel split-based processing strategy with a two-step k-mers distribution.

SparkEC relies on a hash-based partitioning strategy, which partitions the data based on the hashcode of the Resilient Distributed Datasets (RDD) elements. SparkEC defines the hashcode of the RDD elements in such a way that they get oddly distributed.





□ GTS: Genome Transformation Subprograms

>> https://github.com/go-gts/gts





□ Quasic: Reliable and accurate gene expression quantification with subpopulation structure-aware constraints for single-cell RNA sequencing

>> https://www.biorxiv.org/content/10.1101/2022.11.08.515740v1

Quasic, a novel scRNA-seq quantification pipeline which examines the potential cell subpopulation information during quantification, and uses the information to calculate the gene expression level.

Quasic uses the Louvain algorithm to perform clustering. Quasic could separate the doublet and the purified cell type cluster. Quasic not only correctly reinforced the cell signatures, but also identified the corresponding cell subpopulations and biological pathways accurately.





□ HCLC-FC: A novel statistical method for phenome-wide association studies

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0276646

HCLC-FC (Hierarchical Clustering Linear Combination with False discovery rate Control), a novel and powerful multivariate method, to test the association between a genetic variant with multiple phenotypes for each phenotypic category in PheWAS.

HCLC-FC uses the bottom-up Hierarchical Clustering Method (HCM) to partition a large number of phenotypes into disjoint clusters within each category.

The CLC combines test statistics within each phenotypic category and obtain p-values from each phenotypic category. A false discovery rate control based on a large-scale association testing procedure w/ theoretical guarantees for FDR control under flexible correlation structures.





□ Hybran: Hybrid Reference Transfer and ab initio Prokaryotic Genome Annotation

>> https://www.biorxiv.org/content/10.1101/2022.11.09.515824v1

Hybran, a hybrid reference-based and ab initio prokaryotic genome annotation pipeline that transfers features from a curated reference annotation and supplements unannotated regions with ab initio predictions.

Hybran uses the Rapid Annotation Transfer Tool (RATT) to transfer as many annotations as possible from reference genome annotation based on conserved synteny b/n the nucleotide genome sequences. Hybran then supplements unannotated regions with ab initio predictions from Prokka.





□ FASSO: An AlphaFold based method to assign functional annotations by combining sequence and structure orthology

>> https://www.biorxiv.org/content/10.1101/2022.11.10.516002v1

FASSO combines both sequence- and structure-based reciprocal best hit approaches to obtain a more accurate and complete set of orthologs across diverse species. FASSO provides confidence labels on ortholog predictions and flags potential misannotations in existing proteomes.

FASSO uses Diamond, FoldSeek, and FATCAT to find reciprocal best hits and aggregates those results for a final set of ortholog predictions. FASSO merges the results from each method, assigns confidence labels based on the level agreement, and removes conflicting predictions.





□ Moonlight: An Automatized Workflow to Study Mechanistic Indicators for Driver Gene Prediction

>> https://www.biorxiv.org/content/10.1101/2022.11.18.517066v1

Moonlight2 provides the user with the mutation-based mechanistic indicator to streamline the analyses of this second layer of evidence. The Moonlight Process Z-scores indicate if the activity of the process is increased or decreased based on literature reportings and gene expression levels.

One of the strengths of Moonlight is its classification of driver genes into TSGs and OGs which allows for the prediction of dual role genes - genes that are predicted as TSGs in one biological context but as OGs in another context.






…still the yearning stays,

2022-11-22 23:11:11 | Science News




□ Ibex: Variational autoencoder for single-cell BCR sequencing.

>> https://www.biorxiv.org/content/10.1101/2022.11.09.515787v1

Ibex vectorizes the amino acid sequence of the complementarity-determining region 3 (cdr3) of the immunoglobulin heavy and light chains, allowing for unbiased dimensional reduction of B cells using their BCR repertoire.

Ibex was trained on 600,000 human cdr3 sequences of the respective Ig chain, w/ a 128-64-30-64-128 neuron structure. Ibex enables the reduction of cell-level quantifications to clonotype-level quantifications using minimal Euclidean distance across principal component dimensions.





□ gGN: learning to represent graph nodes as low-rank Gaussian distributions

>> https://www.biorxiv.org/content/10.1101/2022.11.15.516704v1

gGN, a novel representation for graph nodes that uses Gaus- sian distributions to map nodes not only to point vectors (means) but also to ellipsoidal regions (covariances).

Besides the Kullback-Leibler divergence is well suited for capturing asymmetric local structures, the reverse KL additionally leads to Gaussian distributions whose entropies properly preserve the information contents of nodes.





□ scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks

>> https://www.nature.com/articles/s41592-022-01562-8

Extending the Basset architecture to predict single cell chromatin accessibility from sequences, using a bottleneck layer to learn low-dimensional representations of the single cells.

scBasset is based on a deep convolutional neural network to predict single cell chromatin accessibility from the DNA sequence underlying peak calls. scBasset takes as input a 1344 bp DNA sequence from each peak’s center and one-hot encodes it as a 4×1344 matrix.





□ Revisiting pangenome openness with k-mers

>> https://www.biorxiv.org/content/10.1101/2022.11.15.516472v1

Defining a genome as a set of abstract items, which is entirely agnostic of the two approaches (gene-based, sequence-based). Genes are a viable option for items, but also other possibilities are feasible, e.g., genome sequence substrings of fixed length k .

Genome assemblies must be computed when using a gene-based approach, while k-mers can be extracted directly from sequencing reads. The pangenome is defined as the union of these sets. The estimation of the pangenome openness requires the computation of the pangenome growth.





□ Snapper: a high-sensitive algorithm to detect methylation motifs based on Oxford Nanopore reads

>> https://www.biorxiv.org/content/10.1101/2022.11.15.516621v1

Snapper, a new highly-sensitive approach to extract methylation motif sequences based on a greedy motif selection algorithm. Snapper has shown higher enrichment sensitivity compared with the MEME tool coupled with Tombo or Nanodisco instruments.

Snapper uses a k-mer approach, with k chosen to be 11 in order to cover all 6-mers that cover one particular base under the assumption that, in general, approximately 6 bases are located in the nanopore simultaneously.

All the extracted k-mers are merged by a greedy algorithm which generates the minimal set of potential modification motifs which can explain the most part of selected 11-mers, under the assumption that all selected 11-mers contain at least one modified base.





□ SCOOTR: Jointly aligning cells and genomic features of single-cell multi-omics data with co-optimal transport

>> https://www.biorxiv.org/content/10.1101/2022.11.09.515883v1

SCOOTR provides quality alignments for unsupervised cell- level and feature-level integration of datasets with sparse feature correspondences. It returns the feature-feature coupling matrix for the user to investigate the correspondence probabilities.

SCOOTR uses the cell-cell coupling matrix to align the samples in the same space via barycentric projection or co-embedding via tSNE. Its unique joint alignment formulation provides the ability to perform the weak supervision at both sample and feature level.





□ memento: Generalized differential expression analysis of single-cell RNA-seq with method of moments estimation and efficient resampling

>> https://www.biorxiv.org/content/10.1101/2022.11.09.515836v1

memento, an end-to-end method that implements a hierarchical model for estimating the mean, residual variance, and gene correlation from scRNA-seq data and a statistical framework for hypothesis testing of differences in these parameters between groups of cells.

memento models scRNA-seq using a novel multivariate hypergeometric sampling process while making no assumptions about the true distributional form of gene expression within cells.

memento implements an innovative bootstrapping strategy for efficient statistical comparisons of the estimated parameters between groups of cells that can also incorporate biological and technical replicates.





□ GALBA: a pipeline for fully automated prediction of protein coding gene structures with AUGUSTUS

>> https://github.com/Gaius-Augustus/GALBA

GALBA code was derived from BRAKER, a fully automated pipeline for predicting genes in the genomes of novel species with RNA-Seq data and a large-scale database of protein sequences with GeneMark-ES/ET/EP/ETP and AUGUSTUS.

GALBA is a fully automated gene pipeline that trains AUGUSTUS, for a novel species and subsequently predicts genes with AUGUSTUS. GALBA uses the protein sequences of one closely related species to generate a training gene set for AUGUSTUS with either miniprot, or GenomeThreader.





□ Genome-wide single-molecule analysis of long-read DNA methylation reveals heterogeneous patterns at heterochromatin

>> https://www.biorxiv.org/content/10.1101/2022.11.15.516549v1

Conducting a genome-wide analysis of single-molecule DNA methy- lation patterns in long reads derived from Nanopore sequencing in order to understand the nature of large-scale intra-molecular DNA methylation heterogeneity in the human genome.

Like mean methylation levels, the mean single-read and bulk measurements of the coefficient of variation and correlation were significantly correlated. Oscillatory DNA patterns are observed in single reads with a high heterogeneity.





□ singleCellHaystack: A universal differential expression prediction tool for single-cell and spatial genomics data

>> https://www.biorxiv.org/content/10.1101/2022.11.13.516355v1

singleCellHaystack, a method that predicts DEGs based on the distribution of cells in which they are active within an input space. Previously, singleCellHaystack was not able to handle sparse matrices, limiting its applicability to the ever-increasing dataset sizes.

singleCellHaystack now accepts continuous features that can be RNA or protein expression, chromatin accessibility or module scores from single cell, spatial and even bulk genomics data, and it can handle 1D trajectories, 2-3D spatial coordinates, as well as higher-dimensional latent spaces.





□ MoClust: Clustering single-cell multi-omics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac736/6831092

MoClust uses a selective automatic doublet detection module that can identify and filter out doublets is introduced in the pretraining stage to improve data quality. Omics-specific autoencoders are introduced to characterize the multi-omics data.

A contrastive learning way of distribution alignment is adopted to adaptively fuse omics representations into an omics-invariant representation.

This novel way of alignment boosts the compactness and separableness of clusters, while accurately weighting the contribution of each omics to the clustering object.





□ BulkSignalR: Inferring ligand-receptor cellular networks from bulk and spatial transcriptomic datasets

>> https://www.biorxiv.org/content/10.1101/2022.11.17.516911v1

BulkSignalR exploits reference databases of known ligand-receptor interactions (LRIs), gene or protein interactions, and biological pathways to assess the significance of correlation patterns between a ligand, its putative receptor, and the targets of the downstream pathway.

There is an obvious parallel with enrichment analysis of gene sets versus the analysis of individual differentially expressed genes. This infrastructure allows network visualization for relating LRIs to target genes.





□ trans-PCO: Trans-eQTL mapping in gene sets identifies network effects of genetic variants

>> https://www.biorxiv.org/content/10.1101/2022.11.11.516189v1

trans-PCO, a flexible approach that uses the PCA-based omnibus test combine multiple PCs and improve power to detect trans-eQTLs. trans-PCO filters sequencing reads and genes based on mappability across different regions of the genome to avoid false positives due to mis-mapping.

trans-PCO uses a novel multivariate association test to detect genetic variants with effects on multiple genes in predefined sets and captures genetic effects on multiple PCs. By default, trans-PCO defines sets of genes based on co-expression gene modules as identified by WGCNA.





□ Accurate Detection of Incomplete Lineage Sorting via Supervised Machine Learning

>> https://www.biorxiv.org/content/10.1101/2022.11.09.515828v1

A model to infer important properties of a particular internal branch of the species tree via genome-scale summary statistics extracted from individual alignments and inferred gene trees.

The model predicts the presence/absence of discordance, estimate the probability of discordance, and infer the correct species tree topology. A variety of SML algorithms can distinguish biological discordance from gene tree inference error across a wide range of parameter space.





□ STREAK: A Supervised Cell Surface Receptor Abundance Estimation Strategy for Single Cell RNA-Sequencing Data using Feature Selection and Thresholded Gene Set Scoring

>> https://www.biorxiv.org/content/10.1101/2022.11.10.516050v1

STREAK estimates receptor abundance levels by leveraging associations between gene expression and protein abundance to enable receptor gene set scoring of scRNA-seq target data.

STREAK generates weighted receptor gene sets using joint scRNA-seq/CITE-seq training data with the gene set for each receptor containing the genes whose normalized and reconstructed scRNA-seq expression values are most strongly correlated with CITE-seq receptor protein abundance.





□ BICOSS: Bayesian iterative conditional stochastic search for GWAS

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05030-0

BICOSS is an iterative procedure where each iteration is comprised of two steps: a screening and a model selection step. BICOSS is initialized with a base model fitted as a linear mixed model with no SNPs in the model.

Then the screening step fits as many models as there are SNPs, each model containing one SNP and regressed against the residuals of the base model. The screening step identifies a set of candidate SNPs using Bayesian FDR control applied to the posterior probabilities of the SNPs.

BICOSS performs Bayesian model selection where the possible models contain any combination of the base model and SNPs from the candidate set. If the model space is too large to perform complete enumeration, a genetic algorithm is used to perform stochastic model search.





□ LVBRS: Latch Verified Bulk-RNA Seq toolkit: a cloud-based suite of workflows for bulk RNA-seq quality control, analysis, and functional enrichment

>> https://www.biorxiv.org/content/10.1101/2022.11.10.516016v1

The LVBRS toolkit supports three databases—Gene Ontology, KEGG Pathway, and Molecular Signatures database—capturing diverse functional information. The LVBRS workflow also conducts differential intron excision analysis.





□ UniverSC: A flexible cross-platform single-cell data processing pipeline

>> https://www.nature.com/articles/s41467-022-34681-z

UniverSC; a shell utility that operates as a wrapper for Cell Ranger. Cell Ranger has been optimised further by adapting open-source techniques, such as the third-party EmptyDrops algorithm for cell calling or filtering, which does not assume thresholds specific for the Chromium platform.

In principle, UniverSC can be run on any droplet-based or well-based technology. UniverSC provides a file with summary statistics, including the mapping rate, assigned/mapped read counts and UMI counts for each barcode, and averages for the filtered cells.





□ VarSCAT: A computational tool for sequence context annotations of genomic variants

>> https://www.biorxiv.org/content/10.1101/2022.11.11.516085v1

Breakpoint ambiguities may cause potential problems for downstream annotations, such as the Human Genome Variation Society (HGVS) nomenclature of variants, which recommends a 3’-aligned position but may lead to redundancies of indels.

VarSCAT, a variant sequence context annotation tool with various functions for studying the sequence contexts around variants and annotating variants with breakpoint ambiguities, flanking sequences, HGVS nomenclature, distances b/n adjacent variants, and tandem repeat regions.





□ AGouTI - flexible Annotation of Genomic and Transcriptomic Intervals

>> https://www.biorxiv.org/content/10.1101/2022.11.13.516331v1

AGouTI – a universal tool for flexible annotation of any genomic or transcriptomic coordinates using known genomic features deposited in different publicly available data- bases in the form of GTF or GFF files.

AGouTI is designed to provide a flexible selection of genomic features overlapping or adjacent to annotated intervals, can be used on custom column- based text files obtained from different data analysis pipelines, and supports operations on transcriptomic coordinate systems.





□ SEGCOND predicts putative transcriptional condensate-associated genomic regions by integrating multi-omics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac742/6832039

SEGCOND, a computational framework aiming to highlight genomic regions involved in the formation of transcriptional condensates. SEGCOND is flexible in combining multiple genomic datasets related to enhancer activity and chromatin accessibility, to perform a genome segmentation.

SEGCOND uses this segmentation for the detection of highly transcriptionally active regions of the genome. And through the integration of Hi-C data, it identifies regions of PTC as genomic domains where multiple enhancer elements coalesce in three-dimensional space.





□ lmerSeq: an R package for analyzing transformed RNA-Seq data with linear mixed effects models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05019-9

lmerSeq can fit models incl. multiple random effects, implement the correlation structures, constructing contrasts and simultaneous tests of multiple regression coefficients, and utilize multiple methods for calculating denominator degrees of freedom for F- and t-tests.

In models with a misspecified random effects structure (incl. a random intercept only), FDR is increased relative to the models with correctly specified random effects for both lmerSeq and DREAM.

Since DREAM and lmerSeq are capable of fitting similar LMMs, it appears that the driving force behind the differential behavior b/n lmerSeq and DREAM is the choice of transformation, with lmerSeq utilizing DESeq2’s VST and DREAM using their own modification of VOOM.





□ rGREAT: an R/Bioconductor package for functional enrichment on genomic regions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac745/6832038

GREAT is a widely used tool for functional enrichment on genomic regions. However, as an online tool, it has limitations of outdated annotation data, small numbers of supported organisms and gene set collections, and not being extensible for users.

rGREAT integrates a large number of gene set collections for many organisms. First it serves as a client to directly interact with the GREAT web service in the R environment. It automatically submits the imput regions to GREAT and retrieves results from there.





□ Modeling and cleaning RNA-seq data significantly improve detection of differentially expressed genes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05023-z

A program RNAdeNoise for cleaning RNA-seq data, which improves the detection of differentially expressed genes and specifically genes with a low to moderate absolute level of transcription.

This cleaning method has a single variable parameter – the filtering strength, which is a removed quantile of the exponentially distributed counts. It computes the dependency between this parameter and the number of detected DEGs.





□ CAGEE: computational analysis of gene expression evolution

>> https://www.biorxiv.org/content/10.1101/2022.11.18.517074v1

CAGEE analyzes changes in global or sample- or clade-specific gene expression taking into account phylogenetic history, and provides a statistical foundation for evolutionary inferences. CAGEE uses Brownian motion to model GE changes across a user-specified phylogenetic tree.

The reconstructed distribution of counts and their inferred evolutionary rate σ2 generated under this model provides a basis for assessing the significance of the observed differences among taxa.





□ USAT: a bioinformatic toolkit to facilitate interpretation and comparative visualization of tandem repeat sequences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05021-1

A Universal STR Allele Toolkit (USAT) for TR haplotype analysis, which takes TR haplotype output from existing tools to perform allele size conversion, sequence comparison of haplotypes, figure plotting, comparison for allele distribution, and interactive visualization.

USAT takes the TR sequences in a plain text file and TR loci configure information in a BED formatted plain text file as input to calculate the length of each haplotype sequence in nucleotide base pairs (bps) and the number of repeats.





□ H3AGWAS: a portable workflow for genome wide association studies

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05034-w

H3Agwas is a simple human GWAS analysis workflow for data quality control and basic association testing developed by H3ABioNet. It is an extension of the witsGWAS pipeline for human genome-wide association studies built at the Sydney Brenner Institute for Molecular Bioscience.

H3Agwas uses Nextflow for workflow managment and has been dockerised to facilitate portability. And split into several independent sub-workflows mapping to separate phases. Independent workflows allow to execute parts that are only relevant to them at those different phases.





□ DNA-LC: Multiple errors correction for position-limited DNA sequences with GC balance and no homopolymer for DNA-based data storage

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac484/6835379

DNA-LC, a novel coding schema which converts binary sequences into DNA base sequences that satisfy both the GC balance and run-length constraints.

The DNA-LC coding mode enables detect and correct multiple errors with a higher error correction capability than the other methods targeting single error correction within a single strand.





□ SyBLaRS: A web service for laying out, rendering and mining biological maps in SBGN, SBML and more

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010635

SyBLaRS (Systems Biology Layout and Rendering Service) accommodates a number of novel methods as well as widely known and used ones on automatic layout of pathways, calculating graph-theoretic properties in pathways and mining pathways for subgraphs of interest.

SyBLaRS exposes the shortest paths algorithm of Dijkstra. It finds one of many potentially available shortest paths from a single dedicated node to another one, whereas algorithms such as Paths-between and Paths-from-to find all such paths b/n a group of source and target nodes.





□ IMMerge: Merging imputation data at scale

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac750/6839927

IMMerge, a Python-based tool that takes advantage of multiprocessing to reduce running time. For the first time in a publicly available tool, imputation quality scores are correctly combined with Fisher’s z transformation.

IMMerge is designed to: (i) rapidly combine sets of imputed data through multiprocessing to accelerate the decompression of inputs, compression of outputs, and merging of files; (ii) preserve variants not shared by all subsets;

(iii) combine imputation quality statistics and detect significant variation in SNP-level imputation quality; (iv) manage samples duplicated across subsets; (v) output relevant combined summary information incl. allele frequency (AF) and minor AF as weighed means, maximum, and minimum values.





□ Improving dynamic predictions with ensembles of observable models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac755/6842325

The procedure starts by analysing structural identifiability and observability; if the analysis of these properties reveals deficiencies in the model structure that prevent it from inferring key parameters or state variables, the method then searches for a suitable reparameterization.

Once a fully identifiable and observable model structure is obtained, it is calibrated using a global optimization procedure, that yields not only an optimal parameter vector but also an ensemble of other possible solutions.

This method exploits the information in these additional vectors to build an ensemble of models with different parameterizations.

The hybrid global optimization approach used here performs a balanced sampling of the parameter space; as a consequence, the median of the ensemble is a good approximation of the median of the model given parameter uncertainty.





□ MerCat2: a versatile k-mer counter and diversity estimator for database-independent property analysis obtained from omics data

>> https://www.biorxiv.org/content/10.1101/2022.11.22.517562v1

MerCat2 (“Mer - Catenate2") allows for direct analysis of data properties in a database-independent manner that initializes all data, which other profilers and assembly- based methods cannot perform.

For massive parallel processing (MPP) and scaling, MerCat2 uses a byte chunking algorithm to split files for MPP and utilization in RAY, a massive open-source parallel computing framework.




□ k2v: A Containerized Workflow for Creating VCF Files from Kintelligence Targeted Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2022.11.21.517402v1

k2v, a containerized workflow for creating standard specification-compliant variant call format (VCF) files from the custom output data produced by the Kintelligence Universal Analysis Software.

k2v enables the rapid conversion of Kintelligence variant data. VCF files produced with k2v enable the use of many pre-existing, widely used, community-developed tools for manipulating and analyzing genetic data in the standard VCF format.







Amsterdam

2022-11-21 20:35:42 | 映画

□ 『Amsterdam』

>> https://www.20thcenturystudios.com/movies/amsterdam

Directed by David O. Russel
Music by Daniel Pemberton

Cast
Christian Bale
Margot Robbie
John David Washington




戦時下の史実を背景に、陰謀に巻き込まれていく男女3人の友情と絆をコミカルに描いた、異色のサスペンス。豪華絢爛なキャスト陣と画が観られるだけでお腹いっぱい。捻くれた構図にユーモアを感じられるかどうかが、評価の分水嶺。Taylor Swiftの扱いに泣いた。






the MENU.

2022-11-21 19:47:41 | 映画


□ 『the MENU』

>> https://www.searchlightpictures.com/the-menu/

Directed by Mark Mylod
Music by Colin Stetson

Cast
Ralph Fiennes
Anya Taylor-Joy
Nicholas Hoult

限りなくソリッド・シチュエーションスリラーに近いけれど、ブラック・コメディとして鑑賞するのが正解。スクリーンの内と外、双方の『お客様』を支配するシェフの狂気とカリスマ性。批評という行為の相補性。価値と支配の力学構造。


Colin Stetsonによる格調高くも前衛的な劇伴音楽が、おぞましく美しい。


□ Colin Stetson - All Aboard | The Menu (Original Motion Picture Soundtrack)





Where the Crawdads Sing.

2022-11-19 22:01:50 | 映画


□ 『Where the Crawdads Sing(ザリガニの鳴くところ)』

>> https://www.sonypictures.com/movies/wherethecrawdadssing

Directed by Olivia Newman
Delia Owens (based upon the novel by)
Lucy Alibar (screenplay by)

Music by Mychael Danna
Song “Carolina” by Taylor Swif

Cast
Daisy Edgar-Jones
Taylor John Smith
Harris Dickinson


奥深い湿地の自然光が描く陰翳と、生態系の描写がただひたすらに美しいサスペンス。
「──沼は死を悲劇にしないし、罪にもしない。」

”湿地の娘”は、動物学者であった原作者の投影であるのかもしれない。
ただ生きるために美しく、擬態し、強くあることを求められた少女の半生譚であり、
それが自然の本質であった。



□ Taylor Swift - Carolina (From The Motion Picture “Where The Crawdads Sing” / Lyric Video)







ANDOR.

2022-11-19 13:40:25 | ドラマ

□ “STAR WARS: Andor”

>> https://disneyplus.com/series/star-wars-andor/

Episode 11 “Daughter Of Ferrix”、緻密な作劇と重厚なドラマ演出を維持したまま、物語はついに佳境へ。ここにきて、作中でも数少ない宇宙船のドッグファイトシーンが登場。たった2分にも満たない僅かなシーケンスだけれど、おそらくStar Wars史上屈指の迫力と驚きに満ちた戦闘シーンであると言っても過言ではない。






















すずめの戸締り

2022-11-19 13:39:42 | 映画

□ 『すずめの戸締り』

>> https://suzume-tojimari-movie.jp/

Directed by Makoto Shinkai

『すずめの戸締り』災厄と日常、出会いと別離、過去と未来、旅路と家路、1と0…その狭間でイーブンであるはずの両側を分つものは何か。『扉の前では興味深いことが起きる。そこは物事の境界だから』と綴ったのは作家のダン・ブラウンだったか。心が幻であるならば、現世もまた幻であり、幻は心を映す。