2022年6月のブログ記事一覧-lens, align.

Cataract.

2022-06-06 06:06:06 | Science News

(Artwork by Joey Camacho)

□ HyperHMM: Efficient inference of evolutionary and progressive dynamics on hypercubic transition graphs

>> https://www.biorxiv.org/content/10.1101/2022.05.09.491130v1.full.pdf

Hypercubic transition path sampling (HyperTraPS) uses biased random walkers to estimate this likelihood, which is then embedded in a Bayesian framework using Markov chain Monte Carlo for parameter estimation.

HyperHMM, an adapted Baum-Welch (expectation maximisation) algorithm for inferring dynamic pathways on hypercubic transition graphs, and can be combined with resampling for quantify uncertainty.

□ Ultima Genomics RT

>> https://www.ultimagenomics.com

Ultima Genomicsが第三勢力となり得る、新しいシーケシング・プラットフォームを2023年にリリース。既にステルスで6億ドルを調達。蛍光フローベースに基づき1ドル/Gbのデータ生成を実現。Sentieon・DeepVariantとも提携、高精度のバリアントコールも実装する。

Today Ultima Genomics emerged from stealth mode with a new high-throughput, low-cost sequencing platform that delivers the $100 genome. Ultima’s goal is to unleash a new era in genomics-driven research and healthcare, and it has secured approximately $600 million in backing from leading investors who share this vision.

□ Joseph Replogle

$1/Gb? I had a great experience collaborating w/ Ultima genomics to sequence genome-scale Perturb-seq libraries on their new open fluidics sequencing platform: biorxiv.org/content/10.110… (see Figure S13 for comparison)

>> https://www.biorxiv.org/content/10.1101/2021.12.16.473013v3

□ Albert Viella

Cost-efficient whole genome-sequencing using novel mostly natural sequencing-by-synthesis chemistry and open fluidics platform biorxiv.org/content/10.110… #UltimaGenomics

>> https://www.biorxiv.org/content/10.1101/2022.05.29.493900v1

□ SUBATOMIC: a SUbgraph BAsed mulTi-OMIcs Clustering framework to analyze integrated multi-edge networks

>> https://www.biorxiv.org/content/10.1101/2022.06.01.494279v1.full.pdf

SUBATOMIC, a SUbgraph BAsed mulTi-Omics Clustering framework to construct and analyze multi-edge networks. SUBATOMIC investigates statistically the connections in between modules as well as between modules and regulators such as miRNAs and transcription factors.

SUBATOMIC integrates all networks into one multi-edge network and decomposes it into two- and three-node subgraphs using ISMAGS. The resulting subgraphs are further categorized according to their type and clustered into modules using the hyperedge clustering algorithm SCHype.

□ AEON: Exploring attractor bifurcations in Boolean networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04708-9

A a computational framework employing advanced symbolic graph algorithms that enable the analysis of large networks with hundreds of Boolean variables. A comprehensive methodology for automated attractor bifurcation analysis of parametrised BNs, fully implemented in AEON.

AEON computes the attractors for all valid parametrisations. AEON assigns each parametrisation its behaviour class. This bifurcation function can be displayed as a simple table which obtains witness instantiations for each behaviour class and inspect their attractor state space.

□ Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time

>> https://www.biorxiv.org/content/10.1101/2022.05.17.492399v1.full.pdf

A formalisation of arc-centric bidirected de Bruijn graphs and prove that it accurately models the k-mer spectrum. The algorithm constructs the de Bruijn graph in the length of the input strings. Then it uses a Eulerian-cycle-based algorithm to compute the minimum representation.

Computing a Hamiltonian cycle in a de Bruijn graph is polynomial. de Bruijn graphs are a subclass of adjoint graphs, in which solving the Hamiltonian cycle problem is equivalent to solving the Eulerian cycle problem in the adjoint graph, which can be computed in linear time.

□ N-ACT: An Interpretable Deep Learning Model for Automatic Cell Type and Salient Gene Identification

>> https://www.biorxiv.org/content/10.1101/2022.05.12.491682v1.full.pdf

N-ACT (Neural-Attention for Cell Type identification) accurately predicts preliminary annotations with no prior knowledge about the system, providing a valuable complementary framework to experimental studies and computational pipelines.

N-ACT learns complex mappings, outputs are non-linearly “activated” through a Point-Wise Feed Forward Neural Network. N-ACT consists of flexible stages that can be modified for different objectives. N-ACT minimizes a cross entropy loss using the Adam gradient-based optimizer.

□ CReSIL: Accurate Identification of Extrachromosomal Circular DNA from Long-read Sequences.

>> https://www.biorxiv.org/content/10.1101/2022.05.13.491700v1.full.pdf

CReSIL (Construction-based Rolling-circle amplification for eccDNA Sequence Identification and Location) constructed directed graphs with the information of regions, terminals, and strands; an individual region contained 4 nodes and multiple edges derived from linkages.

CReSIL relies on the reference genome read alignment result, enabling construction of linkages among regions. CReSIL generated consensus sequences and variants of eccDNA, and assess potential phenotypic effects of eccDNA when variations on the chromosomes are generated.

□ scMinerva: a GCN-featured Interpretable Framework for Single-cell Multi-omics Integration with Random Walk on Heterogeneous Graph

>> https://www.biorxiv.org/content/10.1101/2022.05.28.493838v1.full.pdf

scMinerva, an unsupervised Single-Cell Multi-omics INtegration method with GCN on hEterogeneous graph utilizing RandomWAlk, that can adapt to any number of omics with efficient computational consumption.

Considering the structure and biological insight of this multi-omics integration problem, to learn the cell property on top of multi-omics information and the cell neighbors, they accordingly design the model on a new random walk strategy.

scMinerva process any number of omics and has an explicit probabilistic interpretability, and a Graph Convolutional Network (GCN), which considers the spatial information of nodes and endows the method a strong robustness to noises.

□ scDeepC3: scRNA-seq Deep Clustering by A Skip AutoEncoder Network with Clustering Consistency

>> https://www.biorxiv.org/content/10.1101/2022.06.05.494891v1.full.pdf

scDeepC3, a novel deep clustering model containing an AutoEncoder with adaptive shortcut connection and using deep clustering loss with consistency constraint for clustering analysis of scRNA-seq data.

scDeepC3 can effective extract embedded representations, which is optimized for clustering, of the high-dimensional input through a nonlinear mapping. The optimal mapping function can be efficiently computed by the Hungarian algorithm.

□ MARGARET: Inference of cell state transitions and cell fate plasticity from single-cell

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac412/6593121

MARGARET employs a novel measure of connectivity to assess connectivity between the inferred clusters in the first step and constructs a cluster-level undirected graph to represent a trajectory topology.

MARGARET contructs a kNN graph between all cells and prunes it with reference to the undirected graph computed previously. MARGARET assigns a pseudotime to each cell in the pruned kNN graph denoting the position of this cell in the underlying trajectory.

□ RISER: real-time in silico enrichment of RNA species from nanopore signals

>> http://nanoporetech.com/resource-centre/video/lc22/riser-real-time-in-silico-enrichment-of-rna-species-from-nanopore-signals

RISER, the first method for realtime in silico enrichment of RNA species during direct RNA sequencing (DRS). RISER accurately classifies protein-coding from non-coding species directly from four seconds of raw DRS signal.

RISER has been integrated with the Read Until API to enact real-time sequencing decisions that allow enrichment of mRNAs or non-coding RNAs, as well as real-time tagging of reads with RNA species.

□ Last-train: Finding rearrangements in nanopore DNA reads with LAST and dnarrange

>> https://www.biorxiv.org/content/10.1101/2022.05.30.494079v1.full.pdf

The LAST and dnarrange software packages can resolve complex re- lationships between DNA sequences, and characterize changes such as gene conversion, processed pseudogene insertion, and chromosome shattering.

Last-train learns the rates (probabilities) of deletions, insertions, and each kind of base match and mismatch. These probabilities are then used to find the most likely sequence relationships/alignments, which is especially useful for DNA with unusual rates.

□ inClust: a general framework for clustering that integrates data from multiple sources

>> https://www.biorxiv.org/content/10.1101/2022.05.27.493706v1.full.pdf

inClust provides a general and flexible framework, which can be applied to various tasks with different modes. inClust perform information integration and clustering jointly, meanwhile it could utilize the labeling information from data as regulation information.

inClust encode scRNA-seq data and batch information (or other covariates and auxiliary information) into latent space, respectively. So, the influence of the batch and other covariates is explicitly eliminated by vector arithmetic in latent space.

□ PEPSDI: Scalable and flexible inference framework for stochastic dynamic single-cell models

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010082

PEPSDI (Particles Engine for Population Stochastic DynamIcs), a flexible modelling framework which infers unknown model parameters from dynamic data for single-cell dynamic models that account for both intrinsic and extrinsic noise.

For the Ornstein-Uhlenbeck stochastic differential equation model, the likelihood approximation has a small variance and exact Bayesian inference is possible because the likelihood can be exactly calculated using the Kalman filter.

PEPSDI modularity facilitates modelling of intrinsic noise by the SSA, Extrande, tau-leaping or Langevin stochastic simulators. New particle filters for the pseudo-marginal modules can be added. Like the one used for the Schlögl model, are particularly statistically efficient.

□ NanoSplicer: Accurate identification of splice junctions using Oxford Nanopore sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac359/6594111

NanoSplicer utilises the raw ouput from nanopore sequencing. Instead of identifying splice junctions by mapping basecalled reads, nanosplicer compares the squiggle from a read with the predicted squiggles of potential splice junctions to identify the best match and likely junction.

NanoSplicer adapts Dynamic Time Warping to align the two squiggles. NanoSplicer identifies all possible canonical splice junctions within 10 bases. The NanoSplicer model provides assignment probabilities for each candidate by quantifying the squiggle similarity of each alignment.

□ scMoMaT: Mosaic integration of single cell multi-omics matrices using matrix trifactorization

>> https://www.biorxiv.org/content/10.1101/2022.05.17.492336v1.full.pdf

scMoMaT (single cell Multi-omics integration using Matrix Trifactorization) makes it possible to uncover the cell type specific bio-markers at the same time when learning a unified cell representation. Moreover, scMoMaT can integrate cell batches with unequal cell type composition.

scMoMaT uses a matrix tri-factorization framework, which treats each single cell data matrix as a relationship matrix between the cell and feature entity. It factorizes a data matrix into batch-specific cell factor, feature factor, and a factor association matrix.

□ sshash: On Weighted K-Mer Dictionaries

>> https://www.biorxiv.org/content/10.1101/2022.05.23.493024v1.full.pdf

SSHash, a sparse and skew hashing scheme for k-mers – a compressed dictionary that relies on k-mer minimizers and minimal perfect hashing in both random and streaming query modality in succinct space.

Enriching the SSHash data structure with the weight information. by exploiting the order of the k-mers represented in SSHash, the compressed exact weights take only a small extra space on top of the space of SSHash.

This extra space is proportional to the number of runs (maximal sub-sequences formed by all equal symbols) in the weights and not proportional to the number of distinct k-mers. The weights are represented in a much smaller space than the empirical entropy lower bound.

□ Lossless indexing with counting de Bruijn graphs

>> https://genome.cshlp.org/content/early/2022/05/23/gr.276607.122.abstract

Counting de Bruijn graphs (Counting DBGs), a notion generalizing annotated de Bruijn graphs by supplementing each node-label relation with one or many attributes.

Counting DBGs index k-mer abundances from 2,652 human RNA-seq samples in over 8-fold smaller and yet faster. The full RefSeq collection, Counting DBGs generates a lossless and fully queryable index that is 4.6-fold smaller than the corresponding MegaBLAST index.

□ Sentieon DNAscope LongRead - A highly Accurate, Fast, and Efficient Pipeline for Germline Variant Calling from PacBio HiFi reads

>> https://www.biorxiv.org/content/10.1101/2022.06.01.494452v1.full.pdf

The core variant calling pipeline calls DNAscope across the phased or unphased regions of the genome and uses DNAModelApply to perform model-informed variant genotyping. Small Python scripts are used for VCF manipulation.

DNAscope LongRead is computationally efficient, calling variants from 30x HiFi samples in under 4 hours on a 16-core machine (120 virtual core-hours) with precision and recall on the most recent GIAB benchmark dataset exceeding 99.83% for HiFi samples sequenced at 30x coverage.

□ DSINMF: Detecting cell type from single cell RNA sequencing based on deep bi-stochastic graph regularized matrix factorization

>> https://www.biorxiv.org/content/10.1101/2022.05.16.492212v1.full.pdf

Sparsity is a significant characteristics of single cell data, in other word, scRNA-seq data have a large number of zero entries. It also restricted the application of cluster method in single-cell data analysis.

DSINMF reduces redundant features. The structure of multi-layer matrix factorization is utilized to extract the deep hidden features which can obtain the features in different layers. The deep matrix factorization with bi-stochastic graph regularization is employed to clustering.

<be />

□ DeepPHiC: Predicting promoter-centered chromatin interactions using a novel deep learning approach

>> https://www.biorxiv.org/content/10.1101/2022.05.24.493333v1.full.pdf

DeepPHiC, a supervised multi-modal deep learning model, which utilizes a comprehensive set of features including genomic sequence, epigenetic signals and anchor distance to predict tissue/cell type-specific genome-wide promoter-enhancer and promoter-promoter interactions.

DeepPHiC utilizes a comprehensive set of informative features, ranging from genomic sequence, epigenetic signal in the anchors and anchor distance. DeepPHiC adopts a ResNet-style structure with skip connections, wherein previous layers are connected to all subsequent layers.

□ Sequence UNET: High-throughput deep learning variant effect prediction

>> https://www.biorxiv.org/content/10.1101/2022.05.23.493038v1.full.pdf

Sequence UNET, a highly scalable variant effect predictors (VEP) that uses a fully Convolutional Neural Network architecture to achieve computational efficiency and independence from length. Convolutional kernels also naturally integrate information from nearby amino acids.

Sequence UNET optimises performance for position specific scoring matrix (PSSM) prediction using a softmax output layer and Kullbeck-Leibler divergence loss and variant frequency classification using a sigmoid output and binary cross entropy.

□ CSCD: More accurate estimation of cell composition in bulk expression through robust integration of single-cell information

>> https://www.biorxiv.org/content/10.1101/2022.05.13.491858v1.full.pdf

Many computational tools have been developed and reported in the literature. However, they fail to appropriately incorporate the covariance structures in both scRNA-seq and bulk RNA-seq datasets in use.

CSCD, a covariance-based single-cell decomposition that estimates cell-type proportions in bulk data through building a reference expression profile based on a single-cell data, and learning gene-specific bulk expression transformations using a constrained linear inverse model.

□ isopret: An algorithmic framework for isoform-specific functional analysis

>> https://www.biorxiv.org/content/10.1101/2022.05.13.491897v1.full.pdf

isopret, a new paradigm for isoform function prediction based on the expectation-maximization framework. isopret leverages the relationships between sequence and functional isoform similarity to infer isoform specific functions in a highly accurate fashion.

isopret predicts isoform annotations w/o using isoform-specific labels, learns directly from isoform sequences w/o using gene elements, and assigns GO to isoforms through a global optimization algorithm, thus avoiding inconsistencies due to local isoform-by-isoform predictions.

□ MAGCNSE: predicting lncRNA-disease associations using multi-view attention graph convolutional network and stacking ensemble model

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04715-w

MAGCNSE uses disease semantic similarity (DSS) and disease Gaussian interaction profile kernel similarity (DGS). And for views of lncRNAs, MAGCNSE uses lncRNA functional similarity, lncRNA sequence similarity and lncRNA Gaussian interaction profile kernel similarity.

MAGCNSE then concatenates the representations of lncRNAs and diseases according to the lncRNA-disease association matrix. MAGCNSE employs a stacking ensemble classifier, consisting of multiple traditional machine learning classifiers, to make the final prediction.

□ Bioteque Integrating and formatting biomedical data in the Bioteque, a comprehensive repository of pre-calculated knowledge graph embeddings

>> https://www.biorxiv.org/content/10.1101/2022.05.11.491490v1.full.pdf

Bioteque, a resource of unprecedented size and scope that contains pre-calculated biomedical embeddings derived from a gigantic knowledge graph, displaying more than 450 thousand biological entities and 30 million relationships.

Bioteque descriptors can be easily recycled as node features, transferring the learning encoded from orthogonal biomedical datasets to more complex, attribute-aware models. The Bioteque provides information on the specific sources used to construct each metapath.

□ OMEN: Network-based Driver Gene Identification using Mutual Exclusivity

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac312/6585332

Propagation-based methods in contrast allow recovering rare driver genes, but the interplay between network topology and high-scoring nodes often results in spurious predictions.

OMEN is a logic programming framework based on random walk semantics. OMEN presents a number of novel concepts. In particular, its design is unique in that it presents an effective approach to combine both gene-specific driver properties and gene-set properties.

□ FastIntegration: a fast and high-capacity version of Seurat Integration for large-scale integration of single-cell data

>> https://www.biorxiv.org/content/10.1101/2022.05.10.491296v1.full.pdf

FastIntegration can integrate large single-cell RNA-seq datasets and outputting batch corrected gene expression. Its capacity for large scale batch integration with 4 million cells in 48 hours runtime through good multicore scaling.

Seurat computes a fixed number of kNN to construct the weight matrix of anchor while FastIntegration fits a Gaussian distribution. FastIntgeration removes outlier GE values and keep the sparsity of data, avoiding problem of long vector being unsupported in large sparse matrices.

□ DeSP: a systematic DNA storage error simulation pipeline

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04723-w

DeSP, a systematic DNA storage error Simulation Pipeline, which simulates the errors generated from all DNA storage stages and systematically guides the optimization of encoding redundancy.

DeSP covers both the sequence lost and the within-sequence errors in the particular context of the data storage channel. A systematic model is desired which covers all the key stages of the storage process to reveal how errors are generated / propagated to form final sequencing.

□ INSISTC: Incorporating Network Structure Information for Single-Cell Type Classification

>> https://www.biorxiv.org/content/10.1101/2022.05.17.492304v1.full.pdf

INSISTC utilizes the SIOMICS approach to generate a GRN with its TF-target relationships identified through de novo DNA regulatory motif discovery. SIOMICS is capable of considering both TFs and their cofactors for motif prediction.

INSISTC adopts a random-walk-based graph algorithm to represent the GRN structural information. INSISTC incorporates genes and GRN structural information by creating a Latent Dirichlet Allocation (LDA)-based topic model.

□ scGAD: single-cell gene associating domain scores for exploratory analysis of scHi-C data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac372/6598798

scGAD enables summarization at the gene unit while accounting for inherent gene-level genomic biases. Low-dimensional projections with scGAD capture clustering of cells based on their 3D structures.

scGAD facilitates the integration of scHi-C data with other single-cell data modalities by enabling its projection onto reference low-dimensional embeddings. scGAD facilitated an accurate projection of cells onto this larger space.

□ Quantization of algebraic invariants through Topological Quantum Field Theories

>> https://arxiv.org/pdf/2206.00709v1.pdf

Providing necessary conditions for quantizability based on Euler characteristics and, in the case of surfaces, also sufficient conditions in terms of almost-TQFTs and almost-Frobenius algebras.

The E-polynomial of G-representation varieties is not a quantizable invariant by means of a monoidal TQFTs, for any algebraic group G of positive dimension.

Raven.

2022-06-06 06:03:06 | Science News

□ sc-PHENIX: Diffusion on PCA-UMAP manifold captures a well-balance of local, global, and continuum structures to denoise single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.06.09.495525v1.full.pdf

sc-PHENIX (single cell-PHEnotype recovery by Non-linear Imputation of gene eXpression) which uses PCA-UMAP initialization for revealing new insights into the recovered gene expression that are masked by diffusion on PCA space.

sc-PHENIX captures a continuum structure of the data. sc-PHENIX uses the adaptive kernel to generate a non-symmetric affinity matrix, it is symmetrized and then is normalized to generate the Markov transition matrix.

□ lvm-DE: An Empirical Bayes Method for Differential Expression Analysis of Single Cells with Deep Generative Models

>> https://www.biorxiv.org/content/10.1101/2022.05.27.493625v1.full.pdf

lvm-DE, a general Bayesian framework for detecting differential expression derived from first principles. lvm-DE takes as input a fitted deep generative model of scRNA-seq data, a pair of cell groups and a target α.

lvm-DE provide as output estimates of the log fold change for every gene, as well as a list of DE genes. The Bayesian hypothesis formulation of differential expression uses a composite alternative, built from the log fold change to avoid detecting lowly expressed genes.

The lvm-DE framework applies two deep generative models, scVI and scSphere. As lvm-DE outlines a generic procedure to conduct DE for latent variable models, improving the LVM of choice can be a direction to improve the quality of the predictions.

□ scPrisma: inference, filtering and enhancement of periodic signals in single-cell data using spectral template matching

>> https://www.biorxiv.org/content/10.1101/2022.06.07.493867v1.full.pdf

scPrisma, a generalized spectral framework for the reconstruction, enhancement, and filtering of cyclic signals, as well as inference of informative cyclic genes, and is further extended to linear signals.

scPrisma enables reconstruction, gene inference, filtering, and enhancement of the underlying cyclic or linear signals, w/o low-dimensional embedding, which renders the results useful for diverse types of downstream analyses. The algorithm does not overfit to a circular topology.

□ SiaNN: Single-cell Multi-omics Integration for Unpaired Data by a Siamese Network with Graph-based Contrastive Loss

>> https://www.biorxiv.org/content/10.1101/2022.06.07.495170v1.full.pdf

SiaNN, a variation of the Siamese neural network framework which is trained to integrate multi-omics data on the single-cell resolution by utilizing graph-based contrastive loss.

SiaNN reached among the top methods comparing existing algorithms in silhouette score, FOSCTTM score, and label transfer accuracy. the model can distinguish batch variation from actual biological variation and generate a better co-embedding space while mixing batches well.

SiaNN receives simultaneously one cell from modality 1 (e.g., scRNA-seq) and another from modality 2 (e.g., scATAC-seq) as the inputs and projects them into a shared embedding space using the encoder.

□ DeepLinc: De novo reconstruction of cell interaction landscapes from single-cell spatial transcriptome data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02692-0

DeepLinc (deep learning framework for Landscapes of Interacting Cells) is based on a deep generative model of variational graph autoencoder (VGAE) to integrate and learn from the two dimensions of information (cell interactions / GE profiles) during the encoding phase.

The main task of DeepLinc is to learn from the subset of cell-cell interactions, extract the underlying features of single-cell transcriptome profiles, and regenerate a complete landscape of cell-cell interactions, which would include both proximal and distal interactions.

□ Linearization Autoencoder: an autoencoder-based regression model with latent space linearization

>> https://www.biorxiv.org/content/10.1101/2022.06.06.494917v1.full.pdf

Latent space disentanglement are trying to connect features in the latent space to observable features in high-dimensional space for improving latent space interpretability.

Linearization Autoencoder can project data to low-dimensional space considering the linear relations of the value. Linearizing autoencoder is based on autoencoder combining encoder and decoder consists with several fully-connected hidden layers.

□ HMMerge: an Ensemble Method for Improving Multiple Sequence Alignment

>> https://www.biorxiv.org/content/10.1101/2022.05.29.493880v1.full.pdf

HMMerge, a new approach for adding sequences into backbone alignments. HMMerge builds on the techniques in UPP, in that it builds an ensemble of Hidden Markov Models (HMMs) for the backbone alignment.

HMMerge combines the information from all the HMMs in the ensemble to align each query sequence. that HMMerge utilizes the information from all of the HMMs constructed from the backbone and uses the Viterbi algorithm.

□ Frame-Shift-Detector: A Statistical Detector for Ribosomal Frameshifts and Dual Encodings based on Ribosome Profiling

>> https://www.biorxiv.org/content/10.1101/2022.06.06.495024v1.full.pdf

The intent of this method is to discover ribosomal frameshifts, but it will actually discover regions which are read by the ribo- some in two (or three) reading frames for any reason.

A gene might be read in another reading frame because there is an alternative Start codon either upstream or down- stream of the annotated Start codon, and in a different reading frame.

□ scReadSim: a single-cell multi-omics read simulator

>> https://www.biorxiv.org/content/10.1101/2022.05.29.493924v1.full.pdf

scReadSim counts the number of reads overlapped within each feature for every cell to construct the feature by cell barcode matrix. scReadSim creates two count matrices regarding foreground and background features and treats two matrices separately in the simulation procedure.

scReadSim constructs two surjective mappings from the real feature space to the user-defined feature space based on the features’ length similarity.

scReadSim generates sequencing reads instead of a count matrix. scReadSim defines the mappings separately for foreground and background features, which means that a real foreground feature can only map into a user-input foreground feature.

□ Sockeye: nanopore-only demultiplexing of single-cell reads

>> http://nanoporetech.com/resource-centre/video/lc22/sockeye-nanopore-only-demultiplexing-of-single-cell-reads

Several tools have been developed to analyse nanopore-sequenced 10x transcriptome libraries; however, they currently assume access to paired short-read data.

Sockeye is a research Snakemake pipeline designed to identify the cell barcode and UMI sequences present in nanopore sequencing reads generated from single-cell gene expression libraries.

□ scverse: Foundational tools for omics data in the life sciences

>> https://scverse.org/

‭scverse strives for synergy and interoperability with the ecosystem of packages built around these core tools, to ultimately provide users to cutting-edge and varied selection of analysis methods.‬

‭scverse adopts ‬the key data structures for single-cell data, AnnData for uni-modal data / MuData for multi-modal data, together w/ Scanpy, muon for multimodal analysis, scvi-tools for deep probabilistic analysis, scirpy for T-cell receptor analysis, and squidpy for spatial omics.

□ BSDE: barycenter single-cell differential expression for case-control studies

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac171/6554192

BSDE aggregates case/control distributions by finding their respective Wasserstein barycenters. Then, the Wasserstein distance of the two group-level distributions is compared to permutation counterparts for testing significance.

Barycenter minimizes the total cost of ‘moving distributions to the averaged distribution. BSDE is computationally affordable thanks to recent developments of fast algorithms for entropy-regularized optimal transport.

□ scREG: Regulatory analysis of single cell multiome gene expression and chromatin accessibility data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02682-2

scREG, a dimension reduction methodology, based on the concept of cis-regulatory potential, for single cell multiome data. This concept is further used for the construction of subpopulation-specific cis-regulatory networks.

scREG performs cross-modalities dimension reduction by data integration. A non-negative matrix factorization (NMF)-based optimization model to reduce the dimension of multiome data with m1 genes and m2 peaks to a common K dimension matrix.

□ GLIDER: Function Prediction from GLIDE-based Neigborhoods

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac322/6586285

GLIDE combines a simple local score that captures relationships in the dense core, with a diffusion based embedding that encapsulates the network structure in the periphery, creating a quasi-kernel.

GLIDER uses a variant of GLIDE to create a new similarity network. GLIDER network has more functionally enriched local neighborhoods than the original network such that the application of a simple knn classifier produces a significantly improved function prediction performance.

□ Aryana-LoR: Alignment of Single-Molecule Sequencing Reads by Enhancing the Accuracy and Efficiency of Locality-Sensitive Hashing

>> https://www.biorxiv.org/content/10.1101/2022.05.15.491980v1.full.pdf

Employing Locality-Sensitive Hashing (LSH) for the alignment of SMS reads to a reference genome, using two techniques that enhance both accuracy and eﬀiciency of MinHash scheme for long and noisy reads.

A modified Smith-Waterman algorithm computes the alignment penalty for each pair of gaps, one in the reference and another in the read, between each two consecutive seed in the maximal chain. Finally, it reports the least penalized alignment.

□ Asset: Genome sequence assembly evaluation using long-range sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.05.10.491304v1.full.pdf

Asset evaluates the consistency of a proposed genome assembly with multiple primary long-range data sets, identifying both supported regions and putative structural misassemblies.

Asset uses the four types of long-range sequencing datasets currently used by VGP, namely PacBio long reads, 10X linked reads, Bionano optical maps, and Hi-C. Asset can provide lists of potential problems for subsequent genome curation, and rank genome assemblers.

□ SCSilicon: a tool for synthetic single-cell DNA sequencing data generation

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08566-w

SCSilicon generates single-cell in silicon DNA reads with minimum manual intervention. SCSilicon automatically creates a set of genomic aberrations, including SNP, SNV, Indel, and CNV. SCSilicon yields the ground truth of CNV segmentation breakpoints and subclone cell labels.

SCSilicon only needs users to enter the parameter configurations. Then, besides the sequence file for each cell, SNPSimulator, SNVSimulator, InDelSimulator, and CNVSimulator generates the ground-truth SNPs, SNVs, InDels, CNV matrix, cell cluster, segment-breakpoints as well.

□ Bi-alignments with affine gaps costs

>> https://almob.biomedcentral.com/articles/10.1186/s13015-022-00219-7

Bi-alignments are motivated by treating shifts between sequence and structure explicitly as evolutionary events. Bi-alignments allow simultaneously predicting sequence and structure homologies and their relation.

Bi-alignments provide a coherent framework to detect shift-like incongruences. Optimal bi-alignments with affine gap costs (or affine shift cost) for two constituent alignments can be computed exactly in quartic space and time.

□ BioKIT: a versatile toolkit for processing and analyzing diverse types of sequence data

>> https://academic.oup.com/genetics/advance-article-abstract/doi/10.1093/genetics/iyac079/6583183

BioKIT, a versatile command line toolkit that has, upon publication, 42 functions, several of which were community-sourced, that conduct routine and novel processing and analysis of genome assemblies, multiple sequence alignments, coding sequences, sequencing data, and more.

BioKIT uses the novel metric of gene-wise relative synonymous codon usage can accurately estimate gene-wise codon optimization, evaluated the characteristics of 901 eukaryotic genome assemblies, and calculated alignment summary statistics for 10 phylogenomic data matrices.

□ RE-GOA: Annotating regulatory elements by heterogeneous network embedding

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac185/6553660

RE-GOA, a systematic Gene Ontology Annotation method for Regulatory Elements (RE-GOA) by leveraging the powerful word embedding in natural language processing.

Assembling a heterogeneous network by integrating context specific regulations, and gene ontology (GO) terms RE-GOA performs network embedding and associate regulatory elements with GO terms by assessing their similarity in a low dimensional vector space.

□ Neglecting normalization impact in semi-synthetic RNA-seq data simulation generates artificial false positives

>> https://www.biorxiv.org/content/10.1101/2022.05.10.490529v1.full.pdf

Dearseq is capable of handling many experimental designs beyond the simple two conditions comparison setting of the Wilcoxon test, and thus constitutes a versatile option for differential expression analysis of large human population samples.

Both limma-voom and NOISeq also controlled FDR adequately using the amended permutation scheme – note that this procedure is difficult for voom-limma, edgeR and DESeq2 because normalization is baked into their analysis methodology.

□ BFF and cellhashR: analysis tools for accurate demultiplexing of cell hashing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac213/6565315

Bimodal Flexible Fitting (BFF) demultiplexing algorithms BFFcluster and BFFraw, a novel class of algorithms that rely on the single inviolable assumption that barcode count distributions are bimodal.

BFFcluster demultiplexing is both tunable and insensitive to issues with poorly behaved data that can confound other algorithms. Demultiplexing with BFF algorithms is accurate and consistent for both well-behaved and poorly behaved input data.

□ CrowdGO: Machine learning and semantic similarity guided consensus Gene Ontology annotation

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010075

CrowdGO, a consensus-based GO term meta-predictor that employs machine learning models with GO term semantic similarities and information contents (IC) to produce enhanced functional annotations.

CrowdGO uese an Adaptive Boosting machine learning model, which aims to combine a set of weak classifiers into a weighted sum representing the boosted strong classifier. CrowdGO might benefit from developing additional models using eXtreme Gradient Boosting.

□ JACUSA2: RNA modification mapping

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02676-0

JACUSA2, a versatile software solution and comprehensive analysis framework for RNA modification detection assays that are based on either the Illumina or Nanopore platform.

JACUSA2 can integrate information from multiple experiments, such as replicates and different conditions, and different library types, such as first- or second-strand cDNA libraries.

□ CURC: A CUDA-based reference-free read compressor

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac333/6586792

CURC, a GPU-accelerated reference-free read compressor for FASTQ files. Under a GPU-CPU heterogeneous parallel scheme, CURC implements highly efficient lossless compression of DNA stream based on the pseudogenome approach and CUDA library.

CURC treats each GPU device as an available resource and manages it through a global mutex. When the GPU needs to be utilized in a block compression thread, CURC loops to track the state of the mutex corresponding to some device ID and tries to lock it.

□ MAGScoT - a fast, lightweight, and accurate bin-refinement software

>> https://www.biorxiv.org/content/10.1101/2022.05.17.492251v1.full.pdf

MAGScoT uses GTDBtk rel 207 (v2) marker genes to score completeness and contamination of metagenomic bins, to iteratively select the best metagenome-assembled genomes (MAGs) in a dataset.

MAGScoT can merge overlapping metagenomic bins from multiple binning inputs and add these hybrid bins for scoring and refinement to the set of candidates MAGs.

□ MAMnet: detecting and genotyping deletions and insertions based on long reads and a deep learning approach

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac195/6587170

Despite the fact that long-read sequencing technologies have improved the field of SV detection and genotyping, there are still some challenges that prevent satisfactory results from being obtained.

MAMnet, a fast and scalable SV detection and genotyping method based on long reads and a combination of convolutional neural network and long short-term network. MAMnet uses a deep neural network to implement sensitive SV detection with a novel prediction strategy.

□ gget: Efficient querying of genomic databases for single-cell RNA-seq

>> https://www.biorxiv.org/content/10.1101/2022.05.17.492392v1.full.pdf

Each gget tool requires minimal arguments, provides clear output, and operates from both the command line and Python environments, such as JupyterLab, maximizing ease of use and accommodating novice programmers.

□ CAJAL: A general framework for the combined morphometric, transcriptomic, and physiological analysis of cells using metric geometry

>> https://www.biorxiv.org/content/10.1101/2022.05.19.492525v1.full.pdf

The Gromov-Wasserstein distance that results from this approach can be thought of as a distance in a latent space of cell morphologies. CAJAL enables the analyses for arbitrarily complex and heterogeneous cell populations.

CAJAL has the generality and stability of simple geometric shape descriptors, the discriminative power of cell-type specific descriptors, and the unbiasedness and hierarchical structure of moments-based descriptors.

□ rowbowt: Pangenomic genotyping with the marker array

>> https://www.biorxiv.org/content/10.1101/2022.05.19.492566v1.full.pdf

A new structure called the marker array that replaces the suffix-array-sample component of the r-index with a structure tailored to the problem of collecting genotype evidence.

The rowbowt index consisted of three components: the run-length encoded Burrows-Wheeler Transform (BWT), the run-sampled suffix array, and the marker array. This approach preserves all linkage disequilibrium information.

□ SigProfilerClusters: Examining clustered somatic mutations

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac335/6589887

SigProfilerClusters detects all types of clustered mutations by calculating a sample-dependent IMD threshold using a simulated background model that takes into account extended sequence context, transcriptional strand asymmetries, and regional mutation densities.

SigProfilerClusters disentangles all types of clustered events from non-clustered mutations and annotates each clustered event into an established subclass, including the widely used classes of doublet-base substitutions, multi-base substitutions, omikli, and kataegis.

□ Gene expression data classification using topology and machine learning models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04704-z

This work involves curating relevant features to obtain somewhat better representatives with the help of TDA. This representatives of the entire data facilitates better comprehension of the phenotype labels.

The topology relevant curated data provides an improvement in shallow learning as well as deep learning based supervised classifications. The representative cycles have an unsupervised inclination towards phenotype labels.

□ Modpolish: Correcting Modification-Mediated Errors in Nanopore Sequencing by Nucleotide Demodification and in silico Correction

>> https://www.biorxiv.org/content/10.1101/2022.05.20.492776v1.full.pdf

Modpolish corrects modification-mediated errors without WGA and prior knowledge of the modifications. Modpolish identifies the modification-mediated errors by investigating basecalling quality, basecalling consistency, and evolutionary conservation.

In conjunction with the conservation degree measured by closely-related genomes, only the modified loci with ultra-high conservation will be corrected by Modpolish, avoiding false corrections of strain variations.

□ XR/T-Seq: Reconstruction of Full-length scFv Libraries with the Extended Range Targeted Sequencing Method

>> https://www.biorxiv.org/content/10.1101/2022.05.10.491248v1.full.pdf

Single chain fragment variable (scFv) phage display libraries of randomly paired VH-VL antibody domains are a powerful and widely adopted tool for the discovery of antibodies of a desired specificity.

XR/T-Seq (the Extended Range Targeted Sequencing) enables long molecule reconstruction from standard paired 2X150bp reads. The XR/T-Seq method was applied to analyze a commercial scFv phage display library consisting of randomly paired VH-VL domains.

□ Improved transcriptome assembly using a hybrid of long and short reads with StringTie

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009730

StringTie identifies the heaviest path in the splicing graph and makes that the candidate transcript; and second, it assigns a coverage level to that transcript by solving a maximum-flow problem.

After all transcripts in the annotation have been exhausted, if there are still paths in the splicing graph that are covered by reads, the algorithm resumes using its default heuristic to identify the heaviest path in the graph.

Obscura.

2022-06-06 06:01:06 | Science News

□ SBWT: Succinct k-mer Set Representations Using Subset Rank Queries on the Spectral Burrows-Wheeler Transform

>> https://www.biorxiv.org/content/10.1101/2022.05.19.492613v1.full.pdf

The Spectral Burrows-Wheeler Transform (SBWT) is a distillation of the ideas found in the BOSS and Wheeler graph data structures. The SBWT can also be seen as a specialization of the Wheeler graph framework into k-spectra.

It is possible to use entropy coding methods to compress the space of data structures while retaining query support. MatrixSBWT implemented with bit vectors compressed to the zeroth order entropy leads to a data structure taking 3.25 bits per k-mer on the DNA alphabet.

The space on a general alphabet of size σ is (n+k)(log σ+1/ ln 2)+o((n+k)σ), where n is the number of k-mers in the spectrum. The data structure can answer k-mer membership queries in O(k) time, improving on the BOSS data structure, which occupies the same asymptotic space.

□ The Maximum Entropy Principle For Compositional Data

>> https://www.biorxiv.org/content/10.1101/2022.06.07.495074v1.full.pdf

CME, a data-driven framework for modeling compositions in multi-species networks. CME utilizes maximum entropy, a first-principles modeling approach, to learn influential nodes and their network connections using only the available experimental information.

CME can incorporate more general model constraints as well. The compositional simplex constraint is enforced using the method of Lagrange multipliers. Other geometries, even higher-order moments, can be included simply by including new Lagrange multipliers.

□ xTADA / VBASS: Integration of gene expression data in Bayesian association analysis of rare variants

>> https://www.biorxiv.org/content/10.1101/2022.05.13.491893v1.full.pdf

xTADA takes a single GE profile, such as bulk RNA-seq, as a separate observed variable independent of genetic variants conditioned on risk status. the expression level of a gene is a random variable that has different distributions under the null and the alternative models.

VBASS (Variational inference Bayesian ASSociation), takes a vector of expression profile, and models the priors of risk genes as a function of EP of multiple cell types. VBASS uses deep neural networks to approximate the function and uses semi-supervised variational inference.

□ SPRISS: Approximating Frequent K-mers by Sampling Reads, and Applications

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac180/6588068

SPRISS employs a powerful reads sampling scheme, which allows to extract a representative subset of the dataset that can be used, in combination with any k-mer counting algorithm, to perform downstream analyses in a fraction of the time required by the analysis of the whole data.

SPRISS does not require to receive in input and to scan the entire dataset, but, instead, it needs in input only a small sample of reads drawn from the dataset. the reads-sampling strategy of SPRISS requires the more sophisticated concept of pseudodimension.

□ Exodus: sequencing-based pipeline for quantification of pooled variants

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac319/6584805

Exodus – a reference-based Python algorithm for quantification of genomes, including those that are highly similar, when they are sequenced together in a single mix.

No false negatives were recorded, demonstrating that Exodus’ likelihood of missing an existing genome is very low, even if the genome’s relative abundance is low and similar genomes are sequenced with it in the same mix.

□ ODGI: understanding pangenome graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac308/6585331

ODGI supports pre-built graphs in the Graphical Fragment Assembly format. Its fast parallel execution facilitates routine pangenomic tasks, as well as pipelines that can quickly answer complex biological questions of gigabase-scale pangenome graphs.

ODGI explores context mapping deconvolution of pangenome graph structures via the path jaccard metric. The ODGI data structure allows algorithms that build and modify the graph to operate in parallel, without any global locks.

□ WSV: Identification of representative trees in random forests based on a new tree-based distance measure

>> https://www.biorxiv.org/content/10.1101/2022.05.15.492004v1.full.pdf

A new distance measure for decision trees to identify the most representative trees in random forests, based on the selected splitting variables but incorporating the level at which they were selected within the tree.

WSV, the new weighting splitting variable (WSV) metric and describe how to extract the most representative tree from the forest based on any tree distance. WSV approach leads to the best MSE when the minimal node size is small and the trees are therefore more complex.

□ ntEdit+Sealer: Efficient Targeted Error Resolution and Automated Finishing of Long-Read Genome Assemblies

>> https://currentprotocols.onlinelibrary.wiley.com/doi/10.1002/cpz1.442

ntEdit+Sealer, an alignment-free genome finishing protocol that employs Bloom filters. Both ntEdit / Sealer employ a k-sweep approach, iterating from long to short k-mer lengths. This method is beneficial because different k-mer lengths can provide resolution at different scales.

ntEdit queries assembly k-mers in the Bloom filter, making base corrections where possible and flagging problematic stretches. ABySS-Bloom creates a 2-level cascading Bloom filter, which is used by Sealer as an implicit de Bruijn graph to fill assembly gaps / problematic regions.

□ AtlasXbrowser enables spatial multi-omics data analysis through the precise determination of the region of interest

>> https://www.biorxiv.org/content/10.1101/2022.05.11.491526v1.full.pdf

AtlasXbrowser allows for an assay agnostic image processing GUI that can be used for all DBiT assays. AtlasXbrowser guides the user through the process of locating the region of interest (ROI), defined as the pixels of the micrograph corresponding to the location of the TIXEL mosaic.

AtlasXbrowser encapsulates the numerous advances made in the DBiT protocol since its inception. AtlasXbrowser has standardized the output of DBiT image data, creating a “Spatial Folder”, containing the output of the image processing in the 10x Visium image data format.

□ Prider: multiplexed primer design using linearly scaling approximation of set coverage

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04710-1

Prider initially prepares a full primer coverage of the input sequences, the complexity of which is subsequently reduced by removing components of high redundancy or narrow coverage.

Prider permits efficient design of primers to large DNA datasets by scaling linearly to increasing sequence data. Prider solves a recalcitrant problem in molecular diagnostics: how to cover a maximal sequence diversity with a minimal number of oligonucleotide primers or probes.

□ EpiTrace: Tracking single cell evolution via clock-like chromatin accessibility

>> https://www.biorxiv.org/content/10.1101/2022.05.12.491736v1.full.pdf

EpiTrace derived cell age shows concordance to known developmental hierarchies, correlates well with DNA methylation-based clocks, and is complementary with mutation-based lineage tracing, RNA velocity, and stemness predictions.

EpiTrace age prediction is reversed for erythroid lineage, probably due to genome-wide chromatin condensation. EpiTrace age shows negative correlation to peaks associated w/ genes acting in current and future stage, and positive correlation to peaks associated with genes acting.

□ Nezzle: an interactive and programmable visualization of biological networks in Python

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac324/6585333

Nezzle provides a set of essential features for rapid prototyping to visualize biological networks. Nezzle provides interfaces for interactive graphics and dynamic code execution.

Nezzle can be a test bed for rapidly evaluating the feasibility of algorithms related to biological networks in Python. Users can develop a prototype of network visualization algorithm that is optimized based on a GPU-accelerated deep learning framework.

□ MultiCens: Multilayer network centrality measures to uncover molecular mediators of tissue-tissue communication

>> https://www.biorxiv.org/content/10.1101/2022.05.15.492007v1.full.pdf

MultiCens (Multilayer/Multi-tissue network Centrality measures) can distinguish within- vs. across-layer connectivity to quantify the “influence” of any gene in a tissue on a query set of genes of interest in another tissue.

MultiCens enjoys theoretical guarantees on convergence and decomposability, and excels on synthetic benchmarks. MultiCens also accounts for the multilayer multi-hop network connectivity structure of the underlying system.

□ VIGoR: joint estimation of multiple linear learners with variational Bayesian inference

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac328/6586288

VIGoR (Variational Bayesian Inference for Genome-Wide Regression) conducts linear regression using variational Bayesian inference, particularly optimized for genome-wide association mapping and whole-genome prediction which use a number of SNPs as the explanatory variables.

Solutions are obtained with variational inference which is more time-efficient than MCMC. VIGoR was initially developed to provide variational Bayesian inference for linear regressions and has been updated to incorporate multiple learners.

□ Acidbio: Assessing and assuring interoperability of a genomics file format

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac327/6586286

Acidbio, a new verification system which tests for correct behavior in bioinformatics software packages. They crafted tests to unify correct behavior when tools encounter various edge cases—potentially unexpected inputs that exemplify the limits of the format.

BED variants distinguish BED files based on its number of fields. BEDn denotes a file with only the first n fields. BEDn+m denotes a file with the first n fields followed by m fields of custom-defined fields supplied by the user.

□ Building alternative consensus trees and supertrees using k-means and Robinson and Foulds distance

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac326/6586801

A new efficient method for inferring multiple alternative consensus trees and supertrees to best represent the most important evolutionary patterns of a given set of gene phylogenies.

Adapting the popular k-means clustering algorithm, based on some remarkable properties of the Robinson and Foulds distance, can be used to partition a given set of trees into one (for homogeneous data) or multiple (for heterogeneous data) cluster(s) of trees.

□ Comprehensive and standardized benchmarking of deep learning architectures for basecalling nanopore sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.05.17.492272v1.full.pdf

A wide set of evaluation metrics that can be used to analyze the strengths and weaknesses of basecaller models. This toolbox can be used as benchmark for the standardized training and cross-comparison of existing and future basecallers.

Bonito achieved the best overall performance. Using a CRF decoder, over the more traditional CTC decoder, boosts performance significantly and it is likely the reason why Bonito performs so well in the initial benchmark.

Deep RNNs (LSTM) are superior to Transformer layers, and that both simplex and complex convolutional architectures can achieve competitive performance.

□ DCAlign v1.0: Aligning biological sequences using co-evolution models and informative priors

>> https://www.biorxiv.org/content/10.1101/2022.05.18.492471v1.full.pdf

DCAlign returns the ordered sub-sequence of a query unaligned sequence which maximizes an objective function related to the DCA model of the seed. Standard DCA models fail to adequately describe the statistics of insertions and gaps.

DCAlign v1.0 is a new implementation of the Direct Coupling Analysis (DCA) - based alignment technique, DCAlign, which conversely to the first implementation, allows for a fast parametrization of the seed alignment.

□ ffq: Metadata retrieval from genomics database

>> https://www.biorxiv.org/content/10.1101/2022.05.18.492548v1.full.pdf

ffq facilitates metadata retrieval from a diverse set of databases, including National Center for Biotechnology Information Sequence Read Archive (SRA) and Gene Expression Omnibus (GEO), EMBL-EBI ENA , DDBJ GEA, and ENCODE database.

ffq fetches and returns metadata as a JSON object by traversing the database hierarchy. Subsets of the database hierarchy can be returned by specifying -l [level].

□ SEEM / SEED: Powerful Molecule Generation with Simple ConvNet

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac332/6589886

a ConvNet-based sequential graph generation algorithm. The molecular graph generation problem is reformulated as a sequence of simple classification tasks.

At each step, a convolutional neural network operates on a subgraph that is generated at previous step, and predicts/classifies an atom/bond adding action to populate the input subgraph.

The pretrained model is abbreviated as SEEM (structural encoder for engineering molecules). It is then fine-tuned with reinforcement learning to generate molecules. The fine-tuned model is named SEED (structural encoder for engineering drug-like-molecules).

□ Model verification tools: a computational framework for verification assessment of mechanistic agent-based models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04684-0

Agent-based models (ABMs) usually make use of pseudo-random number generators initialized with different random seeds for reproducing different stochastic behaviors. It is possible to analyze the model behavior from a deterministic or stochastic point of view.

Model Verification Tools (MVT), a suite of tools based on the same theoretical framework with a user-friendly interface for the evaluation of the deterministic verification of discrete-time models, with a particular focus on agent-based approaches.

□ MATTE: anti-noise module alignment for phenotype-gene-related analysis

>> https://www.biorxiv.org/content/10.1101/2022.05.29.493935v1.full.pdf

A Module Alignment of TranscripTomE (MATTE) aligns modules directly by calculating RDE or RDC transformed data, clustering to assign genes from each phenotype a label, and separating genes into preserved and differentiated modules by cross-tabulation.

MATTE shows a strong anti-noises ability to detect both differential expression and differential co-expression. MATTE takes transcriptome data and phenotype data as inputs hoping to construct a space where each gene from different phenotypes is treated as an individual one.

□ Smart-seq3xpress: Scalable single-cell RNA sequencing from full transcripts

>> https://www.nature.com/articles/s41587-022-01311-4

The overlays would both protect the low reaction volumes from evaporation and provide a ‘landing cushion’ for the FACS-sorted cells. Indeed, many overlays with varying chemical properties could be used with low-volume Smart-seq3.

Smart-seq3xpress miniaturizes and streamlines the Smart-seq3 protocol to substantially reduce reagent use and increase cellular throughput. Smart-seq3xpress analysis of peripheral blood mononuclear cells resulted in a granular atlas complete with common and rare cell types.

□ OmicSelector: automatic feature selection and deep learning modeling for omic experiments.

>> https://www.biorxiv.org/content/10.1101/2022.06.01.494299v1.full.pdf

OmicSelector provides an overfitting-resilient pipeline that integrates 94 feature selection approaches based on distinct variable selection. OmicSelector identifies the best feature sets using modeling techniques with hyperparameter optimization in hold-out or cross-validation.

OmicSelector provides classification performance metrics for proposed feature sets, allowing researchers to choose the overfitting-resistant biomarker set with the highest diagnostic potential.

OmicSelector performs GPU-accelerated development, validation, and implementation of deep learning feedforward neural networks (up to 3 hidden layers, with or without autoencoders) on selected signatures.

□ DST: Integrative Data Semantics through a Model-enabled Data Stewardship

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac375/6598845

The Data Steward Tool (DST) which can be used to automatically standardize clinical datasets, map them to established ontologies, align them with OMOP standards, and export them to a FHIR-based format.

the DST is capable of automatically mapping external variables onto the CDM through fuzzy string matching. DST provides a graph-based view of the model where the user can interactively explore the entirety of the model.

□ NcPath: A novel tool for visualization and enrichment analysis of human non-coding RNA and KEGG signaling pathways

>> https://www.biorxiv.org/content/10.1101/2022.06.03.494777v1.full.pdf

NcPath integrates a total of 178,308 human experimentally-validated miRNA-target interactions (MTIs), 36,537 experimentally-verified lncRNA target interactions (LTIs), and 4,879 experimentally-validated human ceRNA networks across 222 KEGG pathways.

The NcPath database provides information on MTIs/LTIs/ceRNA networks, PubMed IDs, gene annotations and the experimental verification method used.

the NcPath database will serve as an important and continually updated platform that provides annotation and visualization of the pathways on which noncoding RNAs (miRNA and lncRNA) are involved, and provide support to multimodal noncoding RNAs enrichment analysis.

□ Random-effects meta-analysis of effect sizes as a unified framework for gene set analysis

>> https://www.biorxiv.org/content/10.1101/2022.06.06.494956v1.full.pdf

A novel approach to GSA that both provides a unifying framework 39 for the different approaches outlined above and also takes into account the uncertainty in the estimate of the effect size from the first stage of the analysis.

The log fold change (LFC) for genes in a given set is modeled as a mixture of Gaussian distributions, with distinct components corresponding to up-regulated, down-regulated and non-DE genes.

□ ACO:lossless quality score compression based on adaptive coding order

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04712-z

ACO is a special compressor for quality scores, so it considers the distribution characteristics of more quality score data.

The main objective of ACO is to traverse the quality score along the most relative directions, which can be regarded as a reorganization of the stack of independent 1D quality score vectors into highly related 2D matrices.

□ YaHS: yet another Hi-C scaffolding tool

>> https://www.biorxiv.org/content/10.1101/2022.06.09.495093v1.full.pdf

YaHS takes the alignment file (either in BED format or BAM format) to first optionally break contigs at positions lacking Hi-C coverage which are potential assembly errors.

YaHS takes account of the restriction enzymes used in the Hi-C library. The cell contact frequencies are normalised by the corresponding number of cutting sites. YaHS builds a scaffolding graph w/ contigs as nodes / contig joins as edges which are weighted by the joining scores.

The graph is simplified by a series of operations incl. filtering low score edges, trimming tips, solving repeats, removing transitive edges, trimming weak edges and removing ambiguous edges. Finally the graph is traversed to assemble scaffolds along contiguous paths.

□ SnapHiC2: A computationally efficient loop caller for single cell Hi-C data

>> https://www.sciencedirect.com/science/article/pii/S2001037022002021

SnapHiC2 adopts a sliding window approach when implementing the random walk with restart (RWR) algorithm, achieving more than 3 times speed up and reducing memory usage by around 70%.

SnapHiC2 can identify 5 Kb resolution chromatin loops with high sensitivity and accuracy. SnapHiC2, with its data-driven strategy to select sliding window size that retains more than 80% of contacts, can identify loops with similar quality as the original SnapHiC algorithm.

□ geometric hashing: Global, highly specific and fast filtering of alignment seeds

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04745-4

Geometric hashing achieves a high specificity by combining non-local information from different seeds using a simple hash function that only requires a constant and small amount of additional time per spaced seed.

Geometric hashing is a fast filter of candidate seeds taken, such as exact k-mer matches induced by spaced seed patterns. the matches from homologous regions are accumulated over possibly long distances. The geometric hashing idea generalizes well for higher dimensional seeds.

□ Monod: mechanistic analysis of single-cell RNA sequencing count data

>> https://www.biorxiv.org/content/10.1101/2022.06.11.495771v1.full.pdf

By parameterizing multidimensional distributions with biophysical variables, Monod provides a route to identifying and studying differential expression patterns that do not cause changes in average gene expression.

To account for inter-gene coupling through sequencing, the inference procedure iterates over a grid of technical noise parameters and computes a conditional maximum likelihood estimate (MLE) for each gene’s biological noise parameters.

□ SVAFotate: Annotation of structural variants with reported allele frequencies and related metrics from multiple datasets

>> https://www.biorxiv.org/content/10.1101/2022.06.09.495527v1.full.pdf

SVAFotate provides the means to aggregate SV calls from multiple SV population datasets and create summaries of AF-relevant data into simple annotations that are added to SV calls based on default or user-determined SV matching criteria.

SVAFotate has been tested on VCFs created from various SV callers and is compatible w/ any VCF incl. SVTYPE (END / SVLEN) in the INFO field. All SV calls in the VCF are internally converted into a BED for the purposes of identifying overlapping genomic coordinates w/ the SVs.

Disassemby.

2022-06-06 06:00:06 | Science News

□ Shepherd: Accurate Clustering for Correcting DNA Barcode Errors

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac395/6609174

Shepherd, a novel clustering method that is based on an indexing system of barcode sequences using k-mers, and a Bayesian statistical test incorporating a substitution error rate to distinguish true from error sequences.

Shepherd provides barcode count estimates that are significantly more accurate, producing 10-150 times fewer spurious lineages. Shepherd introduces the novel capability of tracking lineages that are undetectable in the first time point but emerge at later time points.

Shepherd exploits the pigeonhole principle to efficiently find neighborhoods for each sequence using the k-mer indexing system. It enables identification of sequence neighborhoods, and can be applied to any neighborhood identification task involving the Hamming distance.

□ Graph-based algorithms for Laplace transformed coalescence time distributions.

>> https://www.biorxiv.org/content/10.1101/2022.05.20.492768v1.full.pdf

Using the Laplace transform, this distribution can be generated with a simple recursive procedure, regardless of model complexity.

Assuming an infinite-sites mutation model, the probability of observing specific configurations of linked variants within small haplotype blocks can be recovered from the Laplace transform of the joint distribution of branch lengths.

The state space diagram can be turned into a computational graph, allowing efficient evaluation of the Laplace transform by means of a graph traversal algorithm. This algorithm can be applied to tabulate the likelihoods of mutational configurations in non-recombining blocks.

□ scTite: Entropy-based inference of transition states and cellular trajectory for single-cell transcriptomics

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac225/6607748

scTite uses a new metric called transition entropy met to measure the uncertainty of a cell belonging to different cell clusters, and then identify cell states and transition cells.

scTite utilizes the Wasserstein distance on the probability distribution, and construct the minimum spanning tree. It adopts the signaling entropy / partial correlation coefficient to determine transition paths, which contain a group of transition cells w/ the largest similarity.

□ A survey of mapping algorithms in the long-reads era

>> https://www.biorxiv.org/content/10.1101/2022.05.21.492932v1.full.pdf

The unprecedented characteristics of this new type of sequencing data created a shift, and methods moved on from the seed-and-extend framework previously used for short reads to a seed-and-chain framework due to the abundance of seeds in each read.

The long-read mapping algorithms are based on alternative seed constructs or chaining formulations. The usage of diagonal-transition algorithms which was initially define for edit distance has been reactivated for the gap-affine model with the wavefront alignment algorithm.

□ DNAscope: High accuracy small variant calling using machine learning

>> https://www.biorxiv.org/content/10.1101/2022.05.20.492556v1.full.pdf

As a successor to GATK HaplotypeCaller, DNAscope uses a similar logical architecture, but introduces improvements to active region detection and local assembly for improved sensitivity and robustness, especially across high-complexity regions.

DNAscope can be used with a Bayesian genotyping model, allowing users to benefit from DNAscope’s improved active region detection and local assembly when resequencing diverse organisms.

Sequence reads aligned across active regions undergo local assembly using de Bruijn graphs and read-haplotype likelihoods are calculated through PairHMM.

Gradient Boosting Machines (GBMs) build trees in succession to train sequential ensembles of weak, base learners, reducing residuals in a stepwise fashion.

□ A scalable approach for continuous time Markov models with covariates

>> https://www.biorxiv.org/content/10.1101/2022.06.06.494953v1.full.pdf

Using a mini-batch stochastic gradient descent algorithm which uses a smaller random subset of the dataset at each iteration, making it practical to fit large scale data.

An optimization technique for continuous time Markov models (CTMM) which uses a stochastic gradient descent algorithm combined with differentiation of the matrix exponential using a Pad ́e approximation.

□ DeepLUCIA: predicting tissue-specific chromatin loops using Deep Learning-based Universal Chromatin Interaction Annotator

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac373/6596048

DeepLUCIA (Deep Learning-based Universal Chromatin Interaction Annotator) does not use TF binding profile data which previous TF binding-dependent methods critically rely on, its prediction accuracies are comparable to those of the previous TF binding-dependent methods.

DeepLUCIA enables the tissue-specific chromatin loop predictions from tissue-specific epigenomes that cannot be handled by genomic variation-based approach.

□ scCNC: A method based on Capsule Network for Clustering scRNA-seq Data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac393/6608086

When confronted by the high dimensionality and general dropout events of scRNA-seq data, purely unsupervised clustering methods may not produce biologically interpretable clusters, which complicates cell type assignment.

scCNC, a semi-supervised clustering method based on a capsule network, that integrates domain knowledge into the clustering step. A Semi-supervised Greedy Iterative Training (SGIT) method used to train the whole network.

□ MGMM: Fitting Gaussian mixture models on incomplete data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04740-9

MGMM, missingness-aware Gaussian mixture models for fitting GMMs in the presence of missing data. Unlike existing GMM implementations that can accommodate missing data, MGMM places no restrictions on the form of the covariance matrix.

MGMM employs an Expectation Conditional Maximization algorithm, which accelerates estimation by breaking direct maximization of the EM objective function into a sequence of simpler conditional maximizations. It handles both missingness of the cluster assignments and of elements.

□ Biomarker identification by reversing the learning mechanism of an autoencoder and recursive feature elimination

>> https://pubs.rsc.org/en/content/articlelanding/2022/mo/d1mo00467k

An autoencoder-based biomarker identification method by reversing the learning mechanism.

By reversing the learning mechanism of the trained autoencoders, they devised an explainable post hoc methodology for identifying the influential genes with a high likelihood of becoming biomarkers.

□ kngMap: Sensitive and Fast Mapping Algorithm for Noisy Long Reads Based on the K-Mer Neighborhood Graph

>> https://www.frontiersin.org/articles/10.3389/fgene.2022.890651/full

kngMap, a k-mer neighborhood graph mapper which is specifically designed to improve mapping sensitivity and deal with SV events. kngMap constructs a searching index for the reference genome to quickly find matched k-mers for query reads.

Such matches are then used to construct a k-mer d-neighborhood graph where matched k-mers are viewed as vertices and each pair of matched k-mers is connected by a direct edge.

kngMap has superior ability in terms of base-level sensitivity and end-to-end alignment, which can produce consecutive alignments for the whole read.

□ PCAone: fast and accurate out-of-core PCA framework for large scale biobank data

>> https://www.biorxiv.org/content/10.1101/2022.05.25.493261v1.full.pdf

PCAone uses a window based optimization scheme based on blocks of data which allows the algorithm to converge within a few passes through the whole data.

PCAone implements 3 fast PCA algorithms for finding the top eigenvectors of large datasets, which are Implicitly Restarted Arnoldi Method (IRAM), single pass Randomized SVD (RSVD) and our own fancy RSVD method with window based power iterations.

□ scEFSC: Accurate single-cell RNA-seq data analysis via ensemble consensus clustering based on multiple feature selections

>> https://www.sciencedirect.com/science/article/pii/S2001037022001416

scEFSC, a single-cell consensus clustering algorithm based on ensemble feature selection for scRNA-seq data analysis in an ensemble manner. the algorithm employs several unsupervised feature selections to remove genes that do not contribute significantly to the scRNA-seq data.

scEFSC algorithm exhibited superior clustering performance on the 14 scRNA-seq datasets, indicating that using multiple unsupervised feature selection algorithms can strengthen the clustering ability of consensus clustering over a single unsupervised feature selection algorithm.

□ Cello scope: a probabilistic model for marker-gene-driven cell type deconvolution in spatial transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2022.05.24.493193v1.full.pdf

Cello scope, a novel Bayesian probabilistic graphical model of gene expression in ST data, which deconvolutes cell type composition in ST spots, and a method to infer model parameters based on a MCMC algorithm.

Cello scope was developed to assign cell types, and as such it assumes that each observation refers to only one cell. Cello scope is fully independent of scRNA-seq data intrinsically mitigates risks encountered while integrating data from the two disparate platforms.

□ DeepHisCoM: deep learning pathway analysis using hierarchical structural component models

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac171/6590446

Deep-learning pathway analysis using Hierarchical structured CoMponent models (DeepHisCoM) utilizes DL methods to consider a nonlinear complex contribution of biological factors to pathways by constructing a multilayered model which accounts for hierarchical biological structure.

DeepHisCoM was shown to have a higher power in the nonlinear pathway effect and comparable power for the linear pathway effect when compared to the conventional pathway methods.

□ Simultant: simultaneous curve fitting of functions and differential equations using analytical gradient calculations

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04728-5

Simultant, a software package that allows complex fitting setups to be easily defined using a simple graphical user interface. Fitting functions can be defined directly as mathematical expressions or indirectly as the solution to specified ordinary differential equations.

Simultant accelerates fitting using analytical gradient calculations, thus enabling large-scale fits to be performed. Simultant furthermore utilizes automatic gradient calculations which permits fast fitting even with many parameters.

□ Exploiting Large Datasets Improves Accuracy Estimation for Multiple Sequence Alignment

>> https://www.biorxiv.org/content/10.1101/2022.05.22.493004v1.full.pdf

Facet-NN and Facet-LR; two new scoring-function-based accuracy estimators which reimagine the original Facet estimator by us- ing modern machine learning techniques for optimization, rather than combinatorial optimization, to exploit the much larger datasets.

An advisor contains two key components: a set of candidate parameter vectors, called an advisor set; and an accuracy estimation tool used to choose from among those vectors, called an advisor estimator.

□ scPrivacy: Privacy-preserving integration of multiple institutional data for single-cell type identification

>> https://www.biorxiv.org/content/10.1101/2022.05.23.493074v1.full.pdf

scPrivacy is an efficient automatically single-cell type identification prototype to facilitate single cell annotations, by integrating multiple references data distributed in different institutions using an federated learning based deep metric learning framework.

scPrivacy extends Deep Metric Learning to a federated learning framework by aggregating model parameters of institutions which fully utilized the information contained in multiple institutional datasets to train the aggregated model while avoiding integrating datasets physically.

□ scPreGAN: a deep generative model for predicting the response of single cell expression to perturbation

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac357/6593485

In many cases, it is hard to collect the perturbed cells, such as knowing the response of a cell type to the drug before actual medication to a patient. Prediction in silicon could alleviate the problem and save cost.

ScPreGAN integrates autoencoder and generative adversarial network, the former is to extract common information of the unperturbed data and the perturbed data, the latter is to predict the perturbed data.

□ ChromGene: Gene-Based Modeling of Epigenomic Data

>> https://www.biorxiv.org/content/10.1101/2022.05.24.493345v1.full.pdf

ChromGene uses a mixture of hidden Markov models to model the combinatorial and spatial information of epigenomics maps. ChromGene can learn a common model across multiple cell types and use it to generate per-gene annotations for each.

ChromGene annotations are less likely to directly reflect information about gene length compared to baseline methods that incorporate information from the whole gene.

□ iSpatial: Accurate inference of genome-wide spatial expression

>> https://www.biorxiv.org/content/10.1101/2022.05.23.493144v1.full.pdf

iSpatial uses two-rounds integration to reduce potential technology bias and batch effect on PCA space, allowing accurate integration of ST and scRNA-seq datasets. iSpatial outperforms existing approaches on its accuracy and it can reduce false-positive and false-negative signals.

iSpatial uses weighted KNN when performing expression inference: the neighbors close to the inquired cell will be assigned higher weights than neighbors far from the cell in expression imputation.

This should reduce the over-smoothing effect for rare cell types when relatively large K is used, as the neighbors relatively far away from the rare cell types will have less impact on the inferred expression.

□ SPIRAL: Significant Process InfeRence ALgorithm for single cell RNA-sequencing and spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.05.24.493189v1.full.pdf

SPIRAL is an algorithm that relies on a Gaussian statistical model to produce a comprehensive overview of significant processes in single cell RNA-seq, spatial transcriptomics or bulk RNA-seq.

SPIRAL detects structures combining selection on both gene and sample axes. SPIRAL provides a partitioning of the cells into layers based on the expression values. SPIRAL allows for the determination of statistically significant structures, distinguished from noise.

□ NeRFax: An efficient and scalable conversion from the internal representation to Cartesian space

>> https://www.biorxiv.org/content/10.1101/2022.05.25.493427v1.full.pdf

NeRFax, an efficient method for the conversion from internal to Cartesian coordinates that utilizes the platform-agnostic JAX Python library.

A single-CPU implementation of NeRFax algorithm consistently outperformed the state-of-the art NeRF code for every tested protein chain length in a range of 10 to 1,000 residues yielding 35 to 175 speedup.

□ ICAT: A Novel Algorithm to Robustly Identify Cell States Following Perturbations in Single Cell Transcriptomes

>> https://www.biorxiv.org/content/10.1101/2022.05.26.493603v1.full.pdf

Identify Cell states Across Treatments (ICAT) employes self-supervised feature weighting, followed by semi-supervised clustering, ICAT accurately identifies cell states across scRNA-seq perturbation experiments with high accuracy.

ICAT does not require prior knowledge of marker genes or extant cell states, is robust to perturbation severity, and identifies cell states with higher accuracy than leading integration workflows within both simulated and real scRNA-seq perturbation experiments.

□ Accelerating single-cell genomic analysis with GPUs

>> https://www.biorxiv.org/content/10.1101/2022.05.26.493607v1.full.pdf

RAPIDS K-Nearest Neighbors (KNN) graph construction, UMAP visualization, and Louvain clustering, had previously been integrated into the Scanpy framework.

RAPIDS can be used to load an scATAC-seq fragment file using the cuDF library and create sequencing coverage tracks for selected regions in each cluster, thus enabling interactive cluster-specific visualization alongside interactive clustering.

□ SCADIE: simultaneous estimation of cell type proportions and cell type-specific gene expressions using SCAD-based iterative estimating procedure

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02688-w

Unsupervised methods are useful in situations of cell type discovery or lack of supervising information, but as there is no guarantee that their inferred cell types have one-to-one mapping to actual cell types, annotating cell types remains a challenge.

SCADIE requires either bulk gene expression matrices and cell type proportions or bulk gene expression matrices and shared signature matrix as input; the cell type proportions can be obtained by any deconvolution method.

□ LoRTIS Software Suite: Transposon mutant analysis using long-read sequencing

>> https://www.biorxiv.org/content/10.1101/2022.05.26.493556v1.full.pdf

LoRTIS-SS uses the Snakemake framework to manage the workflow. The workflow uses long-read nucleotide sequence data such as those generated by the MinION sequencer.

The software workflow outputs data compatible with the established Bio-TraDIS analysis toolkit allowing for existing workflows to be easily upgraded to support long-read sequencing.

□ CNpare: matching DNA copy number profiles

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac371/6596047

CNpare identifies similar cell line models based on genome-wide DNA copy number. CNpare compares copy number profiles using four different similarity metrics, quantifies the extent of genome differences between pairs, and facilitates comparison based on copy number signatures.

CNpare can also be applied to other settings including: quality control - ensuring the sequenced copy number profile of a cell line matches the reference profile; assessing differences between cell line cultures - by etimating the percentage genome difference.

□ PRRR: A Poisson reduced-rank regression model for association mapping in sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.05.31.494236v1.full.pdf

Poisson RRR (PRRR) and nonnegative Poisson RRR (nn-PRRR) — to jointly model associations within two high-dimensional paired sets of features where the response variables are counts.

PRRR is able to detect associations between a high-dimensional response matrix and a high- dimensional set of predictors by leveraging low-dimensional representations of the data.

PRRR is able to properly account for the count-based nature of single-cell RNA sequencing data using a Poisson likelihood. PRRR uses a Poisson likelihood to model the transcript counts for each cell as the response variables, conditional on observed cell-specific covariates.

□ Matilda: Multi-task learning from single-cell multimodal omics

>> https://www.biorxiv.org/content/10.1101/2022.06.01.494441v1.full.pdf

Matilda, a neural network-based multi-task learning method for integrative analysis of single-cell multimodal omics data. Matilda simultaneously performs data simulation, dimension reduction, cell type classification, and feature selection using a gradient descent procedure.

Matilda learns to combine and reduce the feature dimensions of single-cell multimodal omics data to a latent space using its VAE component in the framework.

The potential mismatch of cell types in the query datasets may have a significant impact on the performance of Matilda. A solution may be to utilise the prediction probability of the neural network for deciding whether a cell in a query dataset should be classified or not.

□ Markonv: a novel convolutional layer with inter-positional correlations modeled

>> https://www.biorxiv.org/content/10.1101/2022.06.09.495500v1.full.pdf

Markonv layer (Markov convolutional neural layer), a novel convolutional neural layer with Markov transition matrices as its filters, to model the intrinsic dependence in inputs as Markov processes.

Markonv-based networks could not only identify functional motifs with inter-positional correlations in large-scale omics sequence data effectively, but also decode complex electrical signals generated by Oxford Nanopore sequencing efficiently.

□ SNIKT: sequence-independent adapter identification and removal in long-read shotgun sequencing data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac389/6607583

Snikt (Slice Nucleotides Into Klassifiable Tequences) is a program that reports a visual confirmation of adapter or systemic contamination in whole-genome shotgun (WGS) or metagenomic sequencing DNA or RNA reads and based on user input, trims sequence ends to remove them.

Snikt works w/o prior information about the adapter sequence making it applilcable even when this information is unavailable. Its most suitable for long read. Because read end trimming for long reads does not have a significant impact on the overall read throughput post-cleaning.

□ DIVE: a reference-free statistical approach to diversity-generating and mobile genetic element discovery

>> https://www.biorxiv.org/content/10.1101/2022.06.13.495703v1.full.pdf

DIVE, a novel statistical, reference-free paradigm for de novo discovery of MGEs and DGMs by identifying k-mer sequences associated with high rates of sequence diversification.

DIVE generates a target dictionary with an online clustering method that collapses targets within “sequencing error" distance. It then models the number of clusters formed at each step using a Poisson-Binomial model.

Top Gun: Maverick

2022-06-05 23:12:13 | 映画

『Top Gun: Maverick』

>> https://www.topgunmovie.com/home/

□ Harold Faltermeyer, Lady Gaga, Hans Zimmer, & Lorne Balfe – “The Man, The Legend / Touchdown”

現代において創り得る、最高の空撮シーケンスと叙情的ドラマ。

こういう映画を製作することが、語り手にとってどんなに幸福なことか想像するに余りある。

アクション要素と『ロマン』を過積載した怒涛のクライマックスに、ずっと胸が震えっぱなしだった。

【追記】
F-14の翼が展張するシーンが映画史に残る屈指の名シーンだと思う。

【追記】

『TOP GUN: Maverick』記念すべき10回目は4DXSceenXで視聴。180°視界を覆う空中の世界。戦闘機がロールするたびに視野いっぱいの地平線が逆転して、座席の動きも連動してまるで自分も天地がひっくり返ったような錯覚に。

『TOP GUN: Maverick』4DXで視聴（3回目）。振動と風はMX4D版を圧倒していて、ドッグファイトの没入感は断然4DXに軍配。しかし、加速度や香りの表現はMX4D版の方が細やかだと感じた。ただ、アクション中にもドラマがしっかり立っているので、4D要素が邪魔に感じられることも屡々。

F-14 キーホルダー (SPARTA Pewter社製)とF-14 T-Shirtを購入✈︎。キーホルダーは『Top Gun: Maverick』の劇中で実際に使用された（トムが握っていた）航空宇宙博物館の商品で、映画の人気ぶりから入手困難になっているもの。プレゼントとして買ったので、すぐに手元を離れるのが寂しいけれど🥺”BUT NOT TODAY.”

【追記】

『TOP GUN: Maverick』(iTunes 4K HDR)、家の4K OLED TVで観ると、きめ細かな質感と黒の締まりに感嘆する。画面直撮りでも伝わる艶やかさ。加えてDolby Vision / Dolby Atmos対応環境なので、映画館では拾い切れなかった情報量にフォーカスできる。あとは4DXみたいに座席さえ動けば…😇

□ Harold Faltermeyer, Lady Gaga, Hans Zimmer, & Lorne Balfe – “The Man, The Legend / Touchdown“

Simple Obsession.

2022-06-04 23:58:01 | 日記・エッセイ・コラム

社会では要約や定量分析など、『正解』の求まる問題の立て方は少なからずあって、10人中9人が不正解を出すような環境では、一人だけの正解が受容されることは難しい。企業内の能力の均一化は単なる『結果』である場合があり、最適化と見誤ってはならない。

Offshore (Evolution Extended Mix)

2022-06-03 18:06:06 | music19

□ Chicane / Offshore (Evolution Extended Mix)

Taken from the Album “Far From The Maddening Crowds (Evolution Mixes)”

Release Date; tba.

I Can Almost See You.

2022-06-03 18:03:06 | Music20

□ Hammock - I Can Almost See You

Hammock: https://lnk.to/hammock-music
Vinyl / Merch: http://shop.hammockmusic.com/merch
Director: David Altobelli
Producer: Sarah Park, Ryan Kohler, Ross Girard
DP: Larkin Seiple
Prod. Design: Ethan Feldbau
Exec. Producer: Sue Yeon Ahn
Prod Co.: The Directors Bureau
Additional VFX: Paul Santagada
Starring: Jessica Calleiro

	ブログを読むだけ。毎月の訪問日数に応じてポイント進呈
	gooブロガーの今日のひとこと
	訪問者数に応じてdポイント最大1,000pt当たる！
	goo blogは20周年を迎えました！

2022年6月
日	月	火	水	木	金	土
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

Lang ist Die Zeit, es ereignet sich aber Das Wahre.