lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Hyperquant.

2022-12-31 22:13:31 | Science News

If áll time is etérnally présent
all time is únredéemable.




□ HyperHMM: Efficient inference of evolutionary and progressive dynamics on hypercubic transition graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac803/6895098

HyperHMM, an adapted Baum-Welch (expectation maximisation) algorithm for hypercubic inference with resampling to quantify uncertainty, and it allows orders-of-magnitude faster inference while making few practical sacrifices compared to previous hypercubic inference approaches.

The HyperHMM algorithm proceeds by iteratively estimating forward and backward probabilities of the different transitions observed in the dataset, given a current estimate of the hypercubic transition matrix.

Hypercubic inference learns the transition probabilities, finding the parameterisation most compatible with a set of emitted observations. It can be interpreted as a probability map of which feature is likely acquired at which stage, explicit pathways through the hypercube space.






□ Hypergraphs and centrality measures identifying key features in gene expression data

>> https://www.biorxiv.org/content/10.1101/2022.12.18.518108v1

The hypergraph modelling approach presented is designed to interrogate a data set, consisting of a structured collection of labelled multi-dimensional data records. Each data record is tested against a list of conditions of interest, giving a sequence of Boolean results.

The vertices of the hypergraph will correspond to the conditions and the hyperedges will correspond to the data records, with a hyperedge incident with a vertex if the discrete object satisfies the given condition.

The 2-multiplicity hyperedge, with distinct intersection pattern, forms a pendant vertex and strictly center around comparisons between the agravitropic and gravitropic phenotype.

Robust distance measures were obtained by representing hypergraphs in terms of s-line graphs. This definition of distance enabled the calculation of multiple centrality measures, with particular emphasis on betweenness and eigencentrality.





□ MIDAS: a deep generative model for mosaic integration and knowledge transfer of single-cell multimodal data

>> https://www.biorxiv.org/content/10.1101/2022.12.13.520262v1

MIDAS (the mosaic integration and knowledge transfer) simultaneously achieves dimensionality reduction, imputation, and batch correction of single-cell trimodal mosaic data by employing self-supervised modality alignment and information-theoretic latent disentanglement.

MIDAS uses self-supervised learning to align different modalities in latent space, and improving cross-modal inference. The scalable inference of MIDAS is achieved by the Stochastic Gradient Variational Bayes (SGVB), which enables “rectangular integration” and atlas construction.





□ HydRA: Deep-learning models for predicting RNA-binding capacity from protein interaction association context and protein sequence

>> https://www.biorxiv.org/content/10.1101/2022.12.23.521837v1

HydRA enables Occlusion Mapping to robustly detect known RNA-binding domains and to predict hundreds of uncharacterized RNA-binding domains. HydRA scores are highly correlated with the number of experimental studies that identify a given RBP as cross-linkable to RNA.

The HydRA algorithm applies an ensemble learning method that integrates convolutional neural network, Transformer and SVM in RBP prediction by utilizing both intermolecular protein context and sequence-level information.






□ TrAGEDy: Trajectory Alignment of Gene Expression Dynamics

>> https://www.biorxiv.org/content/10.1101/2022.12.21.521424v1

TrAGEDy makes post-hoc changes to the alignment, allowing us to overcome the limitations of Dynamic Time Warping. TrAGEDy aligns the pseudotime of the interpolated points then the cells, and performes a sliding window comparison b/n cells at similar points in aligned pseudotime.

TrAGEDy finds the optimal path through the dissimilarity matrix of the interpolated points, which constitutes the shared process between the two trajectories. DTW, with alterations, is used to find the optimal path.

Another constraint of DTW is that all points must be matched to at least one other point; post-DTW pruned any matches that have high transcriptional dissimilarity, enabling processes which may have diverged in the middle of their respective trajectories.







□ XCVATR: detection and characterization of variant impact on the Embeddings of single -cell and bulk RNA-sequencing samples

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-09004-7

XCVATR performs a multiscale analysis of the distance matrices to identify variant clumps. XCVATR performs a scale selection to tune the analysis to the cell–cell distance metric. XCVATR identifies Nν cells that are closest to it, and defining the close neighborhood of each cell.

XCAVTR builds a matrix and computes the estimated alternative AF. XCVATR performs a cell-centered analysis, wherein it does not aim to model the whole embedding space, but rather focuses on the cells. XCVATR identifies the medians of the minimum and maximum radii over all cells.





□ LuxHMM: DNA methylation analysis with genome segmentation via Hidden Markov Model

>> https://www.biorxiv.org/content/10.1101/2022.12.20.521327v1

LuxHMM, a probabilistic method that uses hidden Markov model (HMM) to segment the genome into regions and a Bayesian regression model, which allows handling of multiple covariates, to infer differential methylation of regions.

LuxHMM determines hypo- and hypermethylated regions. LuxHMM enables to describe the underlying biochemistry in bisulfite sequencing and model inference is done using either automatic differentiation variational inference for genome-scale analysis or Hamiltonian Monte Carlo.





□ Asteroid: a new algorithm to infer species trees from gene trees under high proportions of missing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac832/6964379

Asteroid, a novel algorithm that infers an unrooted species tree from a set of unrooted gene trees. Asteroid is substantially more accurate than ASTRAL and ASTRID for very high proportions of missing data.

Asteroid is parallelized, and can take as input multi-furcating gene trees. Asteroid computes for input gene tree a distance matrix based on the gene internode distance. It computes a species tree from this set of distance matrices under the minimum balanced evolution principle.





□ Liam tackles complex multimodal single-cell data integration challenges

>> https://www.biorxiv.org/content/10.1101/2022.12.21.521399v1

Liam (leveraging information across modalities) is a model for the simultane- ous horizontal / vertical integration of paired multimodal single-cell data. Liam learns a joint low-dimensional representation of two concurrently measured modalities.

Liam integrates the accounts for complex batch effects using CVAE / AVAE and can be optimized using replicate information. Liam employs a logistic-normal distribution for the latent cell variable, making the latent factor loadings interpretable as probabilities.





□ scTensor detects many-to-many cell-cell interactions from single cell RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.12.07.519225v1

scTensor, a novel method for extracting representative triadic relationships incl. ligand / receptor expression, and related L-R pairs. scTensor detects hypergraphs that cannot be detected using conventional CCI detection, especially when they incl. many-to-many relationships.

scTensor constructs the CCI-tensor, decomposes the tensor by the NTD-2 algorithm. scTensor estimates the NTD-2 ranks for each matricized CCI-tensor. Because NMF is performed in each matricized CCI-tensor, each rank of NMF are estimated based on the residual sum of squares.





□ NPGREAT: assembly of human subtelomere regions with the use of ultralong nanopore reads and linked-reads

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05081-3

NanoPore Guided REgional Assembly Tool (NPGREAT) combines Linked-Read data with mapped ultralong nanopore reads spanning subtelomeric segmental duplications to potentially overcome these difficulties.

Linked-Read sets of DNA sequences identified by matches with 1-copy subtelomere sequence adjacent to segmental duplications are assembled and extended into the segmental duplication regions using Regional Extension of Assemblies using Linked-Reads (REXTAL).

REXTAL contig alignment with the cognate nanopore read sequence is monitored and alignment discrepancies above a given threshold. Mapped telomere-containing ultralong nanopore reads are used to provide contiguity and correct orientation for matching REXTAL sequence.





□ SC3s: efficient scaling of single cell consensus clustering to millions of cells

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05085-z

SC3s takes as input the gene-by-cell expression matrix, after preprocessing and dimensionality reduction via PCA using Scanpy commands. SC3s attempts to combine the results of multiple clustering runs, where the number of principal components is changed.

All this information is then encoded into a binary matrix, which can be efficiently used to produce the final k cell clusters. The key difference from the original SC3 is that for each d, the cells are first grouped into microclusters which can be reused for multiple values of K.





□ Spectra: Supervised discovery of interpretable gene programs from single-cell data

>> https://www.biorxiv.org/content/10.1101/2022.12.20.521311v1

Spectra overcomes the dominance of cell-type signals by modeling cell-type-specific programs, and can characterize interpretable cell states along a continuum.

Spectra retrieves gene programs from scRNA-seq data using biological priors. As input, Spectra receives a gene expression count matrix with cell type labels for each cell, as well as pre-defined gene sets, which it converts to a gene-gene graph.

The algorithm fits a factor analysis using a loss function that optimizes reconstruction of the count matrix and guides factors to support the input gene-gene graph. As output, Spectra provides factor loadings and gene programs corresponding to cell types and cellular processes.





□ DEAPLOG: A method for differential expression analysis and pseudo- temporal locating and ordering of genes in single-cell transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2022.12.21.521359v1

DEAPLOG, a method for differential expression analysis and pseudo-temporal locating and ordering of genes in sc-transcriptomic data. DEAPLOG infers pseudo-time / embedding coordinates of genes, therefore is useful in identifying regulators in trajectory of cell fate decision.

DEAPLOG identifies a large number of statistically significant DEGs. DEAPLOG defines the point with the maximum curvature on the fitting curve of a gene expression as threshold. DEAPLOG combines polynomial fitting and hypergeometric distribution.





□ SCellBOW: Latent representation of single-cell transcriptomes enables algebraic operations on cellular phenotypes

>> https://www.biorxiv.org/content/10.1101/2022.12.28.522060v1

SCellBOW uses Doc2vec, which is a bag- of-words model, and therefore is independent of any strict ordering of genes. The SCellBOW algorithm provides a latent representation of single-cells in a manner that captures the 'semantics' associated with cellular phenotypes.

SCellBOW learned neuronal weights are transferable. These representations, aka embeddings, allow algebraic operations such as +/-. SCellBOW-based vector representation of cellular transcriptomes preserves their phenotypic relationships in a vector space.





□ SEISM: Neural Networks beyond explainability: Selective inference for sequence motifs

>> https://www.biorxiv.org/content/10.1101/2022.12.23.521748v1

SEISM, a selective inference procedure to test the association b/n the extracted features and the predicted phenotype. SEISM uses a one-layer convolutional network is formally equivalent to selecting motifs maximizing some association score.

SEISM partitions the space of motifs to quantize the selection. The selection event is the set of phenotype vectors. SEISM uses 50, 000 replicates under the conditional null hypothesis using the hypersphere direction sampler, after 10, 000 burn-in iterations.





□ mapquik: Efficient low-divergence mapping of long reads in minimizer space

>> https://www.biorxiv.org/content/10.1101/2022.12.23.521809v1

mapquik, which instead of using a single minimizer as a seed to a genome (e.g. minimap2), builds accurate longer seeds by anchoring alignments through matches of k consecutively-sampled minimizers (k-min-mers).

mapquik borrows from natural language processing, where the tokens of the k-mers are the minimizers instead of base-pair letters. mapquik application of minimizer-space computation is entirely distinct from genome assembly, as no de Bruijn graph is constructed.

Indexing the long minimizer-space seeds (k-min-mers) that occur uniquely in the genome is sufficient for mapping. mapquik devises a provably O(n) time pseudo-chaining algorithm, which improves upon the subsequent best O(nlogn) runtime of all other known colinear chaining.





□ ASTER: accurately estimating the number of cell types in single-cell chromatin accessibility data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac842/6961187

ASTER, an ensemble learning-based tool for accurately estimating the number of cell types in scCAS data. ASTER performs estimation based on the Davies-Bouldin index.

ASTER calculates the mean silhouette coefficient of all cells based on Louvain and Leiden clustering. It provides the maximum coefficient is thus adopted as the optimal number of clusters.





□ NanoSNP: A progressive and haplotype-aware SNP caller on low coverage Nanopore sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac824/6957086

NanoSNP, a novel deep learning-based SNP calling method to identify the SNP sites (excluding short indels) based on low-coverage Nanopore sequencing reads. NanoSNP utilizes the naive pileup feature to predict a subset of SNP sites with a Bi-LSTM network.

NanoSNP has the highest precision score and second highest recall and F1 score on each dataset compared to Clair, Clair3, Pepper-DeepVariant, and NanoCaller. And NanoSNP extracts the features from both the alignment before WhatsHap phasing and the phased alignment.





□ SpaGFT is a graph Fourier transform for tissue module identification from spatially resolved transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.12.10.519929v1

SpaGFT transforms complex gene expression patterns into simple, but informative signals, leading to the accurate identification of spatially variable genes (SVGs) at a fast computational speed.

SpaGFT generates a novel representation of GE and the corresponding spot graph topology in a Fourier space, which enables TM identification and enhances SVG prediction. The low-frequency SVG FM signals are selected as features to identify SVG clusters using Louvain clustering.





□ EnDecon: cell type deconvolution of spatially resolved transcriptomics data via ensemble learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac825/6957087

EnDecon obtains the ensemble result by alternatively updating the ensemble result as a weighted median of the base deconvolution results and the weights of base results based on their distance from the ensemble result.

EnDecon correctly locates cell type to the specific spatial regions, which are consistent with the gene expression patterns of the corresponding cell type marker genes. Furthermore, cell types enriched regions are in line with those of located regions.





□ STREAM: Enhancer-driven gene regulatory networks inference from single-cell RNA-seq and ATAC-seq data

>> https://www.biorxiv.org/content/10.1101/2022.12.15.520582v1

STREAM (Single-cell enhancer regulaTory netwoRk inference from gene Expression And ChroMatin accessibility), a computational framework to infer eGRNs from jointly profiled scRNA-seq and scATAC-seq data.

STREAM combines the Steiner forest problem (SFP) model and submodular optimization, respectively, to discover the enhancer-gene relations and TF-enhancer-gene relations in a global optimization manner. STREAM formulates the eGRN inference by detecting a set of hybrid biclusters.





□ CAbiNet: Joint visualization of cells and genes based on a gene-cell graph

>> https://www.biorxiv.org/content/10.1101/2022.12.20.521232v1

“Correspondence Analysis based Biclustering on Networks” (CAbiNet) to produce a joint visualization and co-clustering of cells and genes in a planar embedding. CAbiNet employs CA to build a graph in which the nodes are comprised of both cells and genes.

Then a clustering algorithm determines the cell-gene clusters from the graph. Finally, the cells, genes and the clustering results are visualized in a 2D-embedding (biMAP). Cells and genes from the same cluster are colored identically in the biMAP.





□ scPROTEIN: A Versatile Deep Graph Contrastive Learning Framework for Single-cell Proteomics Embedding

>> https://www.biorxiv.org/content/10.1101/2022.12.14.520366v1

scPROTEIN, a novel versatile framework composed of peptide uncertainty estimation based on a multi-task heteroscedastic regression model and cell embedding learning based on graph contrastive learning designed for single-cell proteomic data analysis.

sPROTEIN can construct cell graph based on spatial proximity. scPROTEIN contains four major components: Data augmentation, GCN-based graph encoder, Node-level graph contrastive learning and Alternated topology-attribute denoising module.





□ Quantum-Si

>> https://ir.quantum-si.com/news-releases/news-release-details/quantum-si-announces-commercial-availability-platinumtm-worlds/

Introducing the world’s 1st next-generation single-molecule protein sequencing platform — #Platinum™. Learn more about this simple-to-use system and its low price point, unique design, and advanced capabilities here: ir.quantum-si.com/news-releases/… $QSI #ProteinSequencing #Biotech #NGS

"by monitoring for amino-acid specific patterns in fluorescent probe behavior. This means that a single probe can be used for the robust identification of multiple distinct amino acids, including those containing post translational modifications."





□ Dissecting Complexity: The Hidden Impact of Application Parameters on Bioinformatics Research

>> https://www.biorxiv.org/content/10.1101/2022.12.20.521257v1

SOMATA, a methodology to facilitate systematic exploration of the vast choice of configuration options, and apply it to three different tools on a range of scientific inquires.

SOMATA involves Selecting tools and data, identifying Objective metrics, Modeling the parameter space, choosing a sample design Approach, Testing, and Analyzing. A single parameter — MaxO — was varied since that is intuitively related to growth, the output objective of interest.





□ DRfold: Integrating end-to-end learning with deep geometrical potentials for ab initio RNA structure prediction

>> https://www.biorxiv.org/content/10.1101/2022.12.30.522296v1

DRfold predicts RNA tertiary structures by simultaneous learning of local frame rotations and geometric restraints from experimentally solved RNA structures, where the learned knowledge is converted into a hybrid energy potential to guide subsequent RNA structure constructions.

The core of the DRfold pipeline is the introduction of two types of complementary potentials, i.e., FAPE potential and geometry potentials, from two separate transformer networks.

The former models directly predict the rotation matrix and the translation vector for the frames representing each nucleotide, forming an end-to-end learning strategy for RNA structure.





□ A Boolean Algebra for Genetic Variants

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad001/6967432

A comprehensive set of Boolean relations: equivalence, containment, overlap and disjoint, that partitions the domain of binary variant relations. Using these relations, additional variants of interest, i.e., variants with a specific relation to the queried variant can be identified.

The relations can be computed efficiently using a novel algorithm that computes all minimal alignments. Filtering on the maximal influence interval allows for calculating the relations for all pairs of variants for an entire gene.





□ RGT: a toolbox for the integrative analysis of high throughput regulatory genomics data

>> https://www.biorxiv.org/content/10.1101/2022.12.31.522372v1

RGT provides three core classes to handle the genomic regions and signals. Each genomic region is represented by GenomicRegion class and multiple regions are represented by GenomicRegionSet class. The genomic signals are represented CoverageSet class.

The several tools are developed, namely, HINT for analysis of ATAC/DNase-seq; RGT-viz for finding associations b/n chromatin experiments; TDF for DNA/RNA triplex domain finder; THOR for differential peak calling; Motif analysis for transcription factor binding sites matching.





□ MuLan-Methyl: Multiple Transformer-based Language Models for Accurate DNA Methylation Prediction

>> https://www.biorxiv.org/content/10.1101/2023.01.04.522704v1

The output of MuLan-Methyl is based on the average of the prediction probabilities obtained by transformer-based language models, namely BERT, DistilBERT, ALBERT, XLNet and ELECTRA. Each of the five language models is trained according to the “pre-train / fine-tune” paradigm.





□ ACIDES: In-silico monitoring of directed evolution convergence to unveil best performing variants with credibility score

>> https://www.biorxiv.org/content/10.1101/2023.01.03.522172v1

ACIDES (Accurate Confidence Intervals to rank Directed Evolution Scores), a combination of statistical inference and in-silico simulations to reliably estimate the selectivity of individual variants and its statistical error using the data from all available rounds.

ACIDES realizes a 50- to 70-fold improvement over the Poisson model in the predictive ability of the NGS sampling noise. ACIDES uses simulations to quantify a Rank Robustness (RR), a measure of the quality of the selection convergence.





□ ElasticBLAST: Accelerating Sequence Search via Cloud Computing

>> https://www.biorxiv.org/content/10.1101/2023.01.04.522777v1

One of the ElasticBLAST parameters that is critical to its performance is the batch length, which specifies the number of bases or residues per query batch. ElasticBLAST automatically selects an appropriate instance type for a search, based on database metadata and the BLAST program.





Enigma.

2022-12-31 22:13:17 | Science News

(Generated by Midjourney)



□ DRAGON: Determining Regulatory Associations using Graphical models on multi-Omic Networks

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac1157/6931867

DRAGON calibrates its parameters to achieve an optimal trade-off between the network’s complexity and estimation accuracy, while explicitly accounting for the characteristics of each of the assessed omics ‘layers.’

DRAGON is a partial correlation framework. Extending DRAGON to Mixed Graphical Models, which incorporate both continuous and discrete variables. DRAGON adapts to edge density and feature size differences between omics layers, improving model inference and edge recovery.





□ Sparse RNNs can support high-capacity classification

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010759

A sparsely connected recurrent neural network (RNN) can perform classification in a distributed manner without ever bringing all of the relevant information to a single convergence site.

To investigate capacity and accuracy, networks were trained by back-propagation through time (BPTT). Hebbian-based sparse RNN readout accumulates evidence while the stimulus is on and amplifies the response when a +1-labeled input is shown.





□ Detecting bifurcations in dynamical systems with CROCKER plots

>> https://aip.scitation.org/doi/abs/10.1063/5.0102421

A CROCKER plot, was developed in the context of dynamic metric spaces. The additional restrictions means that the time-varying point clouds under study have labels on vertices from one parameter value to the next, allowing for more available theoretical results on continuity.

The CROCKER plot can be used for understanding bifurcations in dynamical systems. This construction is closely related to the 1-Wasserstein distance used for persistence diagrams and make connections b/n this and the maximum Lyapunov exponent, a commonly used measure for chaos.





□ novoRNABreak: local assembly for novel splice junction and fusion transcript detection from RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2022.12.16.520791v1

novoRNABreak is based on a local assembly model, which offers a tradeoff between the alignment-based and de novo whole transcriptome assembly (WTA) approaches, namely, being more sensitive in assembling novel junctions that cannot be directly aligned.

novoRNABreak modifies the well-attested genomic structural variation breakpoint assembly novoBreak, assembles novel junctions. The assembled contigs are considerably longer than raw reads, are aligned against the Human genomic reference from Ensembl using Burrows-Wheeler Aligner.





□ Syntenet: an R/Bioconductor package for the inference and analysis of synteny networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac806/6947985

syntenet infers synteny networks from whole-genome protein sequence data. syntenet offers a simple and complete framework, incl. data preprocessing, synteny detection and network inference, network clustering and phylogenomic profiling, and microsynteny-based phylogeny inference.

Network clustering is performed with the Infomap algorithm by default, which has been demonstrated as the best clustering for synteny networks, but users can also specify other algorithms implemented in the igraph, such as Leiden, label propagation, Louvain, and edge betweenness.





□ HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes

>> https://www.biorxiv.org/content/10.1101/2022.12.22.521552v1

HAPNEST simulates genotypes by resampling a set of existing reference genomes, according to a stochastic model that approximates the underlying processes of coalescent, recombination and mutation.

HAPNEST enables simulation of diverse biobank-scale datasets, as well as simultaneously generating multiple genetically correlated traits w/ population specific effects under different pleiotropy models. HAPNEST uses a model inspired by the sequential Markovian coalescent model.





□ SnapFISH: a computational pipeline to identify chromatin loops from multiplexed DNA FISH data

>> https://www.biorxiv.org/content/10.1101/2022.12.16.520793v1

SnapFISH collects the 3D localization coordinates of each genomic segment targeted by FISH and computes the pairwise Euclidean distances b/n all imaged targeted loci. SnapFISH compares the pairwise Euclidean distances b/n the pair of interest and its local neighborhood region.

SnapFISH converts the resulting P-values into FDRs, and defines a pair of targeted segments. Lastly, SnapFISH groups nearby loop candidates into clusters, identifies the pair with the lowest FDR within each cluster, and uses these summits as the final list of chromatin loops.





□ SURGE: Uncovering context-specific genetic-regulation of gene expression from single-cell RNA-sequencing using latent-factor models

>> https://www.biorxiv.org/content/10.1101/2022.12.22.521678v1

SURGE (Single-cell Unsupervised Regulation of Gene Expression), a novel probabilistic model that uses matrix factorization to learn a continuous representation of the cellular contexts that modulate genetic effects.

SURGE achieves this goal by leveraging information across genome-wide variant-gene pairs to jointly learn both a continuous representation of the latent cellular contexts defining each measurement and the interaction eQTL effect sizes corresponding to each SURGE latent context.





□ ReSort: Accurate cell type deconvolution in spatial transcriptomics using a batch effect-free strategy

>> https://www.biorxiv.org/content/10.1101/2022.12.15.520612v1

A Region-based cell type Sorting strategy (ReSort) that creates a pseudo-internal reference by extracting primary molecular regions from the ST data and leaves out spots that are likely to be mixtures.

By detecting these regions with diverse molecular profiles, ReSort can approximate the pseudo-internal reference to accurately estimate the composition at each spot, bypassing an external reference that could introduce technical noise.





□ Fast two-stage phasing of large-scale sequence data

>> https://www.cell.com/ajhg/fulltext/S0002-9297(21)00304-9

The method uses marker windowing and composite reference haplotypes. It incorporates a progressive phasing algorithm that identifies confidently phased heterozygotes in each iteration and fixes the phase of these heterozygotes in subsequent iterations.

The Method employs HMM w/ a parsimonious state space of composite reference haplotype. It uses a two-stage phasing algorithm that phases high-frequency markers via progressive phasing in the first stage and phases low-frequency markers via genotype imputation in the second stage.





□ Mabs, a suite of tools for gene-informed genome assembly

>> https://www.biorxiv.org/content/10.1101/2022.12.19.521016v1

Mabs tries to find values of parameters of a genome assembler that maximize the number of accurately assembled BUSCO genes. BUSCO is a program that is supplied with a number of taxon-specific datasets that contain orthogroups whose genes are present and single-copy.

Mabs-hifiasm is intended for assembly using PacBio HiFi reads, while Mabs-flye is intended for assembly using reads of more error-prone technologies, namely Oxford Nanopore Technologies and PacBio CLR. Mabs reduces the number of haplotypic duplications.





□ BioNumPy: Fast and easy analysis of biological data with Python

>> https://www.biorxiv.org/content/10.1101/2022.12.21.521373v1

BioNumPy is able to efficiently load biological datasets (e.g. FASTQ-files, BED-files and BAM-files) into NumPy-like data structures, so that NumPy operations like indexing, vectorized functions and reductions can be applied to the data.

A RaggedArray is similar to a NumPy array/matrix but can represent a matrix consisting of rows with varying lengths. An EncodedRaggedArray supports storing and operating on non-numeric data (e.g. DNA-sequences) by encoding the data and keeping track of the encoding.





□ BUSZ: Compressed BUS files

>> https://www.biorxiv.org/content/10.1101/2022.12.19.521034v1

BUSZ is a binary file consisting of a header, followed by zero / more compressed blocks of BUS records, ending with an empty block. The BUSZ header incl. all information from the BUS header, along w/ compression parameters. BUSZ files have a different magic number than BUS files.

The algorithm assumes a sorted input. The input is sorted lexicographically by barcodes first, then by UMIs, and finally by the equivalence classes. Within each block, the columns are compressed independently, each with a customized compression-decompression codec.





□ CETYGO: Uncertainty quantification of reference-based cellular deconvolution algorithms

>> https://www.tandfonline.com/doi/full/10.1080/15592294.2022.2137659

An accuracy metric that quantifies the CEll TYpe deconvolution GOodness (CETYGO) score of a set of cellular heterogeneity variables derived from a genome-wide DNAm profile for an individual sample.

CETYGO, as the root mean square error (RMSE) between the observed bulk DNAm profile and the expected profile across the M cell type specific DNAm sites used to perform the deconvolution, calculated from the estimated proportions for the N cell types.





□ CONGA: Copy number variation genotyping in ancient genomes and low-coverage sequencing data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010788

CONGA (Copy Number Variation Genotyping in Ancient Genomes and Low-coverage Sequencing Data), a CNV genotyping algorithm tailored for ancient and other low coverage genomes, which estimates copy number beyond presence/absence of events.

CONGA first calculates the number of reads mapped to each given interval in the reference genome, which we call “observed read-depth”. It then calculates the “expected diploid read-depth”, i.e., the GC-content normalized read-depth given the genome average.

CONGA calculates the likelihood for each genotype by modeling the read-depth distribution as Poisson. CONGA uses a split-read step in order to utilize paired-end information. It splits reads and remaps the split within the genome, treating the two segments as paired-end reads.





□ motifNet: Functional motif interactions discovered in mRNA sequences with implicit neural representation learning

>> https://www.biorxiv.org/content/10.1101/2022.12.20.521305v1

Many existing neural network models for mRNA event prediction only take the sequence as input, and do not consider the positional information of the sequence

motifNet is a lightweight neural network that uses both the sequence and its positional information as input. This allows for the implicit neural representation of the various motif interaction patterns in human mRNA sequences.





□ SCIBER: a simple method for removing batch effects from single-cell RNA-sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac819/6957084

SCIBER (Single-Cell Integrator and Batch Effect Remover) matches cell clusters across batches according to the overlap of their differentially expressed genes. SCIBER is a simple method that outputs the batch- effect corrected expression data in the original space/dimension.

SCIBER is computationally more efficient than Harmony, LIGER, and Seurat, and it scales to datasets with a large number of cells. SCIBER can be further accelerated by replacing K-means with a more efficient clustering algorithm or using a more efficient implementation of K-means.





□ CODA: a combo-Seq data analysis workflow

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac582/6955042

CODA (Combo-seq Data Analysis), a a custom-tailored workf low for the processing of Combo-Seq data which uses existing tools com- monly used in RNA-Seq data analysis and compared it to exceRpt.

Because of the chosen trimmer, the maximum read length of trimmed reads when using CODA is higher than the one with exceRpt, and it results in more reads successfully passing. This is more dramatic the shorter the sequenced reads are.

This tends to affect gene-mapping reads, rather than miRNA mapping ones: The absolute number of reads mapping to genes increases, especially for shorter sequencing reads, where the proportion of reads with an incomplete/missing adapter increases.





□ NetSHy: Network Summarization via a Hybrid Approach Leveraging Topological Properties

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac818/6957083

NetSHy applies principal component analysis (PCA) on a combination of the node profiles and the well-known Laplacian matrix derived directly from the network similarity matrix to extract a summarization at a subject level.





□ Redeconve: Spatial transcriptomics deconvolution at single-cell resolution

>> https://www.biorxiv.org/content/10.1101/2022.12.22.521551v1

Redeconve, a new algorithm to estimate the cellular composition of ST spots. Redeconve introduces a regularizing term to solve the collinearity problem of high-resolution deconvolution, with the assumption that similar single cells have similar abundance in ST spots.

Redeconve is a quadratic programming model for single-cell deconvolution. A regularization term in the deconvolution model os based on non-negative least regression. Redeconve further improves the accuracy of estimated cell abundance based on a ground truth by nucleus counting.





□ CRAM compression: practical across-technologies considerations for large-scale sequencing projects

>> https://www.biorxiv.org/content/10.1101/2022.12.21.521516v1

Using CRAM for the Emirati Genome Program, which aims to sequence the genomes of ~1 million nationals in the United Arab Emirates using short- and long-read sequencing technologies (Illumina, MGI and Oxford Nanopore Sequencing).





□ SIMBSIG: Similarity search and clustering for biobank-scale data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac829/6958553

“SIMBSIG = SIMmilarity Batched Search Integrated GPU”, which can efficiently perform nearest neighbour searches, principal component analysis (PCA), and K-Means clustering on central processing units (CPUs) and GPUs, both in-core and out-of-core.




□ Igv.js: an embeddable JavaScript implementation of the Integrative Genomics Viewer (IGV)

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac830/6958554

igv.js is an embeddable JavaScript implementation of the Integrative Genomics Viewer (IGV). It can be easily dropped into any web page with a single line of code and has no external dependencies.

igv.js supports a wide range of genomic track types and file formats, including aligned reads, variants, coverage, signal peaks, annotations, eQTLs, GWAS, and copy number variation. A particular strength of IGV is manual review of genome variants, both single-nucleotide and structural variants.





□ A Pairwise Strategy for Imputing Predictive Features When Combining Multiple Datasets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac839/6964381

This method maximizes common genes for imputation based on the intersection between two studies at a time. This method has significantly better performance than the omitting and merged methods in terms of the Root Mean Square Error of prediction on an external validation set.





□ Sc2Mol: A Scaffold-based Two-step Molecule Generator with Variational Autoencoder and Transformer

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac814/6964383

Sc2Mol, a generative model-based molecule generator without any prior scaffold patterns. Sc2Mol uses SMILES strings for molecules. It consists of two steps: scaffold generation and scaffold decoration, which are carried out by a variational autoencoder and a transformer.





□ scAVENGERS: a genotype-based deconvolution of individuals in multiplexed single-cell ATAC-seq data without reference genotypes

>> https://academic.oup.com/nargab/article/4/4/lqac095/6965979

scAVENGERS (scATAC-seq Variant-based EstimatioN for GEnotype ReSolving) introduces an appropriate read alignment tool, variant caller, and mixture model to appropriately process the demultiplexing of scATAC-seq data.

scAVENGERS uses Scipy's sparse matrix structure to enable large data processing. scAVENGERS conveys the process of selecting alternative allele counts to maximize the expected value of total log-likelihood, a probability value of zero inevitably appears during the calculation.





□ gget: Efficient querying of genomic reference databases

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac836/6971843

gget, a free and open-source software package that queries information stored in several large, public databases directly from a command line or Python environment.

gget consists of a collection of separate but interoperable modules, each designed to facilitate one type of database querying required for genomic data analysis in a single line of code.





□ Metadata retrieval from sequence databases with ffq

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac667/6971839

ffq efficiently fetches metadata and links to raw data in JSON format. ffq’s modularity and simplicity makes it extensible to any genomic database exposing its data for programmatic access.





□ MinNet: Single-cell multi-omics integration for unpaired data by a siamese network with graph-based contrastive loss

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05126-7

MinNet is a novel Siamese neural network design for single-cell multi-omics sequencing data integration. It ranked top among other methods in benchmarking and is especially suitable for integrating datasets with batch and biological variances.

MinNet reduces the distance b/n similar cells and separate different cells in the n-dimensional space. The distances b/n corresponding cells get smaller while the distances between negative pairs get larger. In this way, main biological variance is kept in the co-embedding space.





□ NetAct: a computational platform to construct core transcription factor regulatory networks using gene activity

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02835-3

NetAct infers regulators’ activity using target expression, constructs networks based on transcriptional activity, and integrates mathematical modeling for validation. NetAct infers TF activity for an individual sample directly from the expression of genes targeted by the TF.

NetAct calculates its activity using the mRNA expression of the direct targets of the TF. NetAct is robust against some inaccuracy in the TF-target database and noises in GE data, because of its capability of filtering out irrelevant targets as well as remaining key targets.





□ RabbitVar: ultra-fast and accurate somatic small-variant calling on multi-core architectures

>> https://www.biorxiv.org/content/10.1101/2023.01.06.522980v1

RabbitVar features a heuristic-based calling method and a subsequent machine-learning-based filtering strategy. RabbitVar has also been highly optimized by featuring multi-threading, a high-performance memory allocator, vectorization, and efficient data structures.




□ The probability of edge existence due to node degree: a baseline for network-based predictions

>> https://www.biorxiv.org/content/10.1101/2023.01.05.522939v1

The framework decomposes performance into the proportions attributable to degree. The edge prior can be estimated using the fraction of permuted networks in which a given edge exists—the maximum likelihood estimate for the binomial distribution success probability.

The modified XSwap algorithm by adding two parameters, allow_loops, and allow_antiparallel that allow a greater variety of network types to be permuted. The edge swap mechanism uses a bitset to avoid producing edges which violate the conditions for a valid swap.





□ HiDDEN: A machine learning label refinement method for detection of disease-relevant populations in case-control single-cell transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.01.06.523013v1

HiDDEN refines the casecontrol labels to accurately reflect the perturbation status of each cell. HiDDEN’s superior ability to recover biological signals missed by the standard analysis workflow in simulated ground truth datasets of cell type mixtures.



□ Hetnet connectivity search provides rapid insights into how two biomedical entities are related

>> https://www.biorxiv.org/content/10.1101/2023.01.05.522941v1

Transforming the DWPC across all source-target node pairs for a metapath to yield a distribution that is more compact and amenable to modeling. And calculate a path score heuristic, which can be used to compare the importance of paths between metapaths.





□ scEMAIL: Universal and Source-free Annotation Method for scRNA-seq Data with Novel Cell-type Perception

>> https://www.sciencedirect.com/science/article/pii/S1672022922001747

scEMAIL, a universal transfer learning-based annotation framework for scRNA-seq data, which incorporates expert ensemble novel cell-type perception and local affinity constraints of multi-order, with no need for source data.

scEMAIL can deal with atlas-level datasets with mixed batches. scEMAIL achieved intra-cluster compactness and inter-cluster separation, which indicated that the affinity constraints guide the network to learn the correct intercellular relationships.





□ RCL: Unsupervised Contrastive Peak Caller for ATAC-seq

>> https://www.biorxiv.org/content/10.1101/2023.01.07.523108v1

RCL uses ResNET as the backbone module with only five layers, making the network architecture shallow but efficient. RCL showed no problems with class imbalance, probably because the region selection step effectively discards nonpeak regions and balances the data.

RCL could be extended to take coverage vectors for multiple fragment lengths, the fragments themselves, or even annotation information, as used by the supervised method CNN-Peaks.







The Wonder.

2022-12-31 22:10:10 | 映画


□ 『The Wonder (聖なる証)』

>> https://www.netflix.com/jp/title/81426931

Directed by Sebastián Lelio
Based on the book by Emma Donoghue
Writteb by Emma Donoghue / Sebastián Lelio / Alice Birch
Music by Matthew Herbert
Cinematography by Ari Wegner

19世紀アイルランド、神の奇蹟を体現する少女と、監視を担う看護師。導入からメタフィクションであることが明かされる。欺瞞と支配構造を支えるシンボルの脱構築。我々観察者は外にいるのか、或いは内に囚われたままなのか。反復するだけの入れ子構造には過去も未来もない。







Les Traducteurs.

2022-12-31 17:48:11 | 映画


□ 『Les Traducteurs』(9人の翻訳家)

Directed by  Régis Roinsard

Writing by
Romain Compingt
Daniel Presley
Régis Roinsard

フランス・ベルギー合作のクライムミステリー。ソリッドシチュエーションスリラー → 叙述トリックの切り替えが鮮やかだけど、割と序盤で結末が予想できた程度には王道。文学ネタも適度に塗されていて、特にJames Joyceの引用の件でピンと来るように仕掛けられているのも、ジョイス読者には嬉しい。

and then I asked him with my eyes to ask again yes and then he asked me would I yes to say yes my mountain flower and first I put my arms around him yes and drew him down to me so he could feel my breasts all perfume yes and his heart was going like mad and yes I said yes I will Yes.

─Episode 18: Penelope. James Joyce, Ulysses.