lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Hyperquant.

2022-12-31 22:13:31 | Science News

If áll time is etérnally présent
all time is únredéemable.




□ HyperHMM: Efficient inference of evolutionary and progressive dynamics on hypercubic transition graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac803/6895098

HyperHMM, an adapted Baum-Welch (expectation maximisation) algorithm for hypercubic inference with resampling to quantify uncertainty, and it allows orders-of-magnitude faster inference while making few practical sacrifices compared to previous hypercubic inference approaches.

The HyperHMM algorithm proceeds by iteratively estimating forward and backward probabilities of the different transitions observed in the dataset, given a current estimate of the hypercubic transition matrix.

Hypercubic inference learns the transition probabilities, finding the parameterisation most compatible with a set of emitted observations. It can be interpreted as a probability map of which feature is likely acquired at which stage, explicit pathways through the hypercube space.






□ Hypergraphs and centrality measures identifying key features in gene expression data

>> https://www.biorxiv.org/content/10.1101/2022.12.18.518108v1

The hypergraph modelling approach presented is designed to interrogate a data set, consisting of a structured collection of labelled multi-dimensional data records. Each data record is tested against a list of conditions of interest, giving a sequence of Boolean results.

The vertices of the hypergraph will correspond to the conditions and the hyperedges will correspond to the data records, with a hyperedge incident with a vertex if the discrete object satisfies the given condition.

The 2-multiplicity hyperedge, with distinct intersection pattern, forms a pendant vertex and strictly center around comparisons between the agravitropic and gravitropic phenotype.

Robust distance measures were obtained by representing hypergraphs in terms of s-line graphs. This definition of distance enabled the calculation of multiple centrality measures, with particular emphasis on betweenness and eigencentrality.





□ MIDAS: a deep generative model for mosaic integration and knowledge transfer of single-cell multimodal data

>> https://www.biorxiv.org/content/10.1101/2022.12.13.520262v1

MIDAS (the mosaic integration and knowledge transfer) simultaneously achieves dimensionality reduction, imputation, and batch correction of single-cell trimodal mosaic data by employing self-supervised modality alignment and information-theoretic latent disentanglement.

MIDAS uses self-supervised learning to align different modalities in latent space, and improving cross-modal inference. The scalable inference of MIDAS is achieved by the Stochastic Gradient Variational Bayes (SGVB), which enables “rectangular integration” and atlas construction.





□ HydRA: Deep-learning models for predicting RNA-binding capacity from protein interaction association context and protein sequence

>> https://www.biorxiv.org/content/10.1101/2022.12.23.521837v1

HydRA enables Occlusion Mapping to robustly detect known RNA-binding domains and to predict hundreds of uncharacterized RNA-binding domains. HydRA scores are highly correlated with the number of experimental studies that identify a given RBP as cross-linkable to RNA.

The HydRA algorithm applies an ensemble learning method that integrates convolutional neural network, Transformer and SVM in RBP prediction by utilizing both intermolecular protein context and sequence-level information.






□ TrAGEDy: Trajectory Alignment of Gene Expression Dynamics

>> https://www.biorxiv.org/content/10.1101/2022.12.21.521424v1

TrAGEDy makes post-hoc changes to the alignment, allowing us to overcome the limitations of Dynamic Time Warping. TrAGEDy aligns the pseudotime of the interpolated points then the cells, and performes a sliding window comparison b/n cells at similar points in aligned pseudotime.

TrAGEDy finds the optimal path through the dissimilarity matrix of the interpolated points, which constitutes the shared process between the two trajectories. DTW, with alterations, is used to find the optimal path.

Another constraint of DTW is that all points must be matched to at least one other point; post-DTW pruned any matches that have high transcriptional dissimilarity, enabling processes which may have diverged in the middle of their respective trajectories.







□ XCVATR: detection and characterization of variant impact on the Embeddings of single -cell and bulk RNA-sequencing samples

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-09004-7

XCVATR performs a multiscale analysis of the distance matrices to identify variant clumps. XCVATR performs a scale selection to tune the analysis to the cell–cell distance metric. XCVATR identifies Nν cells that are closest to it, and defining the close neighborhood of each cell.

XCAVTR builds a matrix and computes the estimated alternative AF. XCVATR performs a cell-centered analysis, wherein it does not aim to model the whole embedding space, but rather focuses on the cells. XCVATR identifies the medians of the minimum and maximum radii over all cells.





□ LuxHMM: DNA methylation analysis with genome segmentation via Hidden Markov Model

>> https://www.biorxiv.org/content/10.1101/2022.12.20.521327v1

LuxHMM, a probabilistic method that uses hidden Markov model (HMM) to segment the genome into regions and a Bayesian regression model, which allows handling of multiple covariates, to infer differential methylation of regions.

LuxHMM determines hypo- and hypermethylated regions. LuxHMM enables to describe the underlying biochemistry in bisulfite sequencing and model inference is done using either automatic differentiation variational inference for genome-scale analysis or Hamiltonian Monte Carlo.





□ Asteroid: a new algorithm to infer species trees from gene trees under high proportions of missing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac832/6964379

Asteroid, a novel algorithm that infers an unrooted species tree from a set of unrooted gene trees. Asteroid is substantially more accurate than ASTRAL and ASTRID for very high proportions of missing data.

Asteroid is parallelized, and can take as input multi-furcating gene trees. Asteroid computes for input gene tree a distance matrix based on the gene internode distance. It computes a species tree from this set of distance matrices under the minimum balanced evolution principle.





□ Liam tackles complex multimodal single-cell data integration challenges

>> https://www.biorxiv.org/content/10.1101/2022.12.21.521399v1

Liam (leveraging information across modalities) is a model for the simultane- ous horizontal / vertical integration of paired multimodal single-cell data. Liam learns a joint low-dimensional representation of two concurrently measured modalities.

Liam integrates the accounts for complex batch effects using CVAE / AVAE and can be optimized using replicate information. Liam employs a logistic-normal distribution for the latent cell variable, making the latent factor loadings interpretable as probabilities.





□ scTensor detects many-to-many cell-cell interactions from single cell RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.12.07.519225v1

scTensor, a novel method for extracting representative triadic relationships incl. ligand / receptor expression, and related L-R pairs. scTensor detects hypergraphs that cannot be detected using conventional CCI detection, especially when they incl. many-to-many relationships.

scTensor constructs the CCI-tensor, decomposes the tensor by the NTD-2 algorithm. scTensor estimates the NTD-2 ranks for each matricized CCI-tensor. Because NMF is performed in each matricized CCI-tensor, each rank of NMF are estimated based on the residual sum of squares.





□ NPGREAT: assembly of human subtelomere regions with the use of ultralong nanopore reads and linked-reads

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05081-3

NanoPore Guided REgional Assembly Tool (NPGREAT) combines Linked-Read data with mapped ultralong nanopore reads spanning subtelomeric segmental duplications to potentially overcome these difficulties.

Linked-Read sets of DNA sequences identified by matches with 1-copy subtelomere sequence adjacent to segmental duplications are assembled and extended into the segmental duplication regions using Regional Extension of Assemblies using Linked-Reads (REXTAL).

REXTAL contig alignment with the cognate nanopore read sequence is monitored and alignment discrepancies above a given threshold. Mapped telomere-containing ultralong nanopore reads are used to provide contiguity and correct orientation for matching REXTAL sequence.





□ SC3s: efficient scaling of single cell consensus clustering to millions of cells

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05085-z

SC3s takes as input the gene-by-cell expression matrix, after preprocessing and dimensionality reduction via PCA using Scanpy commands. SC3s attempts to combine the results of multiple clustering runs, where the number of principal components is changed.

All this information is then encoded into a binary matrix, which can be efficiently used to produce the final k cell clusters. The key difference from the original SC3 is that for each d, the cells are first grouped into microclusters which can be reused for multiple values of K.





□ Spectra: Supervised discovery of interpretable gene programs from single-cell data

>> https://www.biorxiv.org/content/10.1101/2022.12.20.521311v1

Spectra overcomes the dominance of cell-type signals by modeling cell-type-specific programs, and can characterize interpretable cell states along a continuum.

Spectra retrieves gene programs from scRNA-seq data using biological priors. As input, Spectra receives a gene expression count matrix with cell type labels for each cell, as well as pre-defined gene sets, which it converts to a gene-gene graph.

The algorithm fits a factor analysis using a loss function that optimizes reconstruction of the count matrix and guides factors to support the input gene-gene graph. As output, Spectra provides factor loadings and gene programs corresponding to cell types and cellular processes.





□ DEAPLOG: A method for differential expression analysis and pseudo- temporal locating and ordering of genes in single-cell transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2022.12.21.521359v1

DEAPLOG, a method for differential expression analysis and pseudo-temporal locating and ordering of genes in sc-transcriptomic data. DEAPLOG infers pseudo-time / embedding coordinates of genes, therefore is useful in identifying regulators in trajectory of cell fate decision.

DEAPLOG identifies a large number of statistically significant DEGs. DEAPLOG defines the point with the maximum curvature on the fitting curve of a gene expression as threshold. DEAPLOG combines polynomial fitting and hypergeometric distribution.





□ SCellBOW: Latent representation of single-cell transcriptomes enables algebraic operations on cellular phenotypes

>> https://www.biorxiv.org/content/10.1101/2022.12.28.522060v1

SCellBOW uses Doc2vec, which is a bag- of-words model, and therefore is independent of any strict ordering of genes. The SCellBOW algorithm provides a latent representation of single-cells in a manner that captures the 'semantics' associated with cellular phenotypes.

SCellBOW learned neuronal weights are transferable. These representations, aka embeddings, allow algebraic operations such as +/-. SCellBOW-based vector representation of cellular transcriptomes preserves their phenotypic relationships in a vector space.





□ SEISM: Neural Networks beyond explainability: Selective inference for sequence motifs

>> https://www.biorxiv.org/content/10.1101/2022.12.23.521748v1

SEISM, a selective inference procedure to test the association b/n the extracted features and the predicted phenotype. SEISM uses a one-layer convolutional network is formally equivalent to selecting motifs maximizing some association score.

SEISM partitions the space of motifs to quantize the selection. The selection event is the set of phenotype vectors. SEISM uses 50, 000 replicates under the conditional null hypothesis using the hypersphere direction sampler, after 10, 000 burn-in iterations.





□ mapquik: Efficient low-divergence mapping of long reads in minimizer space

>> https://www.biorxiv.org/content/10.1101/2022.12.23.521809v1

mapquik, which instead of using a single minimizer as a seed to a genome (e.g. minimap2), builds accurate longer seeds by anchoring alignments through matches of k consecutively-sampled minimizers (k-min-mers).

mapquik borrows from natural language processing, where the tokens of the k-mers are the minimizers instead of base-pair letters. mapquik application of minimizer-space computation is entirely distinct from genome assembly, as no de Bruijn graph is constructed.

Indexing the long minimizer-space seeds (k-min-mers) that occur uniquely in the genome is sufficient for mapping. mapquik devises a provably O(n) time pseudo-chaining algorithm, which improves upon the subsequent best O(nlogn) runtime of all other known colinear chaining.





□ ASTER: accurately estimating the number of cell types in single-cell chromatin accessibility data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac842/6961187

ASTER, an ensemble learning-based tool for accurately estimating the number of cell types in scCAS data. ASTER performs estimation based on the Davies-Bouldin index.

ASTER calculates the mean silhouette coefficient of all cells based on Louvain and Leiden clustering. It provides the maximum coefficient is thus adopted as the optimal number of clusters.





□ NanoSNP: A progressive and haplotype-aware SNP caller on low coverage Nanopore sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac824/6957086

NanoSNP, a novel deep learning-based SNP calling method to identify the SNP sites (excluding short indels) based on low-coverage Nanopore sequencing reads. NanoSNP utilizes the naive pileup feature to predict a subset of SNP sites with a Bi-LSTM network.

NanoSNP has the highest precision score and second highest recall and F1 score on each dataset compared to Clair, Clair3, Pepper-DeepVariant, and NanoCaller. And NanoSNP extracts the features from both the alignment before WhatsHap phasing and the phased alignment.





□ SpaGFT is a graph Fourier transform for tissue module identification from spatially resolved transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.12.10.519929v1

SpaGFT transforms complex gene expression patterns into simple, but informative signals, leading to the accurate identification of spatially variable genes (SVGs) at a fast computational speed.

SpaGFT generates a novel representation of GE and the corresponding spot graph topology in a Fourier space, which enables TM identification and enhances SVG prediction. The low-frequency SVG FM signals are selected as features to identify SVG clusters using Louvain clustering.





□ EnDecon: cell type deconvolution of spatially resolved transcriptomics data via ensemble learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac825/6957087

EnDecon obtains the ensemble result by alternatively updating the ensemble result as a weighted median of the base deconvolution results and the weights of base results based on their distance from the ensemble result.

EnDecon correctly locates cell type to the specific spatial regions, which are consistent with the gene expression patterns of the corresponding cell type marker genes. Furthermore, cell types enriched regions are in line with those of located regions.





□ STREAM: Enhancer-driven gene regulatory networks inference from single-cell RNA-seq and ATAC-seq data

>> https://www.biorxiv.org/content/10.1101/2022.12.15.520582v1

STREAM (Single-cell enhancer regulaTory netwoRk inference from gene Expression And ChroMatin accessibility), a computational framework to infer eGRNs from jointly profiled scRNA-seq and scATAC-seq data.

STREAM combines the Steiner forest problem (SFP) model and submodular optimization, respectively, to discover the enhancer-gene relations and TF-enhancer-gene relations in a global optimization manner. STREAM formulates the eGRN inference by detecting a set of hybrid biclusters.





□ CAbiNet: Joint visualization of cells and genes based on a gene-cell graph

>> https://www.biorxiv.org/content/10.1101/2022.12.20.521232v1

“Correspondence Analysis based Biclustering on Networks” (CAbiNet) to produce a joint visualization and co-clustering of cells and genes in a planar embedding. CAbiNet employs CA to build a graph in which the nodes are comprised of both cells and genes.

Then a clustering algorithm determines the cell-gene clusters from the graph. Finally, the cells, genes and the clustering results are visualized in a 2D-embedding (biMAP). Cells and genes from the same cluster are colored identically in the biMAP.





□ scPROTEIN: A Versatile Deep Graph Contrastive Learning Framework for Single-cell Proteomics Embedding

>> https://www.biorxiv.org/content/10.1101/2022.12.14.520366v1

scPROTEIN, a novel versatile framework composed of peptide uncertainty estimation based on a multi-task heteroscedastic regression model and cell embedding learning based on graph contrastive learning designed for single-cell proteomic data analysis.

sPROTEIN can construct cell graph based on spatial proximity. scPROTEIN contains four major components: Data augmentation, GCN-based graph encoder, Node-level graph contrastive learning and Alternated topology-attribute denoising module.





□ Quantum-Si

>> https://ir.quantum-si.com/news-releases/news-release-details/quantum-si-announces-commercial-availability-platinumtm-worlds/

Introducing the world’s 1st next-generation single-molecule protein sequencing platform — #Platinum™. Learn more about this simple-to-use system and its low price point, unique design, and advanced capabilities here: ir.quantum-si.com/news-releases/… $QSI #ProteinSequencing #Biotech #NGS

"by monitoring for amino-acid specific patterns in fluorescent probe behavior. This means that a single probe can be used for the robust identification of multiple distinct amino acids, including those containing post translational modifications."





□ Dissecting Complexity: The Hidden Impact of Application Parameters on Bioinformatics Research

>> https://www.biorxiv.org/content/10.1101/2022.12.20.521257v1

SOMATA, a methodology to facilitate systematic exploration of the vast choice of configuration options, and apply it to three different tools on a range of scientific inquires.

SOMATA involves Selecting tools and data, identifying Objective metrics, Modeling the parameter space, choosing a sample design Approach, Testing, and Analyzing. A single parameter — MaxO — was varied since that is intuitively related to growth, the output objective of interest.





□ DRfold: Integrating end-to-end learning with deep geometrical potentials for ab initio RNA structure prediction

>> https://www.biorxiv.org/content/10.1101/2022.12.30.522296v1

DRfold predicts RNA tertiary structures by simultaneous learning of local frame rotations and geometric restraints from experimentally solved RNA structures, where the learned knowledge is converted into a hybrid energy potential to guide subsequent RNA structure constructions.

The core of the DRfold pipeline is the introduction of two types of complementary potentials, i.e., FAPE potential and geometry potentials, from two separate transformer networks.

The former models directly predict the rotation matrix and the translation vector for the frames representing each nucleotide, forming an end-to-end learning strategy for RNA structure.





□ A Boolean Algebra for Genetic Variants

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad001/6967432

A comprehensive set of Boolean relations: equivalence, containment, overlap and disjoint, that partitions the domain of binary variant relations. Using these relations, additional variants of interest, i.e., variants with a specific relation to the queried variant can be identified.

The relations can be computed efficiently using a novel algorithm that computes all minimal alignments. Filtering on the maximal influence interval allows for calculating the relations for all pairs of variants for an entire gene.





□ RGT: a toolbox for the integrative analysis of high throughput regulatory genomics data

>> https://www.biorxiv.org/content/10.1101/2022.12.31.522372v1

RGT provides three core classes to handle the genomic regions and signals. Each genomic region is represented by GenomicRegion class and multiple regions are represented by GenomicRegionSet class. The genomic signals are represented CoverageSet class.

The several tools are developed, namely, HINT for analysis of ATAC/DNase-seq; RGT-viz for finding associations b/n chromatin experiments; TDF for DNA/RNA triplex domain finder; THOR for differential peak calling; Motif analysis for transcription factor binding sites matching.





□ MuLan-Methyl: Multiple Transformer-based Language Models for Accurate DNA Methylation Prediction

>> https://www.biorxiv.org/content/10.1101/2023.01.04.522704v1

The output of MuLan-Methyl is based on the average of the prediction probabilities obtained by transformer-based language models, namely BERT, DistilBERT, ALBERT, XLNet and ELECTRA. Each of the five language models is trained according to the “pre-train / fine-tune” paradigm.





□ ACIDES: In-silico monitoring of directed evolution convergence to unveil best performing variants with credibility score

>> https://www.biorxiv.org/content/10.1101/2023.01.03.522172v1

ACIDES (Accurate Confidence Intervals to rank Directed Evolution Scores), a combination of statistical inference and in-silico simulations to reliably estimate the selectivity of individual variants and its statistical error using the data from all available rounds.

ACIDES realizes a 50- to 70-fold improvement over the Poisson model in the predictive ability of the NGS sampling noise. ACIDES uses simulations to quantify a Rank Robustness (RR), a measure of the quality of the selection convergence.





□ ElasticBLAST: Accelerating Sequence Search via Cloud Computing

>> https://www.biorxiv.org/content/10.1101/2023.01.04.522777v1

One of the ElasticBLAST parameters that is critical to its performance is the batch length, which specifies the number of bases or residues per query batch. ElasticBLAST automatically selects an appropriate instance type for a search, based on database metadata and the BLAST program.





Enigma.

2022-12-31 22:13:17 | Science News

(Generated by Midjourney)



□ DRAGON: Determining Regulatory Associations using Graphical models on multi-Omic Networks

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac1157/6931867

DRAGON calibrates its parameters to achieve an optimal trade-off between the network’s complexity and estimation accuracy, while explicitly accounting for the characteristics of each of the assessed omics ‘layers.’

DRAGON is a partial correlation framework. Extending DRAGON to Mixed Graphical Models, which incorporate both continuous and discrete variables. DRAGON adapts to edge density and feature size differences between omics layers, improving model inference and edge recovery.





□ Sparse RNNs can support high-capacity classification

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010759

A sparsely connected recurrent neural network (RNN) can perform classification in a distributed manner without ever bringing all of the relevant information to a single convergence site.

To investigate capacity and accuracy, networks were trained by back-propagation through time (BPTT). Hebbian-based sparse RNN readout accumulates evidence while the stimulus is on and amplifies the response when a +1-labeled input is shown.





□ Detecting bifurcations in dynamical systems with CROCKER plots

>> https://aip.scitation.org/doi/abs/10.1063/5.0102421

A CROCKER plot, was developed in the context of dynamic metric spaces. The additional restrictions means that the time-varying point clouds under study have labels on vertices from one parameter value to the next, allowing for more available theoretical results on continuity.

The CROCKER plot can be used for understanding bifurcations in dynamical systems. This construction is closely related to the 1-Wasserstein distance used for persistence diagrams and make connections b/n this and the maximum Lyapunov exponent, a commonly used measure for chaos.





□ novoRNABreak: local assembly for novel splice junction and fusion transcript detection from RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2022.12.16.520791v1

novoRNABreak is based on a local assembly model, which offers a tradeoff between the alignment-based and de novo whole transcriptome assembly (WTA) approaches, namely, being more sensitive in assembling novel junctions that cannot be directly aligned.

novoRNABreak modifies the well-attested genomic structural variation breakpoint assembly novoBreak, assembles novel junctions. The assembled contigs are considerably longer than raw reads, are aligned against the Human genomic reference from Ensembl using Burrows-Wheeler Aligner.





□ Syntenet: an R/Bioconductor package for the inference and analysis of synteny networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac806/6947985

syntenet infers synteny networks from whole-genome protein sequence data. syntenet offers a simple and complete framework, incl. data preprocessing, synteny detection and network inference, network clustering and phylogenomic profiling, and microsynteny-based phylogeny inference.

Network clustering is performed with the Infomap algorithm by default, which has been demonstrated as the best clustering for synteny networks, but users can also specify other algorithms implemented in the igraph, such as Leiden, label propagation, Louvain, and edge betweenness.





□ HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes

>> https://www.biorxiv.org/content/10.1101/2022.12.22.521552v1

HAPNEST simulates genotypes by resampling a set of existing reference genomes, according to a stochastic model that approximates the underlying processes of coalescent, recombination and mutation.

HAPNEST enables simulation of diverse biobank-scale datasets, as well as simultaneously generating multiple genetically correlated traits w/ population specific effects under different pleiotropy models. HAPNEST uses a model inspired by the sequential Markovian coalescent model.





□ SnapFISH: a computational pipeline to identify chromatin loops from multiplexed DNA FISH data

>> https://www.biorxiv.org/content/10.1101/2022.12.16.520793v1

SnapFISH collects the 3D localization coordinates of each genomic segment targeted by FISH and computes the pairwise Euclidean distances b/n all imaged targeted loci. SnapFISH compares the pairwise Euclidean distances b/n the pair of interest and its local neighborhood region.

SnapFISH converts the resulting P-values into FDRs, and defines a pair of targeted segments. Lastly, SnapFISH groups nearby loop candidates into clusters, identifies the pair with the lowest FDR within each cluster, and uses these summits as the final list of chromatin loops.





□ SURGE: Uncovering context-specific genetic-regulation of gene expression from single-cell RNA-sequencing using latent-factor models

>> https://www.biorxiv.org/content/10.1101/2022.12.22.521678v1

SURGE (Single-cell Unsupervised Regulation of Gene Expression), a novel probabilistic model that uses matrix factorization to learn a continuous representation of the cellular contexts that modulate genetic effects.

SURGE achieves this goal by leveraging information across genome-wide variant-gene pairs to jointly learn both a continuous representation of the latent cellular contexts defining each measurement and the interaction eQTL effect sizes corresponding to each SURGE latent context.





□ ReSort: Accurate cell type deconvolution in spatial transcriptomics using a batch effect-free strategy

>> https://www.biorxiv.org/content/10.1101/2022.12.15.520612v1

A Region-based cell type Sorting strategy (ReSort) that creates a pseudo-internal reference by extracting primary molecular regions from the ST data and leaves out spots that are likely to be mixtures.

By detecting these regions with diverse molecular profiles, ReSort can approximate the pseudo-internal reference to accurately estimate the composition at each spot, bypassing an external reference that could introduce technical noise.





□ Fast two-stage phasing of large-scale sequence data

>> https://www.cell.com/ajhg/fulltext/S0002-9297(21)00304-9

The method uses marker windowing and composite reference haplotypes. It incorporates a progressive phasing algorithm that identifies confidently phased heterozygotes in each iteration and fixes the phase of these heterozygotes in subsequent iterations.

The Method employs HMM w/ a parsimonious state space of composite reference haplotype. It uses a two-stage phasing algorithm that phases high-frequency markers via progressive phasing in the first stage and phases low-frequency markers via genotype imputation in the second stage.





□ Mabs, a suite of tools for gene-informed genome assembly

>> https://www.biorxiv.org/content/10.1101/2022.12.19.521016v1

Mabs tries to find values of parameters of a genome assembler that maximize the number of accurately assembled BUSCO genes. BUSCO is a program that is supplied with a number of taxon-specific datasets that contain orthogroups whose genes are present and single-copy.

Mabs-hifiasm is intended for assembly using PacBio HiFi reads, while Mabs-flye is intended for assembly using reads of more error-prone technologies, namely Oxford Nanopore Technologies and PacBio CLR. Mabs reduces the number of haplotypic duplications.





□ BioNumPy: Fast and easy analysis of biological data with Python

>> https://www.biorxiv.org/content/10.1101/2022.12.21.521373v1

BioNumPy is able to efficiently load biological datasets (e.g. FASTQ-files, BED-files and BAM-files) into NumPy-like data structures, so that NumPy operations like indexing, vectorized functions and reductions can be applied to the data.

A RaggedArray is similar to a NumPy array/matrix but can represent a matrix consisting of rows with varying lengths. An EncodedRaggedArray supports storing and operating on non-numeric data (e.g. DNA-sequences) by encoding the data and keeping track of the encoding.





□ BUSZ: Compressed BUS files

>> https://www.biorxiv.org/content/10.1101/2022.12.19.521034v1

BUSZ is a binary file consisting of a header, followed by zero / more compressed blocks of BUS records, ending with an empty block. The BUSZ header incl. all information from the BUS header, along w/ compression parameters. BUSZ files have a different magic number than BUS files.

The algorithm assumes a sorted input. The input is sorted lexicographically by barcodes first, then by UMIs, and finally by the equivalence classes. Within each block, the columns are compressed independently, each with a customized compression-decompression codec.





□ CETYGO: Uncertainty quantification of reference-based cellular deconvolution algorithms

>> https://www.tandfonline.com/doi/full/10.1080/15592294.2022.2137659

An accuracy metric that quantifies the CEll TYpe deconvolution GOodness (CETYGO) score of a set of cellular heterogeneity variables derived from a genome-wide DNAm profile for an individual sample.

CETYGO, as the root mean square error (RMSE) between the observed bulk DNAm profile and the expected profile across the M cell type specific DNAm sites used to perform the deconvolution, calculated from the estimated proportions for the N cell types.





□ CONGA: Copy number variation genotyping in ancient genomes and low-coverage sequencing data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010788

CONGA (Copy Number Variation Genotyping in Ancient Genomes and Low-coverage Sequencing Data), a CNV genotyping algorithm tailored for ancient and other low coverage genomes, which estimates copy number beyond presence/absence of events.

CONGA first calculates the number of reads mapped to each given interval in the reference genome, which we call “observed read-depth”. It then calculates the “expected diploid read-depth”, i.e., the GC-content normalized read-depth given the genome average.

CONGA calculates the likelihood for each genotype by modeling the read-depth distribution as Poisson. CONGA uses a split-read step in order to utilize paired-end information. It splits reads and remaps the split within the genome, treating the two segments as paired-end reads.





□ motifNet: Functional motif interactions discovered in mRNA sequences with implicit neural representation learning

>> https://www.biorxiv.org/content/10.1101/2022.12.20.521305v1

Many existing neural network models for mRNA event prediction only take the sequence as input, and do not consider the positional information of the sequence

motifNet is a lightweight neural network that uses both the sequence and its positional information as input. This allows for the implicit neural representation of the various motif interaction patterns in human mRNA sequences.





□ SCIBER: a simple method for removing batch effects from single-cell RNA-sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac819/6957084

SCIBER (Single-Cell Integrator and Batch Effect Remover) matches cell clusters across batches according to the overlap of their differentially expressed genes. SCIBER is a simple method that outputs the batch- effect corrected expression data in the original space/dimension.

SCIBER is computationally more efficient than Harmony, LIGER, and Seurat, and it scales to datasets with a large number of cells. SCIBER can be further accelerated by replacing K-means with a more efficient clustering algorithm or using a more efficient implementation of K-means.





□ CODA: a combo-Seq data analysis workflow

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac582/6955042

CODA (Combo-seq Data Analysis), a a custom-tailored workf low for the processing of Combo-Seq data which uses existing tools com- monly used in RNA-Seq data analysis and compared it to exceRpt.

Because of the chosen trimmer, the maximum read length of trimmed reads when using CODA is higher than the one with exceRpt, and it results in more reads successfully passing. This is more dramatic the shorter the sequenced reads are.

This tends to affect gene-mapping reads, rather than miRNA mapping ones: The absolute number of reads mapping to genes increases, especially for shorter sequencing reads, where the proportion of reads with an incomplete/missing adapter increases.





□ NetSHy: Network Summarization via a Hybrid Approach Leveraging Topological Properties

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac818/6957083

NetSHy applies principal component analysis (PCA) on a combination of the node profiles and the well-known Laplacian matrix derived directly from the network similarity matrix to extract a summarization at a subject level.





□ Redeconve: Spatial transcriptomics deconvolution at single-cell resolution

>> https://www.biorxiv.org/content/10.1101/2022.12.22.521551v1

Redeconve, a new algorithm to estimate the cellular composition of ST spots. Redeconve introduces a regularizing term to solve the collinearity problem of high-resolution deconvolution, with the assumption that similar single cells have similar abundance in ST spots.

Redeconve is a quadratic programming model for single-cell deconvolution. A regularization term in the deconvolution model os based on non-negative least regression. Redeconve further improves the accuracy of estimated cell abundance based on a ground truth by nucleus counting.





□ CRAM compression: practical across-technologies considerations for large-scale sequencing projects

>> https://www.biorxiv.org/content/10.1101/2022.12.21.521516v1

Using CRAM for the Emirati Genome Program, which aims to sequence the genomes of ~1 million nationals in the United Arab Emirates using short- and long-read sequencing technologies (Illumina, MGI and Oxford Nanopore Sequencing).





□ SIMBSIG: Similarity search and clustering for biobank-scale data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac829/6958553

“SIMBSIG = SIMmilarity Batched Search Integrated GPU”, which can efficiently perform nearest neighbour searches, principal component analysis (PCA), and K-Means clustering on central processing units (CPUs) and GPUs, both in-core and out-of-core.




□ Igv.js: an embeddable JavaScript implementation of the Integrative Genomics Viewer (IGV)

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac830/6958554

igv.js is an embeddable JavaScript implementation of the Integrative Genomics Viewer (IGV). It can be easily dropped into any web page with a single line of code and has no external dependencies.

igv.js supports a wide range of genomic track types and file formats, including aligned reads, variants, coverage, signal peaks, annotations, eQTLs, GWAS, and copy number variation. A particular strength of IGV is manual review of genome variants, both single-nucleotide and structural variants.





□ A Pairwise Strategy for Imputing Predictive Features When Combining Multiple Datasets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac839/6964381

This method maximizes common genes for imputation based on the intersection between two studies at a time. This method has significantly better performance than the omitting and merged methods in terms of the Root Mean Square Error of prediction on an external validation set.





□ Sc2Mol: A Scaffold-based Two-step Molecule Generator with Variational Autoencoder and Transformer

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac814/6964383

Sc2Mol, a generative model-based molecule generator without any prior scaffold patterns. Sc2Mol uses SMILES strings for molecules. It consists of two steps: scaffold generation and scaffold decoration, which are carried out by a variational autoencoder and a transformer.





□ scAVENGERS: a genotype-based deconvolution of individuals in multiplexed single-cell ATAC-seq data without reference genotypes

>> https://academic.oup.com/nargab/article/4/4/lqac095/6965979

scAVENGERS (scATAC-seq Variant-based EstimatioN for GEnotype ReSolving) introduces an appropriate read alignment tool, variant caller, and mixture model to appropriately process the demultiplexing of scATAC-seq data.

scAVENGERS uses Scipy's sparse matrix structure to enable large data processing. scAVENGERS conveys the process of selecting alternative allele counts to maximize the expected value of total log-likelihood, a probability value of zero inevitably appears during the calculation.





□ gget: Efficient querying of genomic reference databases

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac836/6971843

gget, a free and open-source software package that queries information stored in several large, public databases directly from a command line or Python environment.

gget consists of a collection of separate but interoperable modules, each designed to facilitate one type of database querying required for genomic data analysis in a single line of code.





□ Metadata retrieval from sequence databases with ffq

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac667/6971839

ffq efficiently fetches metadata and links to raw data in JSON format. ffq’s modularity and simplicity makes it extensible to any genomic database exposing its data for programmatic access.





□ MinNet: Single-cell multi-omics integration for unpaired data by a siamese network with graph-based contrastive loss

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05126-7

MinNet is a novel Siamese neural network design for single-cell multi-omics sequencing data integration. It ranked top among other methods in benchmarking and is especially suitable for integrating datasets with batch and biological variances.

MinNet reduces the distance b/n similar cells and separate different cells in the n-dimensional space. The distances b/n corresponding cells get smaller while the distances between negative pairs get larger. In this way, main biological variance is kept in the co-embedding space.





□ NetAct: a computational platform to construct core transcription factor regulatory networks using gene activity

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02835-3

NetAct infers regulators’ activity using target expression, constructs networks based on transcriptional activity, and integrates mathematical modeling for validation. NetAct infers TF activity for an individual sample directly from the expression of genes targeted by the TF.

NetAct calculates its activity using the mRNA expression of the direct targets of the TF. NetAct is robust against some inaccuracy in the TF-target database and noises in GE data, because of its capability of filtering out irrelevant targets as well as remaining key targets.





□ RabbitVar: ultra-fast and accurate somatic small-variant calling on multi-core architectures

>> https://www.biorxiv.org/content/10.1101/2023.01.06.522980v1

RabbitVar features a heuristic-based calling method and a subsequent machine-learning-based filtering strategy. RabbitVar has also been highly optimized by featuring multi-threading, a high-performance memory allocator, vectorization, and efficient data structures.




□ The probability of edge existence due to node degree: a baseline for network-based predictions

>> https://www.biorxiv.org/content/10.1101/2023.01.05.522939v1

The framework decomposes performance into the proportions attributable to degree. The edge prior can be estimated using the fraction of permuted networks in which a given edge exists—the maximum likelihood estimate for the binomial distribution success probability.

The modified XSwap algorithm by adding two parameters, allow_loops, and allow_antiparallel that allow a greater variety of network types to be permuted. The edge swap mechanism uses a bitset to avoid producing edges which violate the conditions for a valid swap.





□ HiDDEN: A machine learning label refinement method for detection of disease-relevant populations in case-control single-cell transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.01.06.523013v1

HiDDEN refines the casecontrol labels to accurately reflect the perturbation status of each cell. HiDDEN’s superior ability to recover biological signals missed by the standard analysis workflow in simulated ground truth datasets of cell type mixtures.



□ Hetnet connectivity search provides rapid insights into how two biomedical entities are related

>> https://www.biorxiv.org/content/10.1101/2023.01.05.522941v1

Transforming the DWPC across all source-target node pairs for a metapath to yield a distribution that is more compact and amenable to modeling. And calculate a path score heuristic, which can be used to compare the importance of paths between metapaths.





□ scEMAIL: Universal and Source-free Annotation Method for scRNA-seq Data with Novel Cell-type Perception

>> https://www.sciencedirect.com/science/article/pii/S1672022922001747

scEMAIL, a universal transfer learning-based annotation framework for scRNA-seq data, which incorporates expert ensemble novel cell-type perception and local affinity constraints of multi-order, with no need for source data.

scEMAIL can deal with atlas-level datasets with mixed batches. scEMAIL achieved intra-cluster compactness and inter-cluster separation, which indicated that the affinity constraints guide the network to learn the correct intercellular relationships.





□ RCL: Unsupervised Contrastive Peak Caller for ATAC-seq

>> https://www.biorxiv.org/content/10.1101/2023.01.07.523108v1

RCL uses ResNET as the backbone module with only five layers, making the network architecture shallow but efficient. RCL showed no problems with class imbalance, probably because the region selection step effectively discards nonpeak regions and balances the data.

RCL could be extended to take coverage vectors for multiple fragment lengths, the fragments themselves, or even annotation information, as used by the supervised method CNN-Peaks.







The Wonder.

2022-12-31 22:10:10 | 映画


□ 『The Wonder (聖なる証)』

>> https://www.netflix.com/jp/title/81426931

Directed by Sebastián Lelio
Based on the book by Emma Donoghue
Writteb by Emma Donoghue / Sebastián Lelio / Alice Birch
Music by Matthew Herbert
Cinematography by Ari Wegner

19世紀アイルランド、神の奇蹟を体現する少女と、監視を担う看護師。導入からメタフィクションであることが明かされる。欺瞞と支配構造を支えるシンボルの脱構築。我々観察者は外にいるのか、或いは内に囚われたままなのか。反復するだけの入れ子構造には過去も未来もない。







Les Traducteurs.

2022-12-31 17:48:11 | 映画


□ 『Les Traducteurs』(9人の翻訳家)

Directed by  Régis Roinsard

Writing by
Romain Compingt
Daniel Presley
Régis Roinsard

フランス・ベルギー合作のクライムミステリー。ソリッドシチュエーションスリラー → 叙述トリックの切り替えが鮮やかだけど、割と序盤で結末が予想できた程度には王道。文学ネタも適度に塗されていて、特にJames Joyceの引用の件でピンと来るように仕掛けられているのも、ジョイス読者には嬉しい。

and then I asked him with my eyes to ask again yes and then he asked me would I yes to say yes my mountain flower and first I put my arms around him yes and drew him down to me so he could feel my breasts all perfume yes and his heart was going like mad and yes I said yes I will Yes.

─Episode 18: Penelope. James Joyce, Ulysses.

Avatar: The Way of Water

2022-12-30 00:12:12 | 映画


□ 『Avatar: The Way of Water』(4K3D+HFR)

>> https://www.avatar.com/movies/avatar-the-way-of-water

Directed by James Cameron
Produced by Richard Baneham / John Landau
Music by Simon Franglen
Cinematography by Russell Carpenter


人類未到の映像領域。極彩色に揺蕩う『水』の世界。HFRで描写される水面の揺らぎ、水棲動物の躍動は、まさに可視化された(共有可能な)至宝の幻想体験。夢と現実、過去と未来、神話とSFはやがて一つに至る。争いも平穏も、破壊も贖罪も、犠牲も彼岸も、全てがそこに解け合う。この叙事詩の辿り着く先を見てみたい。



□ The Weeknd - Nothing Is Lost (You Give Me Strength) (Official Visualizer)


□ Simon Franglen - From Darkness to Light (From "Avatar: The Way of Water"

映画や音楽において、序盤で呈示された主題やモノローグが、終盤の異なる場面においてリフレインされる瞬間のカタルシス。







METANOIA.

2022-12-13 23:13:31 | Science News





□ BioByGANS: biomedical named entity recognition by fusing contextual and syntactic features through graph attention network in node classification framework

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05051-9

BioByGANS (BioBERT/SpaCy-Graph Attention Network-Softmax) models the dependencies / topology of a sentence and formulate the BioNER task. This formulation can introduce topological features of language and no longer be only concerned about the distance b/n words in the sequence.

First, BioByGANS uses periods to segment sentences and spaces and symbols to segment words. Second, contextual features are encoded by BioBERT, and syntactic features such as part of speeches, dependencies and topology are preprocessed by SpaCy respectively.

A graph attention network is then used to generate a fusing representation considering both the contextual features and syntactic features. Last, a softmax function is used to calculate the probabilities.





□ CARNAGE: Investigating graph neural network for RNA structural embedding

>> https://www.biorxiv.org/content/10.1101/2022.12.02.515916v1

CARNAGE (Clustering/Alignment of RNA with Graph-network Em- bedding), which leverages a graph neural network encoder to imprint structural information into a sequence-like embedding; therefore, downstream sequence analyses now account implicitly for structural constraints.

CARNAGE creates a graphG = (V,E,U), where nodes V are unit-vectors encoding the nucleotide identity. For each node/nucleotide, two rounds of message passing network aggregate information. All the node vectors are concatenated to form the Si-seq.





□ bmVAE: a variational autoencoder method for clustering single-cell mutation data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac790/6881080

bmVAE infers the low-dimensional representation of each cell by minimizing the Kullback-Leibler divergence loss and reconstruction loss (measured using cross-entropy). bmVAE takes single-cell binary mutation data as inputs, and outputs inferred cell subpopulations as well as their genotypes.

bmVAE employs a VAE model to learn latent representation of each cell in a low-dimensional space, then uses a Gaussian mixture model (GMM) to find clusters of cells, finally uses a Gibbs sampling based approach to estimate genotypes of each subpopulation in the latent space.





□ rcCAE: a convolutional autoencoder based method for detecting tumor clones and copy number alterations from single-cell DNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.12.04.519013v1

rcCAE uses a convolutional encoder network to project thelog2 transformed read counts (LRC) into a low-dimensional latent space where the cells are clustered into distinct subpopulations through a Gaussian mixture model.

rcCAE leverages a convolutional decoder network to recover the read counts from learned latent representations. rcCAE employs a novel hidden Markov model to jointly segment the genome and infer absolute copy number for each segment.

rcCAE directly deciphers ITH from original read counts, which avoids potential error propagation from copy number analysis to ITH inference. After the algorithm converges, the copy number of each bin is deduced from the state that has the maximum posterior probability.





□ gtexture: Haralick texture analysis for graphs and its application to biological networks

>> https://www.biorxiv.org/content/10.1101/2022.11.21.517417v1

The method for calculating GLCM-equivalents and Haralick texture features and apply it to several network types. They developed the translation of co-occurrence matrix analysis to generic networks for the first time.

The number of distinct node weights is w, the dimension of the co-occurrence matrix, C, is w × w. Co-occurrence matrices summarize a network when the number of distinct node weights is less than the number of nodes.

gtexture reduces the number of unique node weights, incl. node weight binning options for continuous node weights. Continuous data can be transformed via several discretisation methods.

The Haralick features calculated on different landscapes and networks of the same size but with different topologies vary. Although highly specific methods designed for detecting landscape ruggedness exist, this discretization and co-occurrence matrix method is more generalizable.





□ CRMnet: a deep learning model for predicting gene expression from large regulatory sequence datasets

>> https://www.biorxiv.org/content/10.1101/2022.12.02.518786v1

CRMnet, a Transformer encoded U-Net from the image semantic segmentation task and applied it to genomic sequences as a feature extractor. CRMnet utilizes transformer encoders, which leverage self-attention mechanisms to extract additional useful information from genomic sequences.

CRMnet consists of Squeeze and Excitation (SE) Encoder Blocks, Transformer Encoder Blocks, SE Decoder Blocks, SE Block and Multi-Layer Perceptron (MLP). CRMnet has an initial encoding stage that extracts feature maps at progressively lower dimensions.

A decoder stage that upscales these feature maps back to the original sequence dimension, whilst concatenating with the higher resolution feature maps of the encoder at each level to retain prior information despite the sparse upscaling.





□ SRGS: sparse partial least squares-based recursive gene selection for gene regulatory network inference

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-09020-7

SRGS, SPLS (sparse partial least squares)-based recursive gene selection, to infer GRNs from bulk or single-cell expression data. SRGS recursively selects and scores the genes which may have regulations on the considered target gene based on SPLS.

SRGS recursively selects and scores the genes which may have regulations on the considered target gene. They randomly scramble samples, set some values in the expression matrix to zeroes, and generate multiple copies of data through multiple iterations.





□ WINC: M-Band Wavelet-Based Imputation of scRNA-seq Matrix and Multi-view Clustering of Cell

>> https://www.biorxiv.org/content/10.1101/2022.12.05.519090v1

WINC integrates M-band wavelet analysis and UMAP to a panel of single cell sequencing datasets via breaking up the data matrix into a trend (low frequency or low resolution) component and (M − 1) fluctuation (high frequency or high resolution) components.

This strategy resolves the notorious chaotic sparsity of droplet RNA-Seq matrix and uncovers missed / rare cell types, identities, states. A non-parametric wavelet-based imputation algorithm of sparse data that integrates M-band orthogonal wavelet for recovering dropout events.





□ DeepPHiC: Predicting promoter-centered chromatin interactions using a novel deep learning approach

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac801/6887158

DeepPHiC adopts a “shared knowledge transfer” strategy for training the multi-task learning model. When tissue A/B is of interest, and aggregates all chromatin interactions from other tissues except tissue A/B to pretrain the shared feature extractor.

DeepPHiC consists of three types of input features, which include genomic sequence and epigenetic signal in the anchors as well as anchor distance. DeepPHiC uses one-hot encoding for the genomic sequence. As a result, the genomic sequence is converted into a 2000 × 4 matrix.

The network architecture of DeepPHiC is developed based on the DenseNet. DeepPHiC uses a ResNet-style structure with skip connections. During back propagation, each layer has a direct access to the output gradients, resulting in faster network convergence.





□ DPMUnc: Bayesian clustering with uncertain data

>> https://www.biorxiv.org/content/10.1101/2022.12.07.519476v1

Dirichlet Process Mixtures with Uncertainty (DPMUnc), an extension of a Bayesian nonparametric clustering algorithm which makes use of the uncertainty associated with data points.

DPMUnc outperformed its comparators kmeans and mclust by a small margin when observation noise and cluster variance were small, which increased with increasing cluster variance or observation noise.

DPMZeroUnc is the adjusted version of the datasets where the uncertainty estimates were shrunk to 0. The latent variables are essentially fixed to be equal to the observed data points throughout.





□ LAST: Latent Space-Assisted Adaptive Sampling for Protein Trajectories

>> https://pubs.acs.org/doi/10.1021/acs.jcim.2c01213

LAST accelerates the exploration of protein conformational space. This method comprises cycles of (i) variational autoencoder training, (ii) seed structure selection on the latent space, and (iii) conformational sampling through additional Molecular dynamics simulations.

In metastable ADK simulations, LAST explored two transition paths toward two stable states, while SDS explored only one and cMD neither. In VVD light state simulations, LAST was three times faster than cMD simulation with a similar conformational space.





□ FiniMOM: Genetic fine-mapping from summary data using a non-local prior improves detection of multiple causal variants

>> https://www.biorxiv.org/content/10.1101/2022.12.02.518898v1

FiniMOM (fine-mapping using a product inverse-moment priors), a novel Bayesian fine-mapping method for summarized genetic associations. The method uses a non-local inverse-moment prior, which is a natural prior distribution to model non-null effects in finite samples.

FiniMOM allows a non-zero probability for all variables, instead of considering only the variables that correlate highly with the residuals of the current model.

FiniMOM’s sampling scheme is related to reversible jump MCMC algorithm, however this formulation and use of Laplace’s method avoids complicated sampling from varying-dimensional model space.





□ DeepCellEss: Cell line-specific essential protein prediction with attention-based interpretable deep learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac779/6865030

DeepCellEss utilizes convolutional neural network and bidirectional long short-term memory to learn short- and long-range latent information from protein sequences. Further, a multi-head self-attention mechanism is used to provide residue-level model interpretability.

DeepCellEss converts a protein sequence into a numerical matrix using one-hot encoding. The multi-head self-attention is used to produce residue-level attention scores. After this, a bi-LSTM module is applied to model sequential data by learning long-range dependencies.





□ DiffDomain enables identification of structurally reorganized topologically associating domains

>> https://www.biorxiv.org/content/10.1101/2022.12.05.519135v1

DiffDomain, an algorithm leveraging high-dimensional random matrix theory to identify structurally reorganized TADs using chromatin contact maps. DiffDomain outperforms alternative methods for FPRs, TPRs, and identifying a new subtype of reorganized TADs.

DiffDomain directly computes a difference matrix then normalize it properly, skipping the challenging normalization steps for individual Hi-C contact matrices. DiffDomain then borrows well-established theorectical results in ramdom matrix theory to compute a theorectical P value.

DiffDomain identifies reorganized TADs b/n cell types w/ reasonable reproducibility using pseudo-bulk Hi-C data from as few as 100 cells per condition. DiffDomain reveals that TADs have clear differential cell-to-population variability and heterogeneous cell-to-cell variability.





□ Efficient inference and identifiability analysis for differential equation models with random parameters

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010734

A new likelihood-based framework, based on moment matching, for inference and identifiability analysis of differential equation models that capture biological heterogeneity through parameters that vary according to probability distributions.

The availability of a surrogate likelihood allows us to perform inference and identifiability analysis of random parameter models using the standard suite of tools, including profile likelihood, Fisher information, and Markov-chain Monte-Carlo.





□ EDIR: Exome Database of Interspersed Repeats

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac771/6858440

The Exome Database of Interspersed Repeats (EDIR) was developed to provide an overview of the positions of repetitive structures within the human genome composed of interspersed repeats encompassing a coding sequence.

EDIR can be queried for interspersed repeat sequence IRS in a gene of interest. Additional parameters which can be entered are the length of the repeat (7-20 bp), the minimum (0 bp) and maximum distance (1000 bp) of the spacer sequence, and whether to allow a 1-bp mismatch.

As output, a table is given where for each repeat length, the number of interspersed repeat structures, together with the average distance separating two repeats, as well as the number of interspersed repeat structures per megabase and whether a 1 bp mismatch has occurred.





□ T3E: a tool for characterising the epigenetic profile of transposable elements using ChIP-seq data

>> https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-022-00285-z

The Transposable Element Enrichment Estimator (T3E) weights the number of read mappings assigned to the individual TE copies of a family/subfamily by the overall number of genomic loci to which the corresponding reads map, and this is done at the single nucleotide level.

T3E maps ChIP-seq reads to the entire genome of interest w/o subsequently remapping the reads to particular consensus or pseudogenome sequences. In its calculations T3E considers the number of both repetitive / non-repetitive genomic loci to which each multimapper mapped.





□ Hi-LASSO: High-performance python and apache spark packages for feature selection with high-dimensional data

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0278570

Random LASSO does not take advantage of global oracle property. Although Random LASSO uses bootstrapping with weights being proportional to importance scores of predictors in the second procedure, the final coefficients are estimated without the weights.

Hi-LASSO computes importance scores of variables by averaging absolute coefficients. Hi-LASSO alleviates bias from bootstrapping, improves the performance taking advantage of global oracle property, provides a statistical strategy to determine the number of bootstrapping.





□ Scaling Neighbor-Joining to One Million Taxa with Dynamic and Heuristic Neighbor-Joining

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac774/6858462

Dynamic and Heuristic Neighbor-Joining, are presented, which optimize the canonical Neighbor-Joining method to scale to millions of taxa without increasing the memory requirements.

Both Dynamic and Heuristic Neighbor-Joining outperform the current gold standard methods to construct Neighbor-Joining trees, while Dynamic Neighbor-Joining is guaranteed to produce exact Neighbor-Joining trees.

Asymptotically, DNJ reaches a runtime of O(n3) when updates to D causes frequent updates. This worst-case time complexity can be reduced to O(n2) with an approximating search heuristic. The time complexity of HNJ to O(n2), while the space complexity remains at O(n2) as for DNJ.





□ GLCM-WSRC: Robust and accurate prediction of self-interacting proteins from protein sequence information by exploiting weighted sparse representation based classifier

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04880-y

GLCM-WSRC (gray level co-occurrence matrix-weighted sparse representation based classification), for predicting SIPs automatically based on protein evolutionary information from protein primary sequences.

The GLCM algorithm is employed to capture the valuable information from the PSSMs and form feature vectors, after which the ADASYN is applied to balance the training data set to form new feature vectors used as the input of classifier from the GLCM feature vectors.





□ Treenome Browser: co-visualization of enormous phylogenies and millions of genomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac772/6858450

Treenome Browser displays mutations as vertical lines spanning the mutation’s presence among samples in the phylogeny, drawn at their horizontal position in an associated reference genome.

The core algorithm used by Treenome Browser decodes a mutation-annotated tree to compute the on-screen position of each mutation in the tree. To compute vertical positions, the vertical span of each subclade of the tree is first stored using dynamic programming.





□ Accurate quantification of single-nucleus and single-cell RNA-seq transcripts

>> https://www.biorxiv.org/content/10.1101/2022.12.02.518832v1

The presence of both nascent and mature mRNA molecules in single-cell RNA-seq data leads to ambiguity in the notion of a “count matrix”. Underlying this ambiguity, is the challenging problem of separately quantifying nascent and mature mRNAs.

By utilizing k-mers, this approach has the benefit of being efficient as it is compatible with pseudoalignment. An approach to quantification of single-nucleus RNA-seq that focuses on the nascent transcripts, thereby mirroring the approach that focuses on mature transcripts.





□ Variational inference accelerates accurate DNA mixture deconvolution

>> https://www.biorxiv.org/content/10.1101/2022.12.01.518640v1

Considering Stein Variational Gradient Descent (SVGD) and Variational Inference (VI) with an evidence lower-bound objective. Both provide alternatives to the commonly used Markov-Chain Monte-Carlo methods for estimating the model posterior in Bayesian probabilistic genotyping.

The model defines the unnormalised posterior, and the estimator defines the way how an approximation of this distribution is obtained. These two parts are largely independent of each other, meaning that, for example, an estimator can be replaced with another one.

The singularities are not a problem for HMC estimators, who will avoid them
because of the high curvature of the posterior in the vicinity of the singularities. The trajectory of the simulated Hamiltonian differs too much from the expected Hamiltonian.





□ HTRX: an R package for learning non-contiguous haplotypes associated with a phenotype

>> https://www.biorxiv.org/content/10.1101/2022.11.29.518395v1

HTRX defines a template for each haplotype using the combination of ‘0’, ‘1’ and ‘X’ which represent the reference allele, alternative allele and either of the alleles, at each SNP. A four-SNP haplotype ‘1XX0’ only refers to the interaction between the first and the fourth SNP.

HTRX considers lasso penalisation. AIC and BIC penalise the number of features through forward regression, and the features whose parameters do not shrink to 0 are retained. The objective function of HTRX is the out-of-sample variance explained by haplotypes within a region.





□ GSSNNG: Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq (scRNA-seq)

>> https://www.biorxiv.org/content/10.1101/2022.11.29.518384v1

GSSNNG produces a gene set score for each individual cell, addressing problems of low read counts and the many zeros and retains gradations that remain visible in UMAP plots.

The method works by using a nearest neighbor graph in gene expression space to smooth the count matrix. The smoothed expression profiles are then used in single sample gene set scoring calculations.

Using gssnng, large collections of cells can be scored quickly even on a modest desktop. The method uses the nearest neighbor graph (kNN) of cells to smooth the gene expression count matrix which decreases sparsity and improves geneset scoring.





□ Annotation-agnostic discovery of associations between novel gene isoforms and phenotypes

>> https://www.biorxiv.org/content/10.1101/2022.12.02.518787v1

A bi-directed de Bruijn Graph (dBG) is constructed, using Bifrost, from these reads with k-mer size 𝑘 = 31 and then compacted such that consecutive k-mers with out-degree 1 and in-degree 1 respectively are folded into a single, maximal unitig, which is a high-confidence contig.





□ MCProj: Metacell projection for interpretable and quantitative use of transcriptional atlases

>> https://www.biorxiv.org/content/10.1101/2022.12.01.518678v1

MCProj, an algorithm for quantitative analysis of query scRNA-seq given a reference atlas. The algorithm is transforming single cells to quantitative states using a metacell representation of the atlas and the query.

MCProj infers each query state as a mixture of atlas states, and tags cases in which such inference is imprecise, suggestive of novel or noisy states in the query. MCProj tags novel query states and compares them to atlas states.





□ Finemap-MiXeR: A variational Bayesian approach for genetic finemapping

>> https://www.biorxiv.org/content/10.1101/2022.11.30.518509v1

The Finemap-MiXeR is based on a variational Bayesian approach for finemapping genomic data, i.e., determining the causal SNPs associated with a trait at a given locus after controlling for correlation among genetic variants due to linkage disequilibrium.

Finemap-MiXeR on the optimization of Evidence Lower Bound of the likelihood function obtained from the MiXeR model. The optimization is done using Adaptive Moment Estimation Algorithm, allowing to obtain posterior probability of each SNP to be a causal variant.





□ Visual Omics: A web-based platform for omics data analysis and visualization with rich graph-tuning capabilities

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac777/6865031

Visual Omics integrates multiple omics analyses which include differential expression analysis, enrichment analysis, protein domain prediction and protein-protein interaction analysis with extensive graph presentations.

The extensive use of the powerful downstream ggplot2 and its family packages enables almost all analysis results to be visualized by Visual Omics and can be adapted to the online tuning system almost without modification.





□ associationSubgraphs: Interactive network-based clustering and investigation of multimorbidity association matrices

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac780/6874541

associationSubgraphs, a new interactive visualization method to quickly and intuitively explore high-dimensional association datasets using network percolation and clustering.

The algorithm for computing associationSubgraphs at all given cutoffs is closely related to single-linkage clustering but differs philosophically by viewing nodes that are yet to be merged with other nodes as unclustered rather than residing within their own cluster of size one.

It investigates association subgraphs efficiently, each containing a subset of variables with more frequent associations than the remaining variables outside the subset, by showing the entire clustering dynamics and provide subgraphs under all possible cutoff values at once.




Starbright.

2022-12-13 23:12:13 | Science News




□ MoDLE: high-performance stochastic modeling of DNA loop extrusion interactions

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02815-7

MoDLE uses fast stochastic simulation to sample DNA-DNA contacts generated by loop extrusion. Binding and release of LEFs and barriers and the extrusion process is modeled as an iterative process.

MoDLE goes through a burn-in phase where LEFs are progressively bound to DNA, w/o sampling molecular contacts. The burn-in phase runs until the average loop size has stabilized. LEFs are extruded through randomly sampled strides along the DNA in reverse / forward directions.

Extrusion barriers (e.g., CTCF binding sites) are modeled using a two-state (bound and unbound) Markov process. Each extrusion barrier consists of a position, a blocking direction and the Markov process transition probabilities.





□ Reconstructing gene regulatory networks of biological function using differential equations of multilayer perceptrons

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05055-5

A multi-layer perceptron-based differential equation method, which specifically transforms the gene regulation network(GRN) system into an input-output regression problem, where the input is gene expression data and the output is the derivative estimated from the expression data.

The method utilizes time-series gene expression data to train a regulatory function that simulates the transcription rate of a gene, which is a fully connected neural network(NN) with a four-layer structure.





□ BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches in Genome Analysis

>> https://www.biorxiv.org/content/10.1101/2022.11.23.517691v1

BLEND utilizes a technique called SimHash, that can generate the same hash value for similar sets, and provides the proper mechanisms for using seeds as sets with the SimHash technique to find fuzzy seed matches efficiently.

BLEND is faster by 2.4×-83.9× (average 19.3×), has a lower memory foot- print by 0.9×-14.1× (average 3.8×), and finds higher quaity overlaps leading to accurate de novo assemblies than the minimap2. For read mapping, BLEND is faster by 0.8×-4.1× (average 1.7×) than minimap2.





□ SIEVE: joint inference of single-nucleotide variants and cell phylogeny from single-cell DNA sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02813-9

SIEVE, a statistical method for the joint inference of somatic variants and cell phylogeny under the finite-sites assumption from single-cell DNA sequencing. SIEVE leverages raw read counts for all nucleotides and corrects the acquisition bias of branch lengths.

SIEVE takes as input raw read count data, accounting for the read counts for nucleotides and the total depth at each site and combines a phylogenetic model with a probabilistic graphical model, incorporating a Dirichlet Multinomial distribution of the nucleotide counts.





□ scEvoNet: a gradient boosting-based method for prediction of cell state evolution

>> https://www.biorxiv.org/content/10.1101/2022.12.07.519467v1

ScEvoNet builds the confusion matrix of cell states and a bipartite network connecting genes and cell states. It allows a user to obtain a set of genes shared by the characteristic signature of two cell states even between distantly-related datasets.

scEvoNet implements a shortest path search in order to generate a subnetwork of interest. scEvoNet builds a cell type-to-gene network using the Light Gradient Boosting Machine (LGBM) algorithm overcoming different domain effects and dropouts that are inherent.





□ seqwish: Unbiased pangenome graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac743/6854971

The seqwish algorithm builds a variation graph from a set of sequences and alignments between them. seqwish implements a lossless conversion from pairwise alignments between sequences to a variation graph encoding the sequences and their alignments.

seqwish transforms the alignment set into an implicit interval tree. seqwish queries this representation to reduce transitive matches into single DNA segments in a sequence graph. seqwish traces the original paths through this graph, yielding a pangenome variation graph.





□ RawMap: Rapid Real-time Squiggle Classification for Read Until

>> https://www.biorxiv.org/content/10.1101/2022.11.22.517599v1

RawMap is a direct squiggle-space metagenomic classifier which complements Minimap2 for filtering non-targeted reads. RawMap uses a SVM with an RBF kernel, which is trained to capture the non-linear and non-stationary characteristics of the nanopore squiggles.

Each normalized squiggle segment y corresponding to 450 basepairs of a read is mapped to a 3-D feature space. Features are derived from a modified ver. of Hjorth parameters, where the mean and standard deviation are replaced w/ median and median absolute deviation respectively.





□ scSHARP: Consensus Label Propagation with Graph Convolutional Networks for Single-Cell RNA Sequencing Cell Type Annotation

>> https://www.biorxiv.org/content/10.1101/2022.11.23.517739v1

scSHARP provides evidence for the accuracy of the GCN approach through comparison to state-of-the-art methods ScType, ScSorter, SCINA, SingleR, and ScPred on a variety of data sets,

They implemented a non-parametric neighbor ma jority approach as an additional baseline to test our GCN model. This method operates on the 500D vectors produced as the principal components of the gene expression matrices for each data set.





□ Matrix prior for data transfer between single cell data types in latent Dirichlet allocation

>> https://www.biorxiv.org/content/10.1101/2022.11.23.517534v1

When applied to scATAC-seq data, the outputs of latent Dirichlet allocation (LDA) are a cell-topic matrix, describing the topics assigned to each cell, and a topic-peak matrix, describing how strongly a peak contributes to the definition of each topic.

LDA is also well-suited to model single cell genomics data because it expects a matrix of integers as input, and thus can naturally operate on the raw count matrices generated by scATAC-seq or scRNA-seq.

The hyper parameters for the LDA model are the concentration parameters for the document/topic Dirichlet distribution. These distributions are assumed to be symmetric Dirichlet distributions. In that case the Dirichlet distribution can be parameterized with a single scalar value.





□ Interactive explainable AI platform for graph neural networks

>> https://www.biorxiv.org/content/10.1101/2022.11.21.517358v1

An interactive XAI platform that allows the domain expert to ask counterfactual ("what-if") questions. This platform allows a domain expert to observe how changes based on their questions affect the AI decision and the XAI explanation.

This human-in-the-loop approach to GNN classification will pave the way for implementation of GNNs in the clinical setting. This interactive XAI platform will pave the way for informed medical decision-making and the application of AI models as CDSS.

Generating 1000 Barabasi networks comprising 30 nodes and 29 edges. The networks had the same topology, but with varying node feature values. The features of the nodes were randomly sampled from a normal distribution N (0, 0.1). It should uncover these patterns in an algorithmic way.





□ ANNA16: Deep Learning for Predicting 16S rRNA Copy Number

>> https://www.biorxiv.org/content/10.1101/2022.11.26.518038v1

The proposed approach, i.e., Artificial Neural Network Approximator for 16S rRNA Gene Copy Number (ANNA16), essentially links 16S sequence string directly to GCN, without the construction of taxonomy or phylogeny.

ANNA16 is capable of detecting informative positions and weighing K-mers unequally according to their informativeness to more effectively utilize the information contained in 16S sequence.





□ IBDphase: Accurate genome-wide phasing from IBD data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05066-2

Identity by descent (IBD) occurs when one of a person’s two haplotypes is identical to one of another person’s in a segment of the genome because the two share a common ancestor. IBD data can be used to phase and determine the parent from which haplotypes are inherited.

IBDphase is able to separate the DNA inherited from each parent in our test set with an average accuracy over 95%. IBDphase also labels each IBD segment as being on one side of the family or the other.

IBDphase performs better when the DB is large, when many IBD segments are discovered, when a large proportion of sites overlap at least a few IBD segments, and when there are close genetic relationships to provide long IBD segments and help phase across multiple chromosomes.





□ Transposable element finder (TEF): finding active transposable elements from next generation sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05011-3

The new algorithm Transposable Element Finder (TEF) enables the detection of TE transpositions, even for TEs with an unknown sequence. TEF is a finding tool of transposed TEs, in contrast to TIF as a detection tool of transposed sites for TEs with a known sequence.

TEF detects transposed TEs with TSDs as a result of TE transposition, sequences of both ends and their inserted positions of transposed TEs. Genotypes of transpositions are verified by counting of junctions of head and tail, and non-insertion sequences in NGS reads.





□ scCDC: a computational method for gene-specific contamination detection and correction in single-cell and single-nucleus RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2022.11.24.517598v1

scCDC (single-cell Contamination Detection and Correction), which first detects the “contamination-causing genes,” which encode the most abundant ambient RNAs, and then only corrects these genes’ measured expression levels.

scCDC locates the cell cluster in which the GCG has the lowest mean expression. scCDC groups the cell cluster w/ similar clusters in terms of the Wasserstein distance. Genes w/ significant entropy divergence were selected in each cluster and the common genes were defined as GCGs.





□ MAGE: Strain Level Profiling of Metagenome Samples

>> https://www.biorxiv.org/content/10.1101/2022.11.24.517382v1

MAGE builds a k-mer lookup index for the sequence collection. It comprises strain level genome sequences from across a set of species. MAGE performs a novel local search based optimization which computes maximum likelihood estimates subject to constraints on read coverage.

The MAGE index is made of two level indices. In the index at level 2 index, the T sub-collections are indexed separately using FM index based full text indexing that supports k-mer lookup. MAGE performs read mapping purely based on k-mer hits and without any gapped alignment.





□ SCALA: A web application for multimodal analysis of single cell next generation sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.11.24.517826v1

SCALA, a holistic pipeline which integrates all the aforementioned procedures and enables biomedical researchers to get actively involved in the downstream analysis and exploration of both scRNA-seq and scATAC-seq datasets.

SCALA supports additional analysis modes such as automatic cluster annotation, functional enrichment analysis, ligand-receptor analysis, trajectory inference and reconstruction of GRNs.





□ RNAlysis: analyze your RNA sequencing data without writing a single line of code

>> https://www.biorxiv.org/content/10.1101/2022.11.25.517851v1

RNAlysis allows users to build customized analysis pipelines suiting their specific research questions, going all the way from raw FASTQ files, through exploratory data analysis and data visualization, clustering analysis, and gene-set enrichment analysis.

RNAlysis uses a modular approach, and provides an intuitive and flexible GUI, allowing users to answer a wide variety of biological questions, whether they are general or highly specific, and explore their data interactively without writing a single line of code.





□ PRESGENE: A web server for PRediction of ESsential GENE using integrative machine learning strategies

>> https://www.biorxiv.org/content/10.1101/2022.11.25.517801v1

PRESGENE, a ML-based web server for prediction of essential genes in unexplored eukaryotic and prokaryotic organisms.

PRESGENE algorithms mitigate the problems of training dataset imbalance and limited availability of experimentally labeled data for essential genes.





□ WGDTree: a phylogenetic software tool to examine conditional probabilities of retention following whole genome duplication events

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05042-w

Using gene tree-species tree reconciliation to label gene duplicate nodes and differentiate b/n WGD and SSD duplicates, WGDTree calculates a statistic based upon the conditional probability of a gene duplicate being retained after a second WGD dependent upon the retention status.

The inference tool performed well for a range of tree topologies and SSD rates particularly when loss and small-scale duplication rates were small and when event pairs were placed further apart. Therefore, WGDTree can be used to reliably calculate Pratio values in other lineages.





□ Monopogen: single nucleotide variant calling from single cell sequencing

>> https://www.biorxiv.org/content/10.1101/2022.12.04.519058v1

Monopogen, a computational framework that enables researchers to detect single nucleotide variants (SNVs) from a variety of single cell transcriptomic and epigenomic sequencing data. Monopogen starts from individual bam files produced by single cell sequencing technologies

Monopogen leverages linkage disequilibrium (LD) data from an external reference panel to increase SNV detection sensitivity and genotyping accuracy. Monopogen uses Monovar, a probabilistic SNV caller that effectively accounts for allelic dropout and false-positive errors.





□ SysBiolPGWAS: Simplifying Post GWAS analysis through the use of computational technologies and integration of diverse Omics datasets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac791/6883906

SysBiolPGWAS, a post-GWAS web application that provides a comprehensive functionality for biologists and non-bioinformaticians to conduct several post-GWAS analyses. It targets researchers in the area of the human genome and performs its analysis mainly in the autosomal chromosomes.

SysbiolPGWAS can select causal variants based on the linkage disequilibrium information in 1000 genomes using the clumping method of PLINK software. The process of variant clumping reports iteratively the most significant variant in the defined LD regions across the genome.





□ Atlas-scale single-cell multi-sample multi-condition data integration using scMerge2

>> https://www.biorxiv.org/content/10.1101/2022.12.08.519588v1

scMerge2 algorithm is able to integrate many millions of cells from single-cell studies generated from various single-cell technologies, incl. scRNA-seq, CyTOF. scMerge2 is generalizable to other single cell modalities including spatially resolved modality and multi-modalities.

The robustness of scMerge2 is achieved by varying the key tuning parameters of the algorithm, including the number of unwanted variation factors, the number of pseudo-bulk, the ways of pseudo-bulk construction and the number of nearest neighbours.





□ Dysfunctional analysis of the pre-training model on nucleotide sequences and the evaluation of different k-mer embeddings

>> https://www.biorxiv.org/content/10.1101/2022.12.05.518770v1

Decomposing a pre-training model of Bidirectional Encoder Representations from Transformers (BERT) into embedding and encoding modules to illustrate what a pre-trained model learns from pre-training data.

The context-consistent k-mer representation is the primary product that a typical BERT model learns in the embedding layer. Surprisingly, single usage of the k-mer embedding on the random data can achieve comparable performance to that of the k-mer embedding on actual sequences.





□ Freddie: annotation-independent detection and discovery of transcriptomic alternative splicing isoforms using long-read sequencing

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac1112/6882131

Freddie is an annotation-free isoform detection and discovery tool that uses as input transcriptomic long-reads aligned to the reference genome using a splice aligner. Freddie partitions the input reads into sets that can be processed independently and in parallel.

Freddie segments the genomic alignment of the reads into canonical exon segments. Freddie reconstructs the isoforms by jointly clustering and error-correcting the reads using the canonical segmentation as a succinct representation.





□ Optimising a coordinate ascent algorithm for the meta-analysis of test accuracy studies

>> https://www.biorxiv.org/content/10.1101/2022.12.05.519131v1

Considering six closed form methods for estimating the initial values of the parameters for a co-ordinate ascent algorithm used to fit the bivariate model and compare them with numerically derived robust initial values.

All the closed form methods lead to a reduction in computation time of around 80% and rank higher overall across the metrics when compared with the robust initial values method.

Although no initial values estimator dominated the others across all parameters and metrics, the two-step Hedges-Olkin estimator ranked highest overall across the different scenarios.





□ Megan Server: facilitating interactive access to metagenomic data on a server

>> https://www.biorxiv.org/content/10.1101/2022.12.05.518498v1

Megan Server, a stand-alone program that serves MEGAN files to the web, using a RESTful API, facilitating in- teractive analysis without downloading the complete data.

A root directory is specified and then all appropriate files found in or below the root directory are served. The API provides endpoints for obtaining file-related information, classification-related information, for accessing reads and matches and for administrating the server.





□ VASCA: Variable-selection ANOVA Simultaneous Component Analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac795/6887137

Variable-selection ASCA (VASCA), a method that generalizes ASCA through variable selection, augmenting its statistical power without inflating the Type-I error risk. The method is evaluated with simulations and with a real data set from a multi-omic clinical experiment.

VASCA is assessed w/ simulations and w/ a real data set from a multi-omics, and compared to ASCA and the BH (FDR) method in terms of statistical power, and to Partial Least Squares Discriminant Analysis (PLS-DA) and its sparse counterpart (sPLS-DA) in terms of exploratory power.





□ GeneticsMakie.jl: A versatile and scalable toolkit for visualizing locus-level genetic and genomic data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac786/6887175

GeneticsMakie.jl allows scalable and flexible visual display of high-dimensional genetic and genomic data within the Julia ecosystem. It produces high-quality, publication-ready figures by default.

GeneticsMakie.jl harmonizes column names of GWAS or QTL summary statistics, their SNP IDs, and calculates Z-scores if they are missing. GeneticsMakie.jl mitigates this issue by clamping P values of such SNPs to the smallest floating-point number, when munging summary statistics.





□ AutoGater: A Weakly Supervised Neural Network Model to Gate Cells in Flow Cytometric Analyses

>> https://www.biorxiv.org/content/10.1101/2022.12.07.519491v1

Autogater, using a neural network model, can utilize information across multiple channels to distinguish between live and dead cell populations. While the precise definition of dead cells utilized by Autogater is unknown, the model was trained on information only from Forward Scatter and Side Scatter channels.

Autogater has a couple of significant advantages over nucleic acid stains or CFU analyses. When trained on both SYTOX and CFU analyses, Autogater appears to account for features of dead cells identified by both approaches while allowing real-time determination of which cells are dead or alive.





□ TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering

>> https://www.biorxiv.org/content/10.1101/2022.12.09.519749v1

Target- Call performs light-weight basecalling to compute noisy reads using LightCall, and labels these noisy reads as on-target/off- target using Similarity Check. TargetCall eliminates the wasted computation in basecalling by performing basecalling only on the on-target reads.

TargetCall improves the performance of entire genome sequence analysis pipeline by 2.03×-3.00×. TargetCall uses a highly-accurate neural network based variant caller, the execution time of variant calling dominated read mapping.





□ DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05093-z

DiviK: a scalable stepwise algorithm with local data-driven feature space adaptation for segmenting high-dimensional datasets. The algorithm is compared to the optional solutions combined with different feature engineering techniques (None, PCA, EXIMS, UMAP, Neural Ions).

DiviK is an original stepwise deglomerative algorithm. It uses a locally optimised K-means algorithm iteratively. They implemented local feature engineering as filtering based on GMM decomposition of the feature variance across the subregion.





□ Codetta: predicting the genetic code from nucleotide sequence

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac802/6895099

Codetta can analyze an arbitrary nucleotide sequence and needs no sequence annotation or taxonomic placement. The most likely amino acid decoding for each of the 64 codons is inferred from alignments of profile hidden Markov models of conserved proteins to the input sequence.

Codetta takes nucleotide sequences from a single organism as input and predicts the genetic code from coding regions with recognizable homology. For each codon, the best amino acid meaning is selected; Codetta can detect canonical stop and sense codons w/ new amino acid meanings.





□ PYPE: A Python pipeline for phenome-wide association (PheWAS) and mendelian randomization in investigator-driven phenotypes and genotypes of biobank data

>> https://www.biorxiv.org/content/10.1101/2022.12.10.519906v1

PYPE provides the user with the ability to run Mendelian Randomization under a variety of causal effect modeling scenarios (e.g., Inverse Variance Weighted Regression, Egger Regression, and Weighted Median Estimation) to identify possible causal relationships between phenotypes












Maroon.

2022-12-13 23:11:11 | Science News




□ HELIOS: High-speed sequence alignment in optics

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010665

HELIOS, an all-optical high-throughput method for aligning DNA, RNA, and protein sequences. HELIOS locates matches, mutations, and single/multiple indels; while the coding procedure presents distinct coding patterns for input sequences and reduces the noises at the output vector.

The HELIOS optical architecture exploits high-speed processing and operational parallelism, by adopting wavelength and polarization of optical beams. HELIOS and HELIOS optical architecture, each one is manipulated to enhance the other one, and both form a single coherent system.





□ SimMCMC: Inferring delays in partially observed gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2022.11.27.518074v1

SimMCMC, a simulation-based Bayesian method for the inference of kinetic / delay parameters of a GRN when only the products of the genes in the network are observed. SimMCMC is applicable even if only the most downstream genes, i.e. the final outputs, of the network are observed.

SimMCMC uses a a continuous-time Markov Chain, which efficiently explains a biochemical reaction network, one can also use a stochastic differential equation which is accurate when the copy numbers are higher, an agent-based model, or a delay differential equation.





□ Syllable-PBWT for space-efficient haplotype long-match query

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac734/6849513

Syllable- PBWT, a space-efficient variation of the positional Burrows-Wheeler transform (PBWT) which divides every haplotype into syllables, builds the PBWT positional prefix arrays on the compressed syllabic panel, and leverages the polynomial rolling hash function.

Syllable-Query, an algorithm that solves the L-long match query problem. Syllable-Query searchs for ongoing long matches, as opposed to past solutions’ focus on terminated matches, due to the chaotic behavior upon match termination of general sequences in reverse prefix order.





□ IRM / ns-HAL: The Inherited Rate Matrix algorithm for phylogenetic model selection for non-stationary Markov processes

>> https://www.biorxiv.org/content/10.1101/2022.12.06.519392v1

The Inherited Rate Matrix algorithm (IRM) reduces the complexity of identifying a sufficient solution to the problem of time-heterogeneous substitution processes across lineages. fast-IRM makes the parameters from the parent model constant to reduce numerical optimisation time.

The non-stationary heterogeneous across lineages model (ns-HAL) extends the HAL algorithm to the general nucleotide Markov process. This is a discrete-time, the model complexity reducing approach employs a top-down algorithm to identify optimal time-heterogeneous models.





□ Progres: Fast protein structure searching using structure graph embedding

>> https://www.biorxiv.org/content/10.1101/2022.11.28.518224v1

Progres (PROtein GRaph Embedding Search), a simple GNN to embed a protein structure independent of its sequence. Progres uses distance features based on coordinates the embedding is E(3)-invariant. It doesn’t change w/ translation, rotation or reflection of the input structure.

A decoder generates structures from the embedding space. Properties of proteins such as evolution, topological classification , the completeness of fold space, the continuity of fold space, function and dynamics could be explored in the context of the low-dimensional fold space.





□ dnadna: a deep learning framework for population genetics inference

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac765/6851140

dnadna, a flexible python-based software for deep learning inference in population genetics. It is task-agnostic and aims at facilitating the development, reproducibility, dissemination, and reusability of neural networks designed for population genetic data.

dnadna defines multiple workflows. First, users can implement new architectures and tasks, while benefiting from dnadna utility functions, training procedure and test environment. Second, the implemented networks can be re-optimized based on user-specified training sets / tasks.





□ Active Learning for Efficient Analysis of High-throughput Nanopore Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac764/6851141

This work applies several advanced active learning technologies to the nanopore data, including the RNA classification dataset (RNA-CD) and the Oxford Nanopore Technologies barcode dataset (ONT-BD).

Due to the complexity of the nanopore data (with noise sequence), the bias constraint is introduced to improve the sample selection strategy in active learning. Active learning technology can assist experts in labeling samples, and significantly reduce the labeling cost.





□ NanoTrans: an integrated computational framework for comprehensive transcriptome analyses with Nanopore direct-RNA sequencing

>> https://www.biorxiv.org/content/10.1101/2022.11.29.518309v1

Nanopore direct-RNA sequencing (DRS) provides the direct access to native RNA strands with full-length information, shedding light on rich qualitative and quantitative properties of gene expression profiles.

NanoTrans, an integrated computational framework that comprehensively covers all major DRS-based application scopes, including isoform clustering and quanti- fication, poly(A) tail length estimation, RNA modification profiling, and fusion gene detection.





□ NanoPack2: Population scale evaluation of long-read sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.11.28.518232v1

NanoPack now offers tools ready for the evaluation of large populations with implementations in a more performant programming language, with a focus on features relevant to long-read sequencing.

In this manuscript, NanoPack presents newly developed tools that fulfill this need and efficiently assess characteristics specifically relevant to long-read genome sequencing, including alignments spanning structural variants and phasing read alignments.

Phasing, i.e. assigning each sequenced fragment to a parental haplotype by identifying co-occurring variants is important in identifying potential functional variants in association studies and for the pathogenicity of putative compound heterozygous variation.





□ NOMAD+: Unsupervised reference-free inference reveals unrecognized regulated transcriptomic complexity in human single cells

>> https://www.biorxiv.org/content/10.1101/2022.12.06.519414v1

NOMAD+, a new analytic method that performs unified, reference-free statistical inference directly on raw sequencing reads, extending the core NOMAD algorithm to include a micro-assembly and interpretation framework.

NOMAD+ discovers broad and new examples of transcript diversification in single cells, bypassing genome alignment and without requiring cell type metadata and impossible with current algorithms. NOMAD+ simultaneously discovers diversification in centromeric RNA expression.





□ SCExecute: custom cell barcode-stratified analyses of scRNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac768/6854977

SCExecute can be restricted to specific genomic regions and can limit the number of generated scBAMs. SCExecute can be configured to use cleaned up cell barcodes, raw cell barcodes, to use a list of acceptable cell barcodes, or all cell-barcodes found in the BAM file.

Demonstrating SCExecute w/ variant callers designed for bulk (DNA-)sequencing data to identify sceSNVs. SceSNVs from 10xGenomics are vastly understudied, as traditional variant callers estimate quality metrics, incl. allele frequency / genotype confidence, based on all reads.





□ Mathematical model of the cell signaling pathway based on the extended Boolean network model with a stochastic process

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05077-z

A new mathematical model of cell signaling pathways based on the extended Boolean method with the Waller–Kraft operator and a stochastic process. The model was employed to simulate the mitogen-activated protein kinase (MAPK) signaling pathway.

In the model, the activity of proteins in the pathway is regulated by a Boolean function, which is determined by the weights of protein–protein interactions. The model also considers the effect of stochastic factors of protein self-activity on signaling transduction.





□ Transfer learning for genotype–phenotype prediction using deep learning models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05036-8

Any algorithm, TCA, CORAL, 1DCNN, and SVC can also be used for transfer learning, and there is a possibility that these algorithms yield more accuracy when transferring knowledge. So, in the model section, any number of algorithms can be employed without affecting the methodology.

Transfer learning with deep transfer learning. The time to train the model on a large population's genotype is O(E * (Th + T2+. TN)). When transferring knowledge from a large population, one must decide the number of
trainable and non-trainable layers.

If the number of trainable layers is = o, the final computation time would be O(E * (T1 + T2+. .TN)). If some layers are trainable t, the actual computation time would be O(E * (T1 + T2+. .TN)) + O(E * (TN + TN-1+. .Tt)), where is t is the number of trainable layers from bottom to top.





□ Scalable transcriptomics analysis with Dask: applications in data science and machine learning

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05065-3

The simplicity of Dask greatly reduces the barrier to entry for analysts that are new to distributed and parallel computing. The Dask framework combines blocked algorithms with task scheduling to achieve parallel and out-of-core computation.

Dask minimizes the changes required to port pre-existing code. Dask can scale several tasks commonly performed in the preprocessing of scRNA-seq data. Dask can improve the performance of transcriptomics data analysis and scale computation beyond the usual limits.





□ Persistent memory as an effective alternative to random access memory in metagenome assembly

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05052-8

Exploring the possibility of using Persistent Memory (PMem) as a less expensive substitute for dynamic random access memory (DRAM) to reduce OOM and increase the scalability of metagenome assemblers.

PMem can enable metagenome assemblers on terabyte-sized datasets by partially or fully substituting DRAM. Depending on the configured DRAM/PMEM ratio, running assemblies with PMem can achieve a similar speed as DRAM, while in the worst case it showed a roughly two-fold slowdown.





□ Secuer: Ultrafast, scalable and accurate clustering of single-cell RNA-seq data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010753

Secuer, a Scalable and Efficient speCtral clUstERing algorithm for scRNA-seq data. By employing an anchor-based bipartite graph representation algorithm, Secuer enjoys reduced runtime and memory usage over one order of magnitude for datasets with more than 1 million cells.

Secuer pivots p anchors and constructs a weighted bipartite graph by a modified approximate k-nearest neighbor algorithm. Secuer determines the weights of the bipartite graph by a scaled Gaussian kernel function to capture the geometry of the cell-to-anchor similarity network.





□ Mean Dimension of Generative Models for Protein Sequences

>> https://www.biorxiv.org/content/10.1101/2022.12.12.520028v1

The log probability log p(s) of a sequence s in a model can be expanded into terms of different orders. Under some assumptions on the expansion, the corresponding variance under the uniform distribution can be decomposed into contributions of different orders as well.

The mean dimension is then defined as the average of orders under weights that correspond to contributions of orders to the total variance. The contribution of an order to the variance is proportional to the sum of squared interaction coefficients of that order.





□ Nanophase: Nanopore long-read-only metagenomics enables complete and high-quality genome reconstruction from mock and complex metagenomes

>> https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-022-01415-8

Although Nanopore sequencing has difficulty fully characterizing long homopolymer regions, introducing insertion/deletion errors, the continuous improvement of sequencing accuracy, throughput and theoretically unlimited read length empower efficient genome reconstruction.

NanoPhase uses metaFlye to assemble filtered Nanopore long reads to generate assemblies. Then MetaBAT2 and MaxBin2 integrated w/ the coverage information were adopted to reconstruct two candidate genome sets, followed by the bin refinement step of MetaWRAP to generate draft bins.





□ STRling: a k-mer counting approach that detects short tandem repeat expansions at known and novel loci

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02826-4

STRling, software capable of detecting both novel and reference STR expansions, including pathogenic STR expansions. It calls alleles both within the read length and greater than the read length. It is capable of accurately detecting the genomic position of expansions.

STRling can detect STR expansions that are annotated in the reference genome. STRling uses kmer counting to recover mis-mapped STR reads. It then uses soft-clipped reads to precisely discover the position of the STR expansion in the reference genome.





□ Pseudoalignment tools as an efficient alternative to detect repeated transposable elements in scRNAseq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac737/6909008

Kallisto pseudoaligns reads to a reference, producing a list of transcripts that are compatible with each read while avoiding alignment of individual bases and, therefore, bypassing the multiple-mapping issues related to TE detection by conventional alignment tools.

It does so by creating an index through a transcriptome de Brujin Graph (t-DBG) where nodes are k-mers. Reads are hashed and pseudoaligned to a transcript based on their intersection of the k-compatibility classes.





□ Strobealign: flexible seed size enables ultra-fast and accurate read alignment

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02831-7

strobealign is a fast short-read aligner. It achieves the speedup by using a dynamic seed size obtained from syncmer-thinned strobemers. strobealign is multithreaded, aligns single-end and paired-end reads, and outputs mapped reads either in SAM format or PAF format.

The main idea of the seeding approach is to create fuzzy seeds by first computing open syncmers from the reference sequences, then linking the syncmers together using the randstrobe method with two syncmers.





□ CS-CORE: Cell-type-specific co-expression inference from single cell RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.12.13.520181v1

CS-CORE estimates cell-type-specific co-expressions, built on a general expression-measurement model that explicitly accounts for sequencing depth variations and measurement errors in the observed single cell data.

CS-CORE models the unobserved true gene expression levels as latent variables, linked to the observed UMI counts through a measurement model that accounts for both sequencing depth varia- tions and measurement errors.





□ multiGroupVI: Disentangling shared and group-specific variations in single-cell transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2022.12.13.520349v1

multi-Group Variational Inference (multiGroupVI), a DGM that explicitly decomposes the gene expression patterns in scRNA-seq data into shared and group-specific factors of variation.

multiGroupVI models the variations underlying the data using gamma + 1 sets of latent variables: Group-specific encoders embed cells into group-specific latent spaces. For a cell from a given group γ, the latent variables for other groups γ′ ̸= γ are fixed to be zero vectors.





□ TASSEL: Merging short and stranded long reads improves transcript assembly

>> https://www.biorxiv.org/content/10.1101/2022.12.13.520317v1

TASSEL (Transcript Assembly using Short and Strand Emended Long reads), that merges qualitative features of stranded long reads w/ the quantitative depth of short-read sequencing. TASSEL outperforms other assembly in terms of sensitivity / complete assembly on the correct strand.

TASSEL resulted in substantially improved capture of key transcriptomic features such as transcription start and termination sites as well as better enrichment of active histone marks and RNA Pol II. TASSEL TSS are better indicator of active TSS than StringTie Mix TSS.





□ NanopoReaTA: a user-friendly tool for nanopore-seq real-time transcriptional analysis

>> https://www.biorxiv.org/content/10.1101/2022.12.13.520220v1

NanopoReaTA provides biologically relevant snapshots of the sequencing run, which in turn can enable interactive fine-tuning of the sequencing run itself, facilitate decisions to abort the run, when sufficient accuracy is achieved, or accelerate the resolution of clinical cases.

NanopoReaTA focuses on the analysis of cDNA and direct RNA-sequencing reads and achieves the different steps up to final visualizations of results from i.e. differential expression or gene body coverage. NanopoReaTa can be run in real-time right after starting a run via MinKNOW.





□ Insane in the vembrane: filtering and transforming VCF/BCF files

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac810/6909012

vembrane, a new filtering tool for all versions of the VCF and BCF formats. vembrane consolidates and extends the functionality of previously available tools and uses standard Python syntax, while achieving very good processing speed.

vembrane is the first tool to comprehensively handle breakend variants (BNDs): BNDs are a way of encoding structural variants by grouping two or more genomic breakpoints into a joint structural variant event. vembrane thus needs to ensure that each event is removed or kept as a whole.





□ EquiPPIS: E(3) equivariant graph neural networks for robust and accurate protein-protein interaction site prediction

>> https://www.biorxiv.org/content/10.1101/2022.12.14.520476v1

EquiPPIS converts the input protein monomer into an undirected graph 𝒢 = (𝒱,E), with 𝒱 denoting the residues (nodes) and E denoting the interaction between nonsequential residue pairs according to their pairwise spatial proximity.

EquiPPIS uses a deep E(3) equivariant graph neural network that conducts a series of transformations of its input through a stack of equivariant graph convolution layer (EGCL).

A sigmoidal function is applied to the last EGCL node embedding to predict the probability of every residue in the input monomer to be a PPI site, thereby converting the PPI site prediction into a graph node classification task.





□ Mirage2's high-quality spliced protein-to-genome mappings produce accurate multiple-sequence alignments of isoforms

>> https://www.biorxiv.org/content/10.1101/2022.12.14.520492v1

Mirage2 retains the fundamental algorithms of the original Mirage implementation while benefiting from a substantial overhaul of several core components, resulting in a software that improves the results of translated mapping, records informative intermediate outputs.

Isoforms are first mapped back to their coding exons. Once all isoforms within a gene family have been mapped, those genome mapping coordinates serve as the basis for intra-species alignment, resulting in an MSA with explicit splice site awareness and exon delineation.





□ Unsupervized identification of prognostic copy-number alterations using segmentation and lasso regularization

>> https://www.biorxiv.org/content/10.1101/2022.12.14.520497v1

Using Fischer’s non-centered hypergeometric distribution to model survival w/ a segmentation model avoids the high dependency issue of univariate testing, identifies almost systematically all regions, but suffers from the difficulty of selecting the correct number of segments.

Combining this approach with a Lasso-penalization selection improves significantly the ability to recover true regions of interest. Surprisingly, downscaling the data to wider bins seemed to affect only the performances of methods using lasso regularization.

Combining a segmentation approach to create initial meta-regions of similar prognosis impact and a lasso-regularization scheme to select the significant ones provided the best results, especially in the smallest scale situation.





□ PyDESeq2: a python package for bulk RNA-seq differential expression analysis

>> https://www.biorxiv.org/content/10.1101/2022.12.14.520412v1

PyDESeq2 implements the DEA, which consists in modeling raw counts using a negative binomial distribution. Dispersion parameters are estimated independently for each gene by fitting a negative binomial generalized linear model (GLM), and shrunk towards a global trend curve.

PyDESeq2 returns very similar sets of significant genes and pathways, while achieving better likelihood for dispersion and to log-fold changes (LFC) parameters on a vast majority of genes and comparable speeds

PyDESeq2 is structured around two classes of objects: a DeseqDataSet class, handling data-modeling steps from normalization to LFC fitting, and a DeseqStats class for statistical tests and optional LFC shrinkage.





□ LegNet: resetting the bar in deep learning for accurate prediction of promoter activity and variant effects from massive parallel reporter assays

>> https://www.biorxiv.org/content/10.1101/2022.12.22.521582v1

LegNet is an EfficientNetV2-based fully convolutional neural network employing several domain-specific ideas and improvements to reach accurate expression modeling and prediction from a DNA sequence.

LegNet was trained to predict not the single expression value but a vector of expression bin probabilities. At the model evaluation stage, the predicted probabilities are multiplied by bin numbers to convert the vector into a single predicted expression value.





□ Best: A Tool for Characterizing Sequencing Errors

>> https://www.biorxiv.org/content/10.1101/2022.12.22.521488v1

Best, a tool for characterizing sequencing errors using a reference assembly called best: Bam Error Stats Tool. best builds upon the work of a python script published in Wenger et al6 called bamConcordance.

best is written in Rust that quantifies sequencing errors based on alignments to a reference assembly. At its core, best iterates through reads aligned to a high
quality reference assembly, counts the number and types of errors, and aggregates these values into multiple output.







Alice.

2022-12-12 00:00:00 | 映画

□ Jan Švankmajer / “Něco z Alenky” (Alice) excerpt from the elevator scene.

Directed by Jan Švankmajer
Screenplay by Jan Švankmajer
Based on the Novel by Lewis Carroll
Produced by Peter-Christian Fueter

Starring:
Kristýna Kohoutová

Cinematography by Svatopluk Malý
Edited by Marie Zemanová

Production companies: Film Four International / Condor Films

Distributed by First Run Features
Release dates: 3 August 1988 (United States) / 1 November 1990 (Czechoslovakia)

Running time: 86 minutes
Countries: Czechoslovakia / Switzerland / United Kingdom / West Germany
Language: Czech


シュヴァンクマイエルの処女作。幼い少女が白ウサギを追って迷い込んだエレベーター。地下深くへ降りていくに従って、幼い頃の記憶に埋もれていた感覚と、自我形成の根源を辿る背徳感にも似た情動に胸が高鳴る。歯車の軋む音、被膜の破ける音、粘液の擦れる音。遠い記憶を刺激するミュージックコンクレートとしても完成度が高い。

#映画の中のエレベーター・エスカレーター



Rameau / “Entrée de Polymnie” / Ensemble Connect

2022-12-10 00:12:12 | art music


□ Rameau / “Entrée de Polymnie” from “Les Boréades” Ensemble Connect

>> https://youtu.be/I25l_RrxUQY

Monday, January 27, 2020
Ensemble Connect Up Close: Through Movement

Ensemble Connect
Leo Sussman, Flute
Tamara Winston, Oboe
Noémi Sallai, Clarinet
Yen-Chen Wu, Bassoon
Wilden Dannenberg, Horn
Sae Hashimoto, Percussion
Gergana Haralampieva, Violin
Brian Hong, Violin
Jennifer Liu, Violin
Suliman Tekalli, Violin
Emily Liu, Viola (Guest)
Dana Kelley, Viola (Alum)
Ari Evan, Cello
Arlen Hlusko, Cello
Ha Young Jung, Bass

Directed by Lisenka Heijboer Castañon and Julia Eichten
Lighting design by Christopher Gilmore

Ensemble Connect performs Rameau’s Entrée de Polymnie from “Les Boréades” as part of its evening-length concert “Through Movement.” Responding to the challenge of pairing music and movement, Ensemble Connect collaborated with a theater director and a choreographer to create a seamless concert experience that reimaged musicians as movers. Prior to “Through Movement,” a group of Ensemble Connect fellows worked on this music in collaboration with early-music luminary Jordi Savall as part of a residency in Paris in 2018, bringing that experience with them to inform this performance in Carnegie Hall’s Weill Music Room.






PIG

2022-12-08 22:12:24 | 映画



□ 『PIG』

>> https://neonrated.com/films/pig

Directed by Michael Sarnoski

Story by Vanessa Block / Michael Sarnoski
Written by Michael Sarnoski
Cast: Nicolas Cage / Alex Wolff / Cassandra Violet
Music by Alexis Grapsas / Philip Klein
Cinematography by Patrick Scola

愛豚の奪還劇、というよりは『喪失』へ至る旅路。『食』や『料理界』に限らず、虚栄を糧とする産業と、その支配構造へのアンチテーゼ。『本物』じゃなければ人の心は動かせない。喪失こそが愛を雄弁にし、その過程で新たな絆を得ることが叶う。

『PIG』『DÉLICIEUX』『The MENU』奇しくも立て続けに鑑賞することが叶った『料理』にまつわる3本の映画、どれも共通したテーマが『価値』や『支配』についての戯画的批判であった。偶然とはいえ、今こういった映画が出揃ったのは、上流の権力構造を矢面にしながらも、誰もが無責任に消費し、批評を行う当世への警鐘として受け取るべきなのかもしれない。





『PIG』数日経っても鑑賞後の余韻が抜けなくて、いまだにニコラス・ケイジの演技とシーンの数々を思い出しては涙ぐんでるので、少し語りたい。先日も述べた通り、この映画は決して『復讐劇』などではなく、かつて料理人であった一人の男が、自らのルーツを辿って『過去』と『喪失』を受け入れるまでの道程を描いた作品だった。


主人公は『与える側』であって、決して誰からも『奪わない』。バイオレンス描写はあるものの、主人公は目的のために受容する側であって、誰かを傷つけるわけではない。ただ、「なぜ豚のためにそこまでするのか」という、異質だが極めてシンプルな動機が、作劇に終始張り詰めた緊張感を与えている。

※ここで『ブタ』のメタファー的な位置付けを論じるよりも、むしろ重要なのは主人公の慈愛の深さ、率直さの象徴であること、あるいは彼がただ愛情を注いだという事実そのものである。

この映画には明確な『解答』が示されている。それは強力な『演技』と”Context”によって観客の解釈を誘導するものだけれど、決して”Sentence”で全てを説明するようなことはしない。むしろそれは意図的に欠落している。セリフが欠落しているからこそコンテクストが強化され、メッセージが胸に迫る。



この映画は喪失によって『愛』を語る物語だ。劇中の登場人物の誰もが、『誰かのために間違ったことをしている』。虚飾に溺れて自らを見失っているかつての弟子を、主人公が『それが本当にやりたかったことなのか?』と諭すシーンがある。「客も料理も本物じゃない」と。


本物だけが人の心を動かすことが出来る。「単なる職業じゃない、生き様だ」は”Top Gun: Maverick”のセリフだけれど、嘘のないメッセージは、必ずしも言葉にしなくていい。そして愛と喪失は、善悪の境なく誰もが経験する共通言語だ。喪失は愛を雄弁にする。過去を受け入れて踏み出す先に、差し伸べられる手がきっとある。



□ Alexis Grapsas & Philip Klein - Hunting - PIG (Original Motion Picture Soundtrack)





DÉLICIEUX.

2022-12-08 20:08:08 | 映画


□ 『DÉLICIEUX』

>> https://en.unifrance.org/movie/48924/delicieux

Directed by Éric Besnar
Music by Christophe Julien
Cinematography by Jean-Marie Dreujou
Art Direction by Sandrine Jarron

フランス革命前夜、『シェフ』と『料理』は貴族の虚栄と権力闘争に利用され、大衆は飢えと貧しさに喘いでいた。供する者と食する者の資格と価値、その相補性。「生きる力の湧き出るところ」を一般大衆に解放し、初めて「レストラン」を開業した男の実話。男の実話。まるで一枚一枚の静物画や風景画を切り取ったかのような構図やライティングが美しく、ずっと見惚れていた。