lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

TITANS.

2021-10-13 22:17:36 | Science News

“Nemlich es reichen Die Sterblichen eh'an den Abgrund.Also wendet es sich,das Echo Mit diesen.”




環境-生態系の相互作用、あるいは種間の共生関係に対して、人間の社会的尺度における『合理性』を捉えてしまうことはアナロジーとしては遡行しており、物理的計算過程にある『状態』に対するシミュラークル現象である。



□ Infinitely Deep Bayesian Neural Networks with Stochastic Differential Equations

>> https://arxiv.org/pdf/2102.06559.pdf

Gradient-based stochastic variational inference in this infinite-parameter setting, producing arbitrarily-flexible approximate posteriors. A novel gradient estimator that approaches zero variance as the approximate posterior over weights approaches the true posterior.

SDE-BNNs, an alternative construction of Bayesian continuous-depth neural networks. Considering the limit of infinite-depth Bayesian neural networks w/ separate unknown weights at each layer. It allows non-factorized approximate posteriors implicitly defined through neural SDEs.




□ ON ∞-COSMOI OF BICATEGORIES:

>> https://arxiv.org/pdf/2108.11786v1.pdf

There are various ∞-cosmoi whose “∞-categories” are 2-categories or bicategories and whose “∞-functors” and “∞-natural transformations” define some variety of functor and natural transformation.

∞-cosmological definitions of adjunctions between ∞-categories or limits inside ∞-categories compile out to in the 2-quasi-categories model.

There is an ∞-cosmos in which the “∞-categories” are the (∞, n)- categories in that particular model. This suggests the tantalizing possibility that it might be possible to develop (∞,2)-category theory or (∞,n)-category theory “model-independently” by adapting ∞-cosmological methods.





□ Ergodicity and Convergence of Markov chain Monte Carlo Estimators

>> https://arxiv.org/pdf/2110.07032.pdf

A Short Review of the basic theory for quantifying both the asymptotic and preasymptotic convergence of Markov chain Monte Carlo estimators.

Geometric ergodicity in the total variation metric guarantees the existence of a Markov chain Monte Carlo central limit theorem that allows us to empirically quantify preasymptotic convergence of Markov chain Monte Carlo estimators for any sufficiently integrable function.

A Markov transition is periodic whenever there is a sequence of disjoint, π-non-null sets that trap Markov chains into cyclic transitions.

Once a Markov chain wanders into any of these sets it will be forever doomed to cycle between the three sets and unable to explore the rest of the ambient space.

Letting N grow to infinity the normal approximation given by the central limit theorem continues to narrow until it finally converges to a Dirac distribution in the asymptotic limit.





□ FoldHSphere: deep hyperspherical embeddings for protein fold recognition

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04419-7

To ensure maximum angular separation between prototypes, we draw inspiration from the well-known Thomson problem. Its goal is to determine the minimum energy configuration of K charged particles on the surface of a unit sphere.

By minimizing a Thomson-based loss function, extended to a hypersphere of arbitrary number of dimensions, FoldHSphere optimizes the angular distribution of our prototype vectors for each fold class that are maximally separated in hyperspherical space.





□ scTITANS: Identify differential genes and cell subclusters from time-series scRNA-seq data

>> https://www.sciencedirect.com/science/article/pii/S2001037021003068

scTITANS, a method that takes full advantage of individual cells from all time points at the same time by correcting cell asynchrony using pseudotime from trajectory inference analysis.

scTITANS reconstructs the true gene expression trends in time-series data. After correcting the asynchrony of single cells based on TI analysis, a time-dependent covariate is introduced to identify the DEGs and cell subclusters in dynamic processes.





□ scTriangulate: Decision-level integration of multimodal single-cell data

>> https://www.biorxiv.org/content/10.1101/2021.10.16.464640v1.full.pdf

Different from other multimodal methods that integrate at the data-level, through either a low-dimensional latent space, or through geometric graph, scTriangulate integrates results at a decision-level to reconcile conflicting cluster label assignments.


scTriangulate leverages cooperative game theory in conjunction w/ stability metrics (reassign / TFIDF / SCCAF) to intelligently integrate clustering from unlimited sources. Applied to multimodal datasets, scTriangulate highlights new cell mechanisms underlying lineage diversity.





□ DeepSE: Detecting super-enhancers among typical enhancers using only sequence feature embeddings

>> https://www.sciencedirect.com/science/article/pii/S0888754321003700

DeepSE is based on a deep convolutional neural network model, to distinguish the SEs from TEs. DeepSE can be generalized well across different cell lines, which implied that cell-type specific SEs may share hidden sequence patterns across different cell lines.

DeepSE uses the whole genome sequences as learning corpus to train dna2vec for generating k-mer embeddings with a fixed number of dimensions. The Parameter dk indicates that every k-mer was represented as a 100-dimension vector.





□ scINSIGHT for interpreting single-cell gene expression from biologically heterogeneous data

>> https://www.biorxiv.org/content/10.1101/2021.10.13.464306v1.full.pdf

Based on a novel matrix factorization model, scINSIGHT learns coordinated gene expression patterns that are common among or specific to different biological conditions, offering a unique chance to jointly identify heterogeneous biological processes and diverse cell types.

scINSIGHT achieves sparse, interpretable, and biologically meaningful decomposition. scINSIGHT simultaneously identifies common and condition-specific gene modules and quantify their expression levels in each sample in a lower-dimensional space.





□ Airpart: Interpretable statistical models for analyzing allelic imbalance in single-cell datasets https://www.biorxiv.org/content/10.1101/2021.10.15.464546v1.full.pdf

Airpart, a statistical method airpart for identifying differential CTS allelic imbalance (AI) from scRNA-seq data, or other spatially- or time-resolved datasets. airpart outputs discrete partitions of data, pointing to groups of genes and cells under common mechanisms.

Airpart uses a Generalized Fused Lasso w/ Binomial likelihood for partitioning groups of cells by AI signal, and a hierarchical Bayesian model. Airpart identifies differential AI patterns across cell states and could be used to define trends of AI signal over spatial / time axes.





□ La Jolla Assembler (LJA): Assembling Long Accurate Reads Using Multiplex de Bruijn Graphs

>> https://www.biorxiv.org/content/10.1101/2020.12.10.420448v2.full.pdf

La Jolla Assembler (LJA) includes three modules addressing all three challenges in assembling long and accurate reads: jumboDBG (constructing large de Bruijn graphs), mowerDBG (error-correcting reads), and multiplexDBG (utilizing the entire read-length for resolving repeats).

a fast LJA algorithm reduces the error rate by 3 orders of magnitude and constructs the de Bruijn graph for large k-mer sizes. Since the de Bruijn graph constructed for a fixed k-mer size is typically either too tangled or too fragmented, LJA uses a multiplex de Bruijn graph.





□ HiLoop: Identification, visualization, statistical analysis and mathematical modeling of high-feedback loops in gene regulatory networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04405-z

HiLoop quantifies the enrichment of high-feedback loops in the given networks and automatically generates parameterized mathematical models that describe characteristic dynamical systems based on the network topologies.

HiLoop visualizes multiple attractors in the state space of specific genes or axes of reduced dimensions. HiLoop can be extended to facilitate the analysis of diverse transient dynamics and spatial (e.g. Turing) patterns generated from individual spatiotemporal models.





□ VLMCs: Fast parallel construction of variable-length Markov chains

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04387-y

The methods range from probability distributions of sequence composition to first and higher-order Markov chains, where a k-th order Markov chain over DNA has 4^k formal parameters.

VLMCs (variable-length Markov chains) adapt the depth depending on sequence context and curtail excesses in the number of parameters. The scarcity of available fast prompted the development of a parallel implementation using lazy suffix trees and a hash-based alternative.





□ A Converse Sum of Squares Lyapunov Function for Outer Approximation of Minimal Attractor Sets of Nonlinear Systems https://arxiv.org/pdf/2110.03093v1.pdf

a new Lyapunov characterization of attractor sets that is well suited to the problem of finding the minimal attractor set. This Lyapunov characterization is non-conservative even when restricted to Sum-of-Squares (SOS) Lyapunov functions.

a SOS programming problem based on determinant maximization that yields an SOS Lyapunov function whose 1-sublevel set has minimal volume, is an attractor set itself, and provides an optimal outer approximation of the minimal attractor set of the ODE.





□ A Bayesian neural network predicts the dissolution of compact planetary systems

>> https://www.pnas.org/content/118/40/e2026053118

a Bayesian neural network (BNN) naturally incorporates confidence intervals into its instability time predictions, accounting for model uncertainty as well as the intrinsic uncertainty due to the chaotic dynamics.

The gradient information can significantly speed up parameter estimation using Hamiltonian Monte Carlo. The model numerically integrates 10,000 orbits for a compact three-planet system (top) and records orbital elements.





□ Axioms for the category of Hilbert spaces

>> https://arxiv.org/pdf/2109.07418v1.pdf

The latter uses the framework of category theory, and emphasises operators more than their underlying Hilbert spaces. It postulates a category with structure that models physical features of quantum theory.

Which axioms guarantee that a category is equivalent to that of continuous linear functions between Hilbert spaces? The approach is similar to Lawvere’s categorical characterisation of the theory of sets. the finite-dimensional Hilbert spaces can be categorically axiomatised.





□ Robustness of non-computability

>> https://arxiv.org/pdf/2109.15080v1.pdf

a framework for analyzing whether a non-computability result is robust over continuous spaces. the notion of computability is extended to continuous spaces - i.e., non-discrete topological spaces.

There exists a computable C∞ function h : R2 → R2, h ∈ V(K), such that h has a unique computable equilibrium point s - a sink - and the basin of attraction Ws of s is non-computable, where K is the disk centered at the origin with radius 3.





□ SVAT: Secure Outsourcing of Variant Annotation and Genotype Aggregation

>> https://www.biorxiv.org/content/10.1101/2021.09.28.462259v1.full.pdf

SVAT can decrease the time and memory usage for the annotation of deletions by making use of an annotation vector that contains the 1-bp deletions and making use of this to translate the impact of deletions that span multiple nucleotides.

SVAT utilizes proxy re-encryption to securely re-code the genotype matrices. SVAT can perform counting at the allele count or variant existence level. SVAT makes use of a novel vectorized representation of the variant loci to protect the variant loci information.





□ PEAK2VEC ENABLES INFERRENCE OF TRANSCRIPTIONAL REGULATION FROM ATAC-SEQ

>> https://www.biorxiv.org/content/10.1101/2021.09.29.462455v1.full.pdf

Peak2vec, a novel algorithm that can identify ATAC-seq peaks regulated with the same TF, while providing the corresponding signature motif. Peak2vec is also easier to interpret since a multinomial convolution kernel directly represents a position weight matrix.

Peak2vec performes Gaussian mixture on the embedding vector. peak2vec may also be applied to TF ChIP-seq experiment in case multiple motifs exists for cofactors.





□ TRIPOD: Nonparametric Interrogation of Transcriptional Regulation in Single-Cell RNA and Chromatin Accessibility Multiomic Data

>> https://www.biorxiv.org/content/10.1101/2021.09.22.461437v1.full.pdf

TRIPOD, a nonparametric approach to detect and characterize three-way relationships between a TF, its target gene, and the accessibility of the TF’s binding site, using single-cell RNA and ATAC multiomic data.

TRIPOD matches metacells by either their TF expressions or peak accessibilities. For each matched metacell pair, the variable being matched is controlled for, and differences between the pair in the other two variables are computed.





□ Wavelet Screening: a novel approach to analyzing GWAS data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04356-5

Wavelets are oscillatory functions that are useful for analyzing the local frequency and time behavior of signals. The signals can then be divided into different scale components and analyzed separately.

Haar Wavelet transforms the raw genotype data similarly to the widely used ‘Gene- or Region-Based Aggregation Tests of Multiple Variants’ method.





□ BlockPolish: accurate polishing of long-read assembly via block divide-and-conquer https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab405/6383560

BlockPolish couples four Bidirectional LSTM layers with a compressed projection layer and a flip-flop projection layer to predict the consensus sequence according to the reads-to-assembly alignment.

The Bi-LSTM layers take both left and right alignment features when making decisions. The compressed projection layer converts the alignment features to the DNA sequence without continuously repeated nucleotides.

The flip-flop projection layer converts the alignment features into the DNA sequence in which the continuous repeated nucleotides are flip-flopped.

BlockPolish divides contigs into blocks with low complexity and high complexity according to statistics of reads aligned to the assembly. Dividing contigs and generating feature matrix is done in the BPFGM.





□ scAAnet: Non-linear Archetypal Analysis of Single-cell RNA-seq Data by Deep Autoencoders

>> https://www.biorxiv.org/content/10.1101/2021.09.17.460824v1.full.pdf

Non-linear archetypal analysis methods have been proposed based on kernelization, such as kernel principal convex hull analysis. However, there is no guarantee that kernel-based transformation makes data well-approximated by a simplex.

scAAnet decomposes an expression profile into a usage matrix and a GEP/archetype matrix. The role of the encoder part is to perform a non-linear decomposition of the data by mapping data from a high-dimensional space to a much latent space.





□ Modelling the bioinformatics tertiary analysis research process

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04310-5

a conceptual model that captures the salient characteristics of the research methods and human tasks involved in Bioinformatics Tertiary Analysis.

a Conversational Agent guides the user step by step in the data extraction. The final hierarchical task tree was then converted into an ontological representation using an ontology standard formalism.





□ CVODE: Reverse engineering gene regulatory network based on complex-valued ordinary differential equation model

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04367-2

Grammar-guided genetic programming (GGGP) is utilized to evolve the structure of CVODE and complex-valued firefly algorithm (CFA) is proposed to search the optimal complex-valued parameters of model.

CVODE has the complex-valued structures, constants and coefficients, which could improve the modeling ability. GGGP overcomes the shortcomings of GP and CFA has more population diversity and faster convergence.





□ MM-Deacon: Multimodal molecular domain embedding analysis via contrastive learning

>> https://www.biorxiv.org/content/10.1101/2021.09.17.460864v1.full.pdf

MM-Deacon is trained using SMILES and IUPAC molecule representations as two different modalities. First, SMILES and IUPAC strings are encoded by using two different transformer-based language models independently.

Then the contrastive loss is utilized to bring these encoded representations from different modalities closer to each other if they belong to the same molecule, and to push embeddings farther from each other if they belong to different molecules.

PubChem cross-modal molecule search serves as a way to test the learned agreement across SMILES and IUPAC representations in the joint embedding space. Specifically, molecules in the PubChem test set are all embedded into 512-dimensional vectors in the joint embedding space.





□ STAT: a fast, scalable, MinHash-based k-mer tool to assess Sequence Read Archive next-generation sequence submissions

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02490-0

Sequence Taxonomic Analysis Tool (STAT), a scalable k-mer-based tool for fast assessment of taxonomic diversity intrinsic to submissions, independent of metadata.

Based on MinHash, and inspired by Mash, STAT employs a reference k-mer database built from available sequenced organisms to allow mapping of query reads to the NCBI taxonomic hierarchy.

STAT uses the MinHash principle to compress the representative taxonomic sequences by orders of magnitude into a k-mer database, a process that yields a set of diagnostic k-mers for each organism. This allows for significant coverage of taxa w/ a minimal set of diagnostic k-mers.




□ SEDIM: High-throughput single-cell RNA-seq data imputation and characterization with surrogate-assisted automated deep learning

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab368/6374131

Deep imputation architectures are difficult to design and tune for those without rich knowledge of deep neural networks and scRNA-seq.

Surrogate-assisted Evolutionary Deep Imputation Model (SEDIM) automatically designs the architectures of deep neural networks for imputing GE levels. SEDIM constructs an offline surrogate model, which can accelerate the computational efficiency of the architectural search.




□ scHiCStackL: a stacking ensemble learning-based method for single-cell Hi-C classification using cell embedding

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbab396/6374065

scHiCStackL contains a two-layer stacking learning-based ensemble model. the cell embedding generated by its data preprocessing method increases by 0.23, 1.22, 1.46 and 1.61% comparing with the cell embedding generated by scHiCluster.

The stacking ensemble learning-based model is comprised of Ridge Regression (RR) classifier and Logistic Regression (LR) classifier as the base-classifiers (i.e., first-level) and Gaussian Naive Bayes (GaussianNB) classifier as the meta-classifier.





□ Deep GONet: self-explainable deep neural network based on Gene Ontology for phenotype prediction from gene expression data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04370-7

Deep GONet architecture represents different levels of the ontology preserving the hierarchical relationships between the GO terms by using sparse regularization.

Deep GONet is based on a MLP constrained by the GO structure. GO gathers three ontologies that respectively describe the following categories: biological process (GO-BP), molecular function, and cellular component.





□ XENet: Using a new graph convolution to accelerate the timeline for protein design on quantum computers

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009037

XENet, a GNN model that addresses both concerns while also avoid the computational issues introduced by FGNs.

XENet is a message-passing GNN that simultaneously accounts for both the incoming and outgoing neighbors of each node, such that a node’s representation is based on the messages it receives as well as those it sends.

XENet can model residue-level environments better than existing methods ECC and CrystalConv. Not only does the usage of XENet result in lower validation losses, XENet can withstand deeper architectures.




□ RLM: Fast and simplified extraction of Read-Level Methylation metrics from bisulfite sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab663/6380544

RLM, a fast and scalable tool that implements established and frequently used inter- and intramolecular metrics of DNA methylation at the read level from bisulfite sequencing experiments.

RLM is applicable for any reference genome, a wide range of library protocols w/ input alignment files from multiple commonly used alignment tools. RLM automatically accounts for potential errors / biases caused by sequencing artifacts, mapping quality and overlapping read pairs.





□ HyINDEL – A Hybrid approach for Detection of Insertions and Deletions

>> https://www.biorxiv.org/content/10.1101/2021.10.08.463662v1.full.pdf

HyINDEL integrates clustering, split-mapping and assembly-based approaches, for the detection of INDELs of all sizes (from small to large) and also identifies the insertion sequences.

HyINDEL starts with identifying clusters of discordant and soft-clip reads which are validated by depth-of-coverage and alignment of soft-clip reads to identify candidate INDELs, while the assembly -based approach is used in identifying the insertion sequence.




□ SFt: Improved Unsupervised Representation Learning of Spatial Transcriptomic Data with Sparse Filtering

>> https://www.biorxiv.org/content/10.1101/2021.10.11.464002v1.full.pdf

Sparse filtering (SFt), uses principles of sparsity and mutual information to build representations from both global and local features from a minimal list of samples. Critically, the samples that comprise each representation are listed and ranked by informativeness.

SFt, implemented with the PyTorch machine learning libraries for Python, returned the most accurate reconstruction of anatomical ground truth of any method tested.

Sparse learning is a powerful, but underexplored means to derive biologically meaningful representations from complex datasets and a quantitative basis for compressed sensing of classifiable phenomena.

SFt should be considered as an alternative to PCA or manifold learning for any high dimensional dataset and the basis for future spatial learning algorithms.





□ Modular assembly of dynamic models in systems biology

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009513

a model of the Mos/MAPK cascade in a modular fashion using bond graphs. This enabled a principled approach for benchmarking and comparing models of glycolysis with different levels of complexity.

In conjunction with the programmatic approach, bond graphs provide a useful framework for updating models and recording their provenance. MAPK cascade incremental changes were made to incorporate feedback.








Ἐγκέλαδος.

2021-10-13 22:13:37 | Science News

"What should happen in the future" is nothing but "what is happening at this moment"

「未来に起こるべきこと」は「今起きていること」に他ならない


「統計によって何を知るか」ではなく、「統計されている構造を知ること」が重要である。



□ SELMA: Accurate estimation of intrinsic biases for improved analysis of chromatin accessibility sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.10.22.465530v1.full.pdf

SELMA (Simplex Encoded Linear Model for Accessible Chromatin), a computational framework for the accurate estimation of intrinsic cleavage biases and improved analysis of DNase/ATAC-seq data for both bulk and single-cell experiments.

SELMA generates more robust bias estimation from bulk data than the naïve k-mer model. SELMA encodes each k-mer as a vector in the Hadamard Matrix, derived from a simplex encoding model, in which the k-mer sequences are encoded as the vertices of a regular 0-centered simplex.





□ NanoSplicer: Accurate identification of splice junctions using Oxford Nanopore sequencing

>> https://www.biorxiv.org/content/10.1101/2021.10.23.465402v1.full.pdf

NanoSplicer utilises the raw ouput from nanopore sequencing (measures of electrical current commonly known as squiggles) to improve the identification of splice junctions. Instead of identifying splice junctions by mapping basecalled reads.

nanosplicer compares the squiggle from a read with the predicted squiggles of potential splice junctions to identify the best match and likely junction. nanosplicer uses the support in the junction squiggle for the model as a measure of similarity in Dynamic Time Warping.





□ VSS-Hi-C: Variance-stabilized signals for chromatin 3D contacts

>> https://www.biorxiv.org/content/10.1101/2021.10.19.465027v1.full.pdf

VSS-Hi-C stabilizes the variance of Hi-C contact strength. This method learns the empirical mean-variance relationship of the Hi-C matrices and transforms the Hi-C contact strength using a transformation based on this learned mean-variance relationship.

VSS-Hi-C transformed matrices have a fully stabilized mean-variance relationship, in contrast to other transformation methods. Variance-stabilized signals are beneficial for downstream analyses like identifying topological domains and subcompartments.





□ PeakBot: Machine learning based chromatographic peak picking

>> https://www.biorxiv.org/content/10.1101/2021.10.11.463887v1.full.pdf

These are subsequently inspected by a custom-trained convolutional neural network that forms the basis of PeakBot’s architecture. This is achieved by first searching for chromatographic peaks using a smoothing and gradient-descend algorithm.

PeakBot detects all local signal maxima in a chromatogram, which are then extracted as super-sampled standardized areas. The model reports if the respective local maximum is the apex of a chromatographic peak or not as well as its peak center and bounding box.





□ ReFeaFi: Genome-wide prediction of regulatory elements driving transcription initiation

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009376

ReFeaFi, a dynamic negative set updating scheme with a two-model approach, using one model for scanning the genome and the other one for testing candidate positions.

Empty vector and random sequences were used as negative controls, while GAPDH promoter is used as positive control. ReFeaFi achieves outstanding performance on discriminating VISTA enhancers and 100 times as many random genomic regions.





□ ConGRI: Accurate inference of gene regulatory interactions from spatial gene expression with deep contrastive learning

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab718/6401998

The high-throughput spatial gene expression data, like in situ hybridization images that exhibit temporal and spatial expression patterns, has provided abundant and reliable information for the inference of GRNs.

ConGRI is featured by a contrastive learning scheme and deep Siamese CNN architecture, which automatically learns high-level feature embeddings for the expression images and feeds the embeddings to an artificial neural network to determine whether or not the interaction exists.





□ A novel algorithm to flag columns associated in any way with others or a dependent variable is computationally tractable in large data matrices and has much higher power when columns are linked like mutations in chromosomes.

>> https://www.biorxiv.org/content/10.1101/2021.09.15.460360v1.full.pdf

When a data matrix DM has many independent variables IVs, it is not computationally tractable to assess the association of every distinct IV subset with the dependent variable DV of the DM, because the number of subsets explodes combinatorially as IVs increase.

a computationally tractable, fully parallelizable Participation in Association Score (PAS) that in a DM with markers detects one by one every column that is strongly associated in any way with others.





□ Identifying common and novel cell types in single-cell RNA-sequencing data using FR-Match

>> https://www.biorxiv.org/content/10.1101/2021.10.17.464718v1.full.pdf

FR-Match matches query datasets to reference atlases with robust and accurate performance for identifying novel cell types and non-optimally clustered cell types in the query data.

FR-Match is an iterative procedure that allows each cell in the query cluster to be assigned a summary p-value, quantifying the confidence of matching, to a reference cluster. FR-Match forms a clean diagonal alignment of cell types and assigned unmatched cells as “unassigned”.




□ AlphaDesign: A de novo protein design framework based on AlphaFold

>> https://www.biorxiv.org/content/10.1101/2021.10.11.463937v1.full.pdf

AlphaDesign, a computational framework for de novo protein design that embeds AF as an oracle within an optimisable design process. This framework enables rapid prediction of completely novel protein monomers starting from random sequences.

Structural integrity of predicted structures is validated by ab initio folding / structural analysis as well as extensively by rigorous all-atom molecular dynamics simulations and analysing the corresponding structural flexibility, intramonomer / interfacial amino-acid contacts.





□ TT-Mars: Structural Variants Assessment Based on Haplotype-resolved Assemblies

>> https://www.biorxiv.org/content/10.1101/2021.09.27.462044v1.full.pdf

TT- Mars, that takes advantage of the recent production of high-quality haplotype-resolved genome assemblies by evaluating variant calls based on how well their call reflects the content of the assembly, rather than comparing calls themselves.

Compared with validation using dipcall variants, TT-Mars analyzes 1,497-2,229 more calls on long read callsets and has favorable results when candidate calls are fragmented into multiple calls in alignments.





□ motif_prob: Fast and exact quantification of motif occurrences in biological sequences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04355-6

Exact formulae for motif occurrence, under Bernoullian or Markovian models, have exponential complexity, thus can be cumbersome to be implemented efficiently, but approximations can be calculated with constant cost.

‘motif_prob’, a fast implementation of an exact formula for motif count distribution through progressive approximation with arbitrary precision. motif_prob is 50–1000× faster than MoSDi exact and 60–120× faster than MoSDi compound Poisson.

Given the motif m and genome g lengths, one can set a tolerance level ε such that P(0, m, n) > (1 − ε), and in general each case where (1 − P(S))(m−m+1) > (1 − ε). This is equal to (n − m + 1)∙log(1 − P(S)) > log(1 − ε), which implies n > m − 1 + log(1 − ε)/log(1 − P(S)).




□ vcf2gwas—python API for comprehensive GWAS analysis using GEMMA https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab710/6390796

GEMMA can fit a univariate linear mixed model, a multivariate mixed model,, and a Bayesian sparse linear mixed model for testing marker associations with a trait of interest in different organisms.

vcf2gwas is especially helpful when analyzing large numbers of phenotypes or different sets of individuals because it can perform the analyses in parallel with a single .csv file with all the phenotypes. And offers features like analyzing reduced phenotypic space.





□ GBSmode: a pipeline for haplotype-aware analysis of genotyping-by-sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.09.20.461130v1.full.pdf

Genotyping-by-sequencing (GBS) enables simultaneous genotyping of thousands of DNA markers in the genome of any species. GBS exploits a restriction enzyme to reduce genome complexity and directs the sequencing to begin at fixed digestion sites.

GBSmode, a dedicated pipeline to call DNA sequence variants using whole-read information from GBS data. It removes false positives by incorporating biological features such as the ploidy level and the number of possible alleles in the population under investigation.





□ BindVAE: Dirichlet variational autoencoders for de novo motif discovery from accessible chromatin

>> https://www.biorxiv.org/content/10.1101/2021.09.23.461564v1.full.pdf

BindVAE, based on Dirichlet variational autoencoders, for jointly decoding multiple TF binding signals from open chromatin regions. BindVAE automatically learns distinct groups of k-mer patterns that correspond to cell type-specific in vivo binding signals.

BindVAE uses 8-mers with wildcards, which allows us to interpret the learned latent factors. Of the 102 distinct patterns learned over the latent dimensions, BindVAE found specific patterns for some TFs and were able to map the latent factors to unique TFs.





□ BionetBF: A Novel Bloom Filter for Faster Membership Identification of Paired Biological Network Data

>> https://www.biorxiv.org/content/10.1101/2021.09.23.461527v1.full.pdf

BionetBF is capable of executing millions of operations within a second on datasets having millions of paired biological data while occupying tiny amount of main memory.

BionetBF is also compared with other filters: Cuckoo Filter and Libbloom, where BionetBF proves its supremacy by exhibiting higher performance with a smaller sized memory compared with large sized filters of Cuckoo Filter and Libbloom.





□ MONTI: A Multi-Omics Non-negative Tensor Decomposition Framework for Gene-Level Integrative Analysis https://www.frontiersin.org/articles/10.3389/fgene.2021.682841/full

SNF (Similarity Network Fusion) integrates multi-omics data by constructing networks for each omics data in terms of the sample similarity using the omics data and then fusing the networks iteratively using the message-passing method.

MONTI (Multi-Omics Non-negative Tensor Decomposition Integration) that learns hidden features through tensor decomposition for the integration of multi-omics data. The omics matrices are stacked to form a 3-dimensional tensor structure all sharing the same genes.




□ Improving structural variant clustering to reduce the negative effect of the breakpoint uncertainty problem

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04374-3

a statistically significant enrichment of the pattern of decomposed SVs during the evaluation of conventional clustering strategies.

It can be argued that MEI-based quantities, especially Nic, have limited informative values in this case because maximization of Nic is implicitly included in the constrained clustering algorithm.





□ LoHaMMer: Evaluation of Vicinity-based Hidden Markov Models for Genotype Imputation

>> https://www.biorxiv.org/content/10.1101/2021.09.28.462261v1.full.pdf

the HMM evaluates the paths over only a short stretch of variants around the untyped variants. LoHaMMer can perform the computations in the logarithmic domain or it scales the ML and forward-backward variables by a scaling factor.

LoHaMMer keeps track of any overflow and underflow at each computation step. If an array value becomes too high or too low, the values are re-scaled to ensure numerical stability.





□ Evolutionary strategies applied to artificial gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2021.09.28.462218v1.full.pdf

a population of computational robotic models controlled by artificial gene regulatory networks (AGRNs) to evaluate the impact of different genetic modification strategies in the course of evolution.

a gradual increase in the complexity of the performed tasks is beneficial for the evolution of the model.





□ STRATISFIMAL LAYOUT: A modular optimization model for laying out layered node-link network visualizations

>> https://ieeexplore.ieee.org/document/9556579/

Using a layout optimization model that prioritizes optimality – as compared to scalability – because an optimal solution not only represents the best attainable result, but can also serve as a baseline to evaluate the effectiveness of layout heuristics.

STRATISFIMAL LAYOUT, a modular integer-linear-programming formulation that can consider several important readability criteria simultaneously – crossing reduction, edge bendiness, and nested and multi-layer groups.




□ Incomplete Multiple Kernel Alignment Maximization for Clustering

>> https://ieeexplore.ieee.org/document/9556554/

Multiple kernel alignment (MKA) maximization criterion has been widely applied into multiple kernel clustering (MKC) and many variants have been recently developed.

The clustering of MKA maximization guides the imputation of incomplete kernel elements, and the completed kernel matrices are in turn combined to conduct the subsequent Multiple kernel alignment.





□ Open Imputation Server provides secure Imputation services with provable genomic privacy

>> https://www.biorxiv.org/content/10.1101/2021.09.30.462262v1.full.pdf

a client-server-based outsourcing framework for genotype imputation, an important step in genomic data analyses.

Genotype data is encrypted once at the client and submitted to the server, which securely imputes the untyped variants without decrypting the genotypes.





□ ssNet: Integration of probabilistic functional networks without an external Gold Standard

>> https://www.biorxiv.org/content/10.1101/2021.10.01.462727v1.full.pdf

ssNet is easier and faster, overcoming the challenges of data redundancy, Gold Standard bias and ID mapping, while producing comparable performance. In addition ssNet results in less loss of data and produces a more complete network.

The ssNet method provides a computationally amenable one-step PFIN integration method for functional interaction data. ssnet takes a BioGRID file of functional interaction data for a species and produces a probabilitistic functional integrated network.





□ CellDepot: A unified repository for scRNA-seq data and visual exploration

>> https://www.biorxiv.org/content/10.1101/2021.09.30.462602v1.full.pdf

CellDepot integrates with advanced single-cell transcriptomic data explorer to conduct all analytical tasks on the webserver while presenting interactive results on the webpage through leveraging modern web development techniques.

CellDepot requires scRNA-seq data in h5ad file where the expression matrix is stored in CSC (compressed sparse column) instead of CSR (compressed sparse row) format to improve the speed of data retrieving.





□ Productive visualization of high-throughput sequencing data using the SeqCode open portable platform

>> https://www.nature.com/articles/s41598-021-98889-7

SeqCode is entirely focused on the graphical analysis of 1D genomic data. t has been implemented in ANSI C following a modular architecture of blocks.




□ DisCovER: distance- and orientation-based covariational threading for weakly homologous proteins

>> https://pubmed.ncbi.nlm.nih.gov/34599831/

DisCovER, new distance- and orientation-based covariational threading method by effectively integrating information from inter-residue distance and orientation along with the topological network neighborhood of a query-template alignment.

DisCovER selects a subset of templates using standard profile-based threading coupled with topological network similarity terms to account, and subsequently performs distance- and orientation-based query-template alignment using an iterative double dynamic programming framework.





□ SamQL: a structured query language and filtering tool for the SAM/BAM file format

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04390-3

SamQL has intuitive syntax allowing complex queries and takes advantage of parallelizable handling of BAM files.

SamQL builds an abstract syntax tree (AST) corresponding to the query. The AST is then parsed, depth-first, to progressively build a function closure that encapsulates the whole query.




□ Spatial rank-based multifactor dimensionality reduction to detect gene–gene interactions for multivariate phenotypes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04395-y

The new multivariate rank-based MDR (MR-MDR) is mainly suitable for analyzing multiple continuous phenotypes and is less sensitive to skewed distributions and outliers.

MR-MDR utilizes fuzzy k-means clustering and classifies multi-locus genotypes into two groups. Then, MR-MDR calculates a spatial rank-sum statistic as an evaluation measure and selects the best interaction model with the largest statistic.





□ BioKIT: a versatile toolkit for processing and analyzing diverse types of sequence data

>> https://www.biorxiv.org/content/10.1101/2021.10.02.462868v1.full.pdf

BioKIT, a versatile toolkit with 40 functions, several of which were community sourced, that conduct routine and novel processing and analysis of diverse sequence files including genome assemblies, multiple sequence alignments, protein coding sequences, and sequencing data.

Functions implemented in BioKIT facilitate a wide variety of standard bioinformatic analyses, including genome assembly quality assessment, the calculation of multiple sequence alignment properties; number of taxa, alignment length, the number of parsimony-informative sites.




□ iDNA-ABT : advanced deep learning model for detecting DNA methylation with adaptive features and transductive information maximization

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab677/6380543

iDNA-ABT, an advanced deep learning model that utilizes adaptive embedding based on bidirectional transformers for language understanding together with a novel transductive information maximization (TIM) loss.

iDNA-ABT can automatically and adaptively learn the distinguishing features of biological sequences from multiple species. iDNA-ABT has strong adaptability and robustness to different species through comparison of adaptive embedding and six handcrafted feature encodings.




□ Efficient Change-Points Detection For Genomic Sequences Via Cumulative Segmented Regression

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab685/6380564

The cumulative segmented algorithm (cumSeg) has been recently proposed as a computationally efficient approach for multiple change-points detection, which is based on a simple transformation of data and provides results quite robust to model mis-specifications.

Two new change-points detection procedures in the framework of cumulative segmented regression. the proposed methods not only improve the efficiency of each change point estimator substantially but also provide the estimators with similar variations for all the change points.




□ K2Mem: Discovering Discriminative K-mers from Sequencing Data for Metagenomic Reads Classification

>> https://ieeexplore.ieee.org/document/9557831/

Studying the problem of metagenomic reads classification by improving the reference k-mers library with novel discriminative k-mers from the input sequencing reads and is proposed a metagenomics classification tool, named K2Mem.

K2 is based, not only on a set of reference genomes, but also it uses discriminative k-mers from the input metagenomics reads in order to improve the classification.





□ Mining hidden knowledge: Embedding models of cause-effect relationships curated from the biomedical literature

>> https://www.biorxiv.org/content/10.1101/2021.10.07.463598v1.full.pdf

Gene embeddings are based on literature-derived downstream ex- pression signatures in contrast to embeddings obtained with existing approaches that leverage either co-expression, or protein binding networks.

Using the QIAGEN Knowledge Base (QKB), a structured collection of biomedical content. Function embeddings are constructed using gene embedding vectors with a linear model trained on signed gene-function relationships.





□ NS-Forest 2.0: A machine learning method for the discovery of minimum marker gene combinations for cell type identification from single-cell RNA sequencing

>> https://genome.cshlp.org/content/31/10/1767.full

Necessary and Sufficient Forest (NS-Forest) version 2.0 leverages the nonlinear attributes of random forest feature selection and a binary expression scoring approach to discover the minimal marker gene expression combinations that optimally capture the cell type identity.

In NS-Forest v2.0, all permutations of the selected top-ranked genes are tested and their performance assessed using the weighted F-beta score. The F-beta score contains a weighting term, beta, that allows for emphasizing either precision or recall.

By weighting for precision (the contributions of false positives) versus recall (the contributions of false negatives), limit the impact of zero inflation (or drop-out), a known technical artifact with scRNA-seq data, on marker gene assessment.





□ BioVAE: a pre-trained latent variable language model for biomedical text mining

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab702/6390793

OPTIMUS has successfully combined BERT-based PLMs and GPT-2 with variational autoencoders (VAEs), achieving SOTA in both representation learning and language generation tasks. However, they are trained only on general domain text, and biomedical models are still missing.

BioVAE, the first large scale pre-trained latent variable language model for the biomedical domain, which uses the OPTIMUS framework to train on large volumes of biomedical text. BioVAE can generate more accurate biomedical sentences than the original OPTIMUS output.




□ pLMMGMM: A penalized linear mixed model with generalized method of moments for complex phenotype prediction

>> https://www.biorxiv.org/content/10.1101/2021.10.11.463997v1.full.pdf

pLM- MGMM is built within the linear mixed model framework, where random effects are used to model the joint predictive effects from all genetic variants within a region.

pLMMGMM can jointly consider a large number of genetic regions and efficiently select those harboring variants with both linear and non-linear predictive effects.




□ NAToRA, a relatedness-pruning method to minimize the loss of dataset size in genetic and omics analyses

>> https://www.biorxiv.org/content/10.1101/2021.10.21.465343v1.full.pdf

NAToRA is an algorithm that minimizes the number of individuals to be removed from a dataset. In the context of complex network theory, NAToRA finds the maximum clique in the complement networks.

NAToRA is also compatible with relatedness metrics calculated by the REAP method, which is more appropriate for admixed populations than PLINK and KING.




δακτύλιος.

2021-10-13 22:13:33 | Science News


我昔所造諸悪業
皆由無始貪瞋癡
従身口意之所生
一切我今皆懺悔

響きは発生した刹那から静寂へ吸い込まれていく。明滅する現象界の狭間に、儚い願いと共に信号を送るのように。



□ Hyperspherical Dirac Mixture Reapproximation

>> https://arxiv.org/pdf/2110.10411.pdf

Hyperspherical localized cumulative distribution (HLCD) is introduced as a local and smooth characterization of the underlying continuous density in hyperspherical domains.

a manifold-adapted modification of the Cram ́er–von Mises distance measures the statistical divergence b/w two Dirac mixtures. the hyperspherical Dirac mixture reapproximation (HDMR), for efficient discrete probabilistic modeling on unit hyperspheres of arbitrary dimensions.





□ Tangent Space and Dimension Estimation with the Wasserstein Distance

>> https://arxiv.org/pdf/2110.06357.pdf

The estimators arise from a local version of principal component analysis (PCA). This approach directly estimates covariance matrices locally, which simultaneously allows estimating both the tangent spaces and the intrinsic dimension of a manifold.

A matrix concentration inequality, a Wasserstein bound for flattening a manifold, and a Lipschitz relation for the covariance matrix with respect to the Wasserstein distance.




□ hifiasm-meta: Metagenome assembly of high-fidelity long reads

>> https://arxiv.org/pdf/2110.08457.pdf

hifiasm-meta has an optional read selection step that reduces the coverage of highly abundant strains without losing reads on low abundant strains. hifiasm-meta tries to protect reads in genomes of low coverage, which may be treated as chimeric reads.

hifiasm-meta only drops a contained read if other reads exactly overlapping with the read are inferred to come from the same haplotype. This reduces contig breakpoints caused by contained reads.

hifiasm-meta uses the coverage information to prune unitig overlaps, assuming unitigs from the same strain tend to have similar coverage. It also tries to join unitigs from different haplotypes to patch the remaining assembly gaps.





□ qc3C: Reference-free quality control for Hi-C sequencing data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008839

qc3C can be done without access to a reference sequence, which until now has been a significant stopping point for projects not involving model organisms.

qc3C can also perform reference-based analysis. Statistics obtained from “bam mode” include such details as the number of read-through events and HiCPro style pair categorisation e.g. dangling-end, self-circle.





□ Circall: fast and accurate methodology for discovery of circular RNAs from paired-end RNA-sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04418-8

Circall builds the back-splicing junction (BSJ) database based on the annotated reference, thus depends on the completion of the annotation.

Circall controls the FPs using a robust multidimensional local false discovery rate method based on the length and expression of circRNAs. It is computationally highly efficient by using a quasi-mapping algorithm for fast and accurate RNA read alignments.





□ scGAD: single-cell gene associating domain scores for exploratory analysis of scHi-C data

>> https://www.biorxiv.org/content/10.1101/2021.10.22.465520v1.full.pdf

scGAD enables summarization at the gene level while accounting for inherent gene-level genomic biases. Low-dimensional projections with scGAD capture clustering of cells based on their 3D structures.

Projection onto the scRNA-seq embedding from the same system revealed that the cells originating from the same cell type but quantified by different data modalities were tightly clustered. scGAD facilitated an accurate projection of cells onto this larger space.





□ SMURF: End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman

>> https://www.biorxiv.org/content/10.1101/2021.10.23.465204v1.full.pdf

SMURF (Smooth Markov Unaligned Random Field), a new method that jointly learns an alignment and the parameters of a Markov Random Field for unsupervised contact prediction.

SMURF begins with a learned alignment module (LAM). For each sequence, a convolutional architecture produces a matrix of match scores between the sequence and a reference. A similarity tensor is constructed for each sequence with the vectors for the query sequence.





□ MERINGUE: Characterizing spatial gene expression heterogeneity in spatially resolved single-cell transcriptomic data with nonuniform cellular densities

>> https://genome.cshlp.org/content/31/10/1843

MERINGUE, a density-agnostic method for identifying spatial gene expression heterogeneity using spatial autocorrelation and cross-correlation analyses.

MERINGUE first represents these cells as neighborhoods using Voronoi tessellation. In Voronoi tessellation, planes are partitioned into neighborhoods where a neighborhood for a cell consists of all points closer to that cell than any other.





□ scMRA: A robust deep learning method to annotate scRNA-seq data with multiple reference datasets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab700/6384568

In scMRA, a knowledge graph is constructed to represent the characteristics of cell types in different datasets, and a graphic convolutional network (GCN) serves as a discriminator. scMRA keeps intra-cell-type closeness and the relative position of cell types across datasets.

Single-cell Multiple Reference Annotator (scMRA) is tailored to transform knowledges from multiple well-annotated data to the target unlabeled data. scMRA integrate information in those extra cell types into the adjacency matrix to better learn the embeddings of sequencing data.





□ FastqCLS: a FASTQ Compressor for Long-read Sequencing via read reordering using a novel scoring model

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab696/6384565

Various compression techniques have been proposed to reduce the size of original FASTQ raw sequencing data, but these remain suboptimal. Long-read sequencing has become dominant in genomics, whereas most existing compression methods focus on short-read sequencing only.

FastqCLS, a new FASTQ compression tool specialized for long-read sequencing data of large genomes using read reordering and zpaq, which employs arithmetic coding, a form of an entropy encoding.





□ Efficient inference for agent-based models of real-world phenomena

>> https://www.biorxiv.org/content/10.1101/2021.10.04.462980v1.full.pdf

While some methods generally produce more robust results than others, no algorithm offers a one-size-fits-all solution when attempting to infer model parameters from observations.

The predictions of the emulators are directly compared to the mock observations, i.e. the synthetic data. And infer the underlying model parameters (Θ) using rejection Approximate Bayesian computation and Markov Chain Monte Carlo.





□ DiviSSR: Simple arithmetic for efficient identification of tandem repeats

>> https://www.biorxiv.org/content/10.1101/2021.10.05.462997v1.full.pdf

DiviSSR identifies tandem repeats by applying a division rule on the binary numbers resultant after 2-bit transformations of DNA sequences. DiviSSR is on average 5-10 fold faster than the next best tools and takes just ~30 secs to identify all perfect microsatellites in the human genome.

DiviSSR merges repeats as it scans through the input sequence by storing the location of the previous repeat. The time complexity of DiviSSR is O(nm), where n is the input data size and m is the number of desired motif sizes.





□ NN-RNALoc: neural network-based model for prediction of mRNA sub-cellular localization using distance-based sub-sequence profiles

>> https://www.biorxiv.org/content/10.1101/2021.10.06.463397v1.full.pdf

NN-RNALoc is a machine-learning based model to predict the sub-cellular location of mRNAs which is evaluated on two following datasets: Cefra-seq and RNALocate.

The results demonstrate that by employement of the distance-based sub-sequence profiles along with k-mer frequencies and with inclusion of PPI matrix data, NN-RNALoc which has simple and transparent neural network architecture.





□ mmbam: Memory mapped parallel BAM file access API for high throughput sequence analysis informatics

>> https://www.biorxiv.org/content/10.1101/2021.10.05.463280v1.full.pdf

mmbam, a library to allow sequence analysis informatics software to access raw sequencing data stored in BAM files extremely fast.

Mmbam enables parallel processing of alignment data via memory mapped file access, and utilizes the scatter / gather paradigm to parallelize computation tasks across many genomic regions before combining the regional results to produce global results.




□ CoMM-S 4: A Collaborative Mixed Model Using Summary-Level eQTL and GWAS Datasets in Transcriptome-Wide Association Studies

>> https://www.frontiersin.org/articles/10.3389/fgene.2021.704538/full

CoMM-S4, a likelihood-based method which uses individual-level eQTL data to assess expression-trait association, and propose a probabilistic model, Collaborative Mixed Models using Summary Statistics from eQTL and GWAS.

CoMM-S4, like S-PrediXcan, is not able to distinguish between causal relationship and horizontal pleiotropy. CoMM-S4 uses an efficient algorithm based on variational Bayes expectation-maximization and parameter expansion (PX-VBEM).




□ Fast and compact matching statistics analytics

>> https://www.biorxiv.org/content/10.1101/2021.10.05.463202v1.full.pdf

a lossy compression scheme that can reduce the size of our compact encoding to much less than 2|S| bits when S and T are dissimilar, by replacing small match- ing statistics values (that typically arise from random matches) with other, suitably chosen small values.

a practical variant of the algorithm that computes MS in parallel on a shared-memory machine, and that achieves approximately a 41-fold speedup of the core procedures and a 30-fold speedup of the entire program with 48 cores on the instances that are most difficult to parallelize.





□ Lpnet: Reconstructing Phylogenetic Networks from Distances Using Integer Linear Programming

>> https://www.biorxiv.org/content/10.1101/2021.10.08.463657v1.full.pdf

the Lpnet algorithm uses a distance matrix as its input. First it constructs a phylogenetic tree from the distances, then it uses Linear Programming to find a circular ordering which maximizes the sum of all quartet weights consistent with the circular ordering.

Lpnet, a variant of Neighbor- net that does not apply the second heuristic step of the agglomeration. the integer linear programming problem in Lpnet uses a quadratic number of variables and a cubic number of constraints.





□ RDBKE: Enhancing breakpoint resolution with deep segmentation model: A general refinement method for read-depth based structural variant callers

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009186

deepIntraSV, UNet model for segmenting intra-bin structural variants with base-pair read-depth data of WGS. RDBKE uses the deep segmentation model UNet to learn base-wise Read Depth (RD) patterns surrounding breakpoints of known SVs.

the UNet model could also be applied for one-dimensional genomic data. RDBKE formalizes the breakpoint prediction as a segmentation task and inferred breakpoints in single-nucleotide resolution from predicted label marks.





□ scREMOTE: Using multimodal single cell data to predict regulatory gene relationships and to build a computational cell reprogramming model

>> https://www.biorxiv.org/content/10.1101/2021.10.11.463798v1.full.pdf

scREMOTE, a novel computational model for cell reprogramming that leverages single cell multiomics data, enabling a more holistic view of the regulatory mechanisms at cellular resolution.

This is achieved by first identifying the regulatory potential of each transcription factor and gene to uncover regulatory relationships, then a regression model is built to estimate the effect of transcription factor perturbations.




□ Translation procedures in descriptive inner model theory

>> https://arxiv.org/pdf/2110.06091v1.pdf

if there is a stationary class of λ such that λ is a limit of Woodin cardinals and the derived model at λ satisfies AD+ + θ0 < Θ then there is a transitive model M such that Ord ⊆ M and M 􏰃 “there is a proper class of Woodin cardinals and a strong cardinal”.

Using a theorem of Woodin on derived models it is not hard to see that the reverse of the aforementioned theorem is also true, thus proving that the two theories are in fact equiconsistent.





□ ONTdeCIPHER: An amplicon-based nanopore sequencing pipeline for tracking pathogen variants

>> https://www.biorxiv.org/content/10.1101/2021.10.13.464242v1.full.pdf

ONTdeCIPHER is an Oxford Nanopore Technology (ONT) amplicon-based sequencing pipeline to perform key downstream analyses on raw sequencing data from quality testing to SNPs effect to phylogenetic analysis.

ONTdeCIPHER integrates 13 bioinformatics tools, including Seqkit, ARTIC bioinformatics tool, PycoQC, MultiQC, Minimap2, Medaka, Nanopolish, Pangolin (with the model database pangoLEARN), Deeptools (PlotCoverage, BamCoverage), Sniffles, MAFFT, RaxML and snpEff.



□ Incomplete Multiple Kernel Alignment Maximization for Clustering

>> https://ieeexplore.ieee.org/document/9556554

Integrating the imputation of incomplete kernel matrices and Multiple Kernel Alignment maximization for clustering into a unified learning framework.

The clustering of Multiple Kernel Alignment maximization guides the imputation of incomplete kernel elements, and the completed kernel matrices are in turn combined to conduct the subsequent Multiple Kernel Clustering.

These two procedures are alternately performed until convergence. By this way, the imputation and Multiple Kernel Clustering processes are seamlessly connected.




□ LFMKC-PGR: Late Fusion Multiple Kernel Clustering With Proxy Graph Refinement

>> https://ieeexplore.ieee.org/document/9573366/

the kernel partition learning and late fusion processes are separated from each other in the existing mechanism, which may lead to suboptimal solutions and adversely affect the clustering performance.

LFMKC-PGR, a novel late fusion multiple kernel clustering with proxy graph refinement framework to address these issues. LFMKC-PGR constructs a proxy self-expressive graph from kernel base partitions.

The proxy graph in return refines the individual kernel partitions and also captures partition relations in graph structure rather than simple linear transformation.

LFMKC-PGR provides theoretical connections and considerations between the proposed framework and the multiple kernel subspace clustering. An alternate algorithm with proved convergence is then developed to solve the resultant optimization problem.





□ BASE: A novel workflow to integrate nonubiquitous genes in comparative genomics analyses for selection

>> https://onlinelibrary.wiley.com/doi/10.1002/ece3.7959

BASE is a workflow for analyses on selection regimes that integrates several popular pieces of software, with CodeML at its core. BASE allows to seamlessly carry out a user-specified number of replicate analyses, incorporating random omega starting values.

This circumstance can underlie a wide range of technical and biological phenomena—such as sequence misalignment, nonorthology, and incomplete lineage sorting—which can ultimately bias evolutionary rate inference.

In order to account for such possibility, when a fixed species tree is specified BASE will report its normalized Robinson–Foulds distances with each gene tree, calculated using ete3.




□ Eoulsan 2: an efficient workflow manager for reproducible bulk, long-read and single-cell transcriptomics analyses

>> https://www.biorxiv.org/content/10.1101/2021.10.13.464219v1.full.pdf

Eoulsan is a versatile framework based on the Hadoop implementation of the MapReduce algorithm, dedicated to high throughput sequencing data analysis on distributed computers.

Eoulsan 2, a major update that (i) enhances the workflow manager itself, (ii) facilitates the development of new modules, and (iii) expands its applications to long reads RNA-seq (Oxford Nanopore Technologies) and scRNA-seq (Smart-seq2 and 10x Genomics).




□ Polish topologies on groups of non-singular transformations

>> https://arxiv.org/pdf/2110.07289v1.pdf

the group of measure-preserving transformations of the real line whose support has finite measure carries no Polish group topology.

Characterize the Borel σ-finite measures λ on a standard Borel space for which the group of λ-preserving transformations has the automatic continuity property. the natural Polish topology on the group of all non-singular transformations is actually its only Polish group topology.





□ Tailored graphical lasso for data integration in gene network reconstruction

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04413-z

Assuming a Gaussian graphical model, a gene association network may be estimated from multiomic data based on the non-zero entries of the inverse covariance matrix.

The method also has a nice interpretability through the estimated value of k, giving us a “usefulness score” for the prior information, where k close to zero indicates that the prior information does not provide any useful information while larger k indicates that it does.

the tailored graphical is the most suitable for network inference from high-dimensional data with prior information of unknown accuracy.





□ Fractional Calderón problem on a closed Riemannian manifold

>> https://arxiv.org/pdf/2110.07500v1.pdf

the inverse problem of re-covering the isometry class of a smooth closed and connected Riemannian manifold (M,g),

Given the knowledge of a source-to-solution map for the fractional Laplace equation (−∆ )αu = f on the manifold subject to an garbitrarily small observation region O where sources can be placed and solutions can be measured.

Assuming only a local property on the a priori known observation region O while making no geometric assumptions on the inaccessible region of the manifold, namely M \ O.

Thia proof is based on discovering a hidden connection to a variant of Carlson’s theorem in complex analysis that allows us to reduce the non-local inverse problem to the Gel’fand inverse spectral problem.




□ Minimax extrapolation problem for periodically correlated stochastic sequences with missing observations

>> https://arxiv.org/pdf/2110.06675.pdf

Formulas that determine the least favorable spectral densities and the minimax-robust spectral characteristics of the optimal estimates of functionals are proposed in the case of spectral uncertainty,

where the spectral densities are not exactly known while some sets of admissible spectral densities are specified.





□ SIMBA: SIngle-cell eMBedding Along with features

>> https://www.biorxiv.org/content/10.1101/2021.10.17.464750v1.full.pdf

SIMBA is a single-cell embedding method with support for single- or multi- modality analyses that embeds cells and their associated genomic features into a shared latent space, generating interpretable and comparable embeddings of cells and features.

SIMBA readily corrects batch effects and produces joint embeddings of cells and features across multiple datasets with different sequencing platforms and cell type compositions.

SIMBA works as a stand-alone package obviating the need for prior input data correction when applied to multi-batch scRNA-seq dataset. In SIMBA, batch correction is accomplished by encoding multiple scRNA-seq datasets into a single graph.




□ ORTHOSCOPE*: a phylogenetic pipeline to infer gene histories from genome-wide data

>> https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msab301/6400256

ORTHOSCOPE* estimates a tree for a specified gene, detects speciation/gene duplication events that occurred at nodes belonging to only one lineage leading to a species of interest, and integrates results derived from gene trees estimated for all query genes in genome-wide data.

ORTHOSCOPE* can offer a set of orthology-confirmed gene markers for environmental DNA analyses. By using an amino acid file defined in the control.txt file, ORTHOSCOPE* automatically creates an amino acid database for each species by MAKEBLASTDB with -dbtype prot option.





□ REViewer: Haplotype-resolved visualization of read alignments in and around tandem repeats

>> https://www.biorxiv.org/content/10.1101/2021.10.20.465046v1.full.pdf

Repeat Expansion Viewer (REViewer) has been designed to work with the read alignments produced by ExpansionHunter, though it will work with any repeat genotyping software that produces output in the appropriate format.

REViewer constructs all possible pairs of haplotype sequences from the STR genotypes. REViewer reconstructs local haplotype sequences and distributes reads to these haplotypes in a way that is most consistent with the fragment lengths and evenness of read coverage.





□ Creating Generative Art NFTs from Genomic Data

>> https://towardsdatascience.com/creating-generative-art-nfts-from-genomic-data-16a48ae4df99

a dynamic NFT on the Ethereum blockchain with IPFS and discuss the possible use cases for scientific data.

function _mint(address to, uint256 tokenId) internal virtual {
require(to != address(0), "ERC721: mint to the zero address");
require(!_exists(tokenId), "ERC721: token already minted");

_beforeTokenTransfer(address(0), to, tokenId);

_balances[to] += 1;
_owners[tokenId] = to;

emit Transfer(address(0), to, tokenId);
}






□ SINBAD: a flexible tool for single cell DNA methylation data https://www.biorxiv.org/content/10.1101/2021.10.23.465577v1.full.pdf

SINBAD demultiplexes the raw reads using cell barcode sequence information, which is technology dependent. The indexed reads, which are defined as those that match the given indices, are generated for each individual cell as the output.

the dimensionality of the methylation matrix is reduced by the multivariate analysis module and cell populations are detected by clustering analysis.





□ ProPIP: a tool for progressive multiple sequence alignment with Poisson Indel Process

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04442-8

ProPIP - The process of insertions and deletions is described using an explicit evolutionary model—the Poisson Indel Process or PIP. The method is based on dynamic programming and is implemented in a frequentist framework.

Instead of the arbitrary gap penalties, the parameters used by ProPIP are the insertion and deletion rates, which have biological interpretation and are contextualized in a probabilistic environment.

ProPIP implements the originally published progressive MSA inference method based on PIP, and also introduces new features, such as stochastic backtracking and parallelisation.





□ TPSC: a module detection method based on topology potential and spectral clustering in weighted networks and its application in gene co-expression module discovery

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-03964-5

the Topology Potential-based Spectral Clustering (TPSC) Algorithm, an improved module detection algorithm based on topology potential and spectral clustering and use it to detect co-expression modules.

TPSC algorithm found that the module related to extracellular matrix and structure organization does not identified by both lmQCM and WGCNA algorithm. The method improved upon a previous method for full-connected network and asymmetric Laplacian matrix.







iPhone 13 Pro.

2021-10-13 22:13:17 | デジタル・インターネット


□ iPhone 13 Pro, Silver, 256GB (MLUP3J/A)

>> https://www.apple.com/iphone-13-pro/

Awasome! a solid build design and excellent processing power.📱🔊🥰 最近、apple製品はシルバーで統一しているけれど、やっぱり白の高級感がとても落ちつく😌✨



iPhone 13 Proのシネマティックモードで試験撮影。フォーカスだけでなく、黄昏時の色彩も下手に明るく補正せず、イメージ通りに再現してくれる。Proにのみ搭載された3倍光学ズームも、エモいほど高性能🪐🔭😌🔊



iPhone Pro 13のシネマティックモード楽しすぎて無限に遊べる🌌🐬✨クレーンを使ったようなアクロバティックなカメラワークに強い💫









iPhone 13 pro、10秒露光モードで星空を撮影。スマホでここまでクッキリと、しかもノイズも少なく滑らかに星が撮影できるのは革新的。超広角レンズでは光量が減るため、星景モードの搭載が待たれる😌🌌。





STAR WARS: VISIONS

2021-10-13 22:12:13 | 映画


『STAR WARS: VISIONS』

>> https://disneyplusoriginals.disney.com/show/star-wars-visions

日本のアニメ製作スタジオが手掛けるスターウォーズ。”The Duel“は劇画調で描かれる、「オリジン」である黒沢映画へのオマージュ。”The Twins”は日本アニメの真骨頂、ゲキアツバトルな天元突破スターウォーズ🥰X-Wingライトセーバー特攻のアレはEP7のボツ案にあったような…🤔















Pragma.

2021-10-12 22:27:52 | 日記・エッセイ・コラム


禅宗、仏教における倶舎論・浄土といった宇宙観は、畏ろしいほどにシステマティックに構築されていて、デファクトスタンダードとして克ち得た社会性を殻として、実効性を伴う観念的力学構造を以て、実在論・認識論の対岸にあるsolipsismへの反証として、覚知ならざる人の業を糾おうとしている。



───祖父が急逝した。

今夜は私一人でご遺体と夜を越すのだけれど、不謹慎ながら眼前の仏具・儀式には興味が尽きない。晩年の祖父も、趣味で仏学研究に傾倒した。直筆の写経の掛け軸を自慢げに披露してくれたこともある。長年、海上保安庁で上職を務めた彼には、本庁から今日になって叙勲の話も頂いている。


祖父はとんでもなく厳格な人として恐れられていたけれど、私は幼少時に悪ふざけをした時にしか叱られた記憶がない。政財界と付き合いもあり、仲間想いの人情家であった。海外旅行に料理、最新家電が好きと多趣味で、私の機械好きも祖父譲りなのである。保安艦の機関長を務めていたこともあったという。



近年の斎場は、『禅+ミニマリズム』という仏教的な類似概念に則ったモダン建築と、その機能を支えるハイテクが融合して、まさに近未来の建造物の様相を呈している。



祖父が残した公正証書遺言状に基づいて、私が遺言執行者として相続人代表を務めることになった。

戦前から船舶事業を興した曽祖父の代から、当家の家系図は複雑に絡まっており、この責務は非常に骨の折れる作業となることは容易に想像出来るので、法務局ないし司法書士に一任するつもりではある。




Où est mon étoile ?

2021-10-12 22:00:20 | 日記・エッセイ・コラム


『おもいで星がかがやくとき ("Où est mon étoile ?")』刀根里衣


葬儀会場に置いてあって読んでいたら、もう大変なことに…😭

死者は『何処か』へ行くのではない。
別れを受け入れて、時を動かすことが生の原動力。
大切な人が遺してくれたものは、抱えるものではなく、

押し出してくれるものなのだ。