lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Metanode.

2021-11-11 23:13:17 | Science News






□ MetaGraph: Lossless Indexing with Counting de Bruijn Graphs

>> https://www.biorxiv.org/content/10.1101/2021.11.09.467907v1.full.pdf

Together with the underlying graph, these annotations make up a data structure which we call a Counting de Bruijn graph. It can be used to represent quantitative information and, in particular, encode traces of the input sequences in de Bruijn graphs.

The concept of Counting de Bruijn graphs generalizing the notion of annotated (or colored) de Bruijn graphs. Counting de Bruijn graphs supplement each node-label relation with one or many attributes. Extended sequence-to-graph alignment algorithm introduced in MetaGraph.





□ Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds

>> https://www.biorxiv.org/content/10.1101/2021.11.05.467453v1.full.pdf

An implementation of the seed heuristic as part of the AStarix aligner, that exploits information from the whole read to quickly align it to a general graphs reference, and guides the search by placing crumbs on nodes that lead towards optimal alignments even for long reads.

AStarix rephrases the task of alignment as a shortest-path problem in an alignment graph extended by a trie index, and solves it using the A⋆ algorithm instantiated with a problem- specific prefix heuristic.





□ scDeepHash: An automatic cell type annotation and cell retrieval method for large-scale scRNA-seq datasets using neural network-based hashing

>> https://www.biorxiv.org/content/10.1101/2021.11.08.467820v1.full.pdf

scDeepHash, a scalable scRNA- seq analytic tool that employs content-based deep hashing to index single-cell gene expressions. scDeepHash allows for fast and accurate automated cell-type annotation and similar-cell retrieval.

scDeepHash leverages the properties of Hadamard matrix for the cell anchor generation. And enforcing minimum information loss when quantizing continuous codes into discrete binary hash codes. scDeepHash formulates the two losses as Weighted Cell-Anchor Loss and Quantization Loss.





□ scGate: marker-based purification of cell types from heterogeneous single-cell RNA-seq datasets

>> https://www.biorxiv.org/content/10.1101/2021.11.08.467740v1.full.pdf

scGate purifies a cell population of interest using a set of markers organized in a hierarchical structure, akin to gating strategies employed in flow cytometry.

scGate automatically synchronizes its internal database of gating models. scGate takes as input a gene expression matrix or Seurat object and a “gating model” (GM), consisting of a set of marker genes that define the cell population of interest.





□ ENGRAM: Multiplex genomic recording of enhancer and signal transduction activity in mammalian cells

>> https://www.biorxiv.org/content/10.1101/2021.11.05.467434v1.full.pdf

ENGRAM (ENhancer-driven Genomic Recording of transcriptional Activity in Multiplex), an alternative paradigm in which the activity and dynamics of multiple transcriptional reporters are stably recorded to DNA.

ENGRAM is based on the prime editing-mediated insertion of signal- or enhancer-specific barcodes to a genomically encoded recording unit. this strategy can be used to concurrently genomically record the relative activity of at least hundreds of enhancers.





□ Block aligner: fast and flexible pairwise sequence alignment with SIMD-accelerated adaptive blocks

>> https://www.biorxiv.org/content/10.1101/2021.11.08.467651v1.full.pdf

Block aligner greedily shifts and grows a block of computed scores to span large gaps w/ the aligned sequences. This greedy approach is able to only compute a fraction of the DP matrix. Since differences b/n cells are small, this allows for maximum parallelism with SIMD vectors.

Block aligner uses the Smith-Waterman-Gotoh algorithm, along with its global variant, the Needleman-Wunsch algorithm. They are dynamic programming that computes the optimal alignment of two sequences in a O(|q||r|) matrix along with the transition directions (trace).





□ Sincast: a computational framework to predict cell identities in single cell transcriptomes using bulk atlases as references

>> https://www.biorxiv.org/content/10.1101/2021.11.07.467660v1.full.pdf

Sincast is a computational framework to query scRNA-seq data based on bulk reference atlases. Single cell data are transformed to be directly comparable to bulk data, either with pseudo-bulk aggregation or graph-based imputation to address sparse single cell expression profiles.

Sincast avoids batch effect correction, and cell identity is predicted along a continuum to highlight new cell states not found in the reference atlas. Sincast projects single cells into the correct biological niches in the expression space of the bulk reference atlas.





□ Spacemake: processing and analysis of large-scale spatial transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2021.11.07.467598v1.full.pdf

Spacemake is designed to handle all major spatial transcriptomics datasets and can be readily configured to run on other technologies. It can process and analyze several samples in parallel, even if they stem from different experimental methods.

Spacemake enables reproducible data processing from raw data to automatically generated downstream analysis. Spacemake is built with a modular design and offers additional functionality such as sample merging, saturation analysis and analysis of long-reads as separate modules.





□ Weak SINDy for partial differential equations

>> https://www.sciencedirect.com/science/article/pii/S0021999121004204

a learning algorithm for the threshold in sequential-thresholding least-squares (STLS) that enables model identification from large libraries, and utilizing scale invariance at the continuum level to identify PDEs from poorly-scaled datasets.

WSINDy algorithm for identification of PDE systems using the weak form of the dynamics has a worst-case computational complexity of O(N^D+1log(N) for datasets with N points in each of D+1 dimensions.





□ e-DRW: An Entropy-based Directed Random Walk for Pathway Activity Inference Using Topological Importance and Gene Interactions

>> https://www.biorxiv.org/content/10.1101/2021.11.05.467449v1.full.pdf

the entropy-based Directed Random Walk (e-DRW) method to quantify pathway activity using both gene interactions and information indicators based on the probability theory.

Moreover, the expression value of the member genes are inferred based on the t-test statistics scores and correlation coefficient values, whereas, the entropy weight method (EWM) calculates the activity score of each pathway.

The merged directed pathway network utilises e-DRW to evaluate the topological importance of each gene. An equation was proposed to assess the connectivity of nodes in the directed graph via probability values calculated from the Shannon entropy formula.





□ EntropyHub: An open-source toolkit for entropic time series analysis

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0259448

Numerous variants have since been derived from conditional entropy, and to a lesser extent Shannon’s entropy, to estimate the information content of time series data across various scientific domains, resulting in what has recently been termed “the entropy universe”.

EntropyHub (Ver. 0.1) provides an extensive range of more than forty functions for estimating cross-, multiscale, multiscale cross-, and bidimensional entropy, each incl. a number of keyword arguments that allows the user to specify multiple parameters in the entropy calculation.





□ QT-GILD: Quartet Based Gene Tree Imputation Using Deep Learning Improves Phylogenomic Analyses Despite Missing Data

>> https://www.biorxiv.org/content/10.1101/2021.11.03.467204v1.full.pdf

QT-GILD is an automated and specially tailored unsupervised deep learning technique, accompanied by cues from natural languageprocessing (NLP), which learns the quartet distribution in a given set of incomplete gene trees andgenerates a complete set of quartets accordingly.

QT-GILD obviates the need for a reference tree as well as accounts for gene tree estimation error. QT- GILD is a general-purpose approach which requires no explicit modeling of the reasons of gene tree heterogeneity or missing data, making it less vulnerable to model mis-specification.

QT-GILD measures the divergence between true quartet distributions and different sets of quartet distributions in estimated gene trees (e.g., complete, incomplete and imputed) in terms of the number of “dominant” quartets that differ between two quartet distributions.

QT-GILD tries to learn the overall quartet distribution guided by a self-supervised feedback loop and correct for gene tree estimation error, investigating its application beyond incomplete gene trees in order to improve estimated gene tree distributions would be an interesting direction to take.





□ EFMlrs: a Python package for elementary flux mode enumeration via lexicographic reverse search

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04417-9

Recently, Avis et al. developed mplrs—a parallel version of the lexicographic reverse search (lrs) algorithm, which, in principle, enables an EFM analysis on high-performance computing environments.

EFMlrs uses COBRApy to process metabolic models from sbml files, performs loss-free compressions of the stoichiometric matrix. the enumeration of EFM/Vs in metabolic networks is a vertex enumeration problem in convex polyhedra.





□ Hofmann-Mislove theorem for approach spaces

>> https://arxiv.org/pdf/2111.02665v1.pdf

The Hofmann-Mislove theorem says that compact saturated sets of a sober topological space correspond bijectively to open filters of its open set lattice. They concerns an analogy of this result for approach spaces.

It is shown that for a sober approach space, the inhabited and saturated compact functions correspond bijectively to the proper open [0,∞]-filters of the metric space of its upper regular functions, which is an analogy of the Hofmann-Mislove theorem for approach spaces.





□ Classification of pre-Jordan Algebras and Rota-Baxter Operators on Jordan Algebras in Low Dimensions

>> https://arxiv.org/pdf/2111.02035v1.pdf

The equations involving structural constants of Jordan algebras are “cubic”. it is difficult to give all solutions of these equations as well as give the corresponding classification in the sense of isomorphism, even more for pre-Jordan algebras since they involve two identities.

Classifying complex pre-Jordan algebras and give Rota-Baxter operators (of weight zero) on complex Jordan algebras in dimensions ≤ 3.





□ Oriented and unitary equivariant bordism of surfaces

>> https://arxiv.org/pdf/2111.02693v1.pdf

an alternative proof of the fact that surfaces with free actions (of groups of odd order in the oriented case) which in-duce non-trivial elements in the Bogomolov multiplier of the group cannot equivariantly bound.

Surfaces without 0-dimensional fixed points: Let us then denote by ΩG2 the subgroup of ΩG2 generated by manifolds without isolated fixed points, and whose underlying Euler characteristic is zero in the unitary case.





□ Hausdorff dimension of sets with restricted, slowly growing partial quotients

>> https://arxiv.org/pdf/2111.02694v1.pdf

the set of irrational numbers in (0, 1) whose partial quotients an tend to infinity is of Hausdorff dimension 1/2. a precise asymptotics of the Hausdorff dimension of this set as q → ∞ using the thermo-dynamic formalism.

for an arbitrary B and an arbitrary f with values in [min B, ∞) and tending to infinity, the set of irrational numbers in (0, 1) such that
an ∈B, an ≤f(n) for all n∈N, and an → ∞ asn→∞ is of Hausdorff dimension τ(B)/2, where τ(B) is the exponent of convergence of B.

Constructing a sequence of Bernoulli measures with non-uniform weights, supported on finitely many 1-cylinders indexed by elements of B and having dimensions (the Kolmogorov-Sinai entropy divided by the Lyapunov exponent) not much smaller than τ(B)/2.





□ HATTUSHA: Multiplex Embedding of Biological Networks Using Topological Similarity of Different Layers

>> https://www.biorxiv.org/content/10.1101/2021.11.05.467392v1.full.pdf

HATTUSHA formulates an optimization problem that accounts for inner-network smoothness, intra-network smoothness, and topological similarity of networks to compute diffusion states for each network using Gromov-Wasserteins discrepancy.

HATTUSHA integrates the resulting diffusion states and apply dimensionality reduction - singular value decomposition after log-transformation to compute node embeddings.





□ IEPWRMkmer: An Information-Entropy Position-Weighted K-Mer Relative Measure for Whole Genome Phylogeny Reconstruction

>> https://www.frontiersin.org/articles/10.3389/fgene.2021.766496/full

Using the Shannon entropy of the feature matrix to determine the optimal value of K. And can obtain an N×4K feature matrix for a dataset with N genomes. The optimal K is the value at which score(K) reaches its maximum.

IEPWRMkmer, An Information-Entropy Position-Weighted K-Mer Relative Measure, a new alignment-free method which combines the position-weighted measure of k-mers and the information entropy of frequency of k-mers to obtain phylogenetic information for sequence comparison.





□ Nyströmformer: A Nystöm-based Algorithm for Approximating Self-Attention

>> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8570649/

Nyströmformer – a model that exhibits favorable scalability as a function of sequence length. Nyströmformer is based on adapting the Nyström method to approximate standard self-attention with O(n) complexity.

NyströmFormer algorithm makes use of landmark (or Nyström) points to reconstruct the softmax matrix in self-attention, thereby avoiding computing the n × n softmax matrix. The scalability of Nyströmformer enables application to longer sequences with thousands of tokens.





□ AWGAN: A Powerful Batch Correction Model for scRNA-seq Data

>> https://www.biorxiv.org/content/10.1101/2021.11.08.467781v1.full.pdf

AWGAN, a new deep learning framework based on Wasserstein Generative Adversarial Network (WGAN) combined with an attention mechanism to reduce the differences among batches.

AWGAN can remove the batch effect in different datasets and preserve the biological variation. AWGAN adopts a strategy of confrontation training to improve the ability of the two models and finally achieve the Nash equilibrium.




□ Optimizing weighted gene co-expression network analysis with a multi-threaded calculation of the topological overlap matrix


>> https://www.degruyter.com/document/doi/10.1515/sagmb-2021-0025/html

The WGCNA R software package uses an Adjacency Matrix to store a network, next calculates the Topological Overlap Matrix (TOM), and then identifies the modules (sub-networks), where each module is assumed to be associated with a certain biological function.

the single-threaded algorithm of the TOM has been changed into a multi-threaded algorithm (the default values of WGCNA). In the multi-threaded algorithm, Rcpp was used to make R call a C++ function, and then C++ used OpenMP to calculate TOM from the Adjacency Matrix.





□ Nature Reviews Genetics RT

Functional genomics data: privacy risk assessment and technological mitigation

>> https://www.nature.com/articles/s41576-021-00428-7

>> https://twitter.com/naturerevgenet/status/1458732446560800769?s=21

This Perspective highlights privacy issues related to the sharing of functional genomics data, including genotype and phenotype information leakage from different functional genomics data types and their summarization steps.





□ DeepKG: An End-to-End Deep Learning-Based Workflow for Biomedical Knowledge Graph Extraction, Optimization and Applications

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab767/6425667

To improve the performance of DeepKG, a cascaded hybrid information extraction framework (CHIEF) is developed for training model of 3-tuple extraction, and a novel AutoML-based knowledge representation algorithm (AutoTransX) is proposed for knowledge representation and inference.

For link prediction in knowledge graph learning (KGL), the core problem is to learn the relations between 3-tuples, where a triplet includes the embedding vectors of two entities (head and tail) and one relation.

AutoTransX is a data-driven method to address this issue, which automatically combines several candidate operations of 3-tuples in traditional methods to represent the relations in biomedical KGL accurately.

CHIEF is a cascaded hybrid information extraction framework, which extracts relational 3-tuples as a whole and learns both entities and relations through a joint encoder. a fine-tuned deep bidirectional Transformer (BERT) has been utilized to capture the contextual information.





□ scHiCSRS: A Self-Representation Smoothing Method with Gaussian Mixture Model for Imputing single-cell Hi-C Data

>> https://www.biorxiv.org/content/10.1101/2021.11.09.467824v1.full.pdf

scHiCSRS, a self-representation smoothing method that improves the data quality, and a Gaussian mixture model that identifies structural zeros among observed zeros.

scHiCSRS takes spatial dependencies of scHi-C 2D data structure into consideration while also borrows information from similar single cells. scHiCSRS was motivated by scTSSR that recovers scRNA data using a two-sided sparse self-representation method.




□ From shallow to deep: exploiting feature-based classifiers for domain adaptation in semantic segmentation

>> https://www.biorxiv.org/content/10.1101/2021.11.09.467925v1.full.pdf

a Convolutional Neural Networks can be trained to correct the errors of the Random Forest in the source domain and then be applied to correct such errors in the target domain without retraining, as the domain shift b/n the RF predictions is much smaller than between the raw data.

This method can be classified as source-free domain adaption, but the additional feature-based learning step allows us to avoid training set estimation or reconstruction.

a new Random Forest from a few brushstroke labels and simply apply the pre-trained Prediction Enhancer (PE) network to improve the probability maps.





□ MEP: Improving Neural Networks for Genotype-Phenotype Prediction Using Published Summary Statistics

>> https://www.biorxiv.org/content/10.1101/2021.11.09.467937v1.full.pdf

main effect prior (MEP), a new regularization method for making use of GWAS summary statistics from external datasets. The main effect prior is generally applica- ble for machine learning algorithms, such as neural networks and linear regression.

a tractable solution by accessing the summary statistics from another large study. Since the main effects of SNPs have already been captured by GWAS summary statistics on the large external dataset in MEP(external), using MEP is especially beneficial for high-dimensional data.





□ Combining dictionary- and rule-based approximate entity linking with tuned BioBERT

>> https://www.biorxiv.org/content/10.1101/2021.11.09.467905v1.full.pdf

a two-stage approach as follows usage of fine-tuned BioBERT for identification of chemical entities semantic approximate search in MeSH and PubChem databases for entity linking.

This mainly affects the new entities that are not part of the base vocabulary of BERT’s WordPiece tokenizer, resulting into multiple splitting of sub-tokens.




□ REDigest: a Python GUI for In-Silico Restriction Digestion Analysis of Genes or Complete Genome Sequences

>> https://www.biorxiv.org/content/10.1101/2021.11.09.467873v1.full.pdf

REDigest is a fast, user-interactive and customizable software program which can perform in- silico restriction digestion analysis on a multifasta gene or a complete genome sequence file.

REDigest can process Fasta and Genbank format files as input and can write output file for sequence information in Fasta or Genbank format. Validation of the restriction fragment or the terminal restriction fragment size and taxonomy against a database.




□ A2Sign: Agnostic algorithms for signatures — a universal method for identifying molecular signatures from transcriptomic datasets prior to cell-type deconvolution

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab773/6426077

A2Sign is a global framework that can be applied to uncover molecular signatures for cell type deconvolution in arbitrary tissues using bulk transcriptome data.

A2Sign: Agnostic Algorithms for Signatures, based on a non-negative tensor factorization strategy that allows us to identify cell type-specific molecular signatures, greatly reduce collinearities, and also account for inter-individual variability.





□ Scalable inference of transcriptional kinetic parameters from MS2 time series data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab765/6426074

scalable implementation of the cpHMM for fast inference of promoter activity and transcriptional kinetic parameters. This new method can model genes of arbitrary length through the use of a time-adaptive truncated compound state space.

The truncated state space provides a good approximation to the full state space by retaining the most likely set of states at each time during the forward pass of the algorithm.




□ bollito: a flexible pipeline for comprehensive single-cell RNA-seq analyses

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab758/6426066

bollito is an automated, flexible and parallelizable computational pipeline for the comprehensive analysis of single-cell RNA-seq data. bollito performs both basic and advanced tasks in single-cell analysis integrating over 30 state-of-the-art tools.

bollito is built using the Snakemake workflow management system includes quality control, read alignment, dimensionality reduction, clustering, cell-marker detection, differential expression, functional analysis, trajectory inference and RNA velocity.




□ MatrixQCvis: shiny-based interactive data quality exploration for omics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab748/6426067

In high-throughput quantitative omics experiments, after initial processing, the data are typically presented as a matrix of numbers (feature IDs × samples).

Efficient and standardized data-quality metrics calculation and visualization are key to track the within-experiment quality of these rectangular data types and to guarantee for high-quality data sets and subsequent biological question-driven inference.

MatrixQCvis, which provides interactive visualization of data quality metrics at the per-sample and per-feature level using R’s shiny framework. It provides efficient and standardized ways to analyze data quality of quantitative omics data types that come in a matrix-like format.








‘til we meet again.

2021-11-11 23:12:13 | Science News




□ EVE: Disease variant prediction with deep generative models of evolutionary data

>> https://www.nature.com/articles/s41586-021-04043-8

EVE (evolutionary model of variant effect) provides any single amino acid mutation of interest a score reflecting the propensity of the resulting protein to be pathogenic.

a Bayesian VAE learns a distribution over amino acid sequences from evolutionary data. It enables the computation of an evolutionary index for each mutant, which approximates the log-likelihood ratio of the mutant vs the wild type.

The EVE scores reflect probabilistic assignments to the pathogenic cluster. A global-local mixture of Gaussian Mixture Models separates variants into benign and pathogenic clusters based on that index.





□ scPSD: Disentangling single-cell omics representation with a power spectral density-based feature extraction

>> https://www.biorxiv.org/content/10.1101/2021.10.25.465657v1.full.pdf

scPSD, an innovative unified strategy for single-cell omics data transformation that is inspired by power spectral density analysis to intensify discriminatory information from single-cell genomic features.

Entropy estimation to improve the extraction of important information from Fourier transformed data. A vector of genomic features has been realized as a ‘signal’. the scPSD transformation is expected to be applicable to other omics modalities as well as bulk sequencing data.





□ DeepMAPS: Biological network inference from single-cell multi-omics data using heterogeneous graph transformer

>> https://www.biorxiv.org/content/10.1101/2021.10.31.466658v1.full.pdf

DeepMAPS formulates high-level representations of relations among cells and genes in a heterogeneous graph, with cells and genes as the two disjoint node sets in this graph.

DeepMAPS is an end-to-end framework. Projecting the features of genes and cells into the same latent space is an effective way to harmonize the imbalance of different batches and lies a solid foundation of cell clustering and the prediction of cell-gene and gene-gene relations.





□ adabmDCA: adaptive Boltzmann machine learning for biological sequences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04441-9

adaptive Boltzmann machine learning to infer several maximum-entropy statistical models of Potts or Ising variables given a set of observables. It infers the couplings and the fields of a set of generalized Direct Coupling Analysis (DCA) models given a Multiple Sequence Alignment.

adabmDCA encompasses the possibility of adapting the Monte Carlo Markov Chain sampling ensuring an equilibrium training. When the decorrelation time of the Monte Carlo chains appears to be large, the learning at equilibrium is intractable.





□ GFAE: A Graph Feature Auto-Encoder for the prediction of unobserved node features on biological networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04447-3

Graph Feature Auto-Encoder (GFAE) for the prediction of expression values utilizing gene network structures. FeatGraphConv using a message passing neural networks (MPNNs), tailored to reconstructing the representation of the node features rather than the graph structure.

FeatGraphConv convolution layer is able to predict missing features more accurately than all other methods. Graph convolution layers, with the exception of GCN, outperform MAGIC on the single-cell RNAseq imputation task, although the MLP, which does not use graph information.





□ Democratizing long-read genome assembly

>> https://www.cell.com/cell-systems/pdf/S2405-4712(21)00378-1.pdf

Minimizer-space de Bruijn graph (mdBG) assembler can assemble genomes 100-fold faster than previous methods, including a human genome in under 10 min, which unlocks pan-genomics for many species.

The minimizer-space Partial Order Alignment (POA) algorithm corrects sequencing errors in in minimizers by computing a consensus from a multiple sequence alignment of the minimizers found in overlapping reads.





□ phasebook: haplotype-aware de novo assembly of diploid genomes from long reads

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02512-x

phasebook reconstructs the haplotypes of diploid genomes from long reads. phasebook outperforms other approaches in terms of haplotype coverage by large margins, in addition to achieving competitive performance in terms of assembly errors and assembly contiguity.

phasebook constructs a haplotype aware super read overlap graph to extend super reads into haplotype aware contigs. phasebook-hi generally trades higher switch error rates. the modified protocol of phasebook-hi is favorable on diploid genomes that are relatively variant sparse.





□ Artificial intelligence reveals nuclear pore complexity

>> https://www.biorxiv.org/content/10.1101/2021.10.26.465776v1.full.pdf

a near-complete structural model of the human NPC scaffold with explicit membrane and in multiple conformational states.

Combining AI-based structure prediction with in situ and in cellulo cryo-electron tomography and integrative modeling. Linker Nups spatially organize the scaffold within and across subcomplexes to establish the higher-order structure.

Microsecond-long molecular dynamics simulations suggest that the scaffold is not required to stabilize the inner and outer nuclear membrane fusion, but rather widens the central pore.





□ DeepMP: a deep learning tool to detect DNA base modifications on Nanopore sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab745/6413628

DeepMP introduces a threshold-free position modification calling model sensitive to sites methylated at low frequency across cells. DeepMP includes a further innovation, a supervised Bayesian model to call the position-based methylation, which is –to our knowledge- unique.

DeepMP takes as input two types of information from Nanopore sequencing data, basecalling errors and raw current signals. Features from these two types of information are fed into a CNN-based module. DeepMP significantly outperforms DeepSignal, Megalodon, and Nanopolish.





□ Codetta: A computational screen for alternative genetic codes in over 250,000 genomes

>> https://www.biorxiv.org/content/10.1101/2021.06.18.448887v1.full.pdf

Codetta, a computational method that takes DNA or RNA sequences from a single organism and predicts an amino acid translation for each of the 64 codons. Codetta aggregates over the set of aligned profile positions to infer the single most likely amino acid decoding of the codon.

Codetta can correctly infer canonical and non-canonical codon translations and can flag unusual situations such as ambiguous translation even though it assumes unambiguous translation.

Codetta extends the idea to systematic high-throughput analysis by using a probabilistic modeling approach to infer codon decodings, and by taking advantage of the large collection of probabilistic profiles of conserved profile HMMs in the Pfam database.





□ IRFinder-S: a comprehensive suite to discover and explore intron retention

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02515-8

IRFinder-S identifies the true intron retention events using a convolutional neural network, allows the sharing of intron retention results, integrates a dynamic database to explore samples, and provides a tested method to detect differential levels of intron retention.

In order to adapt the IRratio computation in long read, we adapted the estimation of intron and exon abundance keeping unchanged the formula:

I Rratio = Intronic abundance/(Intronic abundance + exonic abundance)





□ RUV-III-NB: Normalization of single cell RNA-seq Data

>> https://www.biorxiv.org/content/10.1101/2021.11.06.467575v1.full.pdf

RUV-III-NB uses the concept of pseudo-replicates to ensure that relevant features of the unwanted variation are only inferred from cells with the same biology and return adjusted sequencing count as output.

RUV-III-NB manages to remove library size and batch effects, strengthen biological signals, improve differential expression analyses, and lead to results exhibiting greater concordance with independent datasets of the same kind.





□ SECEDO: SNV-based subclone detection using ultra-low coverage single-cell DNA sequencing

>> https://www.biorxiv.org/content/10.1101/2021.11.08.467510v1.full.pdf

SECEDO is able to cluster cells and perform variant calling based on information obtained from single-cell DNA sequencing.

SECEDO takes as input BAM files containing the aligned data for each cell and provides as output a clustering of the cells and, optionally, VCF files pinpointing the changes relative to a reference genome.

SECEDO builds a cell-to-cell similarity matrix based only on read-pairs containing the filtered loci, using a probabilistic model that takes into account the probability of the frequency of SNVs, and the structure of the reads, i.e. the whole read sampled from the same haplotype.





□ BAMboozle removes genetic variation from human sequence data for open data sharing

>> https://www.nature.com/articles/s41467-021-26152-8

Re-analyses of published scRNA-seq data also benefit from having the access to raw sequence data, although not necessarily needing genetic variant information.

BAMboozle, a versatile and efficient program that reverts aligned read sequences (in Binary Sequencing Alignment Map (BAM) format) to the reference genome to efficiently eliminate the genetic variant information in raw sequence data.





□ isoCNV: in silico optimization of copy number variant detection from targeted or exome sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04452-6

To maximize the performance, the parameters of the CNV calling algorithms should be optimized for each specific dataset. This requires obtaining validated CNV information using either multiplex ligation-dependent probe amplification or array comparative genomic hybridization.

isoCNV optimizes the parameters of DECoN algorithm using only NGS data. The parameter optimization process is performed using an in silico CNV validated dataset obtained from the overlapping calls of three algorithms: CNVkit, panelcn.MOPS and DECoN.





□ DeepVariant-AF: Improving variant calling using population data and deep learning

>> https://www.biorxiv.org/content/10.1101/2021.01.06.425550v2.full.pdf

The population-aware DeepVariant (DeepVariant-AF) model reduces variant calling errors, improving both precision and recall in single samples, and reduces rare homozygous and pathogenic clinvar calls cohort-wide.

DeepVariant-AF has a slightly lower recall but the difference was marginal. The recall of zero-frequency variants using all variant callers is substantially lower than the recall of all variants, but it can be strongly improved using PacBio Hifi reads.




□ scSPLAT: a scalable plate-based protocol for single cell WGBS library preparation

>> https://www.biorxiv.org/content/10.1101/2021.10.14.464375v1.full.pdf

Splinted Ligation Adapter Tagging (scSPLAT) employs a pooling strategy to facilitate sample preparation at a higher scale and throughput than previously possible.

scSPLAT adapter tagging is performed using splint ligation and carryover of free-nucleotides poses no risk for introduction of artificial sequences.





□ RgCop-A regularized copula based method for gene selection in single cell rna-seq data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009464

RgCop utilizes copula correlation (Ccor), a robust equitable dependence measure that captures multivariate dependency among a set of genes in single cell expression data. RgCop introduces a stable feature/gene selection which is evaluated by applying it in noisy data.

By virtue of the important scale invariant property of copula, the selected features are invariant under any transformation of data due to the most common technical noise present in the scRNA-seq experiment.





□ MACA: Marker-based automatic cell-type annotation for single cell expression data

>> https://www.biorxiv.org/content/10.1101/2021.10.25.465734v1.full.pdf

MACA calculates two cell-type labels for each cell based on an individual cell expression profile and a collective clustering profile. From these, a final cell-type label is generated according to a normalized confusion matrix.

MACA generates Label 1 for each cell by identifying the cell-type associated with the highest score. Using the matrix of cell-type scores as input, the Louvain community detection algorithm is applied to generate Label 2, which is a clustering label to which a cell belongs.





□ Robust enhancer-gene regulation identified by single-cell transcriptomes and epigenomes

>> https://www.biorxiv.org/content/10.1101/2021.10.25.465795v1.full.pdf

Identifying high-confidence, robust enhancer-gene links using a non-parametric permutation-based procedure to control for gene co-expression, and validate the predicted links with multimodal 3D chromatin conformation (snm3C-seq) data.

True causal interactions cannot be inferred from correlational analysis alone. By bringing together multiple data modalities to define robust enhancer-gene links, these analyses can reveal the regulatory principles of cell-type-specific gene expression.





□ TAPE: Deep autoencoder enables interpretable tissue-adaptive deconvolution and cell-type-specific gene analysis

>> https://www.biorxiv.org/content/10.1101/2021.10.26.465846v1.full.pdf

TAPE is the constant running time when deconvolving a large number of samples. Running on the popular graphic card, TAPE is much faster than traditional statistical methods and 3 times faster than the previous deep-learning method.

TAPE benefits from the architecture of autoencoder and the unique training method in the adaptive stage. TAPE takes all the RNA-seq data at one time as input and outputs one signature matrix adapted to all samples.





□ Illumina But With Nanopore: Sequencing Illumina libraries at high accuracy on the ONT MinION using R2C2

>> https://www.biorxiv.org/content/10.1101/2021.10.30.466545v1.full.pdf

a simple workflow that converts almost any Illumina sequencing library into DNA of lengths optimal for the ONT MinION and generated data at similar cost and accuracy as the Illumina MiSeq using R2C2.

R2C2 circularizes dsDNA libraries, amplifies those circles using rolling circle amplification to create long molecules with multiple tandem repeats of the original molecule’s sequence.

PLNK (Processing Live Nanopore Experiments) takes advantage of the real-time data generation of the ONT MinION. PLNK processes raw data and generates immediate feedback on library composition and what percentage of reads fall within defined regions in the genome.

>> https://github.com/kschimke/PLNK

PLNK runs alongside an Oxford Nanopore MinION sequencer, processing individual fast5 files using guppy for basecalling, C3POa for R2C2 consensus calling, and mappy for alignment before analyzing the library content.





□ DensityMorph: Comparing single cell datasets

>> https://www.biorxiv.org/content/10.1101/2021.10.28.466371v1.full.pdf

In summary, a cell-population centric analysis has the potential to hide nuanced shifts in expression.

DensityMorph, a novel approximation that compares point clouds via NN and cross NN distances. The DensityMorph algorithm can be used for characterising a set of N single cell samples by calculating an N × N distance matrix, and taking the square root of the matrix entries.





□ SVDNVLDA: predicting lncRNA-disease associations by Singular Value Decomposition and node2vec

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04457-1

In SVDNVLDA, the linear feature representations of lncRNAs and diseases containing their linear interaction information were obtained by Singular Value Decomposition (SVD); And the nonlinear features containing network topology information were obtained by node2vec.

SVDNVLDA can be adapted to a range of data sets and possess strong robustness. The integrated feature vectors of aforementioned features were inputted into a ML classifier, which transformed the lncRNA-disease association prediction into a binary classification problem.





□ Genotyping of structural variation using PacBio high-fidelity sequencing

>> https://www.biorxiv.org/content/10.1101/2021.10.28.466362v1.full.pdf

a schematic workflow with wide availability for evaluating the SV detection algorithms in terms of precision and recall. the performance of SV detection varied depending on the long-read aligners rather than the SV callers.

Caller cuteSV or SVIM after pbmm2 (for deletion) or NGMLR (for insertion) alignment should be recommended as benchmarking SV software, unrelated ploidy level.





□ Quality-controlled R-loop meta-analysis reveals the characteristics of R-Loop consensus regions

>> https://www.biorxiv.org/content/10.1101/2021.11.01.466823v1.full.pdf

R-loop forming sequences were computationally predicted using the QmRLFS-finder.py python program implemented as part of the makeRLFSBeds.

To reprocess all available R-loop mapping datasets, using a long-running computational pipeline that is available in its entirety with detailed instructions in the accompanying data generation repository.

Proceeding to define consensus R-loop sites called “R-loop regions” (RL regions). And revealed the stark divergence between S9.6 and dRNH-based R-loop mapping methods and identify biologically meaningful subtypes of both constitutive and variable R-loops.





□ RLBase: Exploration and analysis of R-loop mapping data

>> https://www.biorxiv.org/content/10.1101/2021.11.01.466854v1.full.pdf

R-loop regions (RL regions) are consensus sites of R-loop formation discovered from the meta-analysis of high-confidence R-loop mapping.

RLBase, an innovative web server which builds upon those data and software by users with the capability to explore hundreds of public R-loop mapping datasets, explore consensus R-loop regions, and download all the reprocessed data for the 693 samples.

RLBase is a core component of RLSuite, an R-loop analysis software toolchain. RLSuite also includes RLPipes (a CLI pipeline for upstream R-loop data processing), RLSeq (for downstream R-loop data analysis), and RLHub (an interface to the RLBase datastore).





□ SPCS: A Spatial and Pattern Combined Smoothing Method of Spatial Transcriptomic Expression

>> https://www.biorxiv.org/content/10.1101/2021.11.02.467030v1.full.pdf

Spatial and Pattern Combined Smoothing (SPCS) is a novel two-factor smoothing technique, that employs k-nearest neighbor technique to utilize associations from transcriptome and Euclidean space from the Spatial Transcriptomic (ST) data.

SPCS recovers the drop-out events and enhance the expressions of marker genes in the corresponding regions. SPCS smoothing is a state-of-the-art ST smoothing algorithm with implications in numerous diseases where ST data is being generated.





□ NIQKI: Toward optimal fingerprint indexing for large scale genomics

>> https://www.biorxiv.org/content/10.1101/2021.11.04.467355v1.full.pdf

NIQKI can index and calculate pairwise distances for over one million bacterial genomes from GenBank in a matter of days on a small cluster.

NIQKI generalizes the concept of Hyperminhash to take into account different sizes of Hyperloglog and Minhash fingerprints dubbed (h,m)-HMH fingerprints that can be tuned to present the lowest false positive rate given the expected sub-sampling applied.

NIQKI structure query is O(#hits) compared to O(S.N) for the state of the art. This structure came with a memory cost as our index uses O(S(NlogN+2W)) bits in \stead of O(S.N.W).





□ SetSketch: Filling the Gap between MinHash and HyperLogLog

>> https://arxiv.org/pdf/2101.00314.pdf

While HyperLogLog allows counting different elements with very little space, MinHash is suitable for the fast comparison of sets as it allows estimating the Jaccard similarity and other joint quantities.

SetSketch, a new data structure that is able to continuously fill the gap b/n both use cases. The presented estimators for cardinality and joint quantities do not require empirical calibration. And can be applied to other structures such as MinHash, HyperLogLog, or HyperMinHash.





□ Sketching and sampling approaches for fast and accurate long read classification

>> https://www.biorxiv.org/content/10.1101/2021.11.04.467374v1.full.pdf

a chosen sampling or sketching algorithm is used to generate a reduced representation (a “screen”) of potential source genomes for a query readset before reads are streamed in and compared against this screen.

Using a query read’s similarity to the elements of the screen, the methods predict the source of the read.

The sampling and sketching approaches investigated include uniform sampling, methods based on MinHash and its weighted and order variants, a minimizer-based technique, and a novel clustering-based sketching approach.

Alignment-based approaches are slightly better suited to handling these, as they do direct comparisons of the reads against the source genomes, with k-mer indexes and sketching-based methods struggling to narrow down the exact source between several similar sequences.





□ Syotti: Scalable Bait Design for DNA Enrichment

>> https://www.biorxiv.org/content/10.1101/2021.11.05.467426v1.full.pdf

The Minimum Bait Cover Problem is NP-hard even for extremely restrictive versions of the problem. the problem remains intractable even for an alphabet of size four (A, T, C, G), a bait length that is logarithmic in the length of the reference genome, and Hamming distance of zero.

No polynomial-time exact algorithm exists for the problem, and that the problem is intractable even for small and deceptively simple inputs. Syotti is an efficient heuristic that takes advantage of succinct data structures.

Syotti shows linear scaling in practice, running at least an order of magnitude faster than state-of-the-art methods. At the same time, our method produces bait sets that are smaller than the ones produced by the competing methods, while also leaving fewer positions uncovered.




□ PacRAT: A program to improve barcode-variant mapping from PacBio long reads using multiple sequence alignment

>> https://www.biorxiv.org/content/10.1101/2021.11.06.467314v1.full.pdf

a PacBio Read Alignment Tool (PacRAT) maximizes the number of usable reads while reducing the sequencing errors of CCS reads.

PacRAT improves the accuracy in pairing barcodes and variants across these libraries. Analysis of real (non-simulated) libraries also showed an increase in the number of reads that can be used for downstream analyses when using PacRAT.







How long it takes…

2021-11-11 23:11:11 | Science News




□ Chronumental: time tree estimation from very large phylogenies

>> https://www.biorxiv.org/content/10.1101/2021.10.27.465994v1.full.pdf

Chronumental uses stochastic gradient descent to identify lengths of time for tree branches which maximise the evidence lower bound under a probabilistic model, implemented in a framework which can be compiled into XLA for rapid computation.

Representing the summation of branch lengths to estimate node dates as a notional matrix multiplication, by constructing a vast matrix in which one dimension represents the leaf nodes, and one dimension represents the internal branches, with a 1 at each element.

When this matrix is multiplied by a vector of time-lengths for each branch it would yield the date corresponding to each leaf node. Such a matrix would contain over 10^12 elements, dwarfing any resources, but since almost all elements are 0s.

It can be represented as a “sparse matrix”, encoded in coordinate list (COO) format, with the matrix multiplication performed through ‘take’ and ‘segment_sum’ XLA operations.

Representing the operations in this way allows them to be efficiently compiled in XLA, which creates a differentiable graph of arithmetic operations.

Chronumental scales to phylogenies featuring millions of nodes, with chronological predictions made in minutes, and is able to accurately predict the dates of nodes for which it is not provided with metadata.





□ Stabilization of continuous-time Markov/semi-Markov jump linear systems via finite data-rate feedback

>> https://arxiv.org/pdf/2110.14931v1.pdf

the stabilization problem of the Markov jump linear systems (MJLSs) under the communication data-rate constraints, where the switching signal is a continuous-time Markov process. Sampling and quantization are used as fundamental tools to deal with the problem.

the sufficient conditions are given to ensure the almost sure exponential stabilization of the Markov jump linear systems. The conditions depend on the generator of the Markov process. The sampling times and the jump time is also independent.





□ Linear Approximate Pattern Matching Algorithm

>> https://www.biorxiv.org/content/10.1101/2021.10.25.465764v1.full.pdf

a structure that can be built in linear time and space and solve the approximate matching problem in (O(m + logΣk n/k! + occ) search costs, where m is the k! length of the pattern, n is the length of the reference, and k is the number of tolerated mismatches.

Building a novel index that index all suffixes under all internal nodes in the suffix tree in linear time and with maintaining the inter-connectivity among the suffixes under different internal nodes.

The non-linear time cost is due to the trivial process of checking whether each suffix under each internal node is already indexed in OT index. OSHR tree is constructed by reversing the suffix links in ST. Clearly, the space and time cost for building OSHR tree is linear (O(n)).





□ Counterfactuals in Branching Time: The Weakest Solution

>> https://arxiv.org/pdf/2110.11689v1.pdf

a formal analysis of temporally sensitive counterfactual condition- als, using the fusion of Ockhamist branching time temporal logic and minimal counterfactual logic P.

The main advantage of Ockhamist branching time theory in the context of counter- factuals is that it allows both expressions about time and historical possibility/necessity.

Atomic propositions and Boolean connectives have standard meaning. Gφ reads as ”at every moment in the future, φ”, Hφ – ”at every moment in the past, φ”, 􏰆φ – ”it is historically necessary that φ”, which means that in all possible alternative histories, it is φ at the moment.





□ Model-free inference of unseen attractors: Reconstructing phase space features from a single noisy trajectory using reservoir computing

>> https://arxiv.org/pdf/2108.04074.pdf

A reservoir computer is able to learn the various attractors of a multistable system. In separate autonomous operation, the trained reservoir is able to reproduce and therefore infer the existence and shape of these unseen attractors.

the ability to learn the dynamics of a complex system can be extended to systems with multiple co-existing attractors, here a 4-dimensional extension of the well-known Lorenz chaotic system.

The reservoir computers are learning the phase space flow without formulating any intermediate model. They use a continuous time version of an echo state network based on ordinary differential equations.





□ Beyond sequencing: machine learning algorithms extract biology hidden in Nanopore signal data

>> https://www.cell.com/trends/genetics/fulltext/S0168-9525(21)00257-2

Nanopore sequencing accuracy has increased to 98.3% as new-generation base callers replace early generation hidden Markov model basecalling algorithms with neural network algorithms.

Nanopore direct RNA sequencing profiles RNAs with their modification retained, which influences the ion current signals emitted from the nanopore.

Machine learning and statistical testing tools can detect DNA modifications by analyzing ion current signals from nanopore direct DNA sequencing.

Machine learning methods can classify sequences in real-time, allowing targeted sequencing with nanopore’s ReadUntil feature.





□ SpatialDE2: Fast and localized variance component analysis of spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2021.10.27.466045v1.full.pdf

SpatialDE2 implements two major modules, which together provide for an end-to-end workflow for analyzing spatial transcriptomics data: a tissue region segmentation module and a module for detecting spatially variable genes.

SpatialDE2 provides a coherent model for tissue segmentation. Briefly, the spatial tissue region segmentation module is based on a Bayesian hidden markov random field, which segments tissues into distinct histological regions while explicitly accounting for spatial smoothness.





□ A Markov random field model for network-based differential expression analysis of single-cell RNA-seq data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04412-0

MRFscRNAseq is based on a Markov random field (MRF) model to appropriately accommodate gene network information as well as dependencies among cell types to identify cell-type specific DEGs.

With observed DE evidence, it utilizes a Markov random field model to appropriately take gene network information as well as dependencies among cell types into account.





□ SCA: Discovering Rare Cell Types through Information-based Dimensionality Reduction

>> https://www.biorxiv.org/content/10.1101/2021.01.19.427303v3.full.pdf

Shannon component analysis (SCA), a technique that leverages the information- theoretic notion of surprisal for dimensionality reduction. SCA’s information-theoretic paradigm opens the door to more meaningful signal extraction.

In cytotoxic T-cell data, SCA cleanly separates the gamma-delta and MAIT cell subpopulations, which are not detectable via PCA, ICA, scVI, or a wide array of specialized rare cell recovery tools.

SCA leverages the notion of surprisal, whereby less probable events are more informative when they occur, to assign an information score to each transcript in each cell.




□ MoNET: an R package for multi-omic network analysis

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab722/6409845

MoNET enables users to not only track down the interaction of SNPs/genes with metabolome level, but also trace back for the potential risk variants/regulators given altered genes/metabolites.

MoNET is expected to advance our understanding of the multi-omic findings by unveiling their trans-omic interactions and is likely to generate new hypotheses for further validation.




□ SigTools: Exploratory Visualization For Genomic Signals

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab742/6413626

Sigtools is an R-based visualization package, designed to enable the users with limited programming experience to produce statistical plots of continuous genomic data.

Sigtools consists of several statistical visualizations that provide insights regarding the behavior of a group of signals in large regions – such as a chromosome or the whole genome – as well as visualizing them around a specific point or short region.





□ Techniques to Produce and Evaluate Realistic Multivariate Synthetic Data

>> https://www.biorxiv.org/content/10.1101/2021.10.26.465952v1.full.pdf

The work demonstrates how to generate multivariate synthetic data that matches the real input data by converting the input into multiple one-dimensional (1D) problems.

The work also shows that it is possible to convert a multivariate input probability density function to a form that approximates a multivariate normal, although the technique is not dependent upon this finding.





□ RCX – an R package adapting the Cytoscape Exchange format for biological networks

>> https://www.biorxiv.org/content/10.1101/2021.10.26.466001v1.full.pdf

CX is a JSON-based data structure designed as a flexible model for transmitting networks with a focus on flexibility, modularity, and extensibility. Although those features are widely used in common REST protocols they don’t quite fit the R way of thinking about data.

RCX provides a collection of functions to integrate biological networks in CX format into analysis workflows. RCX adapts the aspect-oriented design in its data model, which consists of several aspects and sub-aspects, and corresponding properties, that are linked by internal IDs.





□ SingleCellMultiModal: Curated Single Cell Multimodal Landmark Datasets for R/Bioconductor

>> https://www.biorxiv.org/content/10.1101/2021.10.27.466079v1.full.pdf

SingleCellMultiModal, a suite of single-cell multimodal landmark datasets for benchmarking and testing multimodal analysis methods via the Bioconductor ExperimentHub package including CITE-Seq, ECCITE-Seq, SCoPE2, scNMT, 10X Multiome, seqFISH, and G&T.

For the integration of the 10x Multiome dataset, They used MOFA+ to obtain a latent embedding with contributom from both data modalities.




□ ddqc: Biology-inspired data-driven quality control for scientific discovery in single-cell transcriptomics

>> https://www.biorxiv.org/content/10.1101/2021.10.27.466176v1.full.pdf

data-driven QC (ddqc), an unsupervised adaptive quality control framework that performs flexible and data-driven quality control at the level of cell states while retaining critical biological insights and improved power for downstream analysis.

iterative QC, a revised paradigm to quality filtering best practices. It provides a data-driven quality control framework compatible with observed biological diversity.





□ IPJGL: Importance-Penalized Joint Graphical Lasso (IPJGL): differential network inference via GGMs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab751/6414614

IPJGL, a novel importance-penalized joint graphical Lasso method for differential network inference based on the Gaussian graphical model with adaptive gene importance regularization.

DiNA focuses on gene interactions, which are more complex but can also reveal more information. a novel metric APC2 evaluates the interaction b/n a pair of genes for individual samples, which can be used in the downstream analyses of DiNA such as the gene-pair survival analysis.




□ CellexalVR: A virtual reality platform to visualize and analyze single-cell omics data

>> https://www.cell.com/iscience/fulltext/S2589-0042(21)01220-7

CellexalVR, an open-source virtual reality platform for the visualization and analysis of single-cell data. By placing all DR plots and associated metadata in VR is an immersive, feature-rich, and collaborative environment to explore and analyze scRNAseq experiments.

CellexalVR will also import cell surface marker intensities captured during index sorting/CITEseq and categorical metadata for cells and genes.





□ Filling gaps of genome scaffolds via probabilistic searching optical maps against assembly graph

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04448-2

This approach applies a sequential Bayesian updating to measure the similarity b/n optical maps and candidate contig paths. Using this similarity to guide path searching, It achieves higher accuracy than the existing “searching by evaluation” strategy that relies on heuristics.

nanoGapFiller aligns genome assembly contigs onto optical maps. The aligned contigs are further connected into scaffolds according to their order in the alignment.

nanoGapFiller uses a stochastic model to measure the similarity between a site sequence and any possible contig path, and then uses the probabilistic search technique to efficiently identify the contig path with the highest similarity.





□ Mix: A mixture model for signature discovery from sparse mutation data

>> https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-021-00988-7

Mix algorithm for elucidating the mutational signature landscape of input samples from their (sparse) targeted sequencing data. Mix is a probabilistic model that simultaneously learns signatures and soft clusters patients, learning exposures per cluster instead of per sample.

Mix soft-clusters the patient’s mutations and takes a linear combination of all exposures according to their probability. With this, Mix also solves another problem of existing methods, where adding a new patient requires learning a new exposure vector for it.






□ NanoMethViz: An R/Bioconductor package for visualizing long-read methylation data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009524

NanoMethViz produces publication-quality plots to inspect the broad differences in methylation profiles of different samples, the aggregated methylation profiles of classes of genomic features, and the methylation profiles of individual long reads.

NanoMethViz converts results from methylation caller into a tabular format containing the sample name, 1-based single nucleotide chromosome position, log-likelihood-ratio of methylation and read name.





□ FASTAFS: file system virtualisation of random access compressed FASTA files

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04455-3

FASTAFS; a virtual layer between (random access) FASTA archives and read-only access to FASTA files and their guarenteed in-sync FAI, DICT and 2BIT files, through the File System in Userspace (FUSE).

FASTAFS guarantees in-sync virtualised metadata files and offers fast random-access decompression using bit encodings plus Zstandard (zstd).

FASTAFS, can track all its system-wide running instances, allows file integrity verification and can provide, instantly, scriptable access to sequence files and is easy to use and deploy.





□ RiboCalc: Quantitative model suggests both intrinsic and contextual features contribute to the transcript coding ability determination in cells

>> https://www.biorxiv.org/content/10.1101/2021.10.30.466534v1.full.pdf

Ribosome Calculator (RiboCalc), an experiment-backed, data-oriented computational model for quantitatively predicting the coding ability (Ribo-seq expression level) of a particular human transcript. Features collected for RiboCalc model are biologically related to translation control.

RiboCalc not only makes quantitatively accurate predictions but also offers insight for sequence and transcription features contributing to transcript coding ability determination, shedding lights on bridging the gap between the transcriptome and proteome.

Large-scale analysis further revealed a number of transcripts w/ a variety of coding ability for distinct types of cells (i.e., context-dependent coding transcripts, CDCTs). A transcript’s coding ability should be modeled as a continuous spectrum with a context-dependent nature.




□ PopIns2: Population-scale detection of non-reference sequence variants using colored de Bruijn Graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab749/6415820

PopIns2, a tool to discover and characterize Non-reference sequence (NRS) variants in many genomes, which scales to considerably larger numbers of genomes than its predecessor PopIns.

PopIns2 implements a scalable approach for generating a NRS variant call set. the paths through the graph have weights that may be used to compute a confidence score for each NRS. the traversal of the graph is trivially parallelizable on connected components of the graph.




□ Sub-Cluster Identification through Semi-Supervised Optimization of Rare-cell Silhouettes (SCISSORS) in Single-Cell Sequencing

>> https://www.biorxiv.org/content/10.1101/2021.10.29.466448v1.full.pdf

SCISSORS employs silhouette scoring for the estimation of heterogeneity of clusters and reveals rare cells in heterogenous clusters by implementing a multi-step, semi-supervised reclustering process.

SCISSORS calculates the silhouette score of each cell, which measures how well cells fit within their assigned clusters. The silhouette score estimates the relative cosine distance of each cell to cells in the same cluster versus cells in the closest neighboring cluster.

SCISSORS also enumerates several combinations of clustering parameters to achieve optimal performance by computing and comparing their silhouette coefficients.




□ Ideafix: a decision tree-based method for the refinement of variants in FFPE DNA sequencing data

>> https://academic.oup.com/nargab/article/3/4/lqab092/6412600

The Ideafix (deamination fixing) algorithm uses machine learning multivariate methods has the advantage over univariate methods that multiple descriptors can be tested simultaneously so that relationships between them can be exploited.

Assembled a collection of variant descriptors and evaluated the performance of five supervised learning algorithms for the classification of >1 600 000 variants, incl. both formalin-induced cytosine deamination artefacts and non-deamination variants, in order to arrive at Ideafix.

Unlike other methodologies that require multiple filtering steps and format conversion, the Ideafix algorithm is fully automatic.





□ Peakhood: individual site context extraction for CLIP-seq peak regions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab755/6420697

Peakhood, the first tool that utilizes CLIP-seq peak regions identified by peak callers, in tandem with CLIP-seq read information and genomic annotations, to determine which context applies, individually for each peak region.

Peakhood can merge single datasets into comprehensive transcript context site collections. The collections also include tabular data, for example to identify which sites on transcripts are in close distance, or if site distances decreased compared to the original genomic context.




□ decoupleR: Ensemble of computational methods to infer biological activities from omics data

>> https://www.biorxiv.org/content/10.1101/2021.11.04.467271v1.full.pdf

decoupleR, a Bioconductor package containing different statistical methods to extract these signatures within a unified framework. decoupleR allows the user to flexibly test any method with any resource.

decoupleR incorporates methods that take into account the sign and weight of network interactions. With a common syntax. for types of omics datasets, and knowledge sources, it facilitates the exploration of different approaches and can be integrated in many workflows.




□ Re-expressing coefficients from regression models for inclusion in a meta-analysis

>> https://www.biorxiv.org/content/10.1101/2021.11.02.466931v1.full.pdf

When the distribution of exposure is skewed, the re-expression methods examined are likely to give biased results. The bias varied by method, the direction of the re-expression, skewness, influential observations, and in some cases, the median exposure.

Meta-analysts using any of these re-expression methods may want to consider the uncertainty, the likely direction and degree of bias, and conduct sensitivity analyses on the re-expressed results.




□ Bringing Light Into the Dark: A Large-scale Evaluation of Knowledge Graph Embedding Models under a Unified Framework

>> https://ieeexplore.ieee.org/document/9601281

large-scale benchmarking on four datasets with several thousands of experiments and 24,804 GPU hours of computation time.

the combination of model architecture, training approach, loss function, and the explicit modeling of inverse relations is crucial for a model's performance and is not only determined by its architecture.





EMBL

>> https://www.embl.org/topics/cop26/

EMBL is proud to have been formally admitted as an official Observer organisation by the 26th session of the UN Conference of the Parties @COP26.

We look forward to contributing further to the process of the UN's Framework Convention on Climate Change.


□ Rob Fin

>> https://twitter.com/robdfinn/status/1456936786547052546?s=21

@Google currently talking about the importance of data, ML and high throughput computing solutions to understand deforestation #COP26  image data, geo spatial data, monitoring, all sounds familiar to what we do at @emblebi





□ Making Common Fund data more Findable: Catalyzing a Data Ecosystem

>> https://www.biorxiv.org/content/10.1101/2021.11.05.467504v1.full.pdf

The CFDEs federation system is centered on a metadata catalog that ingests metadata from individual Common Fund Program Data Coordination Centers into a uniform metadata model that can then be indexed and searched from a centralized portal.

This uniform Crosscut Metadata Model (C2M2), supports the wide variety of data set types and metadata terms used by the individual and is designed to enable easy expansion to accommodate new datatypes.





□ hybpiper-rbgv and yang-and-smith-rbgv: Containerization and additional options for assembly and paralog detection in target enrichment data

>> https://www.biorxiv.org/content/10.1101/2021.11.08.467817v1.full.pdf

HybPiper-RBGV: containerised and pipelined using Singularity and Nextflow. hybpiper-rbgv creates two output folders, one with all supercontigs and one with suspected chimeras (using read-mapping to supercontigs and identification of discordant read-pairs) removed.

The Maximum Inclusion algorithm iteratively extracts the largest subtrees from an unrooted gene tree. The Monophyletic Outgroups algorithm removes all genes in which is non-monophyletic. These alignments are ready for phylogenetic analysis either separately or after concatenation.




□ Emulating complex simulations by machine learning methods

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04354-7

a multi-output dynamic simulation model, which, given a set of input, it generates the dynamics of a multivariate vector over a given time horizon.

A pitfall of this method is that it does not exploit the dynamics of the simulated process while relying just on the initial condition. This approach does not fit well those cases in which the modelled process has a large variability as for instance in stochastic simulations.





□ Low-input ATAC&mRNA-seq protocol for simultaneous profiling of chromatin accessibility and gene expression

>> https://star-protocols.cell.com/protocols/968

a simple, fast, and robust protocol (low-input ATAC&mRNA-seq) to simultaneously generate ATAC-seq and mRNA-seq libraries from the same cells in limited cell numbers by coupling a simplified ATAC procedure using whole cells with a novel mRNA-seq approach.

that features a seamless on-bead process including direct mRNA isolation from the cell lysate, solid-phase cDNA synthesis, and direct tagmentation of mRNA/cDNA hybrids for library preparation.



DUNE

2021-11-11 22:31:37 | 映画


□ 『DUNE』

>> https://www.dunemovie.com

Denis Villeneuve ... (directed by)

Jon Spaihts ... (screenplay by) and
Denis Villeneuve ... (screenplay by) and
Eric Roth ... (screenplay by)

Frank Herbert ... (based on the novel Dune written by)


Music by Hans Zimmer


宮廷陰謀劇をベースに資源争奪、民族紛争など当世的な問題を扱うSF文学の大いなる遺産。全編に渡って荘厳な美術と典礼を眼前にするような一大映像叙事詩。「これから起こるべきこと」は「いま起きている」ことに他ならない。ヴィルヌーヴはこの偉大な詩碑に刻まれた予言を成就するのか、果たして…