lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Erde fällt.

2021-07-17 19:17:37 | Science News


Und in den Nächten fällt die schwere Erde
aus allen Sternen in die Einsamkeit.

- Rainer Maria Rilke.

そして夜々には 重たい地球が
あらゆる星の群から 寂寥のなかへ落ちる。



□ ENIGMA: Improved estimation of cell type-specific gene expression through deconvolution of bulk tissues with matrix completion

>> https://www.biorxiv.org/content/10.1101/2021.06.30.450493v1.full.pdf

ENIGMA (a dEcoNvolutIon method based on reGularized Matrix completion) requires cell type reference expression matrix (signature matrix), which could be derived from either FACS RNA-seq / scRNA-seq through calculating the average expression value of each gene from each cell type.

ENIGMA applied robust linear regression model to estimate each cell type fractions among samples based on reference matrix derived from the first step. Third, based on reference matrix and cell type fraction matrix.

ENIGMA applied constrained matrix completion algorithm to deconvolute bulk RNA-seq matrix into CSE. ENIGMA inferred CSE, almost all cell types showed improved cell type fractions estimation, as reflected by increased Pearson correlation with the ground truth cell type fractions.

ENIGMA could reconstruct the pseudo-trajectory of CSE. the returned CSE could be used to identify cell type-specific DEG, visualize each gene’s expression pattern on the cell type-specific manifold space.





□ INFIMA leverages multi-omics model organism data to identify effector genes of human GWAS variants

>> https://www.biorxiv.org/content/10.1101/2021.07.15.452422v1.full.pdf

INFIMA, a statistically grounded framework to capitalize on multi-omics functional data and fine-map model organism molecular quantitative trait loci. INFIMA leverages multiple multi-omics data modalities to elucidate causal variants underpinning the DO islet eQTLs.

INFIMA links ATAC-seq peaks and local-ATAC-MVs to candidate effector genes by fine-mapping DO-eQTLs. As the ability to measure inter-chromosomal interactions matures, incorporating trans-eQTLs into INFIMA framework would be a natural extension.





□ CCPE: Cell Cycle Pseudotime Estimation for Single Cell RNA-seq Data

>> https://www.biorxiv.org/content/10.1101/2021.06.13.448263v1.full.pdf

CCPE maps high-dimensional scRNA-seq data onto a helix in three-dimensional space, where 2D space is used to capture the cycle information in scRNA-seq data, and one dimension to predict the chronological orders of cells along the cycle, which is called cell cycle pseudotime.

CCPE learns a discriminative helix to characterize the circular process and estimates pseudotime in the cell cycle. CCPE iteratively optimizes the discriminative dimensionality reduction via learning a helix until convergence.





□ GRIDSS2: comprehensive characterisation of somatic structural variation using single breakend variants and structural variant phasing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02423-x

GRIDSS2 utilises the same high-level approach as the first version of GRIDSS, assembling all reads that potentially support a structural variant using a positional de Bruijn graph breakend assembly algorithm.

GRIDSS2’s ability to phase breakpoints involving short DNA fragments is of great utility to downstream rearrangement event classification and karyotype reconstruction as it exponentially reduces the number of possible paths through the breakpoint graph.


GRIDSS2’s ability to collapse imprecise transitive calls into their corresponding precise breakpoints is similarly essential to complex event reconstruction as these transitive calls result in spurious false positives that are inconsistent with the actual rearrangement structure.





□ VSS: Variance-stabilized signals for sequencing-based genomic signals

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab457/6308936

Most Gaussian-based methods employ a variance-stabilizing transformation to handle the nonuniform mean-variance relationship. They most commonly use the log or inverse hyperbolic sine transformations.

VSS, a method that produces variance-stabilized signals for sequencing- based genomic signals. Having learned the mean-variance relationship, VSS can be generated using the variance-stabilizing transformation.

VSS uses the zero bin for raw and fold enrichment (FE) signals, but not log Poisson p-value (LPPV), which are not zero-inflated. And using variance-stabilized signals from VSS improves annotations by SAGA algorithms.





□ SIVS: Stable Iterative Variable Selection

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab501/6322982

Stable Iterative Variable Selection (SIVS) starts from aggregating the results of multiple multivariable modeling runs using different cross-validation random seeds.

SIVS hired an iterative approach and internally utilizes varius Machine Learning methods which have embedded feature reduction in order to shrink down the feature space into a small and yet robust set. the "true signal" is more effectively captured by SIVS compared to the standard glmnet.





□ Metric Multidimensional Scaling for Large Single-Cell Data Sets using Neural Networks

>> https://www.biorxiv.org/content/10.1101/2021.06.24.449725v1.full.pdf

a neural network based approach for solving the metric multidimensional scaling problem that is orders of magnitude faster than previous state-of-the-art approaches, and hence scales to data sets with up to a few million cells.

metric MDS clustering approach provides a non-linear mapping between high-dimensional points into the low-dimensional space, that can place previously unseen cells in the same embedding.





□ lra: A long read aligner for sequences and contigs

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009078

lra alignment approach increases sensitivity and specificity for SV discovery, particularly for variants above 1kb and when discovering variation from ONT reads, while having runtime that are comparable (1.05-3.76×) to current methods.

lra is a sequence alignment program that aligns long reads from single-molecule sequencing (SMS) instruments, or megabase-scale contigs from SMS assemblies.

lra implements seed chaining sparse dynamic programming with a concave gap function to read and assembly alignment, which is also extended to allow for inversion cases.

there are O(log(n)) subproblems it is in and in each subproblem EV[j] can be computed from the block structure EB in O(log(n)) time. it takes O((log(n))2) time. Since there are n fragments in total, the time complexity of processing all the points is bounded by O(n log(n)2).




SVNN: an efficient PacBio-specific pipeline for structural variations calling using neural networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04184-7

The logic behind this hypothesis was, only a small fraction of all reads (less than 1%) are used for SV detection, and these reads are usually mapped harder to the reference compared to normal reads and therefore might share some common characteristics which can be leveraged in a learning model.

SVNN is a pipeline for SV detection that intelligently combines Minimap2 and NGMLR as long read aligners for the mapping phase, and SVIM and Sniffles for the SV calling phase.

<bt />



□ IHPF: Dimensionality reduction and data integration for scRNA-seq data based on integrative hierarchical Poisson factorisation

>> https://www.biorxiv.org/content/10.1101/2021.07.08.451664v1.full.pdf

Integrative Hierarchical Poisson Factorisation (IHPF), an extension of HPF that makes use of a noise ratio hyper-parameter to tune the variability attributed to technical (batches) vs. biological (cell phenotypes) sources.

IHPF gene scores exhibit a well defined block structure across all scenarios. IHPF learns latent factors that have a dual block- structure in both cell and gene spaces with the potential for enhanced explainability and biological interpretability by linking cell types to gene clusters.





□ SEDR: Unsupervised Spatial Embedded Deep Representation of Spatial Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2021.06.15.448542v1.full.pdf

Iterative deep clustering generates a soft clustering by assigning cluster-specific probabilities to each cell, leveraging the inferences between cluster-specific and cell-specific representation learning.

SEDR uses a deep autoencoder to construct a gene latent representation in a low-dimensional latent space, which is then simultaneously embedded with the corresponding spatial information through a variational graph autoencoder.





□ DeeReCT-TSS: A novel meta-learning-based method annotates TSS in multiple cell types based on DNA sequences and RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.07.14.452328v1.full.pdf

DeeReCT-TSS uses a meta-learning-based extension for simultaneous transcription start site (TSS) annotation on 10 cell types, which enables the identification of cell-type-specific TSS.

the DNA sequence and the RNA-seq coverage in the 1000bp flanking window were converted into a 1000x4 (one-hot encoding) and 1000x1 vector. Both the DNA sequence and the RNA-seq coverage were fed into the network, resulting in the predicted value for each site in each TSS peak.





□ LongStitch: High-quality genome assembly correction and scaffolding using long reads https://www.biorxiv.org/content/10.1101/2021.06.17.448848v1.full.pdf

LongStitch runs efficiently and generates high-quality final assemblies. Long reads are used to improve upon an input draft assembly from any data type. If a project solely uses long reads, the LongStitch is able to further improve upon de novo long-read assemblies.

LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction using Tigmint-long, followed by two incremental scaffolding stages using ntLink and ARKS-long.





□ ECHO: Characterizing collaborative transcription regulation with a graph-based deep learning approach

>> https://www.biorxiv.org/content/10.1101/2021.07.01.450813v1.full.pdf

ECHO, a graph-based neural network, to predict chromatin features and characterize the collaboration among them by incorporating 3D chromatin organization from 200-bp high-resolution Micro-C contact maps.

ECHO, which mainly consists of convolutional layers, is more interpretable compared to ChromeGCN. ECHO leveraged chromatin structures and extracted information from the neighborhood to assist prediction.





□ Pheniqs 2.0: accurate, high-performance Bayesian decoding and confidence estimation for combinatorial barcode indexing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04267-5

Pheniqs computes the full posterior decoding error probability of observed barcodes by consulting basecalling quality scores and prior distributions, and reports sequences and confidence scores in Sequence Alignment/Map (SAM) fields.

Pheniqs achieves greater accuracy than minimum edit distance or simple maximum likelihood estimation, and it scales linearly with core count to enable the classification of over 11 billion reads.





□ CStreet: a computed Cell State trajectory inference method for time-series single-cell RNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab488/6312549

CStreet estimates the connection probabilities of the cell states and visualizes the trajectory, which may include multiple starting points and paths, using a force-directed graph.

CStreet uses a distribution-based parameter interval estimation to measure the transition probabilities of the cell states, while prior approaches used scoring, such as the percentages of votes or the mutual information of the cluster pathway enrichment used by Tempora.

The Hamming–Ipsen–Mikhailov (HIM) score is a combination of the Hamming distance and the Ipsen- Mikhailov distance to quantify the difference in the trajectory topologies.





□ scGCN is a graph convolutional networks algorithm for knowledge transfer in single cell omics

>> https://www.nature.com/articles/s41467-021-24172-y

scGCN nonlinearly propagates feature information from neighboring cells in the hybrid graph, which learns the topological cell relations and improves the performance of transferring labels by considering higher-order relations between cells.

scGCN learns a sparse and hybrid graph of both inter- and intra-dataset cell mappings using mutual nearest neighbors of canonical correlation vectors. scGCN projects different datasets onto a correlated low-dimensional space.




□ scSGL: Signed Graph Learning for Single-Cell Gene Regulatory Network Inference

>> https://www.biorxiv.org/content/10.1101/2021.07.08.451697v1.full.pdf

scSGL incorporates the similarity and dissimilarity between observed gene expression data to construct gene networks. scSGL is formulated as a non-convex optimization problem and solved using an efficient ADMM framework.

scSGL reconstructs the GRN under the assumption that graph signals admit low-frequency representation over positive edges, while admitting high-frequency representation over negative edges.




□ StrobeAlign: Faster short-read mapping with strobemer seeds in syncmer space

>> https://www.biorxiv.org/content/10.1101/2021.06.18.449070v1.full.pdf

Canonical syncmers can be created for specific parameter combinations and reduce the computational burden of computing the non-canonical randstrobes in reverse complement. Strobealign aligns short reads 3-4x faster than minimap2 and 15-23x faster than BWA and Bowtie2.

Strobealign and Accel-Align achieves the speedup at different stages in the alignment pipeline, -Strobealign in the seed-ing stage and Accel-Align in the filtering stage, they have the potential to be combined.




□ SDDScontrol: A Near-Optimal Control Method for Stochastic Boolean Networks

>> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8208226/

The method requires a set of control actions such as the silencing of a gene or the disruption of the interaction between two genes. An optimal control policy defined as the best intervention at each state of the system can be obtained using existing methods.

the complexity of the proposed algorithm does not depend on the number of possible states of the system, and can be applied to large systems. And uses approximation techniques from the theory of Markov decision processes and reinforcement learning.

the method generates control actions that approximates the optimal control policy with high probability with a computational efficiency that does not depend on the size of the state space.




□ causalDeepVASE: Causal inference using deep-learning variable selection identifies and incorporates direct and indirect causalities in complex biological systems

>> https://www.biorxiv.org/content/10.1101/2021.07.17.452800v1.full.pdf

causalDeepVASE identifies associated variables in a pairwise Markov Random Field or undirected graphical model.

causalDeepVASE develops a penalized regression function with the interaction terms connecting the response variable and each of the other variables and maximizes the likelihood with sparsity penalties.





□ PVS: Pleiotropic Variability Score: A Genome Interpretation Metric to Quantify Phenomic Associations of Genomic Variants

>> https://www.biorxiv.org/content/10.1101/2021.07.18.452819v1.full.pdf

PVS uses ontologies of human diseases and medical phenotypes, namely human phenotype ontology (HPO) and disease ontology (DO), to compute the similarities of disease and clinical phenotypes associated with a genetic variant based on semantic reasoning algorithms.

The Stojanovic method does not need to traverse through the entire ontology to derive the similarity but the computation will terminate upon finding a common parent term using shortest path.

PVS provides a single metric by wrapping the entire compendium of scoring methods to capture phenomic similarity to quantify pleiotropy.





□ GraphCS: A Robust and Scalable Graph Neural Network for Accurate Single Cell Classification

>> https://www.biorxiv.org/content/10.1101/2021.06.24.449752v1.full.pdf

GraphCS, a robust and scalable GNN-based method for accurate single cell classification, where the graph is constructed to connect similar cells within and between labelled and unlabelled scRNA-seq datasets for propagation of shared information.

To overcome the slow information propagation of GNN at each training epoch, the diffused information is pre-calculated via the approximate Generalized PageRank algorithm, enabling sublinear complexity for a high speed and scalability on millions of cells.




□ Klarigi: Explanations for Semantic Groupings

>> https://www.biorxiv.org/content/10.1101/2021.06.14.448423v1.full.pdf

Hypergeometric gene enrichment is a univariate method, while Klarigi produces sets of terms which, considered individually or together, exclusively characterises multiple groups.

Klarigi is based upon the ε-constraints solution, retaining overall inclusivity as the objective function.

Klarigi creates semantic explanations for groups of entities described by ontology terms implemented in a manner that balances multiple scoring heuristics. As such, it presents a contribution to the reduction of unexplainability in semantic analysis.





□ seqgra: Principled Selection of Neural Network Architectures for Genomics Prediction Tasks

>> https://www.biorxiv.org/content/10.1101/2021.06.14.448415v1.full.pdf

seqgra, a deep learning pipeline that incorporates the rule-based simulation of biological sequence data and the training and evaluation of models, whose decision boundaries mirror the rules from the simulation process.

seqgra can serve as a testbed for hypotheses about biological phenomena or as a means to investigate the strengths and weaknesses of various feature attribution methods across different NN architectures that are trained on data sets with varying degrees of complexity.




□ Nanopore callers for epigenetics from limited supervised data

>> https://www.biorxiv.org/content/10.1101/2021.06.17.448800v1.full.pdf

DeepSignal outperforms a common HMM approach (Nanopolish) in the incomplete data setting. Amortized-HMM is a novel hybrid HMM-DNN approach that outperforms both the pure HMM and DNN approaches on 5mC calling when the training data are incomplete.

Amortized-HMM reduces the substantial computational burden, all reported experiments used architecture searches only from the k-mer-complete setting using DeepSignal. Amortized-HMM uses the Nanopolish HMM, w/ any missing modified k-mer emission distributions imputed by the FDNN.





□ splatPop: simulating population scale single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.06.17.448806v1.full.pdf

The splatPop model utilizes the flexible framework of Splatter to simulate data with complex experimental designs, including designs with batch effects, multiple cell groups (e.g., cell-types), and individuals with conditional effects.

splatPop can simulate populations where there are no batches, where all individuals are present in multiple batches, or where a subset of individuals are present in multiple batches as technical replicates.





□ DeepMP: a deep learning tool to detect DNA base modifications on Nanopore sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.06.28.450135v1.full.pdf

DeepMP, a convolutional neural network (CNN)-based model that takes information from Nanopore signals and basecalling errors to detect whether a given motif in a read is methylated or not.

DeepMP introduces a threshold-free position modification calling model sensitive to sites methylated at low frequency across cells. DeepMP achieved a significant separation compared to Megalodon, DeepSignal, and Nanopolish.

DeepMP's architecture: The sequence module involves 6 1D convolutional layers w/ 256 1x4 filters. The error module comprises 3 1D layers & 3 locally connected layers both w/ 128 1x3 filters. Outputs are finally concatenated and inputted into a fully connected layer w/ 512 units.




□ Hamiltonian Monte Carlo method for estimating variance components:

>> https://onlinelibrary.wiley.com/doi/10.1111/asj.13575

Hamiltonian Monte Carlo is based on Hamiltonian dynamics, and it follows Hamilton's equations, which are expressed as two differential equations.

In the sampling process of Hamiltonian Monte Carlo, a numerical integration method called leapfrog integration is used to approximately solve Hamilton's equations, and the integration is required to set the number of discrete time steps and the integration stepsize.




□ CALLR: a semi-supervised cell-type annotation method for single-cell RNA sequencing data

>> https://academic.oup.com/bioinformatics/article/37/Supplement_1/i51/6319673

CALLR (Cell type Annotation using Laplacian and Logistic Regression) combines unsupervised learning represented by the graph Laplacian matrix constructed from all the cells and super- vised learning using sparse logistic regression.

The implementation of CALLR is based on general and rigorous theories behind logistic regression, spectral clustering and graph- based Merriman–Bence–Osher scheme.





□ SvAnna: efficient and accurate pathogenicity prediction for coding and regulatory structural variants in long-read genome sequencing

>> https://www.biorxiv.org/content/10.1101/2021.07.14.452267v1.full.pdf

Structural Variant Annotation and Analysis (SvAnna) assesses all classes of SV and their intersection with transcripts and regulatory sequences in the context of topologically associating domains, relating predicted effects on gene function with clinical phenotype data.

SvAnna filters out common SVs and calculates a numeric priority score for the remaining rare SVs by integrating information about genes, promoters, and enhancers with phenotype matching to prioritize potential disease-causing variants.




□ scQcut: A completely parameter-free method for graph-based single cell RNA-seq clustering

>> https://www.biorxiv.org/content/10.1101/2021.07.15.452521v1.full.pdf

scQcut employs a topology-based criterion to guide the construction of KNN graph, and then applies an efficient modularity-based community discovery algorithm to predict robust cell clusters.

scQcut computes a distance matrix (or similarity matrix) using a given distance metric, and then computes a series of KNN graphs with different values of k. scQcut ambiguously determines the optimal co-expression network, and subsequently the most appropriate number of clusters.




□ AGTAR: A novel approach for transcriptome assembly and abundance estimation using an adapted genetic algorithm from RNA-seq data

>> https://www.sciencedirect.com/science/article/abs/pii/S0010482521004406

the adapted genetic algorithm (AGTAR) program, which can reliably assemble transcriptomes and estimate abundance based on RNA-seq data with or without genome annotation files.

Isoform abundance and isoform junction abundance are estimated by an adapted genetic algorithm. The crossover and mutation probabilities of the algorithm can be adaptively adjusted to effectively prevent premature convergence.




□ OMclust: Finding Overlapping Rmaps via Gaussian Mixture Model Clustering

>> https://www.biorxiv.org/content/10.1101/2021.07.16.452722v1.full.pdf

OMclust, an efficient clustering-based method for finding related Rmaps with high precision, which does not require any quantization or noise reduction.

OMclust performs a grid search to find the best parameters of the clustering model and replaces quantization by identifying a set of cluster centers and uses the variance of the cluster centers to account for the noise.





TEMPUS EDAX RERUM.

2021-07-17 19:16:37 | Science News




□ PseudoGA: cell pseudotime reconstruction based on genetic algorithm

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab457/6318502

PseudoGA uses genetic algorithm to come up with a best possible trajectory of cells that explains expression patterns for individual genes. Another advantage of this method is that it can identify any lineage structure or branching while constructing pseudotime trajectory.

PseudoGA can capture expression that (i) increases or decreases (ii) increases - decreases (iii) increases - decreases - increases. assuming that ranks of gene expression values along pseudotime trajectory, can be either linear, quadratic or cubic function of the pseudo-time.





□ SpiderLearner: An ensemble approach to Gaussian graphical model estimation

>> https://www.biorxiv.org/content/10.1101/2021.07.13.452248v1.full.pdf

The Spider-Learner considers a library of candidate Gaussian graphical model (GGM) estimation methods and constructs the optimal convex combination of their results, eliminating the need for the researcher to make arbitrary decisions in the estimation process.

Under mild conditions on the loss function and the set of candidate learners, the expected difference between the risk of the Super Learner ensemble model and the risk of the oracle model converges to zero as the sample size goes to infinity.




□ Infinite re-reading of single proteins at single-amino-acid resolution using nanopore sequencing

>> https://www.biorxiv.org/content/10.1101/2021.07.13.452225v1

a system in which a DNA-peptide conjugate is pulled through a biological nanopore by a helicase that is walking on the DNA section.

This approach increases identification fidelity dramatically to 100% by obtaining indefinitely many independent re-readings of the same individual molecule with a succession of controlling helicases, eliminating the random errors that lead to inaccuracies in nanopore sequencing.




□ SiGraC: Node Similarity Based Graph Convolution for Link Prediction in Biological Networks

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab464/6307262

Laplacian-based convolution is not well suited to single layered GCNs, as it limits the propagation of information to immediate neighbors of a node.

Coupling of Deep Graph Infomax (DGI’s) neural network architecture and loss function with convolution matrices that are based on node similarities can deliver superior link prediction performance as compared to convolution matrices that directly incorporate the adjacency matrix.




□ FlowGrid enables fast clustering of very large single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab521/6325016

a new automated parameter tuning procedure, FlowGrid can achieve comparable clustering accuracy as state-of-the-art clustering algorithms but at a substantially reduced run time. FlowGrid can complete a 1-hour clustering task for one million cells in about 5 minutes.

FlowGrid combines the benefit of DBSCAN and a grid-based approach to achieve scalability. The key idea of FlowGrid algorithm is to replace the calculation of density from individual points to discrete bins as defined by a uniform grid.




□ SDImpute: A statistical block imputation method based on cell-level and gene-level information for dropouts in single-cell RNA-seq data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009118

SDImpute automatically identifies the dropout events based on the gene expression levels and the variations of gene expression across similar cells and similar genes, and it implements block imputation for dropouts by utilizing gene expression unaffected by dropouts from similar cells.

SDImpute combines gene expression levels and the variations of gene expression across similar cells and similar genes to construct a dropout index matrix to identify dropout events and true zeros. It can be considered the expression of single cells in a one-dimensional manifold.





□ Optimizing expression quantitative trait locus mapping workflows for single-cell studies

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02407-x

Following methodological choices are currently optimal: scran normalization; mean aggregation of expression across cells from one donor (and sequencing run/batch if relevant);

including principal components as covariates in the Linear mixed models (LMM); including a random effect capturing sampling variation in the LMM; and accounting for multiple testing by using the conditional false discovery rate.




□ Accelerated regression-based summary statistics for discrete stochastic systems via approximate simulators

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04255-9

an approximate ratio estimator to inform when the approximation is significantly different and thus when need to simulate using the stochastic simulation algorithm (SSA) to prevent bias.

For the approximate simulators, ODE trajectories were generated using the adaptive LSODA integrator and τ-Leaping trajectories were generated using the adaptive τ-Leaping algorithm.





□ Chromap: Fast alignment and preprocessing of chromatin profiles

>> https://www.biorxiv.org/content/10.1101/2021.06.18.448995v1.full.pdf

Chromap, an ultrafast method for aligning chromatin profiles. Chromap is comparable to BWA-MEM and Bowtie2 in alignment accuracy and is over 10 times faster than other workflows on bulk ChIP-seq / Hi-C profiles and than 10x Genomics’ CellRanger v2.0.0 on scATAC-seq profiles.

Chromap considers every minimizer hit and uses the mate-pair information to rescue remaining missing alignments. Chromap caches the candidate read alignment locations in those regions to accelerate alignment of future reads containing the same minimizers.




□ Ultraplex: A rapid, flexible, all-in-one fastq demultiplexer

>> https://wellcomeopenresearch.org/articles/6-141

Ultraplex, a fast and uniquely flexible demultiplexer which splits a raw FASTQ file containing barcodes either at a single end or at both 5’ and 3’ ends of reads, trims the sequencing adaptors and low-quality bases, and moves UMIs into the read header.

Ultraplex is able to perform such single or combinatorial demultiplexing on both single- and paired-end sequencing data, and can process an entire Illumina HiSeq lane, consisting of nearly 500 million reads, in less than 20 minutes.





□ scDA: Single cell discriminant analysis for single-cell RNA sequencing data

>> https://www.sciencedirect.com/science/article/pii/S2001037021002270

Single cell discriminant analysis (scDA) simultaneously identifies cell groups and discriminant metagenes based on the construction of cell-by-cell representation graph, and then using them to annotate unlabeled cells in data.

With the optimal representation matrix, scDA is capable to estimate the involved cell types through a graph-based clustering method, e.g., spectral clustering; and classify the unlabeled cells to the acquired assignments based on discriminant vectors.





□ D4 - Dense Depth Data Dump: Balancing efficient analysis and storage of quantitative genomics data with the D4 format and d4tools

>> https://www.nature.com/articles/s43588-021-00085-0

The D4 format is adaptive in that it profiles a random sample of aligned sequence depth from the input sequence file to determine an optimal encoding that enables fast data access.

D4 algorithm uses a binary heap that fills with incoming alignments as it reports depth. Using this low entropy to efficiently encode quantitative genomics data in the D4 format. The average time complexity of this algorithm is linear with respect to the number of alignments.





□ SLOW5: a new file format enables massive acceleration of nanopore sequencing data analysis

>> https://www.biorxiv.org/content/10.1101/2021.06.29.450255v1.full.pdf

SLOW5 is a simple tab-separated values (TSV) file encoding metadata and time-series signal data for one nanopore read per line, with global metadata stored in a file header.

SLOW5 can be encoded in binary format (BLOW5) - this is analogous to the seminal SAM/BAM format for storing DNA sequence alignments. BLOW5 can be compressed using standard zlib, thereby minimising the data storage footprint while still permitting efficient parallel access.

Using a GPU- accelerated version of Nanopolish (described elsewhere21), with compressed-BLOW5 input data, we were able to complete whole-genome methylation profiling on a single 30X human dataset in just 10.5 hours with 48 threads.





□ LongRepMarker: A sensitive repeat identification framework based on short and long reads

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab563/6313241

LongRepMarker uses the multiple sequence alignment to find the unique k-mers which can be aligned to different locations on overlap sequences and the regions on overlap sequences that can be covered by these multi-alignment unique k-mers.

The parallel alignment model based on the multi-alignment unique k-mers can greatly optimize the efficiency of data processing in LongRepMarker. By taking the corresponding identification strategies, structural variations that occur between repeats can be identified.




□ Dynamic Bayesian Network Learning to Infer Sparse Models from Time Series Gene Expression Data

>> https://ieeexplore.ieee.org/document/9466470/

Two new BN scoring functions, which are extensions to the Bayesian Information Criterion (BIC) score, with additional penalty terms and use them in conjunction with DBN structure search methods to find a graph structure that maximises the proposed scores.

GRNs are typically sparse but traditional approaches of BN structure learning to elucidate GRNs produce many spurious edges. This BN scoring offer better solutions with fewer spurious edges. These algorithms are able to learn sparse graphs from high-dimensional time series data.




□ Linear functional organization of the omic embedding space

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab487/6313162

the Graphlet Degree Vector Positive Pointwise Mutual Information (PPMI) matrix of the PPI network to capture different topological (structural) similarities of the nodes in the molecular network.

the embeddings obtained by the Non-Negative Matrix Tri-Factorization-based decompositions of the PPMI matrix, as well as of the GDV PPMI matrix, compared to the SVD-based decompositions, uncover more enriched clusters and more enriched genes in the obtained clusters.




□ S4PRED: Increasing the Accuracy of Single Sequence Prediction Methods Using a Deep Semi-Supervised Learning Framework

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab491/6313164

PASS - Profile Augmentation of Single Sequences, a general framework for mapping multiple sequence information to cases where rapid and accurate predictions are required for orphan sequences.

S4PRED uses a variant of the powerful AWD-LSTM. S4PRED uses the PASS framework to develop a pseudo-labelling approach that is used to generate a large set of single sequences with highly accurate artificial labels.





□ TRACS: Inferring transcriptomic cell states and transitions only from time series transcriptome data

>> https://www.nature.com/articles/s41598-021-91752-9

TRACS, a novel time series clustering framework to infer TRAnscriptomic Cellular States only from time series transcriptome data by integrating Gaussian process regression, shape-based distance, and ranked pairs algorithm in a single computational framework.

TRACS determines patterns that correspond to hidden cellular states by clustering gene expression data. The final output of TRACS is a cluster network describing dynamic cell states and transitions by ordered clusters, where cluster genes imply representative genes of each cell state.





□ LIQA: long-read isoform quantification and analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02399-8

LIQA is the first long-read transcriptomic tool that takes these limitations of long-read RNA-seq data into account. LIQA models observed splicing information, high error rate of data, and read length bias.

LIQA is computationally intensive because the approximation of nonparametric Kaplan-Meier estimator of function f(Lr) relies on empirical read length distribution and the parameters are estimated using EM algorithm.




□ libOmexMeta: Enabling semantic annotation of models to support FAIR principles

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab445/6300512

The goal of semantic annotations are to make explicit the biology that underlies the semantics of biosimulation models. LibOmexMeta is a library aimed at providing developer-level support for reading, writing, editing and managing semantic annotations for biosimulation models.





□ GPcounts: Non-parametric modelling of temporal and spatial counts data from RNA-seq experiments

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab486/6313161

GPcounts is Gaussian process regression package for counts data with negative binomial and zero-inflated negative binomial likelihoods. GPcounts can be used to model temporal and spatial counts data in cases where simpler Gaussian and Poisson likelihoods are unrealistic.

GPcounts uses a Gaussian process with a logarithmic link function to model variation in the mean of the counts data distribution across time or space.





□ Sigmap: Real-time mapping of nanopore raw signals

>> https://academic.oup.com/bioinformatics/article/37/Supplement_1/i477/6319675

Sigmap is a streaming method for mapping raw nanopore signal to reference genomes. The method features a new way to index reference genomes using k-d trees, a novel seed selection strategy and a seed chaining algorithm tailored toward the current signal characteristics.

The method avoids any conversion of signals to sequences and fully works in signal space, which holds promise for completely base-calling-free nanopore sequencing data analysis.





□ CVAE–NCEM: Learning cell communication from spatial graphs of cells

>> https://www.biorxiv.org/content/10.1101/2021.07.11.451750v1.full.pdf

Node-centric expression modeling (NCEM), a computational method based on graph neural networks which reconciles variance attribution and communication modeling in a single model of tissue niches.

NCEMs can be extended to mixed models of explicit cell communication events and latent intrinsic sources of variation in conditional variational autoencoders to yield holistic models of cellular variation in spatial molecular profiling data.





□ Parallel Framework for Inferring Genome-Scale Gene Regulatory Networks

>> https://www.biorxiv.org/content/10.1101/2021.07.11.451988v1.full.pdf

a generic parallel inference framework using which any original inference algorithm without any alterations, can parallelly run on humongous datasets in the multiple cores of the CPU to provide efficacious inferred networks.

a strict use of the data about the application executions within the formula for Amdahl's Law gives a much more pessimistic estimate than the scaled speedup formula.





□ Designing Interpretable Convolution-Based Hybrid Networks for Genomics

>> https://www.biorxiv.org/content/10.1101/2021.07.13.452181v1.full.pdf

Systematically investigate the extent that architectural choices in convolution-based hybrid networks influence learned motif representations in first layer filters, as well as the reliability of their attribution maps generated by saliency analysis.

As attention-based models are gaining interest in regulatory genomics, hybrid networks would benefit from incorporating these design principles to bolster their intrinsic interpretability.




□ HieRFIT: A hierarchical cell type classification tool for projections from complex single-cell atlas datasets

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab499/6320801

HieRFIT (Hierarchical Random Forest for Information Transfer) uses a priori information about cell type relationships to improve classification accuracy, taking as input a hierarchical tree structure representing the class relationships, along with the reference data.

HieRFIT uses an ensemble approach combining multiple random forest models, organized in a hierarchical decision tree structure. HieRFIT improves accuracy and reduces incorrect predictions especially for inter-dataset tasks which reflect real life applications.





□ Efficient gradient-based parameter estimation for dynamic models using qualitative data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab512/6321450

a semi-analytical algorithm for gradient calculation of the optimal scaling method developed for qualitative data. This enables the use of efficient gradient-based optimization algorithms.

Validating the accuracy of the obtained gradients by comparing them to finite differences and assessed the advantage of using gradient information on five application examples by performing optimization with a gradient-free and a gradient-based algorithm.





□ MUNIn: A statistical framework for identifying long-range chromatin interactions from multiple samples

>> https://www.cell.com/hgg-advances/fulltext/S2666-2477(21)00017-8

MUNIn (multiple-sample unifying long-range chromatin-interaction detector) MUNIn adopts a hierarchical hidden Markov random field (H-HMRF) model.

MUNIn jointly models multiple samples and explicitly accounts for the dependency across samples. It simultaneously accounts for both spatial dependency within each sample and dependency across samples.




□ xPore: Identification of differential RNA modifications from nanopore direct RNA sequencing

>> https://www.nature.com/articles/s41587-021-00949-w

RNA modifications can be identified from direct RNA-seq data with high accuracy, enabling analysis of differential modifications and expression from a single high-throughput experiment.

xPore identifies positions of m6A sites at single-base resolution, estimates the fraction of modified RNA species in the cell and quantifies the differential modification rate across conditions.





□ ELIMINATOR: Essentiality anaLysIs using MultIsystem Networks And inTeger prOgRamming

>> https://www.biorxiv.org/content/10.1101/2021.07.21.453265v1.full.pdf

ELIMINATOR, an in-silico method for the identification of patient-specific essential genes using constraint-based modelling (CBM). It expands the ideas behind traditional CBM to accommodate multisystem networks, that is a biological network that focuses on complex interactions.

ELIMINATOR calculates the minimum number of non-expressed genes required to be active by the cell to sustain life as defined by a set of requirements; and performs an exhaustive in-silico gene knockout to find those that lead to the need of activating extra non-expressed genes.





□ TRaCE: Ranked Choice Voting for Representative Transcripts

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab542/6326792

TRaCE (Transcript Ranking and Canonical Election) holds an ‘election’ in which a set of RNA-seq samples rank transcripts by annotation edit distance.

TRaCE identies the most common isoforms from a broad expression atlas or prioritize alternative transcripts expressed in specific contexts. TRaCE tallies votes for top-ranked candidates; as there is a tie for first place, votes for the subsequent rankings are added to the tally.




□ NMFLRR: Clustering scRNA-seq data by integrating non-negative matrix factorization with low rank representation

>> https://ieeexplore.ieee.org/document/9495191/

NMFLRR, a new computational framework to identify cell types by integrating low-rank representation (LRR) and nonnegative matrix factorization (NMF).

The LRR captures the global properties of original data by using nuclear norms, and a locality constrained graph regularization term is introduced to characterize the data's local geometric information.

The similarity matrix and low-dimensional features of data can be simultaneously obtained by applying the ADMM algorithm to handle each variable alternatively in an iterative way. NMFLRR uses a spectral algorithm based on the optimized similarity matrix.




□ SDPR: A fast and robust Bayesian nonparametric method for prediction of complex traits using summary statistics

>> https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1009697

SDPR connects the marginal coefficients in summary statistics with true effect sizes through Bayesian multiple Dirichlet process regression.

SDPR utilizes the concept of approximately independent LD blocks and overparametrization to develop a parallel and fast-mixing MCMC algorithm. SDPR can provide estimation of heritability, genetic architecture, and posterior inclusion probability.




□ Using topic modeling to detect cellular crosstalk in scRNA-seq

>> https://www.biorxiv.org/content/10.1101/2021.07.26.453767v1.full.pdf

a new method for detecting genes that change as a result of interaction based on Latent Dirichlet Allocation (LDA). This method does not require prior information in the form of clustering or generation of synthetic reference profiles.






□ Parallel Implementation of Smith-Waterman Algorithm on FPGA

>> https://www.biorxiv.org/content/10.1101/2021.07.27.454006v1.full.pdf

The development of the algorithm was carried out using the development platform provided by the Field-Programmable Gate Array (FPGA) manufacturer, in this case, Xilinx.

From the strategy of storing alignment path distances and maximum score position during Forward Stage processing. It was possible to reduce the complexity of Backtracking Stage processing which allowed to follow the path directly.

This platform allows the user to develop circuits using the block diagram strategy instead of VHDL or Verilog. The architecture was deployed on the FPGA Virtex-6 XC6VLX240T.





□ SUITOR: selecting the number of mutational signatures through cross-validation

>> https://www.biorxiv.org/content/10.1101/2021.07.28.454269v1.full.pdf

SUITOR (Selecting the nUmber of mutatIonal signaTures thrOugh cRoss-validation), an unsupervised cross-validation method that requires little assumptions and no numerical approximations to select the optimal number of signatures without overfitting the data.

SUITOR extends the probabilistic model to allow missing data in the training set, which makes cross-validation feasible. an expectation/conditional maximization algorithm to extract signature profiles, estimate mutation contributions and impute the missing data simultaneously.





□ WGA-LP: a pipeline for Whole Genome Assembly of contaminated reads

>> https://www.biorxiv.org/content/10.1101/2021.07.31.454518v1.full.pdf

WGA-LP connects state-of-art programs and novel scripts to check and improve the quality of both samples and resulting assemblies. With its conservative decontamination approach, has shown to be capable of creating high quality assemblies even in the case of contaminated reads.

WGA-LP includes custom scripts to help in the visualization of node coverage by post processing the output of Samtools depth. For node reordering, WGA-LP uses the ContigOrderer option from Mauve aligner.





Cumulonimbus.

2021-07-17 19:12:36 | Science News

(“La Tempête“ / Pierre Auguste Cot)




□ HexaChord: Topological Structures in Computer-Aided Music Analysis

>> http://repmus.ircam.fr/_media/moreno/BigoAndreatta_Computational_Musicology.pdf

A chord complex is a labelled simplicial complex which represents a set of chords. The dimension of the elements of the complex and their neighbourhood relationships highlight the size of the chords and their intersections.

Following a well-established tradition in set-theoretical and neo-Riemannian music analysis, T/I complexes represent classes of chords which are transpositionally and inversionally equivalent and which relate to the notion of Generalized Tonnetze.

HexaChord improves intelligibility, chromatic and diatonic T/I complexes of dimension 2 (i.e., constituted of 3-note chords) can be unfolded as infinite two-dimensional triangular tessellations, in the same style as the planar representation of the Tonnetz.





□ Deciphering cell–cell interactions and communication from gene expression

>> https://www.nature.com/articles/s41576-020-00292-x

Each approach for inferring CCIs and CCC has its own assumptions and limitations to consider; when one is using such strategies, it is important to be aware of these strengths and weaknesses and to choose appropriate parameters for analyses.

A potential obstacle for this method is the sparsity of single-cell data sets, which can increase or decrease correlation coefficients in undesirable ways, leading to correlation values that measure sparsity, rather than biology.




□ RosettaSurf - a surface-centric computational design approach

>> https://www.biorxiv.org/content/10.1101/2021.06.16.448645v1.full.pdf

To efficiently explore the sequence space during the design process, Monte Carlo simulated annealing guides the optimization of rotamers, where substitutions of residues are scored based on the resulting surface and accepted if they pass the Monte Carlo criterion that is implemented as the SurfS score.

The RosettaSurf protocol combines the explicit optimization of molecular surface features with a global scoring function during the sequence design process, diverging from the typical design approaches that rely solely on an energy scoring function.





□ ANANSE: an enhancer network-based computational approach for predicting key transcription factors in cell fate determination

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab598/6318498

ANANSE (ANalysis Algorithm for Networks Specified by Enhancers), a network-based method that exploits enhancer-encoded regulatory information to identify the key transcription factors in cell fate determination.

ANANSE recovers the largest fraction of TFs that were validated by experimental trans-differentiation approaches. ANANSE can prioritize TFs that drive cellular fate changes.

ANANSE takes a 2-step approach. I. TF binding is imputed for all enhancers using a simple supervised logistic classifier. II. summarizing the imputed TF signals, using a distance-weighted decay function, and combined with TF activity/target GE to infer cell type-specific GRNs.




□ Embeddings of genomic region sets capture rich biological associations in lower dimensions

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab439/6307720

a new method to represent genomic region sets as vectors, or embeddings, using an adapted word2vec approach. It reduces dimensionality from more than a hundred thousand to 100 without significant loss in classification performance.

Assessing the methods whether similarity among embeddings can reflect simulated random perturbations of genomic regions. the vectors retain useful biological information in relatively lower-dimensional spaces.




□ GraphOmics: an Interactive Platform to Explore and Integrate Multi-Omics Data

>> https://www.biorxiv.org/content/10.1101/2021.06.24.449741v1.full.pdf

GraphOmics provides an interactive platform that integrates data to Reactome pathways emphasising interactivity and biological contexts. This avoids the presentation of the integrated omics data as a large network graph or as numerous static tables.

GraphOmics offers a way to perform pathway analysis separately on each omics, and integrate the results at the end. The separate pathway analysis results run on different omics datasets can be combined with an AND operator in the Query Builder.





□ BOOST-GP: Bayesian Modeling of Spatial Molecular Profiling Data via Gaussian Process

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab455/6306406

Recent technology breakthroughs in spatial molecular profiling, including imaging-based technologies and sequencing-based technologies, have enabled the comprehensive molecular characterization of single cells while preserving their spatial and morphological contexts.

BOOST-GP models the gene expression count value with zero-inflated negative binomial distribution, and estimated the spatial covariance with Gaussian process model. It can be applied to detect spatial variable (SV) genes whose expression display spatial pattern.




□ GxEsum: a novel approach to estimate the phenotypic variance explained by genome-wide GxE interaction based on GWAS summary statistics for biobank-scale data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02403-1

GxEsum can handle a large-scale biobank dataset with controlled type I error rates and unbiased GxE estimates, and its computational efficiency can be hundreds of times higher than existing GxE methods.

the computational efficiency of the proposed approach is substantially higher than reaction norm model (RNM), an existing genomic restricted maximum likelihood (GREML)-based method, while the estimates are reasonably accurate and precise.





□ metaMIC: reference-free Misassembly Identification and Correction of de novo metagenomic assemblies

>> https://www.biorxiv.org/content/10.1101/2021.06.22.449514v1.full.pdf

metaMIC can identify misassembled contigs, localize misassembly breakpoints within misassembled contigs and then correct misassemblies by splitting misassembled contigs at breakpoints.

As metaMIC can identify breakpoints in misassembled contigs, it can split misassembled contigs at breakpoints and reduce the number of misassemblies; although the contiguity could be slightly decreased due to more fragmented contigs.





□ SPRUCE: A Bayesian Multivariate Mixture Model for Spatial Transcriptomics Data

>> https://www.biorxiv.org/content/10.1101/2021.06.23.449615v1.full.pdf

SPRUCE (SPatial Random effects-based clUstering of single CEll data), a Bayesian spatial multivariate finite mixture model based on multivariate skew-normal distributions, which is capable of identifying distinct cellular sub-populations in HST data.

SPRUCE implements a novel combination of P ́olya–Gamma data augmentation and spatial random effects to infer spatially correlated mixture component membership probabilities without relying on approximate inference techniques.





□ Transformation and Preprocessing of Single-Cell RNA-Seq Data

>> https://www.biorxiv.org/content/10.1101/2021.06.24.449781v1.full.pdf

Delta method: Variance-stabilizing transformations based on the delta method promise an easy fix for het- eroskedasticity where the variance only depends on the mean.

The residual-based variance-stabilizing transformation the linear nature of the Pearson residuals-based transformation reduces its suitability for comparisons of the data of a gene across cells —there is no variance stabilization across cells, only across genes.




□ CAFEH: Redefining tissue specificity of genetic regulation of gene expression in the presence of allelic heterogeneity

>> https://www.medrxiv.org/content/10.1101/2021.06.28.21259545v1.full.pdf

CAFEH is a Bayesian algorithm that incorporates information regarding the strength of the association between a phenotype and the genotype in a locus along with LD structure of that locus across different studies and tissues to infer causal variants within each locus.

CAFEH is a probabilistic model that performs colocalization and fine mapping jointly across multiple phenotypes. CAFEH users need to specify the number of components and the the prior probability that each component is active in each phenotype.





□ scCOLOR-seq: Nanopore sequencing of single-cell transcriptomes

>> https://www.nature.com/articles/s41587-021-00965-w

Single-cell corrected long-read sequencing (scCOLOR-seq), which enables error correction of barcode and unique molecular identifier oligonucleotide sequences and permits standalone cDNA nanopore sequencing of single cells.

scCOLOR-seq has multiple advantages over current methodologies to correct error-prone sequencing. It provides superior error correction of barcodes, w/ over 80% recovery of reads when using an edit distance of 7, or over 60% recovery when using a conservative edit distance of 6.




□ PZLAST: an ultra-fast amino acid sequence similarity search server against public metagenomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab492/6317664

PZLAST provides extremely-fast and highly accurate amino acid sequence similarity searches against several Terabytes of public metagenomic amino acid sequence data.

PZLAST uses multiple PEZY-SC2s, which are Multiple Instruction Multiple Data (MIMD) many-core processors. The basis of the sequence similarity search algorithm of PZLAST is similar to the CLAST algorithm.




□ Ryūtō: Improved multi-sample transcript assembly for differential transcript expression analysis and more

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab494/6320779

Ryūtō outperforms competing approaches, providing a better and user-adjustable sensitivity-precision trade-off. Ryūtō’s unique ability to utilize a (incomplete) reference for multi sample assemblies greatly increases precision.

Ryūtō consistently improves assembly on replicates of the same tissue independent of filter settings, even when mixing conditions or time series. Consensus voting in Ryūtō is especially effective at high precision assembly, while Ryūtō’s conventional mode can reach higher recall.





□ Merfin: improved variant filtering and polishing via k-mer validation

>> https://www.biorxiv.org/content/10.1101/2021.07.16.452324v1.full.pdf

Merfin (k-mer based finishing tool), a k-mer based variant filtering algorithm for improved genotyping/polishing. Merfin evaluates the accuracy of a call based on expected k-mer multiplicity, independently of the quality of the read alignment and variant caller’s internal score.

K* enables the detection of collapses / expansions, and improves the QV when used to filter variants for polishing. Merfin provides a script generating a lookup table for each k-mer frequency in the raw data w/ the most plausible k-mer multiplicity and its associated probability.





□ CoLoRd: Compressing long reads

>> https://www.biorxiv.org/content/10.1101/2021.07.17.452767v1.full.pdf

CoLoRd, a compression algorithm for ONT and PacBio sequencing data. Its main contributions are (i) novel method for compressing the DNA component of FASTQ files and (ii) lossy processing of the quality stream.

Equipped with an overlap-based algorithm for compressing the DNA stream and a lossy processing of the quality information, CoLoRd allows even tenfold space reduction compared to gzip, without affecting down- stream analyses like variant calling or consensus generation.





□ Modelling, characterization of data-dependent and process-dependent errors in DNA data storage

>> https://www.biorxiv.org/content/10.1101/2021.07.17.452779v1.full.pdf

Theoretically formulating the sequence corruption which is cooperatively dictated by the base error statistics, copy counts of reference sequence, and down-stream processing methods.

The average sequence loss rate E(P (x = 0)) against the average copy count, i.e., the channel coverage (η), can be well described by an exponentially decreasing curve e−λ in which λ is a random variable (RV) following an uneven sequence count distribution Λ.




□ Rascal: Absolute copy number fitting from shallow whole genome sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.07.19.452658v1.full.pdf

Rascal (relative to absolute copy number scaling) that provides improved fitting algorithms and enables interactive visualisation of copy number profiles.

ACN fitting for high purity samples is easily achievable using Rascal, additional information is required for impure clinical tissue samples. In addition, manual inspection of copy number profiles using Rascal’s interactive web interface allows ACN fitting of otherwise problematic samples.





□ danbing-tk: Profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs

>> https://www.nature.com/articles/s41467-021-24378-0

VNTR mapping for short reads with a repeat-pangenome graph (RPGG), a data structure that encodes both the population diversity and repeat structure of VNTR loci from multiple haplotype-resolved assemblies.

Tandem Repeat Genotyping based on Haplotype-derived Pangenome Graphs (danbing-tk) identifies VNTR boundaries in assemblies, construct RPGGs, align SRS reads to the RPGG, and infer VNTR motif composition and length in SRS reads.




□ Nanopanel2 calls phased low-frequency variants in Nanopore panel sequencing data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab526/6322985

Nanopanel2, a variant caller for Nanopore panel sequencing data. Nanopanel2 works directly on base-called FAST5 files and uses allele probability distributions and several other filters to robustly separate true from false positive (FP) calls.

Np2 also produces haplotype map TSV and PDF files that inform about haplotype distributions of called (PASS) variants. Haplotype compositions are then determined by direct phasing.





□ mm2-fast:Accelerating long-read analysis on modern CPUs

>> https://www.biorxiv.org/content/10.1101/2021.07.21.453294v1.full.pdf

The speedups achieved by mm2-fast AVX512 version ranged from 2.5-2.8x, 1.4-1.8x, 1.6-1.9x, and 2.4-3.5x for ONT, PacBio CLR, PacBio HiFi and genome-assembly inputs respectively.

Multiple optimizations using SIMD parallelization, efficient cache utilization and a learned index data structure to accelerate its three main computational modules, i.e., seeding, chaining and pairwise sequence alignment.





□ STRONG: metagenomics strain resolution on assembly graphs

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02419-7

STrain Resolution ON assembly Graphs (STRONG) performs coassembly, and binning into MAGs, and stores the coassembly graph prior to variant simplification. This enables the subgraphs and their unitig per-sample coverages, for individual single-copy core genes (SCGs) in each MAG.

STRONG is validated using synthetic communities and for a real anaerobic digestor time series generates haplotypes that match those observed from long Nanopore reads.





□ CLEAR: Self-supervised contrastive learning for integrative single cell RNA-seq data analysis

>> https://www.biorxiv.org/content/10.1101/2021.07.26.453730v1.full.pdf

a self-supervised Contrastive LEArning framework for scRNA-seq (CLEAR) profile representation and the downstream analysis. CLEAR overcomes the heterogeneity of the experimental data with a specifically designed representation learning task.

CLEAR does not have any assumptions on the data distribution or the encoder architecture. It can eliminate technical noise & generate representation, which is suitable for a range of downstream analysis, such as clustering, batch effect correction, and time-trajectory inference.





□ MUREN: a robust and multi-reference approach of RNA-seq transcript normalization

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04288-0

MUlti-REference Normalizer (MUREN) performs the RNA-seq normalization using a two-step statistical regression induced from a general principle. MUREN adjusts the mode of differentiation toward zero while preserves the skewness due to biological asymmetric differentiation.

MUREN emphasizes on robustness by adopting least trimmed squares (LTS) and least absolute deviations (LAD). A shrinkage of the fold change to zero is reasonable. When the offset is 1, log2(4 + 1) − log2(0 + 1) = 2.3; when the offset is 0.0001, log2(4 + 0.0001) − log2(0 + 0.0001) = 15.3.





□ DeepProg: an ensemble of deep-learning and machine-learning models for prognosis prediction using multi-omics data

>> https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-021-00930-x

DeepProg explicitly models patient survival as the objective and is predictive of new patient survival risks. DeepProg constructs a flexible ensemble of hybrid-models (deep-learning / machine learning models) and integrates their outputs following the ensemble learning paradigm.

DeepProg identifies the optimal number of classes of survival subpopulations and uses these classes to construct SVM-ML models, in order to predict a new patient’s survival group. DeepProg adopts a boosting approach and builds an ensemble of models.




□ Prediction of DNA from context using neural networks

>> https://www.biorxiv.org/content/10.1101/2021.07.28.454211v1.full.pdf

a model to predict the missing base at any given position, given its left and right flanking contexts. Its best-performing model is a neural network that obtains an accuracy close to 54% on the human genome, which is 2% points better than modelling the data using a Markov model.

And certainly, as the models fall long short of predicting their host DNA perfectly, their ”representation” of that DNA may have large imperfections, and possibly specific to the DNA in question.





□ ILRA: From contigs to chromosomes: automatic Improvement of Long Read Assemblies

>> https://www.biorxiv.org/content/10.1101/2021.07.30.454413v1.full.pdf

ILRA combines existing and new tools performing these post-sequencing steps in a completely integrated way, providing fully corrected and ready-to-use genome sequences.

ILRA can alternatively perform BLAST of the final assembly against multiple databases, such as common contaminants, vector sequences, bacterial insertion sequences or ribosomal RNA genes.





□ A unified framework for the integration of multiple hierarchical clusterings or networks from multi-source data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04303-4

a procedure to compare multiple objects built on the same entities, with a focus on trees and networks, in order to define coherent groups of these kind of structures to be further integrated.

Multidimensional scaling and Multiple Factor Analysis, that offer a unified framework to analyze both tree or network structures. Using binary adjacency matrices with shortest path distance, and cophenetic distances for the trees, and computed kernels derivated from these metrics.




□ Maximum parsimony reconciliation in the DTLOR model

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04290-6

the DTLOR model that addresses this issue by extending the DTL model to allow some or all of the evolution of a gene family to occur outside of the given species tree and for transfers events to occur from the outside.

An exact polynomial-time algorithm for maximum parsimony reconciliation in the DTLOR model. Maximum parsimony reconciliations can be found in fixed-parameter polynomial time for non-binary gene trees where the parameter is the maximum branching factor of a node.




□ Using high-throughput multi-omics data to investigate structural balance in elementary gene regulatory network motifs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab577/6349221

Calculating correlation coefficients in longitudinal studies requires appropriate tools to take into account the dependency between (often irregularly spaced) time points as well as latent factors.

In the context of biological networks, multiple studies have already highlighted that GRNs are enriched for balanced patterns and altogether tend to be close to monotone systems.

This framework uses the a priori knowledge on the data to infer elementary causal regulatory motifs (namely chains and forks) in the network. It is based on the notions of conditional independence and partial correlation, and can be applied to both longitudinal and non-longitudinal data.

The regulation of gene transcription is mediated by the remodeling of chromatin in near proximity of the TSS. Chains and forks are characterized by conditional independence, and dynamical correlation reduces to standard correlation in the steady-state data & multiple replicates.




□ MetaLogo: a generator and aligner for multiple sequence logos

>> https://www.biorxiv.org/content/10.1101/2021.08.12.456038v1.full.pdf

MetaLogo draws sequence logos for sequences of different lengths or from different groups in one single plot and align multiple logos to highlight the sequence pattern dynamics across groups, thus allowing to investigate functional motifs in a more delicate and dynamic perspective.

MetaLogo allows users to choose the Jensen–Shannon divergence (JSD) as the similarity measurement. The JSD is a method of measuring the similarity between two probability distributions, and is a symmetrized version of the Kullback–Leibler (KL) divergence.