lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Skylight.

2021-03-03 03:03:03 | Science News




□ STMF: Sparse data embedding and prediction by tropical matrix factorization

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04023-9

Sparse Tropical Matrix Factorization (STMF) introduces non-linearity into matrix factorization models, which enables discovering the most dominant patterns, leading to a more straightforward visual interpretation compared to other methods for missing value prediction.

Integrative data fusion methods are based on co-factorization of multiple data matrices. Using standard linear algebra, DFMF is a variant of penalized matrix tri-factorization, which simultaneously factorizes data matrices to reveal hidden associations.





□ GNIPLR: Inference of gene regulatory networks using pseudo-time series data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab099/6134129

GNIPLR (gene networks inference based on projection and lagged regression) infers GRNs from time-series or non-time-series gene expression data.

GNIPLR projected gene data twice using the LASSO projection (LSP) algorithm and the linear projection (LP) approximation to produce a linear and monotonous pseudo-time series, and then determined the direction of regulation in combination with lagged regression analyses.





□ FASTRAL: Improving scalability of phylogenomic analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab093/6130791

ASTRAL’s algorithm is the use of dyna- mic programming to find an optimal solution to the MQSST (maximum quartet support supertree) within a constraint space that it computes from the input.

FASTRAL is based on ASTRAL, but uses a different technique for constructing the constraint space. FASTRAL is a polynomial time algorithm that is statistically consistent under the multi-locus coalescent model.





□ AQC: mRNA codon optimization on quantum computers

>> https://www.biorxiv.org/content/10.1101/2021.02.19.431999v1.full.pdf

An adiabatic quantum computer (AQC) is compared to a standard genetic algorithm (GA) programmed with the same objective function. The AQC is found to be competitive in identifying optimal solutions and future generations of AQCs may be able to outperform classical GAs.

The Leap Hybrid solver is capable of solving codon optimization problems expressed as a BQM with up to ~1,000 amino acids. The goal of the optimization is to find the combination of codons that minimizes the Hamiltonian. AQCs finds the ground state of the input Hamiltonian.




□ SVFS: Dimensionality reduction using singular vectors

>> https://www.nature.com/articles/s41598-021-83150-y

Let D=[A∣b] be a labeled dataset, where b is the class label and features are columns of matrix A. SVFS uses the signature matrix SD of D to find the cluster that contains b. Then, reduce the size of A by discarding features in the other clusters as irrelevant features.

Singular-Vectors Feature Selection (SVFS) uses the signature matrix SA of reduced A to partition the remaining features into clusters and choose the most important features from each cluster.

Pseudo-inverses are used in neural learning to solve large least square systems. the complexity of Geninv on a single-threaded processor is O(min(m3,n3)) whereas in a multi-thread, the time complexity is O(min(m,n)). the complexity of SVFS algorithm is at most O(max(m3,n2)).





□ MultiMAP: Dimensionality Reduction and Integration of Multimodal Data

>> https://www.biorxiv.org/content/10.1101/2021.02.16.431421v1.full.pdf

MultiMAP recovers a single manifold on which all of the data resides and projects into a low-dimensional space so as to preserve the manifold structure. MultiMAP is based on a Riemannian geometry / algebraic topology, and generalizes the UMAP algorithm to the multimodal setting.

MultiMAP takes as input any number of datasets of potentially differing dimensions. MultiMAP recovers geodesic distances on a single latent manifold on which all of the data is uniformly distributed.

These distances are then used to construct a neighborhood graph (MultiGraph) on the manifold. the data & manifold space are projected into a low-dimensional space by minimizing the cross entropy of the graph in the embedding space with respect to the graph in the manifold space.





□ scGAE: topology-preserving dimensionality reduction for single-cell RNA-seq data using graph autoencoder

>> https://www.biorxiv.org/content/10.1101/2021.02.16.431357v1.full.pdf

scGAE builds a cell graph / uses a multitask-oriented graph autoencoder to preserve topological structure information. scGAE accurately reconstructs developmental trajectory and separates discrete cell clusters under different scenarios, outperforming other deep learning methods.

scGAE combines the deep autoencoder and graphical model to embed the topological structure of high-dimensional scRNA-seq data to a low-dimensional space. After getting the normalized count matrix, scGAE builds the adjacency matrix among cells by K-nearest-neighbor algorithm.

scGAE maps the count matrix to a low-dimensional latent space by graph attentional layers. scGAE decodes the embedded data to the spaces with the same dimension as original data by minimizing the distance between the input data and the reconstructed data.






□ CANTARE: finding and visualizing network-based multi-omic predictive models https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04016-8

CANTARE (Consolidated Analysis of Network Topology And Regression Elements) is a workflow for building predictive regression models from network neighborhoods in multi-omic networks. CANTARE models are competitive with random forests and elastic net.

The AUC values of CANTARE models were comparable to those of random forests and penalized regressions, whether the forests or regressions were generated with the universe of multi-omic data or the data underlying the Vnet.

CANTARE models are subject to the general constraints of linear regressions, such as linearity with log odds or continuous outcomes, normal distribution of the errors, and little to no multicollinearity between predictors.





□ scMM: Mixture-of-experts multimodal deep generative model for single-cell multiomics data analysis

>> https://www.biorxiv.org/content/10.1101/2021.02.18.431907v1.full.pdf

scMM is based on a mixture-of-experts multimodal deep generative model and achieves end-to-end learning by modeling raw count data in each modality based on different probability distributions.

Using the learned standard deviation for the dth dimension σd, with other dimensions fixed to zero, and linearly changed the dth dimension from −5σd to 5σd at a rate of 0.5σd.

scMM uses a Laplace prior with different scale values in each dimension, which encourages disentanglement of information by learning axis-aligned representations.




□ SSRE: Cell Type Detection Based on Sparse Subspace Representation and Similarity Enhancement

>> https://www.sciencedirect.com/science/article/pii/S1672022921000383

SSRE computes the sparse representation similarity of cells based on the subspace theory, and designed a gene selection process and an enhancement strategy based on the characteristics of different similarities to learn more reliable similarities.

SSRE performs eigengap on the learned similarity matrix to estimate the number of clusters. Eigengap is a typical cluster number estimation method, and it determines the number of clusters by calculating max gap between eigenvalues of a Laplacian matrix.




□ AMBIENT: Accelerated Convolutional Neural Network Architecture Search for Regulatory Genomics

>> https://www.biorxiv.org/content/10.1101/2021.02.25.432960v1.full.pdf

AMBIENT maps a summary of that dataset to the initial state of the controller model and generates an optimal task-specific architecture. AMBIENT is more efficient than existing methods, allowing it to identify architectures of comparable accuracy at an accelerated pace.

AMBIENT uses a 10-layer model search space to evaluate the optimal architecture differences. And generates highly accurate CNN architectures for sequences of diverse functions, while substantially reducing the computing cost of conventional Neural Architecture Search.





□ Genozip - A Universal Extensible Genomic Data Compressor

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab102/6135077

Genozip is designed to be a general-purpose software and a development framework for genomic compression by providing five core capabilities – universality (support for all common genomic file formats), high compression ratios, speed, feature-richness, and extensibility.

Genozip supports all common genomic file formats - FASTQ, SAM/BAM/CRAM, VCF, GVF, FASTA, PHYLIP, and 23andMe. Genozip is architected with a separation of the Genozip Framework from file-format-specific Segmenters and data-type-specific Codecs.





□ AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes

>> https://www.biorxiv.org/content/10.1101/2021.02.16.431517v1.full.pdf

AirLift, a fast and comprehensive method for moving alignments from one genome to another. AirLift reduces the number of reads that need to be fully mapped from the entire read set, and the overall execution time to remap read sets b/n two reference genome versions.

AirLift is the first tool that provides BAM-to-BAM remapping results of a read data set on which downstream analysis can be immediately performed. AirLift identifies similar rates of SNPs and Indels as the full mapping baseline.





□ iMAP: integration of multiple single-cell datasets by adversarial paired transfer networks

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02280-8

iMAP combines the two kinds of unsupervised deep network structures—autoencoders and generative adversarial networks. A novel autoencoder structure is used to build low-dimensional representations of the biological contents of cells disentangled from the technical variations.

iMAP framework consists of two stages, including building the batch-ignorant representations for all cells, and then guiding the batch effect removal of the original high-dimensional expression profiles. The input expression vectors for iMAP were log-transformed TPM-like values.

iMAP regards the cells in the mutual nearest neighbors (MNN) pairs as initial seeds, and adopts a random walk-based method to enroll new pairs, through successively selecting a cell from the kNNs (k nearest neighbors) of the seeds within each batch.




□ TransPi - a comprehensive TRanscriptome ANalysiS PIpeline for de novo transcriptome assembly

>> https://www.biorxiv.org/content/10.1101/2021.02.18.431773v1.full.pdf

TransPi utilizes various assemblers and kmers (i.e. k length sequences used for the assembly) to generate an over assembled transcriptome that is then reduced to a non-redundant consensus transcriptome with the EvidentialGene.

TransPi performs multiple assemblies with different parameters to then get a non-redundant consensus assembly. It also performs other valuable analyses such as quality assessment of the assembly, BUSCO scores, Transdecoder (ORFs), and gene ontologies (Trinotate).





□ Deep propensity network using a sparse autoencoder for estimation of treatment effects

>> https://academic.oup.com/jamia/advance-article-abstract/doi/10.1093/jamia/ocaa346/6139936

Drawing causal estimates from observational data is problematic, because datasets often contain underlying bias. To examine causal effects, it is important to evaluate what-if scenarios—the so-called counterfactuals.

DPN-SA: Architecture for propensity score matching & counterfactual prediction—Deep Propensity Network using a Sparse Autoencoder—to tackle the problems of high dimensionality, nonlinear/nonparallel treatment assignment, and residual confounding when estimating treatment effects.





□ IRIS-FGM: an integrative single-cell RNA-Seq interpretation system for functional gene module analysis

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab108/6140779

IRIS-FGM (integrative scRNA-Seq interpretation system for functional gene module analysis) to support the investigation of FGMs and cell clustering using scRNA-Seq data.

Empowered by QUBIC2, IRIS-FGM can identify co-expressed and co-regulated FGMs, predict types/clusters, identify differentially expressed genes, and perform functional enrichment analysis. IRIS-FGM also applies Seurat objects that can be easily used in the Seurat vignettes.





□ ALN: Decoupling alignment strategy from feature quantification using a standard alignment incidence data structure

>> https://www.biorxiv.org/content/10.1101/2021.02.16.431379v1.full.pdf

ALNtools processes next-generation sequencing read alignments into a sparse compressed incidence matrix and stores it in a pre-defined binary format for efficient downstream analyses. It enables us to compare, contrast, or combine the results of different alignment strategies.

ALN uses EMASE-Zero algorithm, In combination with alntools (that generates compressed three-dimensional incidence matrix), Zero estimates the expected read counts fast, over 10 times faster than RSEM. Zero generalizes the fast hierarchical EM to any decent alignment strategies.





□ CellWalker integrates single-cell and bulk data to resolve regulatory elements across cell types in complex tissues

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02279-1

Using a graph diffusion implemented via a random walk with restarts, CellWalker computes a global influence matrix that relates every cell and label to every other cell and label based on information flow between them in the network.

CellWalker takes as input scATAC-seq data and labeling information, either directly in the form of marker genes, or by processing scRNA-seq data to generate labels (for example using Seurat). scATAC-seq data can optionally be converted into a cell-by-gene matrix using software such as SnapATAC, Cicero, or ArchR.





□ META-CS: Accurate SNV detection in single cells by transposon-based whole-genome amplification of complementary strands

>> https://www.pnas.org/content/118/8/e2013106118

META-CS achieved the highest accuracy in terms of detecting single-nucleotide variations, and provided potential solutions for the identification of other genomic variants, such as insertions, deletions, and structural variations in single cells.

with META-CS, a mutation can be identified with as few as four reads, which significantly reduces sequencing cost. In contrast to the 30 to 60× sequencing depth commonly used for single-cell SNV identification, most cells were sequenced between 3 and 8× in this work.





□ RaptGen: A variational autoencoder with profile hidden Markov model for generative aptamer discovery

>> https://www.biorxiv.org/content/10.1101/2021.02.17.431338v1.full.pdf

RaptGen, a variational autoencoder for aptamer generation. RaptGen uses a profile hidden Markov model decoder to efficiently create latent space in which sequences form clusters based on motif structure.

RaptGen learns the relationship b/n sequencing data and latent space embeddings. RaptGen constructs a latent space based on sequence similarity. And can propose candidates according to the activity distribution by transforming a latent representation into a probabilistic model.





□ PhylEx: Accurate reconstruction of clonal structure via integrated analysis of bulk DNA-seq and single cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.02.16.431009v1.full.pdf

PhylEx: a clonal-tree reconstruction method that integrates bulk genomics and single-cell transcriptomics data. In addition to the clonal-tree, PhylEx also assigns single-cells to clones, which effectively produce clonal expression profiles, and generates clonal genotypes.

PhylEx improves over bulk-based clone reconstruction method and should be the preferred choice for inferring the guide tree needed for Cardelino. PhylEx is a strong alternative to DLP scDNA-seq for mapping expression profiles to clones using methods such as clonealign.




□ coupleCoC+: an information-theoretic co-clustering-based transfer learning framework for the integrative analysis of single-cell genomic data

>> https://www.biorxiv.org/content/10.1101/2021.02.17.431728v1.full.pdf

coupleCoC+ uses the linked features in the two datasets for effective knowledge transfer, and it also uses the information of the features in the target data that are unlinked with the source data.

coupleCoC+ can automatically adjust for sequencing depth, so we do not need to normalize for sequencing depth. coupleCoC+ is guaranteed to converge as the objective functions in Equations are non-increasing in each iteration.




□ multistrain SIRS: Localization, epidemic transitions, and unpredictability of multistrain epidemics with an underlying genotype network

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008606

a multistrain Susceptible-Infectious-Recovered-Susceptible (multistrain SIRS) epidemic model with an underlying genotype network, allowing the disease to evolve along plausible mutation pathways as it spreads in a well-mixed population.

the genotype network do not affect the classic epidemic threshold but localize outbreaks around key strains and yield a second immune invasion threshold below which the epidemics follow almost cyclical and chaos-like dynamics.




□ Squidpy: a scalable framework for spatial single cell analysis

>> https://www.biorxiv.org/content/10.1101/2021.02.19.431994v1.full.pdf

Spatial graphs encode spatial proximity, and are, depending on data resolution, flexible in order to support the variety of neighborhood metrics that spatial data types and users may require.

Squidpy implements a pipeline based on Scikit-image for preprocessing and segmenting images, extracting morphological, texture, and deep learning-powered features. Squidpy’s Image Container stores the image with an on-disk/in-memory switch based on xArray and Dask.



□ VSAT: Variant-set association test for generalized linear mixed model

>> https://onlinelibrary.wiley.com/doi/10.1002/gepi.22378

An adjustment in the generalized linear mixed model (GLMM) framework, which accounts for both sample relatedness and non-Gaussian outcomes, has not yet been attempted.

a new Variant-Set Association Test (VSAT), a powerful and efficient analysis tool in GLMM, to examine the association between a set of omics variants and correlated phenotypes.





□ Estimating DNA methylation potential energy landscapes from nanopore sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.02.22.431480v1.full.pdf

a novel approach that characterizes the probability distribution of methylation within a genomic region of interest using a parametric correlated potential energy landscape (CPEL) model that is consistent with methylation means and pairwise correlations at each CpG site.

an estimation approach based on the expectation-maximization (EM) algorithm. This method determines values for the parameters of the CPEL model by maximizing the likelihood that the observed nanopore sequencing data have been generated by the estimated model.

Within each DNA fragment, the C’s of all CG dinucleotide marked by 1 are replaced with M’s, a step that modifies the DNA sequence within each fragment by incorporating the methylation, as determined by the methylation states drawn from the ground truth CPEL model.

< br />


□ NanoMethPhase: Megabase-scale methylation phasing using nanopore long reads

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02283-5

SNVs from nanopore sequencing data using Clair. Clair is designed to call germline small variants from nanopore reads based on pileup format, and the authors demonstrated its superiority over other pileup-based tools.

NanoMethPhase and SNVoter detect allele-specific methylation (ASM) from a single sample using only nanopore sequence data with redundant sequence coverage as low as about 10×.




□ GuideStar: bioinformatics tool for gene characterization-case study:

>> https://www.biorxiv.org/content/10.1101/2021.02.25.432957v1.full.pdf

GUIdeStaR, a ready-to-plug-in-to-AI database integrated with five important nucleotide elements and structure, G-quadruplex, Uorf, IRES, Small RNA, Repeats.




□ Normalization of single-cell RNA-seq counts by log(x + 1) or log(1 + x)

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab085/6155989

while it doesn’t matter whether one uses log(x + 1) or log(1 + x), the filtering and normalization applied to counts can affect comparative estimates in non-intuitive ways.

the SCnorm normalization is based on a preliminary filter for all cells with at least one count. Indeed, there have been reports of problems with SCnorm when applying the method to sparse datasets with many zeroes.




□ Demographic inference from multiple whole genomes using a particle filter for continuous Markov jump processes

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0247647

The algorithm relies on Radon-Nikodym derivatives, and establish criteria for choosing a finite set of “waypoints” that makes it possible to reduce the problem to the discrete-time case, while ensuring that particle degeneracy remains under control.

The Auxiliary Particle Filter for discrete-time models, generalise it to continuous-time and -space Markov jump processes. And use Variational Bayes to model the uncertainty in parameter estimates for rare events, avoiding biases seen with Expectation Maximization.





□ ASpli: Integrative analysis of splicing landscapes through RNA-Seq assays

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab141/6156815

ASpli, a computational suite implemented in R statistical language, that allows the identification of changes in both, annotated and novel alternative splicing events and can deal with simple, multi-factor or paired experimental designs.

ASpli considers the same GLM model, applied to different sets of reads and junctions, in order to compute complementary splicing signals. the consolidation of these signals resulted in a robust proxy of the occurrence of splicing alterations.





□ StationaryOT: Optimal transport analysis reveals trajectories in steady-state systems

>> https://www.biorxiv.org/content/10.1101/2021.03.02.433630v1.full.pdf

The problem of inferring cell trajectories from single-cell measure- ments has been a major topic in the single-cell analysis community, with different methods developed for equilibrium and non-equilibrium systems.

StationaryOT, is mathematically motivated in a natural way from the hypothesis of a Waddington’s metaphor of an epigenetic landscape. StationaryOT with either entropic or quadratic regularisation consistently produces more accurate fate estimates compared to the scVelo method.




□ Mako: a graph-based pattern growth approach to detect complex structural variants

>> https://www.biorxiv.org/content/10.1101/2021.03.01.433465v1.full.pdf

Though long read sequencing technologies bring us promising opportunities to characterize CSVs, their application is currently limited to small-scale projects and the methods for CSV discovery are also underdeveloped.

Mako, utilizing a bottom-up guided model-free strategy, to detect CSVs from paired-end short-read sequencing. Mako uses a graph to build connections of mutational signals derived from abnormal alignment, providing the potential breakpoint connections of CSVs.




INFINITE.

2021-03-03 03:01:06 | Science News



□ d-PBWT: dynamic positional Burrows-Wheeler transform

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab117/6149123

Durbin’s positional Burrows-Wheeler transform (PBWT) is a scalable data structure for haplotype matching. It has been successfully applied to identical by descent (IBD) segment identification and genotype imputation.

d-PBWT, a dynamic data structure where the reverse prefix sorting at each position is stored with linked lists. And systematically investigated variations of set maximal match and long match query algorithms: while they all have average case time complexity.




□ Graviton2: A generalized approach to benchmarking genomics workloads in the cloud: Running the BWA read aligner

>> https://aws.amazon.com/blogs/publicsector/generalized-approach-benchmarking-genomics-workloads-cloud-bwa-read-aligner-graviton2/

The most cost-effective instance type turns out to be the m6g.8xlarge with a mean runtime of 258 sec / run cost of $0.88. The most cost-effective x86_64 instance type was the r5dn.8xlarge with a mean runtime of 237 sec. the arm64 architecture provides optimal performance.

Graviton2 utilizes 64-bit Arm Neoverse cores and deliver up to 40 percent better price performance over comparable current generation x86-based instances. And recompiled the Burrows-Wheeler Aligner (BWA) application for Arm-based chips and evaluated their cost effectiveness.




□ Chronos: a CRISPR cell population dynamics model

>> https://www.biorxiv.org/content/10.1101/2021.02.25.432728v1.full.pdf

Chronos, an algorithm for inferring gene knockout fitness effects based on an explicit model of the dynamics of cell proliferation after CRISPR gene knockout.

Chronos addresses sgRNA efficacy, variable screen quality and cell growth rate, and heterogeneous DNA cutting outcomes through a mechanistic model of the experiment.

Chronos also directly models the readcount level data using a more rigorous negative binomial noise model, rather than modeling log-fold change values with a Gaussian distribution as is typically done.





□ FICT: Cell Type Assignments for Spatial Transcriptomics Data

>> https://www.biorxiv.org/content/10.1101/2021.02.25.432887v1.full.pdf

FICT (FISH Iterative Cell Type assignment) maximizes a joint probabilistic likelihood function that takes into account both the expression of the genes in each cell and the joint multi-variate spatial distribution of cell types.

FICT can correctly determine both expression and neighborhood parameters for different cell types improving on methods that rely only on expression levels or do not take into account the complete neighborhood of each cell.

FICT can also identify cell sub-types that are similar in terms of their expression while differ in their spatial organization.





□ MIGNON: A versatile workflow to integrate RNA-seq genomic and transcriptomic data into mechanistic models of signaling pathways

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008748

MIGNON, a complete and versatile workflow able to exploit all the information contained in RNA-Seq data and producing not only the conventional normalized gene expression matrix, but also an annotated VCF file per sample with the corresponding mutational profile.

Gene expression and LoF variants are integrated by doing an in-silico knockdown of genes that present a LoF variant. MIGNON can combine both files to model signaling pathway activities through an integrative functional analysis using the mechanistic Hipathia algorithm.





□ Triku: a feature selection method based on nearest neighbors for single-cell data

>> https://www.biorxiv.org/content/10.1101/2021.02.12.430764v1.full.pdf

triku, a FS method that selects genes that show an unexpected distribution of zero counts and whose expression is localized in cells that are transcriptomically similar.

Triku identifies genes that are locally overexpressed in groups of neighboring cells by inferring the distribution of counts in the vicinity of a cell and computing the expected distribution of counts.

the Wasserstein distance between the observed and the expected distributions is computed. Higher distances imply that the gene is locally expressed in a subset of transcriptionally similar cells. a subset of relevant features is selected using a cutoff value for the distance.





□ kmtricks: Efficient construction of Bloom filters for large sequencing data collections

>> https://www.biorxiv.org/content/10.1101/2021.02.16.429304v1.full.pdf

kmtricks, a novel approach for generating Bloom filters from terabase-sized sequencing data. Kmtricks is an efficient method for jointly counting k-mers across multiple samples, incl. a streamlined Bloom filter construction by directly counting hashes instead of k-mers.

Kmtricks takes advantage of joint counting to preserve low-abundant k-mers present in several samples, improving the recovery of non-erroneous k-mers. HowDe-SBT/kmtricks is 1-1.5x faster to construct than HowDe-SBT/KMC, 3-4x faster than HowDe-SBT/Jellyfish, 2x faster than Mantis.





□ Supervised biomedical semantic similarity

>> https://www.biorxiv.org/content/10.1101/2021.02.16.431402v1.full.pdf

This approach is independent of the semantic aspects, the specific implementation of knowledge graph-based similarity and the ML algorithm employed in regression.

This approach is able to learn a supervised semantic similarity that outperforms static semantic similarity both using KG embeddings and standard taxonomic SSMs, obtaining more accurate similarity values.





□ MQF and buffered MQF: quotient filters for efficient storage of k-mers with their counts and metadata

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-03996-x

the mixed-counters quotient filter (MQF) as a new variant of the CQF with novel counting and labeling systems. MQF adapts to a wider range of data distributions for increased space efficiency and is faster than the CQF for insertions and queries in most of the tested scenarios.

MQF comes with a novel labeling system that supports associating each k-mer w/ multiple values to avoid redundant duplication of k-mers' keys in separate data structures. MQF needs just an extra O(N) operation to update the block labels where N is the number of its unique k-mers.





□ MUFFIN: Metagenomics workflow for hybrid assembly, differential coverage binning, metatranscriptomics and pathway analysis

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008716

MUFFIN utilizes the advantages of both sequencing technologies. Short-reads provide a better representation of low abundant species due to their higher coverage based on read count. Long-reads are utilized to resolve repeats for better genome continuity.

MUFFIN is capable of enhancing the pathway results present by incorporating the data as well as the general expression level of the genes. MUFFIN executes a de novo assembly of the RNA-seq reads instead of a mapping of the reads against the MAGs to avoid bias during the mapping.





□ kLDM: Inferring Multiple Metagenomic Association Networks based on the Variation of Environmental Factors

>> https://www.sciencedirect.com/science/article/pii/S1672022921000206

the k-Lognormal-Dirichlet-Multinomial (kLDM) model, which estimates multiple association networks that correspond to specific environmental conditions, and simultaneously infers microbe-microbe and environmental factor-microbe associations for each network.

kLDM adopts a split-merge algorithm to estimate the number of environmental conditions and sparse OTU-OTU and EF-OTU associations under each environmental condition.




□ Variance Penalized On-Policy and Off-Policy Actor-Critic

>> https://arxiv.org/pdf/2102.01985.pdf

an on- and off-policy actor-critic algorithm for variance penalized objective which leverages multi- timescale stochastic approximations, where both value and variance critics are estimated in TD style.

the convergence of the algorithm to locally optimal policies for finite state action Markov decision processes. And result in trajectories with much lower variance as compared to the risk-neutral and existing indirect variance-penalized counterparts.




□ scSorter: assigning cells to known cell types according to marker genes

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02281-7

scSorter is based on the observation that marker genes, which are expected to express in higher levels in the corresponding cell types, may in practice express at a very low level in many of those cells.

scSorter takes full use of such feature and allows cells to express either at an elevated level or a base level, without a direct penalty.




□ BiSEK: a platform for a reliable differential expression analysis

>> https://www.biorxiv.org/content/10.1101/2021.02.22.432271v1.full.pdf

Biological Sequence Expression Kit (BiSEK), a graphical user interface-based platform for DEA, dedicated to a reliable inquiry. BiSEK is based on a novel algorithm to track discrepancies between the data and the statistical model design.

PaDETO (Partition Distance Explanation Tree Optimizer) tracks discrepancies in the data, alerts about problems and offers the best solutions considering the user setup, to increase reliability of the DEA output.

BiSEK enables differential-expression analysis of groups of genes, to identify affected pathways, without relying on the significance of genes comprising them.




□ WLasso: A variable selection approach for highly correlated predictors in high-dimensional genomic data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab114/6146520

Regularized approaches are classically used to perform variable selection in high-dimensional linear models. However, these methods can fail in highly correlated settings.

WLasso consists in rewriting the initial high-dimensional linear model to remove the correlation between the biomarkers (predictors) and in applying the generalized Lasso criterion.





□ Flanker: a tool for comparative genomics of gene flanking regions

>> https://www.biorxiv.org/content/10.1101/2021.02.22.432255v1.full.pdf

Flanker performs alignment-free clustering of gene flanking sequences in a consistent format, allowing investigation of MGEs without prior knowledge of their structure.

Flanker clusters flanking sequences based on Mash distances, allowing for easy comparison of similarity and the extent of this similarity across sequences

Flanker can be flexibly parameterised to finetune outputs by characterising upstream and downstream regions separately and investigating variable lengths of flanking sequence.




□ ESCO: single cell expression simulation incorporating gene co-expression

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab116/6149079

ESCO adopts the idea of the copula to impose gene co-expression, while preserving the highlights of available simulators, which perform well for simulation of gene expression marginally.

Using ESCO, they assess the performance of imputation methods on GCN recovery and find that imputation generally helps GCN recovery when the data are not too sparse, and the ensemble imputation method works best among leading methods.





□ CellWalkR: An R Package for integrating single-cell and bulk data to resolve regulatory elements

>> https://www.biorxiv.org/content/10.1101/2021.02.23.432593v1.full.pdf

CellWalkR implements and extends a previously introduced network-based model that relies on a random walk with restarts model of diffusion. CellWalkR can optionally run this step on a GPU using TensorFlow3 for a greater than 15-fold speedup.

The output is a large influence matrix, portions of which are used for cell labeling, determining label similarity, embedding cells into low dimensional space, and mapping regulatory regions to cell types.




□ DeTOKI identifies and characterizes the dynamics of chromatin topologically associating domains in a single cell

>> https://www.biorxiv.org/content/10.1101/2021.02.23.432401v1.full.pdf

decode TAD boundaries that keep chromatin interaction insulated (deTOKI) from ultra-sparse Hi-C data. By nonnegative matrix factorization, this novel algorithm seeks out for regions that insulate the genome into blocks with minimal chance of clustering.

deTOKI applies non-negative matrix factorization (NMF) to decompose the Hi-C contact matrix into genome domains that may be spatially segregated in 3D space. The alternative local optimal solutions in the structure ensemble are achieved by multiple random initiations.




□ REVA as a Well-curated Database for Human Expression-modulating Variants

>> https://www.biorxiv.org/content/10.1101/2021.02.24.432622v1.full.pdf

REVA, a manually curated database for over 11.8 million experimentally tested noncoding variants with expression-modulating potentials.

REVA provides high-qualify experimentally tested expression-modulating variants with extensive functional annotations, which will be useful for users in the noncoding variants community.





□ scMoC: Single-Cell Multi-omics clustering

>> https://www.biorxiv.org/content/10.1101/2021.02.24.432644v1.full.pdf

scMoC is designed to cluster paired multimodal datasets that measures both single-cell transcriptomics sequencing (scRNA-seq) and single-cell transposase accessibility chromatin sequencing.

scMOC encompasses an RNA-guided imputation strategy to leverage the higher data sparsity. scMOC builds on the idea that cell-cell similarities can be better estimated from the RNA profiles and then used to define a neighborhood to impute from it the ATAC data.




□ sweetD: An R package using Hoeffding's D statistic to visualise the dependence between M and A for large numbers of gene expression samples

>> https://www.biorxiv.org/content/10.1101/2021.02.24.432640v1.full.pdf

Using Hoeffding’s D statistic as a non-parametric measure of dependence between M and A, so that large numbers of MA plots need not be inspected. If a sample’s D statistic is high, this means there is a relationship between M and A. this relationship can be non-monotonic.

sweetD calculates Hoeffding's D statistic for all samples relative to the median or each other, which can take any log transformed gene expression matrix as an input, and which can simultaneously visualise changes in the distribution of Hoeffding's D statistic.





□ Strainberry: Automated strain separation in low-complexity metagenomes using long reads

>> https://www.biorxiv.org/content/10.1101/2021.02.24.429166v1.full.pdf

Strainberry combines a strain-oblivious assembler with the careful use of a long-read variant calling and haplotyping tool, followed by a novel component that performs long-read metagenome scaffolding.

Strainberry is able to accurately separate strains using long reads. An average depth of coverage of 60-80X suffices to assemble individual strains of low-complexity metagenomes with almost complete coverage and sequence identity exceeding 99.9%.




□ Lasso.TopX: Machine Learning Approaches Identify Genes Containing Spatial Information From Single-Cell Transcriptomics Data

>> https://www.frontiersin.org/articles/10.3389/fgene.2020.612840/full

The NN approach utilizes weak supervision for linear regression to accommodate for uncertain or probabilistic training labels. This is especially useful to take advantage of training data generated from DistMap’s probabilistic mapping output.

Lasso.TopX, leverages linear models using the least absolute shrinkage and selection operator (Lasso), which is applied to high-dimensional single-cell sequencing data in order to accurately identify genes that contain spatial information.





□ CINS: Cell Interaction Network inference from Single cell expression data

>> https://www.biorxiv.org/content/10.1101/2021.02.22.432206v1.full.pdf

CINS combines Bayesian network analysis with regression-based modeling to identify differential cell type interactions and the proteins that underlie them.

CINS learns a regression model with ligand-target interaction matrix that identifies the key ligands and targets that participate in the interactions between these cell types. CINS correctly identifies known interacting cell type pairs and ligands associated with these interactions.




□ MONTAGE: a new tool for high-throughput detection of mosaic copy number variation

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-021-07395-7

Mosaicism describes a phenomenon where a mixture of genotypic states in certain genomic segments exists within the same individual. Mosaicism is a prevalent and impactful class of non-integer state copy number variation (CNV).

Montage directly interfaces with ParseCNV2 algorithm to establish disease phenotype genome-wide association and determine which genomic ranges had more or less than expected frequency of mosaic events.





□ geneRFinder: gene finding in distinct metagenomic data complexities

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-03997-w

geneRFinder, an ML-based gene predictor able to outperform state-of-the-art gene prediction tools across this benchmark by using only one pre-trained Random Forest model.

The geneRFinder is an ORF extraction based tool capable of identifying coding sequences and intergenic regions in metagenomic sequences, predicting based on the capture of signals from these regions.




□ Privacy-Preserving and Robust Watermarking on Sequential Genome Data using Belief Propagation and Local Differential Privacy

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab128/6149476

a novel watermarking method on sequential genome data using belief propagation algorithm. Embedding robust watermarks so that the malicious adversaries can not temper the watermark by modification and are identified with high probability.

Achieving ε-local differential privacy in all data sharings with SPs. For the preservation of system robustness against single SP and collusion attacks. Considering publicly available genomic information like Minor Allele Frequency, Linkage Disequilibrium, Phenotype Information.




□ PICS2: Next-generation fine mapping via probabilistic identification of causal SNPs

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab122/6149122

The Probabilistic Identification of Causal SNPs (PICS) algorithm and web application was developed as a fine-mapping tool to determine the likelihood that each single nucleotide polymorphism (SNP) in LD with a reported index SNP is a true causal polymorphism.

PICS2 enables performance of PICS analyses of large batches of index SNPs. And use of LD reference data generated from 1000 Genomes phase 3; annotation of variant consequences; annotation of GTEx eQTL genes and downloadable PICS SNPs from GTEx eQTLs.




□ DeepAccess: Discovering differential genome sequence activity with interpretable and efficient deep learning

>> https://www.biorxiv.org/content/10.1101/2021.02.26.433073v1.full.pdf

Differential Expected Pattern Effect (DEPE), a method to compare Expected Pattern Effects between two conditions or cell states.

DeepAccess was developed specifically for identifying cell type-specific sequence features from chromatin accessibility, Differential Expected Pattern Effect can be used to discover condition-specific sequence features from many types of experimental genome-wide sequencing data.




□ DENTIST – using long reads to close assembly gaps at high accuracy

>> https://www.biorxiv.org/content/10.1101/2021.02.26.432990v1.full.pdf

DENTIST uses uncorrected, long sequencing reads to close gaps in fragmented assemblies. DENTIST employs a reference-based consensus caller to generate high-quality consensus sequence for each closed assembly gap, maintaining a high base accuracy in the final assembly.

DENTIST is able to scaffold contigs using the given long reads. DENTIST provides a “free scaffolding mode”, where it scaffolds the given contigs just using long read alignments.




□ VarCA: Discovering single nucleotide variants and indels from bulk and single-cell ATAC-seq

>> https://www.biorxiv.org/content/10.1101/2021.02.26.433126v1.full.pdf

VarCA uses a random forest to predict indels and SNVs and achieves substantially better performance than any individual caller.

VarCA calculates the quality scores by their RF classification probabilities and fitting a linear model between the phred-scaled RF classification probabilities and empirical precision of each bin. And uses this model to calculate the final quality scores for every variant.





□ RWRF: Multi-dimensional data integration algorithm based on random walk with restart

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04029-3

RWRF (Random Walk with Restart for multi-dimensional data Fusion) uses similarity network of samples as the basis for integration. It constructs the similarity network for each data type and then connects corresponding samples of multiple similarity networks to to construct a multiplex network.

RWRF uses stationary probability distribution to fuse similarity networks. RWRF can automatically capture various structure information and make full use of topology information of the whole similarity network of each type of data.




□ Gene-Set Integrative Analysis of Multi-Omics Data Using Tensor-based Association Test

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab125/6154849

A common strategy is to regress the outcomes on all omics variables in a gene set. However, this approach suffers from problems associated with high-dimensional inference.

TRinstruction, a tensor-based framework for variable-wise inference. By accounting for the matrix structure of an individual’s multi-omics data, tensor methods incorporate the relationship among omics effects, reduce the number of parameters, and boost the modeling efficiency.




□ ksrates: positioning whole-genome duplications relative to speciation events using rate-adjusted mixed paralog–ortholog KS distributions

>> https://www.biorxiv.org/content/10.1101/2021.02.28.433234v1.full.pdf

if the lineages involved exhibit different substitution rates, such direct naive comparison of paralog and ortholog KS estimates can be misleading and result in phylogenetic misinterpretation of WGD signatures.

ksrates estimates differences in synonymous substitution rates among the lineages involved and generates an adjusted mixed plot of paralog and ortholog KS distributions that allows to assess the relative phylogenetic positioning of presumed WGD and speciation events.





□ 2passtools: two-pass alignment using machine-learning-filtered splice junctions increases the accuracy of intron detection in long-read RNA sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02296-0

An alignment metrics and machine-learning-derived sequence information to filter spurious splice junctions from long-read alignments and use the remaining junctions to guide realignment in a two-pass approach.

2passtools, a method for filtered two-pass alignment of the relatively high-error long reads generated by techniques such as nanopore DRS. 2passtools, uses a rule-based approach to identify probable genuine and spurious splice junctions from first-pass read alignments.




□ GMSECT: Genome-Wide Massive Sequence Exhaustive Com-parison Tool for structural and copy number variations

>> https://www.biorxiv.org/content/10.1101/2021.03.01.433223v1.full.pdf

Most of the existing pair wise alignment tools are an extension to the dynamic programming algorithm, and though they are extensively fast in comparison to standard dynamic programming approach, they are not rapid and efficient to handle massive sequences.

The GMSECT algorithm can be implemented using other parallel application programming interfaces as well such as Posix-Threads or can even be implemented in a serial submission fashion.