lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Max Richter – Beethoven - Opus 2020

2020-12-31 23:22:33 | art music
parts:eNozsjJkhIPUZENDA6Nks8qCfLMol+DMskxPJjMTAyZjMwMmAyYEcHBwAAARTgkc]

□ Max Richter – Beethoven - Opus 2020

>> https://www.maxrichtermusic.com

Max Richter - Beethoven - Opus 2020:
Beethoven Orchestra Bonn
Elisabeth Brauß – Piano
Dirk Kaftan – Conductor / General Music Director Beethoven Orchestra Bonn


Max Richter pays homage to Beethoven by releasing world premiere recording of Beethoven – Opus 2020 on 250th anniversary of iconic composer’s birthday.

His new orchestral work was commissioned by the Beethoven-Haus Bonn, which will host the world premiere performance on the eve of the anniversary.

Richter’s creative contemporary dialogue with Beethoven also embraces Opus 1970 – an earlier tribute from visionary 20th-century composer Stockhausen.





Silentium.

2020-12-24 22:13:36 | Science News




□ pythrahyper_net: Biological network growth in complex environments: A computational framework

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008003

the properties of complex networks are often calculated with respect to canonical random graphs such as the Erdös-Renyi or Watts-Strogatz model. These models are defined by the topological structure in an adjacency matrix, but usually neglect spatial constraints.

pythrahyper_net is a probabilistic agent-based model to describe individual growth processes as biased, correlated random motion. pythrahyper_net is a computational framework based on directional statistics to model network formation in space and time under arbitrary spatial constraints.

Probability distributions are modeled as multivariate Gaussian distributions (MGD), with mean and covariance determined from the discrete simulation grid. Individual MGDs are combined by convolution, transformed to spherical coordinates, and projected onto the unit sphere.





□ NanosigSim: Simulation of Nanopore Sequencing Signals Based on BiGRU

>> https://www.mdpi.com/1424-8220/20/24/7244

NanosigSim, a signal simulation method based on Bi-directional Gated Recurrent Units (BiGRU). NanosigSim signal processing model has a novel architecture that couples a three-layer BiGRU and a fully connected layer.

NanosigSim can model the relation between ground-truth signal and real-world sequencing signal through experimental data to accurately filter out the useless high-frequency components. This process can be achieved by using Continuous Wavelet Dynamic Time Warping.





□ Induced and higher-dimensional stable independence

>> https://arxiv.org/pdf/2011.13962v1.pdf

Stable independence in the context of accessible categories. This notion has its origins in the model-theoretic concept of stable nonforking, which can be thought of on one hand as a freeness property of type extensions. As a notion of freeness or independence for amalgams of models.

Given an (n+1)-dimensional stable independence notion Γn+1 = (Γn, Γ), KΓn+1 to be the category(KΓn)Γ, whose objects are morphisms of KΓn and whose morphisms are the Γ-independent squares. ⌣ is λ-accessible, λ an infinite regular cardinal, if the category K↓ is λ-accessible.

a stable independence notion immediately yields higher-dimensional independence, taken to its logical conclusion, leads to a formulation of stable independence as a property of commutative squares in a general category, described by a family of purely category-theoretic axioms.





□ Characterisations of Variant Transfinite Computational Models: Infinite Time Turing, Ordinal Time Turing, and Blum-Shub-Smale machines

>> https://arxiv.org/pdf/2012.08001.pdf

Using admissibility theory, Σ2-codes and Π3-reflection properties in the constructible hierarchy to classify the halting times of ITTMs with multiple independent heads; the same for Ordinal Turing Machines which have On length tapes.

Infinite Time Blum-Shub-Smale machines (IBSSM’s) have a universality property - this is because ITTMs do and the two classes of machine are ‘bi-simulable’. This is in contradistinction to the machine using a ‘continuity’ rule which fails to be universal.




□ Linear space string correction algorithm using the Damerau-Levenshtein distance

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3184-8

Linear space algorithms to compute the Damerau-Levenshtein (DL) distance between two strings and determine the optimal trace. Lowrance and Wagner have developed an O(mn) time O(mn) space algorithm to find the minimum cost edit sequence between strings of length m and n.

The linear space algorithms uses a refined dynamic programming recurrence. the more general algorithm in string correction using the Damerau-Levenshtein distance that runs in O(mn) time and uses O(s∗min{m,n}+m+n) space.




□ The Diophantine problem in finitely generated commutative rings

>> https://arxiv.org/pdf/2012.09787.pdf

the Diophantine problem, denoted D(R), in infinite finitely generated commutative associative unitary rings R.

a polynomial time algorithm that for a given finite system S of polynomial equations with coefficients in O constructs a finite system S ̊ of polynomial equations with coefficients in R such that S has a solution in O if and only if S ̊ has a solution in R.




□ MathFeature: Feature Extraction Package for Biological Sequences Based on Mathematical Descriptors

>> https://www.biorxiv.org/content/10.1101/2020.12.19.423610v1.full.pdf

MathFeature provides 20 approaches based on several studies found in the literature, e.g., multiple numeric mappings, genomic signal processing, chaos game theory, entropy, and complex networks.

Various studies have applied concepts from information theory for sequence feature extraction, mainly Shannon’s entropy. Another entropy-based measure has been successfully explored, e.g., Tsallis entropy, proposed to generalize the Boltzmann/Gibbs’s traditional entropy.




□ Megadepth: efficient coverage quantification for BigWigs and BAMs

>> https://www.biorxiv.org/content/10.1101/2020.12.17.423317v1.full.pdf

Megadepth is a fast tool for quantifying alignments and coverage for BigWig and BAM/CRAM input files, using substantially less memory than the next-fastest competitor.

Megadepth can summarize coverage within all disjoint intervals of the Gencode V35 gene annotation for more than 19,000 GTExV8 BigWig files in approximately one hour using 32 threads.

Megadepth can be configured to use multiple HTSlib threads for reading BAMs, speeding up block-gzip decompression.

megadepth allocates a per-base counts array across the entirety of the current chromosome before processing the alignments from that chromosome.





□ VEGA: Biological network-inspired interpretable variational autoencoder

>> https://www.biorxiv.org/content/10.1101/2020.12.17.423310v1.full.pdf

VEGA (Vae Enhanced by Gene Annotations), a novel sparse Variational Autoencoder architecture, whose decoder wiring is inspired by ​a priori ​characterized biological abstractions, providing direct interpretability to the latent variables.

Composed of a deep non-linear encoder and a masked linear decoder, VEGA encodes single-cell transcriptomics data in an interpretable latent space specified ​a priori.




□ Sfaira accelerates data and model reuse in single cell genomics

>> https://www.biorxiv.org/content/10.1101/2020.12.16.419036v1.full.pdf

a size factor-normalised, but otherwise non-processed feature space, for models so that all genes can contribute to embeddings and classification and the contribution of all genes can be dissected without the issue of removing low variance features.

Sfaira automatizes exploratory analysis of single-cell data. Sfaira allows fitting of cell type classifiers for data sets with different levels of annotation granularity by using cell type ontologies. And allows streamlined embedding models training across whole atlases.




□ Methrix: Systematic aggregation and efficient summarization of generic bedGraph files from Bisufite sequencing

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa1048/6042753

Core functionality of Methrix includes a comprehensive bedGraph - which summarizes methylation calls based on annotated reference indices, infers and collapses strands, and handles uncovered reference CpG sites while facilitating a flexible input file format specification.

Methrix enriches established WGBS workflows by bringing together computational efficiency and versatile functionality.





□ Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly

>> https://arxiv.org/pdf/2010.10055.pdf

a sparse linear algebra centric approach for distributed memory parallelization of overlap and layout phases. Formulating the overlap detection as a distributed Sparse General Matrix Multiply.

Sparse matrix-matrix multiplication allows diBELLA to efficiently parallelize the computation without losing expressiveness, thanks to the semiring abstraction. a novel distributed memory algorithm for the transitive reduction of the overlap graph.





□ SOMDE: A scalable method for identifying spatially variable genes with self-organizing map

>> https://www.biorxiv.org/content/10.1101/2020.12.10.419549v1.full.pdf

SOMDE, an efficient method for identifying SVgenes in large-scale spatial expression data. SOMDE uses self-organizing map (SOM) to cluster neighboring cells into nodes, and then uses a Gaussian Process to fit the node-level spatial gene expression to identify SVgenes.

SOMDE converts the original spatial gene expression to node-level gene meta-expression profiles. SOMDE models the condensed representation of the original spatial transcriptome data with a modified Gaussian process to quantify the relative spatial variability.




□ LISA: Learned Indexes for DNA Sequence Analysis

>> https://www.biorxiv.org/content/10.1101/2020.12.22.423964v1.full.pdf

LISA (Learned Indexes for Sequence Analysis) accelerates two of the most essential flavors of DNA sequence search—exact search and super-maximal exact match (SMEM) search.

LISA achieves 13.3× higher throughput than Trans-Omics Acceleration Library (TAL). Super-Maximal Exact Match for every position in the read, search of exact matches of longest substring of the read that passes through that position and still has a match in the reference sequence.





□ EVE: Large-scale clinical interpretation of genetic variants using evolutionary data and deep learning

>> https://www.biorxiv.org/content/10.1101/2020.12.21.423785v1.full.pdf

EVE (Evolutionary model of Variant Effect) learns a distribution over amino acid sequences from evolutionary data. It enables the computation of the evolutionary index. A global-local mixture of Gaussian Mixture Models separates variants into benign and pathogenic clusters based on that index.

EVE reflects the probabilistic assignment to either pathogenic or benign clusters. The probabilistic nature of the model enables us to quantify the uncertainty on this cluster assignment, which can bin variants into Benign / Pathogenic by assigning some variants as Uncertain.






□ CellVGAE: An unsupervised scRNA-seq analysis workflow with graph attention networks

>> https://www.biorxiv.org/content/10.1101/2020.12.20.423645v1.full.pdf

CellVGAE uses the connectivity between cells (e.g. k-nearest neighbour graphs or KNN) with gene expression values as node features to learn high-quality cell representations in a lower-dimensional space, with applications in downstream analyses like (density-based) clustering.

CellVGAE leverages the connectivity between cells, represented as a graph, to perform convolutions on a non-Euclidean structure, thus subscribing to the geometric deep learning paradigm.





□ Cytopath: Simulation based inference of differentiation trajectories from RNA velocity fields

>> https://www.biorxiv.org/content/10.1101/2020.12.21.423801v1.full.pdf

Cytopath is based upon transitions that use the full expression and velocity profiles of cells, it is less prone to projection artifacts distorting expression profile similarity.

The objective of trajectory inference is to estimate trajectories from root to terminal state. a common terminal state are aligned using Dynamic Time Warping. Root / terminal states can either be derived from a Markov random-walk model utilizing the transition probability matrix.





□ GCNG: graph convolutional networks for inferring gene interaction from spatial transcriptomics data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02214-w

GCNG model using spatial single cell expression data. A binary cell adjacent matrix and an expression matrix are extracted from spatial data. After normalization, both matrices are fed into the graph convolutional network.

GCNG consists of two graph convolutional layers, one flatten layer, one 512-dimension dense layer, and one sigmoid function output layer for classification.





□ GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU

>> https://arxiv.org/pdf/1908.01407.pdf

GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock.

Currently, direction-optimization is only active for matrix-vector multiplication. However, in the future, the optimization can be extended to matrix-matrix multiplication.





□ DipAsm: Chromosome-scale, haplotype-resolved assembly of human genomes

>> https://www.nature.com/articles/s41587-020-0711-0

DipAsm uses long, accurate reads and long-range conformation data for single individuals to generate a chromosome-scale phased assembly within 1 day.

A potential solution is to retain heterozygous events in the initial assembly graph and to scaffold and dissect these events later to generate a phased assembly.

DipAsm accurately reconstructs the two haplotypes in a diploid individual using only PacBio’s long high-fidelity (HiFi) reads and Hi-C data, both at ~30-fold coverage, without any pedigree information.





□ Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads

>> https://www.nature.com/articles/s41587-020-0719-5

A comparison of Oxford Nanopore Technologies and Pacific Biosciences phased assemblies identified 154 regions that are preferential sites of contig breaks, irrespective of sequencing technology or phasing algorithms.

examining the whole major histocompatibility complex (MHC) region and found that it was traversed by a single contig in both haplotype assemblies.





□ Characterizing finitely generated fields by a single field axiom

>> https://arxiv.org/pdf/2012.01307v1.pdf

The Elementary Equivalence versus Isomorphism Problem, for short EEIP, asks whether the elementary theory Th(K) of a finitely generated field K (always in the language of rings) encodes the isomorphism type of K in the class of all finitely generated fields.

every field K is elementary equivalent to its “constant field” κ – the relative algebraic closure of the prime field in K –, and its first-order theory is decidable.

Concerning with fields which are at the centre of (birational) arithmetic geometry, namely the finitely generated fields K, which are the function fields of integral Z-schemes of finite type.





□ PySCNet: A tool for reconstructing and analyzing gene regulatory network from single-cell RNA-Seq data

>> https://www.biorxiv.org/content/10.1101/2020.12.18.423482v1.full.pdf

PySCNet integrates competitive gene regulatory construction methodologies for cell specific or trajectory specific gene regulatory networks (GRNs) and allows for gene co-expression module detection and gene importance evaluation.

PySCNet uses Louvain clustering to detect gene co-expression modules. Node centrality is applied to estimate the importance of gene / TF in the network. To discover hidden regulating links of a target gene node, graph traversal are utilized to predict indirect regulations.





□ SCMER: Single-Cell Manifold Preserving Feature Selection

>> https://www.biorxiv.org/content/10.1101/2020.12.01.407262v1.full.pdf

SCMER, a novel unsupervised approach which performs UMAP style dimensionality reduction via selecting a compact set of molecular features with definitive meanings.

a manifold defined by pairwise cell similarity scores sufficiently represents the complexity of the data, encoding both global relationship between cell groups and local relationship within cell groups.

While clusters usually reflect distinct cell types, continuums reflect similar cell types and trajectory of transitioning/differentiating cell states. SCMER selects optimal features that preserve the manifold and retain inter- and intra-cluster diversity.

SCMER does not require clusters or trajectories, and thereby circumvents the associated biases. It is sensitive to detect diverse features that delineate common and rare cell types, continuously changing cell states, and multicellular programs shared by multiple cell types.

If a dataset with n cells is separate into b batches, the space complexity will reduce from O(n^2) to O(b * (n/b)^2) = O(n^2 / b).

Orthant-Wise Limited memory quasi-Newton (OWL-QN) algorithm solves the l2-regularized regression problem by introducing pseudo-gradients and restrict the optimization to an orthant without discontinuities in the gradient.





□ A Scalable Optimization Mechanism for Pairwise based Discrete Hashing

>> https://ieeexplore.ieee.org/document/9280410

a novel alternative optimization mechanism to reformulate one typical quartic problem, in term of hash functions in the original objective of Kernel- based Supervised Hashing, into a linear problem by introducing a linear regression model.

a scalable symmetric discrete hashing algorithm that gradually and smoothly updates each batch of binary codes. And a greedy symmetric discrete hashing algorithm to update each bit of batch binary codes.





□ SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network

>> https://www.biorxiv.org/content/10.1101/2020.11.30.405118v1.full.pdf

SpaGCN draws a circle around each spot with a pre-specified radius, and all spots that reside in the circle are considered as neighbors of this spot. SpaGCN allows to combine multiple domains as one target domain or specify which neighboring domains to be included in DE analysis.

SpaGCN can identify spatial domains with coherent gene expression and histology and detect SVGs and meta genes that have much clearer spatial expression patterns and biological interpretations than genes detected by SPARK and SpatialDE.





□ GRGNN: Inductive inference of gene regulatory network using supervised and semi-supervised graph neural networks

>> https://www.sciencedirect.com/science/article/pii/S200103702030444X

GRGNN - an end-to-end gene regulatory graph neural network approach to reconstruct GRNs from scratch utilizing the gene expression data, in both a supervised and a semi-supervised framework.

One of the time-consuming parts of GRGNN practice is extracting the enclosed subgraphs. The time complexity is O(n|V|h) and the memory complexity is O(n|E|) for extracting n subgraphs in h-hop, where |V| and |E| are numbers of nodes and edges in the whole graph.




□ spVCF: Sparse project VCF: efficient encoding of population genotype matrices

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa1004/6029516

Sparse Project VCF (spVCF), an evolution of VCF with judicious entropy reduction and run-length encoding, delivering >10X size reduction for modern studies with practically minimal information loss.

spVCF interoperates with VCF efficiently, including tabix-based random access. spVCF provides the genotype matrix sparsely, by selectively reducing QC measure entropy and run-length encoding repetitive information about reference coverage.





□ SDPR: A fast and robust Bayesian nonparametric method for prediction of complex traits using summary statistics

>> https://www.biorxiv.org/content/10.1101/2020.11.30.405241v1.full.pdf

SDPR (Summary statistics based Dirichelt Process Regression) is a method to compute polygenic risk score (PRS) from summary statistics. It is the extension of Dirichlet Process Regression (DPR) to the use of summary statistics.

SDPR connects the marginal coefficients in summary statistics with true effect sizes through Bayesian multiple DPR. And utilize the concept of approximately independent LD blocks and reparameterization to develop a parallel and fast-mixing Markov Chain Monte Carlo algorithm.





□ Maximum Caliber: Inferring a network from dynamical signals at its nodes

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008435

an approximate solution to the difficult inverse problem of inferring the topology of an unknown network from given time-dependent signals at the nodes.

The method of choice for inferring dynamical processes from limited information is the Principle of Maximum Caliber. Max Cal can infer both the dynamics and interactions within arbitrarily complex, non-equilibrium systems, albeit in an approximate way.




□ scGMAI: a Gaussian mixture model for clustering single-cell RNA-Seq data based on deep autoencoder

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbaa316/6029147

scGMAI is a new single-cell Gaussian mixture clustering method based on autoencoder networks and the fast independent component analysis (FastICA).

scGMAI utilizes autoencoder networks to reconstruct gene expression values from scRNA-Seq data and FastICA is used to reduce the dimensions of reconstructed data.




□ Assembling Long Accurate Reads Using de Bruijn Graphs

>> https://www.biorxiv.org/content/10.1101/2020.12.10.420448v1.full.pdf

an efficient jumboDB algorithm for constructing the de Bruijn graph for large genomes and large ​k​-mer sizes and the LJA genome assembler that error-corrects HiFi reads and uses jumboDB to construct the de Bruijn graph on the error-corrected reads.

Since the de Bruijn graph constructed for a fixed ​k​-mer size is typically either too tangled or too fragmented, LJA uses a new concept of a multiplex de Bruijn graph.




□ SCCNV: A Software Tool for Identifying Copy Number Variation From Single-Cell Whole-Genome Sequencing

>> https://www.frontiersin.org/articles/10.3389/fgene.2020.505441/full

Several statistical models have been developed for analyzing sequencing data of bulk DNA, for example, Circular Binary Segmentation (CBS), Mean Shift-Based (MSB) model, Shifting Level Model (SLM), Expectation Maximization (EM) model, and Hidden Markov Model (HMM).

SCCNV is a read-depth based approach with adjustment for the WGA bias. it controls not only bias during sequencing and alignment, e.g., bias associated with mappability and GC content, but also the locus-specific amplification bias.





□ A generative spiking neural-network model of goal-directed behaviour and one-step planning

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007579

The first hypothesis allows the architecture to learn the world model in parallel with its use for planning: a new arbitration mechanism decides when to explore, for learning the world model, or when to exploit it, for planning, based on the entropy of the world model itself.

The entropy threshold decreases linearly with each planning cycle so that the exploration component is eventually called to select the action if the planning process fails to reach the goal multiple time.





□ Probabilistic Contrastive Principal Component Analysis

>> https://arxiv.org/pdf/2012.07977.pdf

PCPCA, a model-based alterna- tive to contrastive principal component analysis (CPCA). model is both generative and discriminative, PCPCA provides a model based approach that allows for uncertainty quantification and principled inference.

PCPCA can be applied to a variety of statistical and machine learning problem domains including dimension reduction, synthetic data generation, missing data imputation, and clustering.





□ scCODA: A Bayesian model for compositional single-cell data analysis

>> https://www.biorxiv.org/content/10.1101/2020.12.14.422688v1.full.pdf

scCODA, a Bayesian approach for cell type composition differential abundance analysis to further address the low replicate issue.

scCODA framework models cell type counts with a hierarchical Dirichlet-Multinomial distribution that accounts for the uncertainty in cell type proportions and the negative correlative bias via joint modeling of all measured cell type proportions instead of individual ones.





Every sight I've ever seen.

2020-12-24 22:12:24 | Science News



□ Beyond low-Earth orbit: Characterizing the immune profile following simulated spaceflight conditions for deep space missions

>> https://www.cell.com/iscience/fulltext/S2589-0042(20)30944-5

Circulating immune biomarkers are defined by distinct deep space irradiation types coupled to simulated microgravity and could be targets for future space health initiatives.

Unique immune signatures and microRNA (miRNA) profiles would be produced by distinct experimental conditions of simulated GCR, SPE, and gamma irradiation, singly or in combination with HU.

Linear energy transfer (LET) is defined as the amount of energy that is deposited or transferred in a material from an ion. High-LET irradiation can cause more damaging ionizing tracks and pose a higher relative biological effectiveness (RBE) risk compared to low-LET irradiation.





□ Advancing the Integration of Biosciences Data Sharing to Further Enable Space Exploration

>> https://www.cell.com/cell-reports/fulltext/S2211-1247(20)31430-3

This open access science perspective invites investigators to participate in a transformative collaborative effort for interpreting spaceflight effects by integrating omics and physiological data to the systems level.

Integration of data from GeneLab and ALSDA will enable spaceflight health risk modeling. All data would then benefit from applied FAIR principles.





□ Super-robust data storage in DNA by de Bruijn graph-based decoding

>> https://www.biorxiv.org/content/10.1101/2020.12.20.423642v1.full.pdf

De Bruijn Graph-based Greedy Path Search (DBG-GPS) algorithm can efficient reconstruction of DNA strands from multiple error-rich sequences directly.

DBG-GPS is designed as inner decoding mechanism for correction of errors within DNA strands. And shows 50 times faster than the clustering and multiple alignment-based methods. The revealed linear decoding complexity makes DBG-GPS a suitable solution for large-scale data storage.





□ STARRPeaker: uniform processing and accurate identification of STARR-seq active regions

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02194-x

STARRPeaker, an algorithm optimized for processing and identifying functionally active enhancers from STARR-seq data. This approach statistically models the basal level of transcription, accounting for potential confounding factors, and accurately identifies reproducible enhancers.

To model the fragment coverage from STARR-seq using discrete probability distribution, assuming each genomic bin is independent, as specified in the Bernoulli trials. STARRPeaker calculates fragment coverage and the basal transcription rate using negative binomial regression.





□ RedOak: a reference-free and alignment-free structure for indexing a collection of similar genomes

>> https://www.biorxiv.org/content/10.1101/2020.12.19.423583v1.full.pdf

The parallelization of the data structure construction allows, through the use of networking resources, to efficiently index and query those genomes. RedOak is inspired by Bloom Filter Trie, using a probabilistic approach.

RedOak can also be applied to reads from unassembled genomes, and it provides a nucleotide sequence query function. This software is based on a k-mer approach and has been developed to be heavily parallelized and distributed on several nodes of a cluster.




□ TAPER: Pinpointing errors in multiple sequence alignments despite varying rates of evolution

>> https://www.biorxiv.org/content/10.1101/2020.11.30.405589v1.full.pdf

a strategy to combine several k values, each with a different p, q setting. And run the 2D outlier algorithm on multiple k values and report their union.

TAPER, Two-dimensional Algorithm for Pinpointing ERrors that takes a multiple sequence alignment as input and outputs outlier sequence positions. TAPER is able to pinpoint errors in multiple sequence alignments without removing large parts of the alignment.




□ WENGAN: Efficient hybrid de novo assembly of human genomes

>> https://www.nature.com/articles/s41587-020-00747-w

WENGAN, a hybrid genome assembler that, unlike most long-read assemblers, entirely avoids the all-versus-all read comparison, does not follow the OLC paradigm and integrates short reads in the early phases of the assembly process (short-read-first).

WENGAN starts by building short-read contigs using a de Bruijn graph assembler. Then, the pair-end reads are pseudo-aligned back to detect and error-correct chimeric contigs as well as to classify them as repeats or unique sequences.

Wengan builds a new sequence graph called the Synthetic Scaffolding Graph. The SSG is built from a spectrum of synthetic mate-pair libraries extracted from raw long-reads. Longer alignments are then built by peforming a transitive reduction of the edges.




□ Learning interpretable latent autoencoder representations with annotations of feature sets

>> https://www.biorxiv.org/content/10.1101/2020.12.02.401182v1.full.pdf

In f-scLVM, deterministic approximate Bayesian inference based on variational methods is used to approximate the posterior over all random variables of the model.

a scalable alternative to f-scLVM to learn latent representations of single-cell RNA-seq data that exploit prior knowledge such as Gene Ontology, resulting in interpretable factors.




□ FastK: A K-mer counter for HQ assembly data sets

>> https://github.com/thegenemyers/FASTK

FastK is a k-mer counter that is optimized for processing high quality DNA assembly data sets such as those produced with an Illumina instrument or a PacBio run in HiFi mode.

FastK is about 2 times faster than KMC3 when counting 40-mers in a 50X HiFi data set. Its relative speedup decreases with increasing error rate or increasing values of k, but regardless is a general program that works for any DNA sequence data set and choice of k.





Andrew Carroll

>> https://github.com/google/deepvariant/releases/tag/v1.1.0

Release of DeepVariant v1.1: Introducing DeepTrio, with greater accuracy for trio or duos. Pre-trained models for Illumina WGS, WES, and PacBio HiFi. Also in DV1.1 (non-trio_, better speed for long reads. 21% reduction in PacBio Indel Errors.




□ Coupled co-clustering-based unsupervised transfer learning for the integrative analysis of single-cell genomic data

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbaa347/6024740

Clustering similar genomic features reduces the noise in single-cell data and facilitates transfer of knowledge across single-cell datasets.

coupleCoC builds upon the information theoretic co-clustering framework. In co-clustering, both the cells and the genomic features are simultaneously clustered.




□ GeneTerpret: a customizable multilayer approach to genomic variant prioritization and interpretation

>> https://www.biorxiv.org/content/10.1101/2020.12.04.408336v1.full.pdf

GeneTerpret platform collates data from current interpretation tools and databases, and applies a phenotype-driven query to categorize the variants identified in a given genome.

GeneTerpret improves the GVI process. GeneTerpret is encouragingly accurate when compared with expert-curated datasets in such well- established public records of clinically relevant variants as DECIPHER and ClinGen.




□ Selective Inference for Hierarchical Clustering

>> https://arxiv.org/pdf/2012.02936.pdf

a selective inference framework to test for a difference in means after any type of clustering. This framework exploits ideas from the recent literature on selective inference for regression and changepoint detection.

This framework avoids the need for bootstrap resampling and provides exact finite-sample inference for the difference in means between a single pair of estimated clusters.





□ multiGSEA: a GSEA-based pathway enrichment analysis for multi-omics data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03910-x

multiGSEA, a highly versatile tool for multi-omics pathway integration that minimizes previous restrictions in terms of omics layer selection and the mapping of feature IDs. Pathway definitions can be downloaded from up to 8 different pathway databases by means of the graphite.

multiGSEA utilizes three different p value combination methods. By default, combinePvalues() will apply the Z-method or Stouffer’s method which has no bias towards small or large p values.





□ Giraffe: Genotyping common, large structural variations in 5,202 genomes using pangenomes, the Giraffe mapper, and the vg toolkit

>> https://www.biorxiv.org/content/10.1101/2020.12.04.412486v1.full.pdf

Giraffe, a new pangenome mapper that focuses on mapping to collections of aligned haplotypes. Giraffe is a short read to graph mapper designed to map to haplotypes, producing alignments embedded within a sequence graph.

The Giraffe algorithm can only find a correct mapping if the read contains instances of minimizers that exactly match minimizers in the true placement in the graph, which then form a cluster, which is then extended to produce an alignment.




□ FEATS: feature selection-based clustering of single-cell RNA-seq data

>> https://pubmed.ncbi.nlm.nih.gov/33285568/

FEATS, a univariate feature selection-based approach for clustering, which involves the selection of top informative features to improve clustering performance.

FEATS gives superior performance compared with the current tools, in terms of adjusted Rand index and estimating the number of clusters.





□ constclust: Consistent Clusters for scRNA-seq

>> https://www.biorxiv.org/content/10.1101/2020.12.08.417105v1.full.pdf

constclust is a novel meta-clustering method based on the idea that if the data contains distinct populations which a clustering method can identify, meaningful clusters should be robust to small changes in the parameters used to derive them.

constclust finds labels which match ground truth, so does running the underlying clustering method with default parameters. constclust formalizes the operations by automatically detecting the clusters which are consistently found within contiguous regions of parameter space.




□ Prioritizing genes for systematic variant effect mapping

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa1008/6029515

Missense VUS (variant of uncertain significance) collected through clinical testing were extracted from the ClinVar and Invitae databases. The first strategy ranked genes based on their unique VUS count.

The second strategy ranked genes based on their movability- and reappearance-weighted impact score(s) (MARWIS) to give extra weight to reappearing, movable VUS.

The third strategy ranked the genes by their difficulty-adjusted impact score(s) (DAIS), calculated to account for the costs associated with studying longer genes.





□ TrancriptomeReconstructoR: data-driven annotation of complex transcriptomes

>> https://www.biorxiv.org/content/10.1101/2020.12.10.418897v1.full.pdf

ONT Direct RNA-seq has four key limitations. First, up to 30-40% of bases can be called with errors. To tolerate the sequencing errors, the dedicated aligners allow for more mismatches and thus inevitably sacrifice the accuracy of alignments.

TranscriptomeReconstructoR takes three datasets as input: i) full-length RNA-seq (e.g. ONT Direct RNA-seq) to resolve splicing patterns; ii) 5' tag sequencing (e.g. CAGE-seq) to detect TSS; iii) 3' tag sequencing (e.g. PAT-seq) to detect polyadenylation sites (PAS).





□ HiddenVis: a Hidden State Visualization Toolkit to Visualize and Interpret Deep Learning Models for Time Series Data

>> https://www.biorxiv.org/content/10.1101/2020.12.11.422030v1.full.pdf

Hidden State Visualization Toolkit (HiddenVis) visualizes and facilitate the interpretations of sequential models for accelerometer data. HiddenVis can visualize the hidden states, match input samples with similar patterns and explore the potential relation among covariates.

The HiddenViz model is suitable for a wide range of Deep Learning based accelerometer data analyses. It can be easily extended to the visualization and analysis of other temporal data.





□ Unbiased integration of single cell multi-omics data

>> https://www.biorxiv.org/content/10.1101/2020.12.11.422014v1.full.pdf

bindSC, a single-cell data integration tool that realizes simultaneous alignment of the rows and the columns between data matrices without making approximations.

The alignment matrix derived from bi-CCA (bi-order canonical correlation analysis) can be utilized to derive in silico multiomics profiles from aligned cells. Bi-CCA outputs canonical correlation vectors (CCVs), which project cells from two datasets onto a shared latent space.




□ FFD: Fast Feature Detector

>> https://ieeexplore.ieee.org/document/9292438

The robust and accurate keypoints exist in the specific scale-space domain. And formulating the superimposition problem into a mathematical model and then derive a closed-form solution for multiscale analysis.

The model is formulated via difference-of-Gaussian (DoG) kernels in the continuous scale-space domain, and it is proved that setting the scale-space pyramid’s blurring ratio and smoothness to 2 and 0.627, respectively, facilitates the detection of reliable keypoints.




□ Cytosplore-Transcriptomics: a scalable inter-active framework for single-cell RNA sequenc-ing data analysis

>> https://www.biorxiv.org/content/10.1101/2020.12.11.421883v1.full.pdf

The two-dimensional embeddings of the HSNE hierarchy can be used to cluster and define cell populations at different levels of the hierarchy, or to visualize the expression of selected genes and metadata across cells.

Cytosplore-Transcriptomics, a framework to analyze scRNA-seq data. At its core, it uses a hierarchical, manifold preserving representation of the data that allows the inspection and annotation of scRNA-seq data at different levels of detail.





□ Robustifying genomic classifiers to batch effects via ensemble learning

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa986/6007261





□ Macarons: Uncovering complementary sets of variants for the prediction of quantitative phenotypes

>> https://www.biorxiv.org/content/10.1101/2020.12.11.419952v1.full.pdf

Macarons takes into account the correlations between SNPs to avoid the selection of redundant pairs of SNPs in linkage disequilibrium.

Macarons features two simple, interpretable parameters to control the time/performance trade-off: The number of SNPs to be selected (k), and maximum intra-chromosomal distance (D, in base pairs) to reduce the search space for redundant SNPs.





□ TraNCE: Scalable Analysis of Multi-Modal Biomedical Data

>> https://www.biorxiv.org/content/10.1101/2020.12.14.422781v1.full.pdf

TraNCE, a framework that automates the difficulties of designing distributed analyses with complex biomedical data types. TraNCE is capable of outperforming the common alternative, based on “flattening” complex data structures.

TraNCE is a compilation framework that transforms declarative programs over nested collections into distributed execution plans.




□ Hapo-G, Haplotype-Aware Polishing Of Genome Assemblies

>> https://www.biorxiv.org/content/10.1101/2020.12.14.422624v1.full.pdf

Hapo-G maintains two stacks of alignments, the first (all-ali) contains all the alignments that overlap the currently inspected base, and the second (hap-ali) contains only the read alignments that agree with the last selected haplotype.

Hapo-G selects a reference alignment and tries to use it as long as possible to polish the region where it aligns, which will minimize mixing between haplotypes.




□ AdRoit: an accurate and robust method to infer complex transcriptome composition

>> https://www.biorxiv.org/content/10.1101/2020.12.14.422697v1.full.pdf

AdRoit, an accurate and robust method to infer transcriptome composition. The method estimates the proportions of each cell type in the compound RNA-seq data using known single cell data of relevant cell types.


AdRoit uniquely uses an adaptive learning approach to correct the bias gene-wise. due to the difference in sequencing techniques. AdRoit also utilizes cell type specific genes while control their cross-sample variability.




□ DeMaSk: a deep mutational scanning substitution matrix and its use for variant impact prediction

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa1030/6039113

DeMaSk, an intuitive and interpretable method based only upon DMS datasets and sequence homologs that predicts the impact of missense mutations within any protein.

DeMaSk first infers a directional amino acid substitution matrix from DMS datasets and then fits a linear model that combines these substitution scores with measures of per-position evolutionary conservation and variant frequency.




□ HTSlib - C library for reading/writing high-throughput sequencing data

>> https://www.biorxiv.org/content/10.1101/2020.12.16.423064v1.full.pdf

The HTSlib library is structured as follows: the media access layer is a collection of low-level system and library (libcurl, knet) functions, which facilitate access to files on different storage environments and over multiple protocols to various online storage providers.

Over the lifetime of HTSlib the cost of sequencing has decreased by approximately 100-fold with a corresponding increase in data volume.




□ TIPS: Trajectory Inference of Pathway Significance through Pseudotime Comparison for Functional Assessment of single-cell RNAseq Data

>> https://www.biorxiv.org/content/10.1101/2020.12.17.423360v1.full.pdf

TIPS leverages the common trajectory mapping principle of pseudotime assignment to build pathway-specific trajectories from a pool of single cells.

The pseudotime values for each cell along these pathway-specific trajectories are compared to identify the processes with highest similarity to an overall trajectory. This latter source of variation may have significant ramifications on the accuracy of pseudotime alignment.




□ Minimally-overlapping words for sequence similarity search

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa1054/6042707

a simple sparse-seeding method: using seeds at positions of certain “words” (e.g. ac, at, gc, or gt). Sensitivity is maximized by using words with minimal overlaps. in a random sequence, minimally-overlapping words are anti-clumped.

using increasingly long minimum-variance words, with fixed sparsity n, the sensitivity might approach that of every-nth seeding. The seed count of every-nth seeding has zero variance.




□ VCFShark: how to squeeze a VCF file

>> https://www.biorxiv.org/content/10.1101/2020.12.18.423437v1.full.pdf

VCFShark, a dedicated fully-fledged com- pressor of VCF files. It significantly outperforms the universal tools in terms of compression ratio; sometimes its advantage is severalfold.

VCFShark dominates over BCF, pigz, and 7z by a large margin, achieving 3- to 32-fold better compression. It is mainly a result of an algorithm for compression of genotypes. The advantage over genozip, which uses similar compression for genotypes, up to 5.5-fold for HRC.




□ A monotonicity-based gene clustering algorithm for enhancing clarity in single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2020.12.20.423308v1.full.pdf

When clustering genes based on a monotonicity-based metric, it is important to note that uniformly expressed genes (with either very scarce dropout values or very abundant dropout values) are dangerous because they are likely to have high monotonicity values with many genes, even when a meaningful relationship may not exist.

Due to the high dimensionality of scRNA-seq data, genes with high variances, which will tend to serve as the cluster “centroids”, will tend to be well-separated.




□ scTypeR: Framework to accurately classify cell types in single-cell RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2020.12.22.424025v1.full.pdf

The advantage of scTypeR and other related tools is that the cell type’s properties are learned from a reference dataset, but the reference dataset is no longer necessary to apply the model.

scTypeR uses SVM learning models organised in a tree-like structure to improve the classification of closely related cell types. scTypeR reports classification probabilities for every cell type and reports ambiguous classification results.





□ VarSAn: Associating pathways with a set of genomic variants using network analysis

>> https://www.biorxiv.org/content/10.1101/2020.12.22.424077v1.full.pdf

VarSAn analyzes a configurable network whose nodes represent variants, genes and pathways, using a Random Walk with Restarts algorithm to rank pathways for relevance to the given variants, and reports p-values for pathway relevance.

VarSAn ranks pathways first by their empirical p-values, which represent their connectivity to the query set, and then (to break ties) by their equilibrium probabilities, which are determined by both the connectivity and the network topology.





□ KATK: fast genotyping of rare variants directly from unmapped sequencing reads

>> https://www.biorxiv.org/content/10.1101/2020.12.23.424124v1.full.pdf

KATK is a fast and accurate software tool for calling variants directly from raw NGS reads. It uses predefined k-mers to retrieve only the reads of interest from the FASTQ file and calls genotypes by aligning retrieved reads locally.

KATK identifies unreliable variant calls and clearly distinguishes them in the output. KATK does not use data about known polymorphisms and has NC (No Call) as default genotype.




□ ARPIR: automatic RNA-Seq pipelines with interactive report

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03846-2

ARPIR allows the analysis of RNA-Seq data from groups undergoing different treatment allowing multiple comparisons in a single launch and can be used either for paired-end or single-end analysis.

Automatic RNA-Seq Pipelines with Interactive Report (ARPIR) makes a final tertiary-analysis that includes a Gene Ontology and Pathway analysis.




□ glmGamPoi: Fitting Gamma-Poisson Generalized Linear Models on Single Cell Count Data

>> https://doi.org/10.1093/bioinformatics/btaa1009

glmGamPoi provides inference of Gamma-Poisson generalized linear models with the following improvements over edgeR / DESeq2. glmGamPoi is more than 5 times faster than edgeR and more than 18 times faster than DESeq2.

glmGamPoi provides a quasi-likelihood ratio test with empirical Bayesian shrinkage to identify differentially expressed genes. glmGamPoi scales sub-linearly with the number of cells, which explains the observed performance benefit.




Untitled.

2020-12-03 23:36:37 | Science News

(Photo by Shelbie Dimond)



□ UNCALLED: Targeted nanopore sequencing by real-time mapping of raw electrical signal

>> https://www.nature.com/articles/s41587-020-0731-9

UNCALLED, the Utility for Nanopore Current ALignment to Large Expanses of DNA, with the goal of mapping streaming raw signal to DNA references for targeted sequencing using ReadUntil.

Dynamic Time Warping step to UNCALLED, making it a full-scale signal-to-basepair aligner. UNCALLED probabilistically considers k-mers that could be represented by the signal and then prunes the candidates based on the reference encoded within a Ferragina–Manzini index.





□ MAGUS: Multiple Sequence Alignment using Graph Clustering

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa992/6012350

In divide-and-conquer strategies, a sequence dataset is divided into disjoint subsets, alignments are computed on the subsets using base MSA methods (e.g., MAFFT), and then merged together into an alignment on the full dataset.

MAGUS, Multiple sequence Alignment using Graph clUStering, a new technique for computing large-scale alignments. MAGUS merges the subset alignments using the Graph Clustering Merger (GCM), a new method for combining disjoint alignments.





□ Cell Layers: Uncovering clustering structure and knowledge in unsupervised single-cell transcriptomic analysis

>> https://www.biorxiv.org/content/10.1101/2020.11.29.400614v1.full.pdf

Cell Layers, a Sankey network for the quantitative investigation of coexpression, biological processes, and cluster integrity across clustering resolutions. And enhances the interpretability of single-cell clustering by linking molecular data and cluster evaluation metrics.

The output of a multi-resolution Louvain analysis is a cell by resolution parameter matrix, where values are the cluster assignment. The primary input to Cell Layers is a multi-resolution and cell by gene expression matrix.





□ DeepPheno: Predicting single gene loss-of-function phenotypes using an ontology-aware hierarchical classifier

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008453

DeepPheno, a neural network based hierarchical multi-class multi-label classification method. DeepPheno relies on ontologies to relate altered molecular functions and processes to their physiological consequences.

DeepPheno takes a sparse binary vector of functional annotation features and gene expression features as input and outputs phenotype annotation scores which are consistent with the hierarchical dependencies of the phenotypes.




□ Readfish enables targeted nanopore sequencing of gigabase-sized genomes

>> https://www.nature.com/articles/s41587-020-00746-x

Readfish enables targeted sequencing of gigabase genomes including depletion of host sequences as well as example methods to ensure minimum coverage depth for genomes present within a mixed population.

Readfish removes the need for complex signal mapping algorithms but does require a sufficiently performant base caller. Readfish does not rely on comparison of raw current and so do not have to convert references into signal space as Dynamic Time Warping.




□ Lancet: Somatic variant analysis of linked-reads sequencing data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa888/5926970

Lancet uses a localized micro-assembly strategy to detect somatic mutation. Lancet is based on the colored de Bruijn graph assembly paradigm where tumor and normal reads are jointly analyzed within the same graph.

Lancet computes a barcode-aware coverage and identifies variants that disagree with the local haplotype structure. On-the-fly repeat composition analysis and self-tuning k-mer strategy are used together to increase specificity in regions characterized by low complexity sequences.




□ MarcoPolo: a clustering-free approach to the exploration of differentially expressed genes along with group information in single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2020.11.23.393900v1.full.pdf

To find informative genes without clustering, MarcoPolo exploits the bimodality of gene expression to learn the group information of the cells with respect to the expression level directly from given data.

MarcoPolo disentangles the bimodality inherent in gene expression and divides cells into two groups by the maximum likelihood estimation under a mixture model. it utilizes the fact that the difference of expression patterns of a gene between two subsets of cells can be bimodal.





□ Milo: differential abundance testing on single-cell data using k-NN graphs

>> https://www.biorxiv.org/content/10.1101/2020.11.23.393769v1.full.pdf

Milo defines a set of representative neighbourhoods on the k-NN graph, where a neighbourhood is defined as the group of cells that are connected to an index cell by an edge in the graph.

Milo leverages the flexibility of generalized linear models. The detection of DA subpopulations by Milo requires a k-NN graph that reflects the true cell-cell similarities in the phenotypic manifold; a limitation shared with all DA methods that work on reduced dimensional spaces.



□ DANGO: Predicting higher-order genetic interactions

>> https://www.biorxiv.org/content/10.1101/2020.11.26.400739v1.full.pdf

DANGO, based on a self-attention hypergraph neural network, to effectively predict the higher-order genetic interaction for a group of genes.

DANGO takes multiple pairwise molecular interaction networks as input and pre-trains multiple graph neural networks to generate node embeddings. Embeddings for the same node across different networks are integrated through a meta embedding learning scheme.

Hyper-SAGNN architecture is trained w/ a distinct loss function to predict the attributes of hyperedges in a regression manner, different from other applications of Hyper-SAGNN. the meta embedding learning module & the Hyper-SAGNN are jointly optimized in an end-to-end fashion.





□ SPICEMIX: Integrative single-cell spatial modeling for inferring cell identity

>> https://www.biorxiv.org/content/10.1101/2020.11.29.383067v1.full.pdf

SPICEMIX, Spatial Identification of Cells using Matrix Factorization uses latent variable modeling to express the interplay of various spatial and intrinsic factors that comprise cell identity.

SPICEMIX markedly enhances the standard NMF formulation with a graphical representation of the spatial relationship of cells to explicitly capture spatial factors.

SPICEMIX also uses an Hidden Markov Random Field as the graphical model, however, the model is significantly enhanced by integrating the NMF formulation of gene expression into each cell in the graph.




□ AMLE: Mixed logistic regression in genome-wide association studies

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03862-2

Offset method consists of first estimating individual effects in a mixed logistic regression model, and then incorporating these effects as an offset in a (non-mixed) logistic regression model.

Approximate Maximum Likelihood Estimate (AMLE), is based on a first-order approximation of the MLR, which leads to an approximation of the SNPs effect. Their implementation in milorGWAS allows flexible use, with for example the possibility to specify a user defined GRM matrix.





□ CCAT: Ultra-fast scalable estimation of single-cell differentiation potency from scRNA-Seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa987/6007262

CCAT (Correlation of Connectome and Transcriptome), a single-cell potency measure which can return accurate single-cell potency estimates of a million cells in minutes, a 100 fold improvement over CytoTRACE or GCS.

CCAT can be used to unambiguously identify stem-or multipotent root-states, which are necessary for inferring lineage-trajectories. Having identified the root-cell, CCAT next infers lineage trajectories and pseudotime using Diffusion Maps.




□ Robustifying Genomic Classifiers To Batch Effects Via Ensemble Learning

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa986/6007261

The philosophy behind the standard approach of merging and batch adjust- ment is to remove the undesired batch-associated variation from as many of the genomic features as feasible, and then use the "cleaned" data in classification as though the batch effects never existed.

The framework is based on the integration of predictions rather than that of data. This is a simpler task for prediction, as it operates in one dimension rather than many.




□ Structure learning for zero-inflated counts, with an application to single-cell RNA sequencing data

>> https://arxiv.org/pdf/2011.12044.pdf

using the Leiden algorithm and, in order to validate the associations discovered by PC-zinb, they interpret each of the communities by computing overlap with known functional gene sets in the MSigDB database.

the existence of a theoretical proof of convergence of the algorithm under suitable assumptions; an easy implementation of sparsity by a control on the number of variables in the conditional sets; invariance to feature scaling.





□ GraphUnzip: Phases an assembly graph using Hi-C data and/or long reads

>> https://github.com/nadegeguiglielmoni/GraphUnzip

GraphUnzip phases an uncollapsed assembly graph in Graphical Fragment Assembly (GFA) format. Its naive approach makes no assumption on the ploidy or the heterozygosity rate of the organism and thus can be used on highly heterozygous genomes.

GraphUnzip needs two things to work : Hi-C data : GraphUnzip needs a sparse contact matrix and a fragment list using the formats outputted by hicstuff or Long reads: mapped to the GFA in the GAF format of GraphAligner.




□ STREME: Accurate and versatile sequence motif discovery

>> https://www.biorxiv.org/content/10.1101/2020.11.23.394619v1.full.pdf

The STREME algorithm presented here advances the state-of-the-art in ab initio motif discovery in terms of both accuracy and versatility.

STREME uses the Markov model in conjunction with the PWM when counting matches to the motif to further bias the search away from motifs that are mere artifacts of the lower-order statistics of the input sequences.




□ SC1CC: Computational cell cycle analysis of single cell RNA-Seq data

>> https://www.biorxiv.org/content/10.1101/2020.11.21.392613v1.full.pdf

SC1CC method enables a comprehensive analysis of the cell cycle effects that can be performed independently of cell type/functional annotation, hence avoiding hazardous manipulation of the single cell transcription data that could lead to misleading analysis results.

SC1CC reorders the leaves of the hierarchical clustering dendogram by using the Optimal Leaf Ordering (OLO) algorithm. Performing additional leaf-node reordering is equivalent to minimizing the length of a Hamiltonian path.




□ BOSO: a novel feature selection algorithm for linear regression with high-dimensional data

>> https://www.biorxiv.org/content/10.1101/2020.11.18.388579v1.full.pdf

BOSO (Bilevel Optimization Selector Operator), a novel feature selection algorithm for linear regression, which is more accurate than Relaxed Lasso in many cases, particularly in high-dimensional datasets.

BOSO searches for the best combination of features of length K by solving a bilevel optimization problem, where the outer layer minimizes the validation error and the inner layer uses training data to minimize the loss function of the linear regression approach considered.

BOSO relies on the observation that the optimal solution of the inner problem can be written as a set of linear equations. This observation makes it possible to solve a complex bilevel optimization problem via Mixed-Integer Quadratic Programming (MIQP).




□ eMPRess: A Systematic Cophylogeny Reconciliation Tool

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa978/5995312

eMPRess, a software program for phylogenetic tree reconciliation under the duplication-transfer-loss model that systematically addresses the problems of choosing event costs and selecting representative solutions.

Maximum parsimony reconciliation seeks to minimize the number of duplication, host transfer, and loss events weighted by their respective event costs. eMPRess also uses a variant of the Costscape Algorithm to compute and visualize the solution space.





□ gCAnno: a graph-based single cell type annotation method

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07223-4

gCAnno constructs cell type-gene bipartite graph and adopts graph embedding to obtain cell type specific genes. Then, naïve Bayes (gCAnno-Bayes) and SVM (gCAnno-SVM) classifiers are built for annotation.

gCAnno assigns the closest cell types with the most similar expression profiles to them. gCAnno selects a set of genes for each cell type with similar profiles in the embedding space.





□ A Statistical Approach to Dimensionality Reduction Reveals Scale and Structure in scRNA-seq Data

>> https://www.biorxiv.org/content/10.1101/2020.11.18.389031v1.full.pdf

a statistical framework for characterizing the stability and variability of embedding quality by posing a point-wise metric as an Empirical Embedding Statistic.

Non-computationally, this approach may be of widespread utility in the analysis of high-dimensional biological data sets in order to detect and to assess the stability of biologically relevant structures.





□ scDesign2: an interpretable simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured

>> https://www.biorxiv.org/content/10.1101/2020.11.17.387795v1.full.pdf

scDesign2 has the potential to improve the alignment of cells from multiple single- cell datasets.

scDesign2 generates more realistic synthetic data for four scRNA-seq protocols (10x Genomics, CEL-Seq2, Fluidigm C1, and Smart-Seq2) and two single-cell spatial transcriptomics protocols (MERFISH and pciSeq) than existing simulators do.




□ Hierarchical clustering of bipartite data sets based on the statistical significance of coincidences

>> https://link.aps.org/doi/10.1103/PhysRevE.102.042304

a hierarchical clustering algorithm based on a dissimilarity between entities that quantifies the probability that the features shared by two entities are due to mere chance.

The algorithm performance is O(n2) when applied to a set of n entities, and its outcome is a dendrogram exhibiting the connections of those entities. The algorithm performs at least as well as the standard, modularity-based algorithms—with a higher numerical performance.




□ STATegra: Multi-omics data integration - A conceptual scheme with a bioinformatics pipeline

>> https://www.biorxiv.org/content/10.1101/2020.11.20.391045v1.full.pdf

STATegra, a conceptual framework aiming it to be as generic as possible for multi-omics analysis, combining machine learning component analysis, non-parametric data combination and a multi-omics exploratory analysis in a step-wise manner.

The STATegra framework provided novel genes, miRNAs, and CpG sites for the two selected cases in comparison to unimodal analyses.





□ Maximizing statistical power to detect clinically associated cell states with scPOST

>> https://www.biorxiv.org/content/10.1101/2020.11.23.390682v1.full.pdf

To approximate the specific experimental and clinical scenarios being investigated, scPOST (single-cell Power Simulation Tool) takes prototype (public or pilot) single-cell data as input and generates large numbers of single-cell datasets in silico.

a wide range of factors that potentially affect power: variation in cell state frequencies across samples, covariation and inter-sample variation in gene expression, batch variability and structure, number of cells and samples, and sequencing depth.





□ SCReadCounts: Estimation of cell-level SNVs from scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2020.11.23.394569v1.full.pdf

SCReadCounts is a method for a cell-level estimation of the sequencing read counts bearing a particular nucleotide at genomic positions of interest from barcoded scRNA-seq alignments.

SCReadCounts generates an array of outputs, including cell-SNV matrices with the absolute variant-harboring read counts, as well as cell-SNV matrices with expressed Variant Allele Fraction.





□ Integrating long-range regulatory interactions to predict gene expression using graph convolutional neural networks

>> https://www.biorxiv.org/content/10.1101/2020.11.23.394478v1.full.pdf

a graph convolutional neural network (GCNN) framework to integrate measurements probing spatial genomic organization and measurements of local regulatory factors, specifically histone modifications, to predict gene expression.

This formulation enables the model to incorporate crucial information about long-range interactions via a natural encoding of spatial interaction relationships into a graph representation. This model presents a novel setup for predicting gene expression by integrating multimodal datasets.




□ SEPIA: Simulation-based Evaluation of Prioritization Algorithms

>> https://www.biorxiv.org/content/10.1101/2020.11.23.394890v1.full.pdf

SEPIA (Simulation-based Evaluation of PrIoritization Algorithms), a novel simulation-based framework for determining the effectiveness of prioritization algorithms.

Given a prioritization with a computed metric value for each individual, SEPIA then constructs an “optimal” prioritization by simply sorting the individuals in descending order of metric value. SEPIA computes the Kendall Tau-b rank correlation coefficient.





□ GCViT: a method for interactive, genome-wide visualization of resequencing and SNP array data

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07217-2

GCViT can be used to identify introgressions, conserved or divergent genomic regions, pedigrees, and other features for more detailed exploration. The program can be used online or as a local instance for whole genome visualization of resequencing or SNP array data.

GCViT operates on variant call (VCF) files which have been mapped to a single reference genome assembly. GCViT performs pairwise comparisons between the comparison and reference genotypes and displays the results on a whole genome view of the reference assembly.




□ distinct: a novel approach to differential distribution analyses

>> https://www.biorxiv.org/content/10.1101/2020.11.24.394213v1.full.pdf

distinct computes the empirical cumulative distribution function (ECDF) of the individual (e.g., single-cell) measurements of each sample, and compares the ECDFs to identify changes between conditions, even when the mean is unchanged or marginally involved.

distinct is general and flexible: it targets complex changes between groups, explicitly mod- els biological replicates within a hierarchical framework, does not rely on asymptotic theory, avoids parametric assumptions, and can be applied to arbitrary types of data.





□ Bias invariant RNA-seq metadata annotation

>> https://www.biorxiv.org/content/10.1101/2020.11.26.399568v1.full.pdf

a deep-learning based domain adaptation algorithm for the automatic annotation of RNA-seq metadata.

This Domain Adaptation architecture is based on the siamese network architecture. It consists of three modules: A source mapper (SM) and bias mapper (BM) which correspond to the siamese part of the model, as well as a classification layer (CL).





□ scover: Predicting the impact of sequence motifs on gene regulation using single-cell data

>> https://www.biorxiv.org/content/10.1101/2020.11.26.400218v1.full.pdf

scover, a shallow convolutional neural network for ​de novo​ discovery of regulatory motifs and their cell type specific impact on gene expression from single cell data.

Scover is a convolutional neural network composed of a convolutional layer, a rectified linear unit (ReLU) activation layer, a global maximum pooling layer, and a fully connected layer with multiple output channels.




□ MMseqs2: Fast and sensitive taxonomic assignment to metagenomic contigs

>> https://www.biorxiv.org/content/10.1101/2020.11.27.401018v1.full.pdf

MMseqs2 extracts all possible protein fragments from each contig, quickly retains those that can contribute to taxonomic annotation, assigns them with robust labels and determines the contig’s taxonomic identity by weighted voting.

Its fragment extraction step is suitable for the analysis of all domains of life. MMseqs2 taxonomy is 2-18x faster than state-of-the-art tools and also contains new modules for creating taxonomic reference databases as well as reporting and visualizing taxonomic assignments.




□ miqoGraph : Fitting admixture graphs using mixed-integer quadratic optimization

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa988/6008687

a novel formulation of the problem using mixed-integer quadratic optimization (MIQO), where they model the problem of determining a best-fit graph topology as assignment of populations to leaf nodes of a binary tree.

miqoGraph using the Julia language and the Gurobi optimization solver. miqoGraph also uses mixed-integer quadratic optimization to fit topology, drift lengths, and admixture proportions simultaneously.




□ ASCETS: Quantification of aneuploidy in targeted sequencing data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa980/6008689

ASCETS produces arm-level copy-number variant calls and arm-level weighted average log2 segment means from segmented copy number data.

ASCETS may exhibit decreased performance when using data from methods (ex. amplicon sequencing) which interrogate an especially small amount of genomic territory.




□ A novel computational strategy for DNA methylation imputation using mixture regression model (MRM)

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03865-z

By applying MRM to an RRBS dataset from subjects w/ low versus high bone mineral density, it recovered methylation values of ~ 300 K CpGs in the promoter regions of chromosome 17 and identified some novel differentially methylated CpGs that are significantly associated with BMD.

MRM is a finite mixture regression model, the number of clusters has to be specified. It is computationally burdensome to fit multiple MRMs and do model selection based on the model likelihood.




□ bFMD: Balanced Functional Module Detection in Genomic Data

>> https://www.biorxiv.org/content/10.1101/2020.11.30.404038v1.full.pdf

bFMD detects sparse sets of variables within high-dimensional datasets such that interpretability may be favorable as compared to other similar methods by leveraging balance properties used in other graphical applications.

The methods bFMD and W both operate on a matrix which highlights balanced sets of variables affecting an outcome variable as a positive submatrix. bFMD most accurately identifies the set of module variables, as measured by the Hamming distance.




□ IMIX: a multivariate mixture model approach to association analysis through multi-omics data integration

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa1001/6015105

despite the expected difference between the actual individual samples that may result in the difference as illustrated in the Benjamini-Hochberg FDR method, IMIX performed well in returning a robust result.




□ Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data

>> https://www.biorxiv.org/content/10.1101/2020.12.01.405886v1.full.pdf

Pearson residuals produce better-quality 2D embeddings than both GLM-PCA and the square-root transform. Applying gene selection prior to dimensionality reduction reduces the computational cost of using Pearson residuals down to negligible.




You ain't never been blue.

2020-12-01 22:13:39 | Science News

(Photo by Nan Goldin)




□ Signac: Multimodal single-cell chromatin analysis

>> https://www.biorxiv.org/content/10.1101/2020.11.09.373613v1.full.pdf

Signac is designed for the analysis of single-cell chromatin data, including scATAC-seq, single-cell targeted tagmentation methods such as scCUT&Tag and scACT-seq, and multimodal datasets that jointly measure chromatin state alongside other modalities.

Signac uses Latent Semantic Indexing. LSI is scalable to large numbers of cells as it retains the data sparsity - zero counts remain as zero. And uses the Singular Value Decomposition, for which there are highly optimized, fast algorithms that are able to run on sparse matrices.





□ lra: the Long Read Aligner for Sequences and Contigs

>> https://www.biorxiv.org/content/10.1101/2020.11.15.383273v1.full.pdf

Ira alignment approach may be used to provide additional evidence of SV calls in PacBio datasets, and an increase in sensitivity and specificity on ONT data with current SV detection algorithms.

an iterative refinement where a large number of anchors from the initial minimizer search are grouped into a super-fragments that are chained using SDP, and a rough alignment has been found a new set of matches with smaller anchors is calculated using the local miminizer indexes.





□ BABEL enables cross-modality translation between multi-omic profiles at single-cell resolution

>> https://www.biorxiv.org/content/10.1101/2020.11.09.375550v1.full.pdf

BABEL learns a set of neural networks that project single-cell multi-omic modalities into a shared latent representation capturing cellular state, and subsequently uses that latent representation to infer observable genome-wide phenotypes.

BABEL’s encoder and decoder networks for ATAC data are designed to focus on more biologically relevant intra-chromosomal patterns.

BABEL’s interoperable encoder/decoder modules effectively leverage paired measurements to learn a meaningful shared latent representation without the use of additional manifold alignment methods.




□ PseudotimeDE: inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2020.11.17.387779v1.full.pdf

PseudotimeDE uses subsampling to estimate pseudotime inference uncertainty and propagates the uncertainty to its statistical test for DE gene identification.

PseudotimeDE fits NB-GAM or zero-inflated negative binomial GAM to every gene in the dataset to obtain a test statistic that indicates the effect size of the inferred pseudotime on the GE. Pseudotime fits a Gamma distribution or a mixture of two Gamma distributions.





□ LongTron: Automated Analysis of Long Read Spliced Alignment Accuracy

>> https://www.biorxiv.org/content/10.1101/2020.11.10.376871v1.full.pdf

LongTron, a simulation of error modes for both Oxford Nanopore DirectRNA and PacBio CCS spliced-alignments.

If there are more exons in an isoform, that translates into a larger number of potential splice-site determination errors the aligner can make when aligning long reads, which often are still fragments of the full length isoform.

LongTron extends the Qtip algorithm ​that also attempted to profile alignment quality/errors using a Random Forest classifer to assign new long-read alignments to one of two error categories, a novel category, or label them as non-error.





□ ARBitR: An overlap-aware genome assembly scaffolder for linked reads

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa975/5995311

ARBitR: Assembly Refinement with Barcode-identity-tagged Reads. ARBitR has the advantages of performing the linkage-finding and scaffolding steps in succession in a single application.

While initially developed for 10X Chromium linked reads, ARBitR is also able to use stLFR reads, and can be adapted for any type of linked-read data.

ARBitR pipeline is the consideration of overlaps between ends of linked contigs, and can decrease the number of erroneous structural variants, indels and mismatches in resulting scaffolds and improve assembly of transposable elements.





□ Symphony: Efficient and precise single-cell reference atlas mapping

>> https://www.biorxiv.org/content/10.1101/2020.11.18.389189v1.full.pdf

Symphony, a novel algorithm for building compressed, integrated reference atlases of cells and enabling efficient query mapping within seconds.

Symphony builds upon the same linear mixture model framework as Harmony, that localizes query cells w/ a low-dimensional reference embedding without the need to reintegrate the reference cells, facilitating the downstream transfer of many types of reference-defined annotations.





□ Extremal quantum states

>> https://avs.scitation.org/doi/full/10.1116/5.0025819

In the continuous-variable (CV) setting, quantum information is encoded in degrees of freedom with continuous spectra. Concentrating on phase-space formulations because they can be applied beyond particular symmetry groups.

Wehrl entropy, inverse participation ratio, cumulative multipolar distribution, and metrological power, which are linked to the intrinsic properties of any quantum state.





□ VarNote: Ultrafast and scalable variant annotation and prioritization with big functional genomics data

>> https://genome.cshlp.org/content/early/2020/11/17/gr.267997.120

VarNote is a tool to rapidly annotate genome-scale variants from large and complex functional annotation resources. VarNote supports both region-based and allele-specific annotations for different file formats and equips many advanced functions for flexible annotations extraction.

VarNote is equipped by a novel index system and a parallel random-sweep searching algorithm. It shows substantial performance improvements to annotate human genetic variants at different scales.




□ SCNIC: Sparse Correlation Network Investigation for Compositional Data

>> https://www.biorxiv.org/content/10.1101/2020.11.13.380733v1.full.pdf

SCNIC uses two methods: Louvain modularity maximization (LMM) and a novel shared minimum distance (SMD) module detection algorithm. the SMD algorithm aids in dimensionality reduction in 16S rRNA sequencing data while ensuring a minimum strength of association within modules.

SCNIC produces a graph modeling language (GML) format for network visualization in which the edges in the correlation network represent the positive correlations, and a feature table in the Biological Observation Matrix (BIOM) format.




□ Tensor Sketching: Fast Alignment-Free Similarity Estimation

>> https://www.biorxiv.org/content/10.1101/2020.11.13.381814v1.full.pdf

Tensor Sketch had 0.88 Spearman’s rank correlation with the exact edit distance, almost doubling the 0.466 correlation of the closest competitor while running 8.8 times faster than computing the exact alignment.

While the sketching of rank-1 or super-symmetric tensors is known to admit efficient sketching, the sub-sequence tensor does not satisfy either of these properties. Tensor Sketch completely avoids the need for constructing the ambient space.





□ Proximity Measures as Graph Convolution Matrices for Link Prediction in Biological Networks

>> https://www.biorxiv.org/content/10.1101/2020.11.14.382655v1.full.pdf

GCN-based network embedding algorithms utilize a Laplacian matrix in their convolution layers as the convolution matrix and the effect of the convolution matrix on algorithm has not been comprehensively characterized in the context of link prediction in biomedical networks.

Deep Graph Infomax uses the single-layered GCN encoder for the convolution matrice. Node proximity measures in the single-layed GCN encoder deliver much better link prediction results comparing to conventional Laplacian convolution matrix in the encoder.




□ THUNDER: A reference-free deconvolution method to infer cell type proportions from bulk Hi-C data

>> https://www.biorxiv.org/content/10.1101/2020.11.12.379941v1.full.pdf

THUNDER - the Two-step Hi-c UNsupervised DEconvolution appRoach constructed from published single-cell Hi-C (scHi-C) data.

THUNDER estimates cell-type-specific chromatin contact profiles for all cell types in bulk Hi-C mixtures. These estimated contact profiles provide a useful exploratory framework to investigate cell-type-specificity of the chromatin interactome while data is still sparse.





□ Achieving large and distant ancestral genome inference by using an improved discrete quantum-behaved particle swarm optimization algorithm https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03833-7

an improved discrete quantum-behaved particle swarm optimization algorithm (IDQPSO) by averaging two of the fitness values is proposed to address the discrete search space.

Quantum-behaved particle swarm optimization is a stochastic searching algorithm that was inspired by the movement of particles in quantum space. The behavior of all particles is described by the quantum mechanics presented in the quantum time-space framework.




□ A Markov Random Field Model for Network-based Differential Expression Analysis of Single-cell RNA-seq Data

>> https://www.biorxiv.org/content/10.1101/2020.11.11.378976v1.full.pdf

a Markov Random Field (MRF) model to appropriately accommodate gene network information and dependencies among cell types to identify cell-type specific DE genes.

a Markov Random Field scRNAseq implements an Expectation-Maximization (EM) algorithm with mean field-like approximation to estimate model parameters and a Gibbs sampler to infer DE status.




□ JPSA: Joint and Progressive Subspace Analysis With Spatial-Spectral Manifold Alignment for Semisupervised Hyperspectral Dimensionality Reduction

>> https://ieeexplore.ieee.org/document/9256351

JPSA spatially and spectrally aligning a manifold structure in each learned latent subspace in order to preserve the same or similar topological property between the compressed data and the original data.

The JPSA learns a high-level, semantically meaningful, joint spatial-spectral feature representation from hyperspectral (HS) data by jointly learning latent subspaces and a linear classifier to find an effective projection direction favorable for classification.





□ CATCaller: An End-to-end Oxford Nanopore Basecaller Using Convolution-augmented Transformer

>> https://www.biorxiv.org/content/10.1101/2020.11.09.374165v1.full.pdf

CATCaller based on the Long-Short Range Attention and flattened FFN layer to specialize for efficient global and local feature extraction through dynamic convolution.

Dynamic convolution built on the lightweight convolution dynamically learns a new kernel at every time step. And deployed a Gated Linear Units and a fully-connected layer before/after the convolution module and the kernel sizes are [3,5,7,31×3] for the overall six encoder blocks.





□ A Bayesian Nonparametric Model for Inferring Subclonal Populations from Structured DNA Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2020.11.10.330183v1.full.pdf

a hierarchical Dirichlet process (hDP) mixture model that incorporates the correlation structure induced by a structured sampling arrangement.

a representation of the hierarchical Dirichlet process prior as a Gamma-Poisson hierarchy and use this representation to derive a fast Gibbs sampling inference algorithm using the augment-and-marginalize method.




□ iSMNN: Batch Effect Correction for Single-cell RNA-seq data via Iterative Supervised Mutual Nearest Neighbor Refinement

>> https://www.biorxiv.org/content/10.1101/2020.11.09.375659v1.full.pdf

iSMNN, an iterative supervised batch effect correction method that performs multiple rounds of MNN refining and batch effect correction instead of one step correction with the MNN detected from the original expression matrix.

The number of iterations of iSMNN mainly depends on the magnitude and complexity of batch effects. Larger and more complex batch effects usually require more iterations. iSMNN achieved optimal performance with only one round of correction.




□ FASTAFS: file system virtualisation of random access compressed FASTA files

>> https://www.biorxiv.org/content/10.1101/2020.11.11.377689v1.full.pdf

FASTAFS uses a virtual layer to (random access) TwoBit/FourBit compression that provides read-only access to a FASTA file and the guarenteed in-sync FAI, DICT and 2BIT files, through a FUSE file system layer.

FASTAFS guarantees in-sync virtualised metadata files and offers fast random-access decompression using Zstd-seekable.





□ accuEnhancer: Accurate enhancer prediction by integration of multiple cell type data with deep learning

>> https://www.biorxiv.org/content/10.1101/2020.11.10.375717v1.full.pdf

accuEnhancer, a joint training of multiple cell types to boost the model performance in predicting the enhancer activities of an unstudied cell type.

accuEnhancer utilized the pre-trained weights from deepHaem, which predicts chromatin features from DNA sequence, to assist the model training process.





□ D-EE: Distributed software for visualizing intrinsic structure of large-scale single-cell data

>> https://academic.oup.com/gigascience/article/9/11/giaa126/5974979

D-EE, a distributed optimization implementation of the EE algorithm, termed distributed elastic embedding.

D-TSEE, a distributed optimization implementation of time-series elastic embedding, can reveal dynamic gene expression patterns, providing insights for subsequent analysis of molecular mechanisms and dynamic transition progression.




□ Hybrid Clustering of single-cell gene-expression and cell spatial information via integrated NMF and k-means

>> https://www.biorxiv.org/content/10.1101/2020.11.15.383281v1.full.pdf

scHybridNMF (single-cell Hybrid Nonnegative Matrix Factorization), which performs cell type identification by incorporating single cell gene expression data with cell location data.

scHybridNMF combines two classical methods, nonnegative matrix factorization with a k-means clustering scheme, to respectively represent high-dimensional gene expression data and low-dimensional location data together.




□ Set-Min sketch: a probabilistic map for power-law distributions with application to k-mer annotation

>> https://www.biorxiv.org/content/10.1101/2020.11.14.382713v1.full.pdf

Set-Min sketch, a new probabilistic data structure that capable to represent k-mer count information in small space and with small errors. the expected cumulative error obtained when querying all k-mers of the dataset can be bounded by εN where N is the number of all k-mers.

Count-Min sketch is a sketching technique for memory efficient representation of high-dimensional vectors. Set-Min sketch provides a very low error rate, both in terms of the probability and the size of errors, much lower than a Count-Min sketch of similar dimensions.





□ ABACUS: A flexible UMI counter that leverages intronic reads for single-nucleus RNAseq analysis

>> https://www.biorxiv.org/content/10.1101/2020.11.13.381624v1.full.pdf

Abacus, a flexible UMI counter software for sNuc-RNAseq analysis. Abacus draws extra information from sequencing reads mapped to introns of pre-mRNAs (~60% of total data) that are ignored by many single-cell RNAseq analysis pipelines.

Abacus parses CellRanger-derived BAM files and extracts the barcodes and corrected UMI sequences from aligned reads, then summarizes UMI counts from intronic and exonic reads in the forward and reverse directions for each gene.




□ Arioc: High-concurrency short-read alignment on multiple GPUs

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008383

Arioc benefits specifically from larger GPU device memory and high-bandwidth peer-to-peer (P2P) memory-access topology among multiple GPUs.

Arioc computes two million short-read alignments per second in a four-GPU system; it can align the reads from a human WGS sequencer run–over 500 million 150nt paired-end reads–in less than 15 minutes.




□ kTWAS: integrating kernel machine with transcriptome-wide association studies improves statistical power and reveals novel genes

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbaa270/5985285

kernel methods such as sequence kernel association test (SKAT) model genotypic and phenotypic variance use various kernel functions that capture genetic similarity between subjects, allowing nonlinear effects to be included.

kTWAS, a novel method called kernel-based TWAS that applies TWAS-like feature selection to a SKAT-like kernel association test, combining the strengths of both approaches.




□ Venice: A new algorithm for finding marker genes in single-cell transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2020.11.16.384479v1.full.pdf

Venice outperforms all compared methods, including Seurat, ROTS, scDD, edgeR, MAST, limma, normal t-test, Wilcoxon and Kolmogorov–Smirnov test. Ttherefore, enables interactive analysis for large single-cell data sets in BioTuring Browser.

Venice devises a new metric to classify genes into up/down-regulated genes. a gene is up-regulated in group 1 iif for every p ∈ (0, 1), the p-quantile of the expression is higher than the p-quantile of the expression in the group 2 and vise versa for down regulated genes.





□ MegaGO: a fast yet powerful approach to assess functional similarity across meta-omics data sets

>> https://www.biorxiv.org/content/10.1101/2020.11.16.384834v1.full.pdf

Comparing large sets of GO terms is not an easy task due to the deeply branched nature of GO, which limits the utility of exact term matching.

MegaGO relies on semantic similarity between GO terms to compute functional similarity between two data sets. MegaGO allows the comparison of functional annotations derived from DNA, RNA, or protein based methods as well as combinations thereof.




□ Celda: A Bayesian model to perform bi-clustering of genes into modules and cells into subpopulations using single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2020.11.16.373274v1.full.pdf

Celda - Cellular Latent Dirichlet Allocation, a novel discrete Bayesian hierarchical model to simultaneously perform bi-clustering of genes into modules and cells into subpopulations.

Celda can also quantify the relationship between different levels in a biological hierarchy by determining the contribution of each gene in each module, each module in each cell population, and each cell population in each sample.





□ WEVar: a novel statistical learning framework for predicting noncoding regulatory variants

>> https://www.biorxiv.org/content/10.1101/2020.11.16.385633v1.full.pdf

“Context-free” WEVar is used to predict functional noncoding variants from unknown or heterogeneous context. “Context-dependent” WEVar can further improve the functional prediction when the variants come from the same context in both training and testing set.

WEVar directly integrates the precomputed functional scores from represen- tative scoring methods. It will maximize the usage of integrated methods by automatically learning the relative contribution of each method and produce an ensemble score as the final prediction.





□ CLIMB: High-dimensional association detection in large scale genomic data

>> https://www.biorxiv.org/content/10.1101/2020.11.18.388504v1.full.pdf

CLIMB (Composite LIkelihood eMpirical Bayes) provides a generic framework facilitating a host of analyses, such as clustering genomic features sharing similar condition-specific patterns and identifying which of these features are involved in cell fate commitment.

CLIMB allows us to tractably estimate which latent association vectors are likely to be present in the data. CLIMB is motivated by the observation that the true number of latent classes, each described by a different association vector, cannot be greater than the sample size.




□ Adyar-RS: An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03738-5

Adyar-RS, a novel linear-time heuristic to approximate ACSk, which is faster than computing the exact ACSk while being closer to the exact ACSk values compared to previously published linear-time greedy heuristics.

Adyar-RS algorithm performs both forward and backward extensions to identify a k-mismatch common substring of longer length. Adyar-RS shows considerably improvement over that of kmacs for longer full genomes that are few hundred megabases long.




□ Clover: a clustering-oriented de novo assembler for Illumina sequences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03788-9

Clover that integrates the flexibility of the overlap-layout-consensus approach, and provides multiple operations based on spectrum, structure and their combination for removing spurious edges from the de Bruijn graph.

Clover constructs a Hamming graph in which it links each pair of k-mers as an edge if the Hamming distance of the pair of k-mers is ≤ p. To accelerate the process, Clover utilizes the indexing technique that partitions a k-mer into (p + 1) substrings.





□ RowDiff: Using Genome Graph Topology to Guide Annotation Matrix Sparsification

>> https://www.biorxiv.org/content/10.1101/2020.11.17.386649v1.full.pdf

RowDiff can be constructed in linear time relative to the number of nodes and labels in the graph, and the construction can be efficiently parallelized and distributed, significantly reducing construction time.

RowDiff can be viewed as an intermediary sparsification step of the initial annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrix representation.




□ Universal annotation of the human genome through integration of over a thousand epigenomic datasets

>> https://www.biorxiv.org/content/10.1101/2020.11.17.387134v1.full.pdf

a large-scale application of the stacked modeling approach with more than a thousand human epigenomic datasets as input, using a version of ChromHMM of which we enhanced the scalability.

the full-stack ChromHMM model directly differentiates constitutive from cell-type-specific activity and is more predictive of locations of external genomic annotations.





□ I-Impute: a self-consistent method to impute single cell RNA sequencing data

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07007-w

I-Impute leverages continuous similarities and dropout probabilities and refines the data iteratively to make the final output "self-consistent". I-Impute exhibits robust imputation ability and follows the “self-consistency” principle.

I-Impute optimizes continuous similarities and dropout probabilities, in iterative refinements until a self-consistent imputation is reached. I-Impute exhibited the highest Pearson correlations for different dropout rates consistently compared with SAVER and scImpute.





□ PCQC: Selecting optimal principal components for identifying clusters with highly imbalanced class sizes in single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2020.11.19.390542v1.full.pdf

Existing methods for selecting the top principal components, such as a scree plot, are typically biased towards selecting principal components that only describe larger clusters, as the eigenvalues typically scale linearly with the size of the cluster.

PCQC (Principal Component Quantile Check) criteria, a computationally efficient methodology for identifying the optimal principal components based on the tails of the distribution of variance explained for each observation.