lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Silentium.

2020-12-24 22:13:36 | Science News




□ pythrahyper_net: Biological network growth in complex environments: A computational framework

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008003

the properties of complex networks are often calculated with respect to canonical random graphs such as the Erdös-Renyi or Watts-Strogatz model. These models are defined by the topological structure in an adjacency matrix, but usually neglect spatial constraints.

pythrahyper_net is a probabilistic agent-based model to describe individual growth processes as biased, correlated random motion. pythrahyper_net is a computational framework based on directional statistics to model network formation in space and time under arbitrary spatial constraints.

Probability distributions are modeled as multivariate Gaussian distributions (MGD), with mean and covariance determined from the discrete simulation grid. Individual MGDs are combined by convolution, transformed to spherical coordinates, and projected onto the unit sphere.





□ NanosigSim: Simulation of Nanopore Sequencing Signals Based on BiGRU

>> https://www.mdpi.com/1424-8220/20/24/7244

NanosigSim, a signal simulation method based on Bi-directional Gated Recurrent Units (BiGRU). NanosigSim signal processing model has a novel architecture that couples a three-layer BiGRU and a fully connected layer.

NanosigSim can model the relation between ground-truth signal and real-world sequencing signal through experimental data to accurately filter out the useless high-frequency components. This process can be achieved by using Continuous Wavelet Dynamic Time Warping.





□ Induced and higher-dimensional stable independence

>> https://arxiv.org/pdf/2011.13962v1.pdf

Stable independence in the context of accessible categories. This notion has its origins in the model-theoretic concept of stable nonforking, which can be thought of on one hand as a freeness property of type extensions. As a notion of freeness or independence for amalgams of models.

Given an (n+1)-dimensional stable independence notion Γn+1 = (Γn, Γ), KΓn+1 to be the category(KΓn)Γ, whose objects are morphisms of KΓn and whose morphisms are the Γ-independent squares. ⌣ is λ-accessible, λ an infinite regular cardinal, if the category K↓ is λ-accessible.

a stable independence notion immediately yields higher-dimensional independence, taken to its logical conclusion, leads to a formulation of stable independence as a property of commutative squares in a general category, described by a family of purely category-theoretic axioms.





□ Characterisations of Variant Transfinite Computational Models: Infinite Time Turing, Ordinal Time Turing, and Blum-Shub-Smale machines

>> https://arxiv.org/pdf/2012.08001.pdf

Using admissibility theory, Σ2-codes and Π3-reflection properties in the constructible hierarchy to classify the halting times of ITTMs with multiple independent heads; the same for Ordinal Turing Machines which have On length tapes.

Infinite Time Blum-Shub-Smale machines (IBSSM’s) have a universality property - this is because ITTMs do and the two classes of machine are ‘bi-simulable’. This is in contradistinction to the machine using a ‘continuity’ rule which fails to be universal.




□ Linear space string correction algorithm using the Damerau-Levenshtein distance

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3184-8

Linear space algorithms to compute the Damerau-Levenshtein (DL) distance between two strings and determine the optimal trace. Lowrance and Wagner have developed an O(mn) time O(mn) space algorithm to find the minimum cost edit sequence between strings of length m and n.

The linear space algorithms uses a refined dynamic programming recurrence. the more general algorithm in string correction using the Damerau-Levenshtein distance that runs in O(mn) time and uses O(s∗min{m,n}+m+n) space.




□ The Diophantine problem in finitely generated commutative rings

>> https://arxiv.org/pdf/2012.09787.pdf

the Diophantine problem, denoted D(R), in infinite finitely generated commutative associative unitary rings R.

a polynomial time algorithm that for a given finite system S of polynomial equations with coefficients in O constructs a finite system S ̊ of polynomial equations with coefficients in R such that S has a solution in O if and only if S ̊ has a solution in R.




□ MathFeature: Feature Extraction Package for Biological Sequences Based on Mathematical Descriptors

>> https://www.biorxiv.org/content/10.1101/2020.12.19.423610v1.full.pdf

MathFeature provides 20 approaches based on several studies found in the literature, e.g., multiple numeric mappings, genomic signal processing, chaos game theory, entropy, and complex networks.

Various studies have applied concepts from information theory for sequence feature extraction, mainly Shannon’s entropy. Another entropy-based measure has been successfully explored, e.g., Tsallis entropy, proposed to generalize the Boltzmann/Gibbs’s traditional entropy.




□ Megadepth: efficient coverage quantification for BigWigs and BAMs

>> https://www.biorxiv.org/content/10.1101/2020.12.17.423317v1.full.pdf

Megadepth is a fast tool for quantifying alignments and coverage for BigWig and BAM/CRAM input files, using substantially less memory than the next-fastest competitor.

Megadepth can summarize coverage within all disjoint intervals of the Gencode V35 gene annotation for more than 19,000 GTExV8 BigWig files in approximately one hour using 32 threads.

Megadepth can be configured to use multiple HTSlib threads for reading BAMs, speeding up block-gzip decompression.

megadepth allocates a per-base counts array across the entirety of the current chromosome before processing the alignments from that chromosome.





□ VEGA: Biological network-inspired interpretable variational autoencoder

>> https://www.biorxiv.org/content/10.1101/2020.12.17.423310v1.full.pdf

VEGA (Vae Enhanced by Gene Annotations), a novel sparse Variational Autoencoder architecture, whose decoder wiring is inspired by ​a priori ​characterized biological abstractions, providing direct interpretability to the latent variables.

Composed of a deep non-linear encoder and a masked linear decoder, VEGA encodes single-cell transcriptomics data in an interpretable latent space specified ​a priori.




□ Sfaira accelerates data and model reuse in single cell genomics

>> https://www.biorxiv.org/content/10.1101/2020.12.16.419036v1.full.pdf

a size factor-normalised, but otherwise non-processed feature space, for models so that all genes can contribute to embeddings and classification and the contribution of all genes can be dissected without the issue of removing low variance features.

Sfaira automatizes exploratory analysis of single-cell data. Sfaira allows fitting of cell type classifiers for data sets with different levels of annotation granularity by using cell type ontologies. And allows streamlined embedding models training across whole atlases.




□ Methrix: Systematic aggregation and efficient summarization of generic bedGraph files from Bisufite sequencing

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa1048/6042753

Core functionality of Methrix includes a comprehensive bedGraph - which summarizes methylation calls based on annotated reference indices, infers and collapses strands, and handles uncovered reference CpG sites while facilitating a flexible input file format specification.

Methrix enriches established WGBS workflows by bringing together computational efficiency and versatile functionality.





□ Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly

>> https://arxiv.org/pdf/2010.10055.pdf

a sparse linear algebra centric approach for distributed memory parallelization of overlap and layout phases. Formulating the overlap detection as a distributed Sparse General Matrix Multiply.

Sparse matrix-matrix multiplication allows diBELLA to efficiently parallelize the computation without losing expressiveness, thanks to the semiring abstraction. a novel distributed memory algorithm for the transitive reduction of the overlap graph.





□ SOMDE: A scalable method for identifying spatially variable genes with self-organizing map

>> https://www.biorxiv.org/content/10.1101/2020.12.10.419549v1.full.pdf

SOMDE, an efficient method for identifying SVgenes in large-scale spatial expression data. SOMDE uses self-organizing map (SOM) to cluster neighboring cells into nodes, and then uses a Gaussian Process to fit the node-level spatial gene expression to identify SVgenes.

SOMDE converts the original spatial gene expression to node-level gene meta-expression profiles. SOMDE models the condensed representation of the original spatial transcriptome data with a modified Gaussian process to quantify the relative spatial variability.




□ LISA: Learned Indexes for DNA Sequence Analysis

>> https://www.biorxiv.org/content/10.1101/2020.12.22.423964v1.full.pdf

LISA (Learned Indexes for Sequence Analysis) accelerates two of the most essential flavors of DNA sequence search—exact search and super-maximal exact match (SMEM) search.

LISA achieves 13.3× higher throughput than Trans-Omics Acceleration Library (TAL). Super-Maximal Exact Match for every position in the read, search of exact matches of longest substring of the read that passes through that position and still has a match in the reference sequence.





□ EVE: Large-scale clinical interpretation of genetic variants using evolutionary data and deep learning

>> https://www.biorxiv.org/content/10.1101/2020.12.21.423785v1.full.pdf

EVE (Evolutionary model of Variant Effect) learns a distribution over amino acid sequences from evolutionary data. It enables the computation of the evolutionary index. A global-local mixture of Gaussian Mixture Models separates variants into benign and pathogenic clusters based on that index.

EVE reflects the probabilistic assignment to either pathogenic or benign clusters. The probabilistic nature of the model enables us to quantify the uncertainty on this cluster assignment, which can bin variants into Benign / Pathogenic by assigning some variants as Uncertain.






□ CellVGAE: An unsupervised scRNA-seq analysis workflow with graph attention networks

>> https://www.biorxiv.org/content/10.1101/2020.12.20.423645v1.full.pdf

CellVGAE uses the connectivity between cells (e.g. k-nearest neighbour graphs or KNN) with gene expression values as node features to learn high-quality cell representations in a lower-dimensional space, with applications in downstream analyses like (density-based) clustering.

CellVGAE leverages the connectivity between cells, represented as a graph, to perform convolutions on a non-Euclidean structure, thus subscribing to the geometric deep learning paradigm.





□ Cytopath: Simulation based inference of differentiation trajectories from RNA velocity fields

>> https://www.biorxiv.org/content/10.1101/2020.12.21.423801v1.full.pdf

Cytopath is based upon transitions that use the full expression and velocity profiles of cells, it is less prone to projection artifacts distorting expression profile similarity.

The objective of trajectory inference is to estimate trajectories from root to terminal state. a common terminal state are aligned using Dynamic Time Warping. Root / terminal states can either be derived from a Markov random-walk model utilizing the transition probability matrix.





□ GCNG: graph convolutional networks for inferring gene interaction from spatial transcriptomics data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02214-w

GCNG model using spatial single cell expression data. A binary cell adjacent matrix and an expression matrix are extracted from spatial data. After normalization, both matrices are fed into the graph convolutional network.

GCNG consists of two graph convolutional layers, one flatten layer, one 512-dimension dense layer, and one sigmoid function output layer for classification.





□ GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU

>> https://arxiv.org/pdf/1908.01407.pdf

GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock.

Currently, direction-optimization is only active for matrix-vector multiplication. However, in the future, the optimization can be extended to matrix-matrix multiplication.





□ DipAsm: Chromosome-scale, haplotype-resolved assembly of human genomes

>> https://www.nature.com/articles/s41587-020-0711-0

DipAsm uses long, accurate reads and long-range conformation data for single individuals to generate a chromosome-scale phased assembly within 1 day.

A potential solution is to retain heterozygous events in the initial assembly graph and to scaffold and dissect these events later to generate a phased assembly.

DipAsm accurately reconstructs the two haplotypes in a diploid individual using only PacBio’s long high-fidelity (HiFi) reads and Hi-C data, both at ~30-fold coverage, without any pedigree information.





□ Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads

>> https://www.nature.com/articles/s41587-020-0719-5

A comparison of Oxford Nanopore Technologies and Pacific Biosciences phased assemblies identified 154 regions that are preferential sites of contig breaks, irrespective of sequencing technology or phasing algorithms.

examining the whole major histocompatibility complex (MHC) region and found that it was traversed by a single contig in both haplotype assemblies.





□ Characterizing finitely generated fields by a single field axiom

>> https://arxiv.org/pdf/2012.01307v1.pdf

The Elementary Equivalence versus Isomorphism Problem, for short EEIP, asks whether the elementary theory Th(K) of a finitely generated field K (always in the language of rings) encodes the isomorphism type of K in the class of all finitely generated fields.

every field K is elementary equivalent to its “constant field” κ – the relative algebraic closure of the prime field in K –, and its first-order theory is decidable.

Concerning with fields which are at the centre of (birational) arithmetic geometry, namely the finitely generated fields K, which are the function fields of integral Z-schemes of finite type.





□ PySCNet: A tool for reconstructing and analyzing gene regulatory network from single-cell RNA-Seq data

>> https://www.biorxiv.org/content/10.1101/2020.12.18.423482v1.full.pdf

PySCNet integrates competitive gene regulatory construction methodologies for cell specific or trajectory specific gene regulatory networks (GRNs) and allows for gene co-expression module detection and gene importance evaluation.

PySCNet uses Louvain clustering to detect gene co-expression modules. Node centrality is applied to estimate the importance of gene / TF in the network. To discover hidden regulating links of a target gene node, graph traversal are utilized to predict indirect regulations.





□ SCMER: Single-Cell Manifold Preserving Feature Selection

>> https://www.biorxiv.org/content/10.1101/2020.12.01.407262v1.full.pdf

SCMER, a novel unsupervised approach which performs UMAP style dimensionality reduction via selecting a compact set of molecular features with definitive meanings.

a manifold defined by pairwise cell similarity scores sufficiently represents the complexity of the data, encoding both global relationship between cell groups and local relationship within cell groups.

While clusters usually reflect distinct cell types, continuums reflect similar cell types and trajectory of transitioning/differentiating cell states. SCMER selects optimal features that preserve the manifold and retain inter- and intra-cluster diversity.

SCMER does not require clusters or trajectories, and thereby circumvents the associated biases. It is sensitive to detect diverse features that delineate common and rare cell types, continuously changing cell states, and multicellular programs shared by multiple cell types.

If a dataset with n cells is separate into b batches, the space complexity will reduce from O(n^2) to O(b * (n/b)^2) = O(n^2 / b).

Orthant-Wise Limited memory quasi-Newton (OWL-QN) algorithm solves the l2-regularized regression problem by introducing pseudo-gradients and restrict the optimization to an orthant without discontinuities in the gradient.





□ A Scalable Optimization Mechanism for Pairwise based Discrete Hashing

>> https://ieeexplore.ieee.org/document/9280410

a novel alternative optimization mechanism to reformulate one typical quartic problem, in term of hash functions in the original objective of Kernel- based Supervised Hashing, into a linear problem by introducing a linear regression model.

a scalable symmetric discrete hashing algorithm that gradually and smoothly updates each batch of binary codes. And a greedy symmetric discrete hashing algorithm to update each bit of batch binary codes.





□ SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network

>> https://www.biorxiv.org/content/10.1101/2020.11.30.405118v1.full.pdf

SpaGCN draws a circle around each spot with a pre-specified radius, and all spots that reside in the circle are considered as neighbors of this spot. SpaGCN allows to combine multiple domains as one target domain or specify which neighboring domains to be included in DE analysis.

SpaGCN can identify spatial domains with coherent gene expression and histology and detect SVGs and meta genes that have much clearer spatial expression patterns and biological interpretations than genes detected by SPARK and SpatialDE.





□ GRGNN: Inductive inference of gene regulatory network using supervised and semi-supervised graph neural networks

>> https://www.sciencedirect.com/science/article/pii/S200103702030444X

GRGNN - an end-to-end gene regulatory graph neural network approach to reconstruct GRNs from scratch utilizing the gene expression data, in both a supervised and a semi-supervised framework.

One of the time-consuming parts of GRGNN practice is extracting the enclosed subgraphs. The time complexity is O(n|V|h) and the memory complexity is O(n|E|) for extracting n subgraphs in h-hop, where |V| and |E| are numbers of nodes and edges in the whole graph.




□ spVCF: Sparse project VCF: efficient encoding of population genotype matrices

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa1004/6029516

Sparse Project VCF (spVCF), an evolution of VCF with judicious entropy reduction and run-length encoding, delivering >10X size reduction for modern studies with practically minimal information loss.

spVCF interoperates with VCF efficiently, including tabix-based random access. spVCF provides the genotype matrix sparsely, by selectively reducing QC measure entropy and run-length encoding repetitive information about reference coverage.





□ SDPR: A fast and robust Bayesian nonparametric method for prediction of complex traits using summary statistics

>> https://www.biorxiv.org/content/10.1101/2020.11.30.405241v1.full.pdf

SDPR (Summary statistics based Dirichelt Process Regression) is a method to compute polygenic risk score (PRS) from summary statistics. It is the extension of Dirichlet Process Regression (DPR) to the use of summary statistics.

SDPR connects the marginal coefficients in summary statistics with true effect sizes through Bayesian multiple DPR. And utilize the concept of approximately independent LD blocks and reparameterization to develop a parallel and fast-mixing Markov Chain Monte Carlo algorithm.





□ Maximum Caliber: Inferring a network from dynamical signals at its nodes

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008435

an approximate solution to the difficult inverse problem of inferring the topology of an unknown network from given time-dependent signals at the nodes.

The method of choice for inferring dynamical processes from limited information is the Principle of Maximum Caliber. Max Cal can infer both the dynamics and interactions within arbitrarily complex, non-equilibrium systems, albeit in an approximate way.




□ scGMAI: a Gaussian mixture model for clustering single-cell RNA-Seq data based on deep autoencoder

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbaa316/6029147

scGMAI is a new single-cell Gaussian mixture clustering method based on autoencoder networks and the fast independent component analysis (FastICA).

scGMAI utilizes autoencoder networks to reconstruct gene expression values from scRNA-Seq data and FastICA is used to reduce the dimensions of reconstructed data.




□ Assembling Long Accurate Reads Using de Bruijn Graphs

>> https://www.biorxiv.org/content/10.1101/2020.12.10.420448v1.full.pdf

an efficient jumboDB algorithm for constructing the de Bruijn graph for large genomes and large ​k​-mer sizes and the LJA genome assembler that error-corrects HiFi reads and uses jumboDB to construct the de Bruijn graph on the error-corrected reads.

Since the de Bruijn graph constructed for a fixed ​k​-mer size is typically either too tangled or too fragmented, LJA uses a new concept of a multiplex de Bruijn graph.




□ SCCNV: A Software Tool for Identifying Copy Number Variation From Single-Cell Whole-Genome Sequencing

>> https://www.frontiersin.org/articles/10.3389/fgene.2020.505441/full

Several statistical models have been developed for analyzing sequencing data of bulk DNA, for example, Circular Binary Segmentation (CBS), Mean Shift-Based (MSB) model, Shifting Level Model (SLM), Expectation Maximization (EM) model, and Hidden Markov Model (HMM).

SCCNV is a read-depth based approach with adjustment for the WGA bias. it controls not only bias during sequencing and alignment, e.g., bias associated with mappability and GC content, but also the locus-specific amplification bias.





□ A generative spiking neural-network model of goal-directed behaviour and one-step planning

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007579

The first hypothesis allows the architecture to learn the world model in parallel with its use for planning: a new arbitration mechanism decides when to explore, for learning the world model, or when to exploit it, for planning, based on the entropy of the world model itself.

The entropy threshold decreases linearly with each planning cycle so that the exploration component is eventually called to select the action if the planning process fails to reach the goal multiple time.





□ Probabilistic Contrastive Principal Component Analysis

>> https://arxiv.org/pdf/2012.07977.pdf

PCPCA, a model-based alterna- tive to contrastive principal component analysis (CPCA). model is both generative and discriminative, PCPCA provides a model based approach that allows for uncertainty quantification and principled inference.

PCPCA can be applied to a variety of statistical and machine learning problem domains including dimension reduction, synthetic data generation, missing data imputation, and clustering.





□ scCODA: A Bayesian model for compositional single-cell data analysis

>> https://www.biorxiv.org/content/10.1101/2020.12.14.422688v1.full.pdf

scCODA, a Bayesian approach for cell type composition differential abundance analysis to further address the low replicate issue.

scCODA framework models cell type counts with a hierarchical Dirichlet-Multinomial distribution that accounts for the uncertainty in cell type proportions and the negative correlative bias via joint modeling of all measured cell type proportions instead of individual ones.





Every sight I've ever seen.

2020-12-24 22:12:24 | Science News



□ Beyond low-Earth orbit: Characterizing the immune profile following simulated spaceflight conditions for deep space missions

>> https://www.cell.com/iscience/fulltext/S2589-0042(20)30944-5

Circulating immune biomarkers are defined by distinct deep space irradiation types coupled to simulated microgravity and could be targets for future space health initiatives.

Unique immune signatures and microRNA (miRNA) profiles would be produced by distinct experimental conditions of simulated GCR, SPE, and gamma irradiation, singly or in combination with HU.

Linear energy transfer (LET) is defined as the amount of energy that is deposited or transferred in a material from an ion. High-LET irradiation can cause more damaging ionizing tracks and pose a higher relative biological effectiveness (RBE) risk compared to low-LET irradiation.





□ Advancing the Integration of Biosciences Data Sharing to Further Enable Space Exploration

>> https://www.cell.com/cell-reports/fulltext/S2211-1247(20)31430-3

This open access science perspective invites investigators to participate in a transformative collaborative effort for interpreting spaceflight effects by integrating omics and physiological data to the systems level.

Integration of data from GeneLab and ALSDA will enable spaceflight health risk modeling. All data would then benefit from applied FAIR principles.





□ Super-robust data storage in DNA by de Bruijn graph-based decoding

>> https://www.biorxiv.org/content/10.1101/2020.12.20.423642v1.full.pdf

De Bruijn Graph-based Greedy Path Search (DBG-GPS) algorithm can efficient reconstruction of DNA strands from multiple error-rich sequences directly.

DBG-GPS is designed as inner decoding mechanism for correction of errors within DNA strands. And shows 50 times faster than the clustering and multiple alignment-based methods. The revealed linear decoding complexity makes DBG-GPS a suitable solution for large-scale data storage.





□ STARRPeaker: uniform processing and accurate identification of STARR-seq active regions

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02194-x

STARRPeaker, an algorithm optimized for processing and identifying functionally active enhancers from STARR-seq data. This approach statistically models the basal level of transcription, accounting for potential confounding factors, and accurately identifies reproducible enhancers.

To model the fragment coverage from STARR-seq using discrete probability distribution, assuming each genomic bin is independent, as specified in the Bernoulli trials. STARRPeaker calculates fragment coverage and the basal transcription rate using negative binomial regression.





□ RedOak: a reference-free and alignment-free structure for indexing a collection of similar genomes

>> https://www.biorxiv.org/content/10.1101/2020.12.19.423583v1.full.pdf

The parallelization of the data structure construction allows, through the use of networking resources, to efficiently index and query those genomes. RedOak is inspired by Bloom Filter Trie, using a probabilistic approach.

RedOak can also be applied to reads from unassembled genomes, and it provides a nucleotide sequence query function. This software is based on a k-mer approach and has been developed to be heavily parallelized and distributed on several nodes of a cluster.




□ TAPER: Pinpointing errors in multiple sequence alignments despite varying rates of evolution

>> https://www.biorxiv.org/content/10.1101/2020.11.30.405589v1.full.pdf

a strategy to combine several k values, each with a different p, q setting. And run the 2D outlier algorithm on multiple k values and report their union.

TAPER, Two-dimensional Algorithm for Pinpointing ERrors that takes a multiple sequence alignment as input and outputs outlier sequence positions. TAPER is able to pinpoint errors in multiple sequence alignments without removing large parts of the alignment.




□ WENGAN: Efficient hybrid de novo assembly of human genomes

>> https://www.nature.com/articles/s41587-020-00747-w

WENGAN, a hybrid genome assembler that, unlike most long-read assemblers, entirely avoids the all-versus-all read comparison, does not follow the OLC paradigm and integrates short reads in the early phases of the assembly process (short-read-first).

WENGAN starts by building short-read contigs using a de Bruijn graph assembler. Then, the pair-end reads are pseudo-aligned back to detect and error-correct chimeric contigs as well as to classify them as repeats or unique sequences.

Wengan builds a new sequence graph called the Synthetic Scaffolding Graph. The SSG is built from a spectrum of synthetic mate-pair libraries extracted from raw long-reads. Longer alignments are then built by peforming a transitive reduction of the edges.




□ Learning interpretable latent autoencoder representations with annotations of feature sets

>> https://www.biorxiv.org/content/10.1101/2020.12.02.401182v1.full.pdf

In f-scLVM, deterministic approximate Bayesian inference based on variational methods is used to approximate the posterior over all random variables of the model.

a scalable alternative to f-scLVM to learn latent representations of single-cell RNA-seq data that exploit prior knowledge such as Gene Ontology, resulting in interpretable factors.




□ FastK: A K-mer counter for HQ assembly data sets

>> https://github.com/thegenemyers/FASTK

FastK is a k-mer counter that is optimized for processing high quality DNA assembly data sets such as those produced with an Illumina instrument or a PacBio run in HiFi mode.

FastK is about 2 times faster than KMC3 when counting 40-mers in a 50X HiFi data set. Its relative speedup decreases with increasing error rate or increasing values of k, but regardless is a general program that works for any DNA sequence data set and choice of k.





Andrew Carroll

>> https://github.com/google/deepvariant/releases/tag/v1.1.0

Release of DeepVariant v1.1: Introducing DeepTrio, with greater accuracy for trio or duos. Pre-trained models for Illumina WGS, WES, and PacBio HiFi. Also in DV1.1 (non-trio_, better speed for long reads. 21% reduction in PacBio Indel Errors.




□ Coupled co-clustering-based unsupervised transfer learning for the integrative analysis of single-cell genomic data

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbaa347/6024740

Clustering similar genomic features reduces the noise in single-cell data and facilitates transfer of knowledge across single-cell datasets.

coupleCoC builds upon the information theoretic co-clustering framework. In co-clustering, both the cells and the genomic features are simultaneously clustered.




□ GeneTerpret: a customizable multilayer approach to genomic variant prioritization and interpretation

>> https://www.biorxiv.org/content/10.1101/2020.12.04.408336v1.full.pdf

GeneTerpret platform collates data from current interpretation tools and databases, and applies a phenotype-driven query to categorize the variants identified in a given genome.

GeneTerpret improves the GVI process. GeneTerpret is encouragingly accurate when compared with expert-curated datasets in such well- established public records of clinically relevant variants as DECIPHER and ClinGen.




□ Selective Inference for Hierarchical Clustering

>> https://arxiv.org/pdf/2012.02936.pdf

a selective inference framework to test for a difference in means after any type of clustering. This framework exploits ideas from the recent literature on selective inference for regression and changepoint detection.

This framework avoids the need for bootstrap resampling and provides exact finite-sample inference for the difference in means between a single pair of estimated clusters.





□ multiGSEA: a GSEA-based pathway enrichment analysis for multi-omics data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03910-x

multiGSEA, a highly versatile tool for multi-omics pathway integration that minimizes previous restrictions in terms of omics layer selection and the mapping of feature IDs. Pathway definitions can be downloaded from up to 8 different pathway databases by means of the graphite.

multiGSEA utilizes three different p value combination methods. By default, combinePvalues() will apply the Z-method or Stouffer’s method which has no bias towards small or large p values.





□ Giraffe: Genotyping common, large structural variations in 5,202 genomes using pangenomes, the Giraffe mapper, and the vg toolkit

>> https://www.biorxiv.org/content/10.1101/2020.12.04.412486v1.full.pdf

Giraffe, a new pangenome mapper that focuses on mapping to collections of aligned haplotypes. Giraffe is a short read to graph mapper designed to map to haplotypes, producing alignments embedded within a sequence graph.

The Giraffe algorithm can only find a correct mapping if the read contains instances of minimizers that exactly match minimizers in the true placement in the graph, which then form a cluster, which is then extended to produce an alignment.




□ FEATS: feature selection-based clustering of single-cell RNA-seq data

>> https://pubmed.ncbi.nlm.nih.gov/33285568/

FEATS, a univariate feature selection-based approach for clustering, which involves the selection of top informative features to improve clustering performance.

FEATS gives superior performance compared with the current tools, in terms of adjusted Rand index and estimating the number of clusters.





□ constclust: Consistent Clusters for scRNA-seq

>> https://www.biorxiv.org/content/10.1101/2020.12.08.417105v1.full.pdf

constclust is a novel meta-clustering method based on the idea that if the data contains distinct populations which a clustering method can identify, meaningful clusters should be robust to small changes in the parameters used to derive them.

constclust finds labels which match ground truth, so does running the underlying clustering method with default parameters. constclust formalizes the operations by automatically detecting the clusters which are consistently found within contiguous regions of parameter space.




□ Prioritizing genes for systematic variant effect mapping

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa1008/6029515

Missense VUS (variant of uncertain significance) collected through clinical testing were extracted from the ClinVar and Invitae databases. The first strategy ranked genes based on their unique VUS count.

The second strategy ranked genes based on their movability- and reappearance-weighted impact score(s) (MARWIS) to give extra weight to reappearing, movable VUS.

The third strategy ranked the genes by their difficulty-adjusted impact score(s) (DAIS), calculated to account for the costs associated with studying longer genes.





□ TrancriptomeReconstructoR: data-driven annotation of complex transcriptomes

>> https://www.biorxiv.org/content/10.1101/2020.12.10.418897v1.full.pdf

ONT Direct RNA-seq has four key limitations. First, up to 30-40% of bases can be called with errors. To tolerate the sequencing errors, the dedicated aligners allow for more mismatches and thus inevitably sacrifice the accuracy of alignments.

TranscriptomeReconstructoR takes three datasets as input: i) full-length RNA-seq (e.g. ONT Direct RNA-seq) to resolve splicing patterns; ii) 5' tag sequencing (e.g. CAGE-seq) to detect TSS; iii) 3' tag sequencing (e.g. PAT-seq) to detect polyadenylation sites (PAS).





□ HiddenVis: a Hidden State Visualization Toolkit to Visualize and Interpret Deep Learning Models for Time Series Data

>> https://www.biorxiv.org/content/10.1101/2020.12.11.422030v1.full.pdf

Hidden State Visualization Toolkit (HiddenVis) visualizes and facilitate the interpretations of sequential models for accelerometer data. HiddenVis can visualize the hidden states, match input samples with similar patterns and explore the potential relation among covariates.

The HiddenViz model is suitable for a wide range of Deep Learning based accelerometer data analyses. It can be easily extended to the visualization and analysis of other temporal data.





□ Unbiased integration of single cell multi-omics data

>> https://www.biorxiv.org/content/10.1101/2020.12.11.422014v1.full.pdf

bindSC, a single-cell data integration tool that realizes simultaneous alignment of the rows and the columns between data matrices without making approximations.

The alignment matrix derived from bi-CCA (bi-order canonical correlation analysis) can be utilized to derive in silico multiomics profiles from aligned cells. Bi-CCA outputs canonical correlation vectors (CCVs), which project cells from two datasets onto a shared latent space.




□ FFD: Fast Feature Detector

>> https://ieeexplore.ieee.org/document/9292438

The robust and accurate keypoints exist in the specific scale-space domain. And formulating the superimposition problem into a mathematical model and then derive a closed-form solution for multiscale analysis.

The model is formulated via difference-of-Gaussian (DoG) kernels in the continuous scale-space domain, and it is proved that setting the scale-space pyramid’s blurring ratio and smoothness to 2 and 0.627, respectively, facilitates the detection of reliable keypoints.




□ Cytosplore-Transcriptomics: a scalable inter-active framework for single-cell RNA sequenc-ing data analysis

>> https://www.biorxiv.org/content/10.1101/2020.12.11.421883v1.full.pdf

The two-dimensional embeddings of the HSNE hierarchy can be used to cluster and define cell populations at different levels of the hierarchy, or to visualize the expression of selected genes and metadata across cells.

Cytosplore-Transcriptomics, a framework to analyze scRNA-seq data. At its core, it uses a hierarchical, manifold preserving representation of the data that allows the inspection and annotation of scRNA-seq data at different levels of detail.





□ Robustifying genomic classifiers to batch effects via ensemble learning

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa986/6007261





□ Macarons: Uncovering complementary sets of variants for the prediction of quantitative phenotypes

>> https://www.biorxiv.org/content/10.1101/2020.12.11.419952v1.full.pdf

Macarons takes into account the correlations between SNPs to avoid the selection of redundant pairs of SNPs in linkage disequilibrium.

Macarons features two simple, interpretable parameters to control the time/performance trade-off: The number of SNPs to be selected (k), and maximum intra-chromosomal distance (D, in base pairs) to reduce the search space for redundant SNPs.





□ TraNCE: Scalable Analysis of Multi-Modal Biomedical Data

>> https://www.biorxiv.org/content/10.1101/2020.12.14.422781v1.full.pdf

TraNCE, a framework that automates the difficulties of designing distributed analyses with complex biomedical data types. TraNCE is capable of outperforming the common alternative, based on “flattening” complex data structures.

TraNCE is a compilation framework that transforms declarative programs over nested collections into distributed execution plans.




□ Hapo-G, Haplotype-Aware Polishing Of Genome Assemblies

>> https://www.biorxiv.org/content/10.1101/2020.12.14.422624v1.full.pdf

Hapo-G maintains two stacks of alignments, the first (all-ali) contains all the alignments that overlap the currently inspected base, and the second (hap-ali) contains only the read alignments that agree with the last selected haplotype.

Hapo-G selects a reference alignment and tries to use it as long as possible to polish the region where it aligns, which will minimize mixing between haplotypes.




□ AdRoit: an accurate and robust method to infer complex transcriptome composition

>> https://www.biorxiv.org/content/10.1101/2020.12.14.422697v1.full.pdf

AdRoit, an accurate and robust method to infer transcriptome composition. The method estimates the proportions of each cell type in the compound RNA-seq data using known single cell data of relevant cell types.


AdRoit uniquely uses an adaptive learning approach to correct the bias gene-wise. due to the difference in sequencing techniques. AdRoit also utilizes cell type specific genes while control their cross-sample variability.




□ DeMaSk: a deep mutational scanning substitution matrix and its use for variant impact prediction

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa1030/6039113

DeMaSk, an intuitive and interpretable method based only upon DMS datasets and sequence homologs that predicts the impact of missense mutations within any protein.

DeMaSk first infers a directional amino acid substitution matrix from DMS datasets and then fits a linear model that combines these substitution scores with measures of per-position evolutionary conservation and variant frequency.




□ HTSlib - C library for reading/writing high-throughput sequencing data

>> https://www.biorxiv.org/content/10.1101/2020.12.16.423064v1.full.pdf

The HTSlib library is structured as follows: the media access layer is a collection of low-level system and library (libcurl, knet) functions, which facilitate access to files on different storage environments and over multiple protocols to various online storage providers.

Over the lifetime of HTSlib the cost of sequencing has decreased by approximately 100-fold with a corresponding increase in data volume.




□ TIPS: Trajectory Inference of Pathway Significance through Pseudotime Comparison for Functional Assessment of single-cell RNAseq Data

>> https://www.biorxiv.org/content/10.1101/2020.12.17.423360v1.full.pdf

TIPS leverages the common trajectory mapping principle of pseudotime assignment to build pathway-specific trajectories from a pool of single cells.

The pseudotime values for each cell along these pathway-specific trajectories are compared to identify the processes with highest similarity to an overall trajectory. This latter source of variation may have significant ramifications on the accuracy of pseudotime alignment.




□ Minimally-overlapping words for sequence similarity search

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa1054/6042707

a simple sparse-seeding method: using seeds at positions of certain “words” (e.g. ac, at, gc, or gt). Sensitivity is maximized by using words with minimal overlaps. in a random sequence, minimally-overlapping words are anti-clumped.

using increasingly long minimum-variance words, with fixed sparsity n, the sensitivity might approach that of every-nth seeding. The seed count of every-nth seeding has zero variance.




□ VCFShark: how to squeeze a VCF file

>> https://www.biorxiv.org/content/10.1101/2020.12.18.423437v1.full.pdf

VCFShark, a dedicated fully-fledged com- pressor of VCF files. It significantly outperforms the universal tools in terms of compression ratio; sometimes its advantage is severalfold.

VCFShark dominates over BCF, pigz, and 7z by a large margin, achieving 3- to 32-fold better compression. It is mainly a result of an algorithm for compression of genotypes. The advantage over genozip, which uses similar compression for genotypes, up to 5.5-fold for HRC.




□ A monotonicity-based gene clustering algorithm for enhancing clarity in single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2020.12.20.423308v1.full.pdf

When clustering genes based on a monotonicity-based metric, it is important to note that uniformly expressed genes (with either very scarce dropout values or very abundant dropout values) are dangerous because they are likely to have high monotonicity values with many genes, even when a meaningful relationship may not exist.

Due to the high dimensionality of scRNA-seq data, genes with high variances, which will tend to serve as the cluster “centroids”, will tend to be well-separated.




□ scTypeR: Framework to accurately classify cell types in single-cell RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2020.12.22.424025v1.full.pdf

The advantage of scTypeR and other related tools is that the cell type’s properties are learned from a reference dataset, but the reference dataset is no longer necessary to apply the model.

scTypeR uses SVM learning models organised in a tree-like structure to improve the classification of closely related cell types. scTypeR reports classification probabilities for every cell type and reports ambiguous classification results.





□ VarSAn: Associating pathways with a set of genomic variants using network analysis

>> https://www.biorxiv.org/content/10.1101/2020.12.22.424077v1.full.pdf

VarSAn analyzes a configurable network whose nodes represent variants, genes and pathways, using a Random Walk with Restarts algorithm to rank pathways for relevance to the given variants, and reports p-values for pathway relevance.

VarSAn ranks pathways first by their empirical p-values, which represent their connectivity to the query set, and then (to break ties) by their equilibrium probabilities, which are determined by both the connectivity and the network topology.





□ KATK: fast genotyping of rare variants directly from unmapped sequencing reads

>> https://www.biorxiv.org/content/10.1101/2020.12.23.424124v1.full.pdf

KATK is a fast and accurate software tool for calling variants directly from raw NGS reads. It uses predefined k-mers to retrieve only the reads of interest from the FASTQ file and calls genotypes by aligning retrieved reads locally.

KATK identifies unreliable variant calls and clearly distinguishes them in the output. KATK does not use data about known polymorphisms and has NC (No Call) as default genotype.




□ ARPIR: automatic RNA-Seq pipelines with interactive report

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03846-2

ARPIR allows the analysis of RNA-Seq data from groups undergoing different treatment allowing multiple comparisons in a single launch and can be used either for paired-end or single-end analysis.

Automatic RNA-Seq Pipelines with Interactive Report (ARPIR) makes a final tertiary-analysis that includes a Gene Ontology and Pathway analysis.




□ glmGamPoi: Fitting Gamma-Poisson Generalized Linear Models on Single Cell Count Data

>> https://doi.org/10.1093/bioinformatics/btaa1009

glmGamPoi provides inference of Gamma-Poisson generalized linear models with the following improvements over edgeR / DESeq2. glmGamPoi is more than 5 times faster than edgeR and more than 18 times faster than DESeq2.

glmGamPoi provides a quasi-likelihood ratio test with empirical Bayesian shrinkage to identify differentially expressed genes. glmGamPoi scales sub-linearly with the number of cells, which explains the observed performance benefit.