lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Awake.

2020-01-31 23:57:58 | Science News

体制側の瑕疵に適応、あるいは逸脱する他に、内側から食い破ることも可能だ。後者の手段が実効性を持つには、組織の内殻に局在する瑕疵について、社会的に共有されているパースペクティブを還流する必要がある。




□ Exactly solvable models of stochastic gene expression

>> https://www.biorxiv.org/content/10.1101/2020.01.05.895359v1.full.pdf

The method was employed to approximate solutions to a broad class of linear multi-state gene expression models, which termed l-switch models, most of which currently have no known analytic solution.

While a characterisation of precisely which models the method is applicable to is not yet known, the results presented here suggest the recurrence method is an accurate and flexible tool for analysing stochastic models of gene expression.

Multistate models of gene expression which generalise the canonical Telegraph process, and are capable of capturing the joint effects of e.g. transcription factors, heterochromatin state and DNA accessibility (or, in prokaryotes, Sigma-factor activity) on transcript abundance.





□ TALC: Transcription-Aware Long Read Correction https://www.biorxiv.org/content/10.1101/2020.01.10.901728v1.full.pdf

In TALC, a path in the DBG is defined as an ordered list of connected k-mers (nodes) which are weighted by their number of occurrence in the SR dataset. any sequence of transcripts expressed in the RNA-seq sample should appear as a unique path of the graph.

TALC favours coverage-consistent exploration of the DBG. All paths passing the test described above are explored in parallel, according to a breadth-first approach.

Paths in the DBG that successfully bridge two Solid Regions are first ranked according to their sequence similarity with the Weak Region’s. The similarity is computed as the edit distance between sequences.

By integrating coverage information, TALC can efficiently account for the existence of multiple transcript isoforms. By eliminating inconsistent nodes, TALC reduces the exploration space in the graph and thus reduces the probability of exploring false paths.





□ LPWC: Lag Penalized Weighted Correlation for Time Series Clustering

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3324-1

Countless clustering algorithms group data points with similar characteristics, but the meaning of “similar” is inherently subjective and application-specific. In time series datasets, similarity must account for the temporal structure.

Dynamic Time Warping (DTW) aligns timepoints so that the distance between the aligned samples is minimized. LEAP allows time delays when constructing co-expression networks.

The LPWC similarity score is derived from weighted correlation, but the correlations of lagged temporal profiles are penalized using a Gaussian kernel. The kernel is also used to account for irregular time sampling.

LPWC is designed to identify groups of biological entities that exhibit the same pattern of activity changes over time. A solution can be verified in polynomial time and that the NP-complete Weighted Maximum Cut problem can be reduced to Lag Optimization in polynomial time.





□ ProphAsm: Simplitigs as an efficient and scalable representation of de Bruijn graphs https://www.biorxiv.org/content/10.1101/2020.01.12.903443v1.full.pdf

ProphAsm is a tool for computing simplitigs from k-mer sets and for k-mer set manipulation. Simplitigs are genomic sequences computed as disjoint paths in a bidirectional vertex-centric de Bruijn graph.

Compared to unitigs, simplitigs provide an improvement in the total number of sequences and their cumulative length, while both representations contain exactly the same k-mers.

ProphAsm, a tool implementing a greedy heuristic to compute maximal simplitigs. ProphAsm proceeds by building the associated de Bruijn graph in memory, followed by a greedy enumeration of maximum vertex-disjoint paths.





□ sstGPLVM: A Bayesian nonparametric semi-supervised model for integration of multiple single-cell experiments

>> https://www.biorxiv.org/content/10.1101/2020.01.14.906313v1.full.pdf

sstGPLVM the semi-supervised t-distributed Gaussian process latent variable model, which projects the data onto a mixture of fixed and latent dimensions, can learn a unified low-dimensional embedding for multiple single cell experiments with minimal assumptions.

sstGPLVM is the robust semi-supervised Gaussian process latent variable model to estimate a manifold that eliminates variance from un-wanted covariates and enables the imputation of missing covariates for other types of multi-modal data.





□ d-PBWT: dynamic positional Burrows-Wheeler transform

>> https://www.biorxiv.org/content/10.1101/2020.01.14.906487v1.full.pdf

d-PBWT, a dynamic version of the PBWT data structure. d-PBWT data structure can be initialized by direct bulk conversion from an existing PBWT. This algorithms open new research avenues for developing efficient genotype imputation and phasing algorithms.

two search algorithms for set maximal matches and long matches with worst case linear time complexity, but requiring multiple passes, and one search algorithm for long matches with average case linear time complexity with single pass without additional LEAP arrays data structures.





□ Alignment of single-cell RNA-seq samples without over-correction using kernel density matching

>> https://www.biorxiv.org/content/10.1101/2020.01.05.895136v1.full.pdf

Dmatch uses an external panel of primary cells to identify shared pseudo cell-types across scRNA-seq samples, and then finds a set of common alignment parameters that minimize gene expression level differences between cells that are determined to be the same pseudo cell-types.

the consistency of cell-type assignment was generally increased when all Pearson correlations of the 95-dimensional vectors were set to zero except for those between the cell and the top five reference atlas cell-types with the highest Pearson correlation coefficient.





□ Supervised Adversarial Alignment of scRNA-seq Data

>> https://www.biorxiv.org/content/10.1101/2020.01.06.896621v1.full.pdf

scDGN, Single Cell Domain Generalization Network includes three modules: scRNA encoder, label classifier and domain discriminator.

Gradient Reversal Layers (GRL) have no effect in forward propagation, but flip the sign of the gradients that flow through them during backpropagation.





□ Nanopore Sequencing at Mars, Europa and Microgravity Conditions

>> https://www.biorxiv.org/content/10.1101/2020.01.09.899716v1.full.pdf

Now ubiquitous on Earth and previously demonstrated on the International Space Station (ISS), nanopore sequencing involves translocation of DNA through a biological nanopore on timescales of milliseconds per base.

confirming the ability to sequence at Mars, Europa and Lunar g levels, the sequencing protocols under parabolic flight, and consistent performance across g level, during dynamic accelerations, and despite vibrations with significant power at translocation-relevant frequencies.




□ ABEMUS: platform specific and data informed detection of somatic SNVs in cfDNA

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa016/5699904

Next generation sequencing assays allow for the simultaneous interrogation of extended sets of somatic single nucleotide variants (SNVs) in circulating cell free DNA (cfDNA), a mixture of DNA molecules originating both from normal and tumor tissue cells.

Adaptive Base Error Model in Ultra-deep Sequencing data (ABEMUS), which combines platform-specific genetic knowledge and empirical signal to readily detect and quantify somatic SNVs in cfDNA.





□ DEcode: Decoding differential gene expression

>> https://www.biorxiv.org/content/10.1101/2020.01.10.894238v1.full.pdf

Combining the strengths of systems biology and deep learning in a model called DEcode, it is able to predict DE more accurately than traditional sequence-based methods, which do not utilize systems biology data.

the DEcode framework integrates a wealth of genomic data into a unified computational model of transcriptome regulations to predict multiple transcriptional effects, the absolute expression differences across, genes and transcripts, tissue- and person-specific transcriptomes.

DEcode builds a prediction model for tissue-specific gene expression for each tissue via XGBoost based on the training script from the ExPecto using the same hyper-parameters for XGBoost as in the script.





□ CaSpER identifies and visualizes CNV events by integrative analysis of single-cell or bulk RNA-sequencing data https://www.nature.com/articles/s41467-019-13779-x

CaSpER utilizes a non-linear median-based filtering of RNA-seq expression and allele-frequency signal. The median filtering preserves the edges of the signal much better compared to the kernel-based linear filters.

After the assignment of HMM states, CaSpER integrates the BAF shift signal with the assigned states to generate the final CNV calls.




□ Fast Zero-Inflated Negative Binomial Mixed Modeling Approach for Analyzing Longitudinal Metagenomics Data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz973/5697093

The FZINBMM approach is based on zero-inflated negative binomial mixed models (ZINBMMs) for modeling longitudinal metagenomic count data and a fast EM-IWLS algorithm for fitting ZINBMMs.

FZINBMM takes advantage of a commonly used procedure for fitting linear mixed models (LMMs), which allows us to include various types of fixed and random effects and within-subject correlation structures and quickly analyze many taxa.





□ LiPLike: Towards gene regulatory network predictions of high certainty

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz950/5697090

the Linear Profile Likelihood (LiPLike), that assumes a regression model and iteratively searches for interactions that cannot be replaced by a linear combination of other predictors.

LiPLike could successfully remove false positive identifications from GRN predictions of other methods, and recognised this feature to be useful whenever high accuracy GRN predictions are sought from gene expression data.





□ Probabilistic gene expression signatures identify cell-types from single cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2020.01.05.895441v1.full.pdf

a computationally light statistical approach, based on Naive Bayes, that leverages public datasets to combine information across thousands of genes and probabilistically assign cell-type identity.

To overcome the problem where the estimated rates are zero for hundreds of genes due to the sparsity of the data, using a hierarchical model that defines a cell-type-specific distribution with the hierarchical aspect providing statistical power in the presence of said sparsity.




□ isONcorrect: Error correction enables use of Oxford Nanopore technology for reference-free transcriptome analysis

>> https://www.biorxiv.org/content/10.1101/2020.01.07.897512v1.full.pdf

applying isONcorrect to direct RNA reads is a direction for future work that should enable the reference-free use of direct RNA reads.

IsONcorrect is able to jointly use all isoforms from a gene during error correction, thereby allowing it to correct reads at low sequencing depths.

As structural differences and variable coverage is at the heart of transcriptomic error correction, solving the partitioning problem by formulating it as a global with respect to the read k-mer anchor optimization problem over anchor depth.





□ SparkINFERNO: A scalable high-throughput pipeline for inferring molecular mechanisms of non-coding genetic variants

>> https://www.biorxiv.org/content/10.1101/2020.01.07.897579v1.full.pdf

SparkINFERNO (Spark-based INFERence of the molecular mechanisms of NOn-coding genetic variants), a scalable bioinformatics pipeline characterizing noncoding GWAS association findings.

SparkINFERNO algorithm integrates GWAS summary statistics with large-scale functional genomics datasets spanning enhancer activity, transcription factor binding, expression quantitative trait loci, and other functional datasets across more than 400 tissues and cell types.

SparkINFERNO is 61-times faster and scales well with the amount of computational resources. SparkINFERNO identified 1,418 and 15,343 candidate causal variants and 149 and 1,002 co-localized target gene-tissue combinations for IGAP and IBD.





□ BEM: Mining Coregulation Patterns in Transcriptomics via Boolean Matrix Factorization

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz977/5698267

Boolean matrix factorization via expectation maximization (BEM) is more aligned with the molecular mechanism of transcriptomic coregulation and can scale to matrix with over 100 million data points.

BEM is applicable to all kinds of transcriptomic data, including bulk RNAseq, single cell RNAseq, and spatial transcriptomic datasets. Given appropriate binarization, BEM was able to extract coregulation patterns consistent with disease subtypes, cell types, or spatial anatomy.




□ A U-statistics for integrative analysis of multi-layer omics data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa004/5698271

the proposed method is flexible for analyzing different types of outcomes as it makes no assumptions about their distributions, and outperformed the commonly used kernel regression-based methods.

a U-statistics-based non-parametric framework for the association analysis of multi-layer omics data, where consensus and permutation-based weighting schemes are developed to account for various types of disease models.




□ Gapsplit: Efficient random sampling for non-convex constraint-based models

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz971/5698269

Combinatorial models may require an astronomical number of solutions to form a stable sampling distribution.For non-convex models the relationships between variables can change drastically from subspace to subspace.

Gapsplit provides uniform coverage of linear, mixed-integer, and general nonlinear models. Gapsplit generates random samples from convex and non-convex constraint-based models by targeting under-sampled regions of the solution space.




□ Lep-Anchor: Automated construction of linkage map anchored haploid genomes

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz978/5698268

Lep-Anchor has been developed to efficiently anchor genomes into chromosomes using Lep-MAP3 and the additional information provided by long reads and contig-contig alignments to link contigs and to collapse haplotypes. Lep-Anchor supports millions of markers over multiple maps.

Lep-Anchor anchors genome assemblies automatically using dense linkage maps. the anchoring accuracy can be improved by utilising information about map position uncertainty.





□ Multiset sparse partial least squares path modeling for high dimensional omics data analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3286-3

Multiset sparse Partial Least Squares path modeling (msPLS), a generalized penalized form of Partial Least Squares path modeling, for the simultaneous modeling of biological pathways across multiple omics domains.

msPLS is an multiset multivariate method for the integrative analysis of multiple high dimensional omics data sources. It accounts for the relationship between multiple high dimensional data sources while it provides interpretable results through its sparse solutions.





□ GSR: Partitioning gene-based variance of complex traits by gene score regression

>> https://www.biorxiv.org/content/10.1101/2020.01.08.899260v1.full.pdf

The rationale of Gene Score Regression (GSR) is based on the insight that genes that are highly correlated with the causal genes in the causal gene set or pathways will exhibit high marginal TWAS statistic.

Consequently, by regressing on the genes’ marginal statistic using the sum of the gene-gene correlation scores in each gene set, GSR can assess the amount of phenotypic variance explained by the predicted expression of the genes in that gene set.

And then calculate the statistical significance of each gene set based on the z-score of the linear regression coefficients in the GSR model.




□ DeepECA: an end-to-end learning framework for protein contact prediction from a multiple sequence alignment

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3190-x

DeepECA can obtain good results compared with existing ECA methods such as PSICOV, CCMpred, DeepCOV, and ResPRE when tested on the CASP11 and CASP12 datasets.

Deep ECA models to achieve improvement in the respective precisions of both shallow and deep MSAs. And expanded this model to a multi-task model to increase the prediction accuracy by incorporation with predictions of secondary structures and solvent-accessible surface areas.





□ Independent evolution of transcript abundance and gene regulatory dynamics

>> https://www.biorxiv.org/content/10.1101/2020.01.22.915033v1.full.pdf

Profiling the interspecific hybrid provided insights into the basis of variations, showed that trans-varying alleles interact dominantly, and revealed complementation of cis-variations by variations in trans.

The data suggests that gene expression diverges primarily through changes in promoter strength that do not alter gene positioning within the transcription network.




□ Tagsteady: a metabarcoding library preparation protocol to avoid false assignment of sequences to samples

>> https://www.biorxiv.org/content/10.1101/2020.01.22.915009v1.full.pdf

Tagsteady, a metabarcoding Illumina library preparation protocol for pools of nucleotide-tagged amplicons that enables efficient and cost-effective generation of metabarcoding data with virtually no tag-jumps.

Tagsteady protocol is developed as a single-tube library preparation protocol, circumventing both the use of T4 DNA Polymerase in the end-repair step and the post-ligation PCR amplification step.




□ PARC: ultrafast and accurate clustering of phenotypic data of millions of single cells

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa042/5714737

PARC can cluster a single cell data set of 1.1M cells within 13 minutes, compared to > 2 hours for the next fastest graph-clustering algorithm.

PARC, “phenotyping by accelerated refined community-partitioning” - is a fast, automated, combinatorial graph-based clustering approach that integrates hierarchical graph construction (HNSW) and data-driven graph-pruning with the new Leiden community-detection algorithm.




□ Transfer index, NetUniFrac and some useful shortest path-based distances for community analysis in sequence similarity networks

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa043/5714743

the shortest path concept can be extended to sequence similarity networks by defining five new distances, NetUniFrac, Spp, Spep, Spelp and Spinp, and the Transfer index, between species communities present in the network.

NetUniFrac and the Transfer index can be computed in linear time with respect to the number of edges in the network.





□ SBOL Visual 2 Ontology

>> https://www.biorxiv.org/content/10.1101/2020.01.24.918417v1.full.pdf

the SBOL Visual 2 Ontology, which provides a machine-readable representation of the constraints attached to genetic circuit glyphs and their relationships to other ontological terms.

SBOL-VO can act as a a catalogue that can be converted into different formats and hence can be used to autogenerate portions of the SBOL Visual specification in the future.

Ontological axioms restricting the use of glyphs for different sequence features, molecules, or molecular interactions can directly be utilised in ontological queries and be submitted to existing reasoners, such as HermiT.




□ Pathway Mining and Data Mining in Functional Genomics. An Integrative Approach to Delineate Boolean Relationships Between Src and Its targets

>> https://www.biorxiv.org/content/10.1101/2020.01.25.919639v1.full.pdf

Boolean relationships between molecular components of cells suffer from too much simplicity regarding the complex identity of molecular interactions.

applying hierarchical clustering on expression of all DEGs just in Src over-activated samples. Pearson correlation coefficient was used as the distance in the clustering method.

Using information in KEGG and OmniPath databases to construct pathways from Src to DEGs (Differentially Expressed Genes) and between DEGs themselves.




□ FitHiC2: Identifying statistically significant chromatin contacts from Hi-C data

>> https://www.nature.com/articles/s41596-019-0273-0

FitHiC2, it is possible to perform genome-wide analysis for high-resolution Hi-C data, including all intra-chromosomal distances and inter-chromosomal contacts.

FitHiC2 also offers a merging filter module, which eliminates indirect/bystander interactions, leading to significant reduction in the number of reported contacts without sacrificing recovery of key loops such as those between convergent CTCF binding sites.





□ CCSN: Single Cell RNA Sequencing Data Analysis by Conditional Cell-specific Network

>> https://www.biorxiv.org/content/10.1101/2020.01.25.919829v1.full.pdf

To quantify the differentiation state of cells, developed a new method “network flow entropy” (NFE) to estimate the differentiation potency of cells by exploiting the gene-gene network constructed by CCSN.

CCSN reveals the network dynamics over the differentiation trajectory. The normalized gene expression profile and CSN/CCSN is used when compute the network flow entropy.





□ iPAC: A genome-guided assembler of isoforms via phasing and combing paths

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa052/5716327

iPAC, a new genome-guided assembler for reconstruction of isoforms, which revolutionizes the usage of paired-end and sequencing depth information via phasing and combing paths over a newly designed phasing graph.

Another new graph model, the phasing graph is introduced for solving the ambiguity of the connections of the in- and out- splicing junctions at each exon by effectively integrating the paired-end and sequence depth information.





□ HASLR: Fast Hybrid Assembly of Long Reads

>> https://www.biorxiv.org/content/10.1101/2020.01.27.921817v1.full.pdf

HASLR, a hybrid assembler which uses both second and third generation sequencing reads to efficiently generate accurate genome assemblies. HASLR is capable of assembling large genomes on a single computing node.

HASLR, similar to hybridSPAdes, Unicycler, and Wengan builds SR contigs using a fast SR assembler (i.e. Minia).

HASLR builds a novel data structure called backbone graph to put short read contigs in the order expected to appear in the genome and to fill the gaps between them using consensus of long reads.





□ SPARK: Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies

>> https://www.nature.com/articles/s41592-019-0701-7

SPARK identifies spatial expression patterns of genes in data generated from various spatially resolved transcriptomic techniques.

SPARK directly models spatial count data through generalized linear spatial models. It relies on recently developed statistical formulas for hypothesis testing, providing effective control of type I errors and yielding high statistical power.





Whose name is written on water.

2020-01-31 16:18:47 | Science News



□ Miraculix: Efficient Calculation of the Genomic Relationship Matrix

>> https://www.biorxiv.org/content/10.1101/2020.01.12.903146v1.full.pdf

a sequence of distinct algorithms that differ in their speed-up and their SIMD requirements: TwoBit (15×; SIMD not used), Packed (28×; SSE2); Shuffle (35×; SSSE3).

Since the calculation of a genomic relationship matrix needs a large number of arithmetic operations, fast implementations are of interest. This fastest algorithm is more accurate and 25× faster than a Advanced Vector Extensions double precision floating-point implementation.





□ Hypercluster: a flexible tool for parallelized unsupervised clustering optimization https://www.biorxiv.org/content/10.1101/2020.01.13.905323v1.full.pdf

Hypercluster distributes clustering calculations in parallel and aggregates results using SnakeMake. Augment hyperparameters to test and investigate several evaluation metrics to choose optimal clustering.

Hypercluster streamlines the use of unsupervised clustering to derive biologically relevant structure within data.




□ Metalign: Efficient alignment-based metagenomic profiling via containment min hash

>> https://www.biorxiv.org/content/10.1101/2020.01.17.910521v1.full.pdf

Metalign employs a high-speed, high-recall pre-filtering method based on the mathematical concept of Containment Min Hash, which identifies a small number of candidate organisms that are potentially in the sample and creates a subset database consisting of these organisms.

Multi-aligned reads are resolved according to the uniquely-mapped abundances of the organisms that a read is aligned to.

The standard of evidence used to determine the presence of an organism by Metalign could potentially be automatically modulated based on characteristics of the sample such as sequencing depth and estimated alpha diversity.





□ Enabling high-accuracy long-read amplicon sequences using unique molecular identifiers with Nanopore or PacBio sequencing

>> https://www.biorxiv.org/content/10.1101/645903v3.full.pdf

For rapid testing and iterative development, the ONT UMI approach is attractive due to its low cost and portability.

PB CCS sequencing also performs well for high-accuracy amplicon sequencing, but the presence of low abundant chimeric variants is problematic, especially if they propagate into reference databases.




□ Bayesian differential analysis of gene regulatory networks exploiting genetic perturbations

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3314-3

The BFDSEM algorithm is developed for GRNs modeled with structural equation models (SEMs), which makes it possible to incorporate genetic perturbations into models to improve the inference accuracy.

Computer simulations are run to compare the performance of our proposed BFDSEM to FSSEM and ReDNet, the results demonstrate that BFDSEM has somewhat consistent results with FSSEM and has better performance than ReDNet.

Compared to FSSEM and ReDNet, the Gibbs sampler in BFDSEM is easy to implement, and not only provides point estimation via the posterior mean or median, but also quantifies the uncertainty via the credible interval automatically.





□ Highly Multiplexed Single-Cell Full-Length cDNA Sequencing of human immune cells with 10X Genomics and R2C2 https://www.biorxiv.org/content/10.1101/2020.01.10.902361v1.full.pdf

At current throughput and accuracy, the combination of ONT sequencers and the R2C2 method allows the analysis of thousands of cells. An increase in read output will make it possible to either analyze more cells or sequence all transcripts reverse transcribed by the 10X workflow.

with about 3,000 R2C2 reads per cell, it captured about 60% (based on ~5000 molecules per cell in Illumina data set) of all reverse transcribed molecules.





□ HIFI: estimating DNA-DNA interaction frequency from Hi-C data at restriction-fragment resolution

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1913-y

HIFI algorithms aim to reliably estimate Hi-C contact frequencies between all intra-chromosomal pairs of restriction fragments.

RF resolution TAD and subTAD boundary calling, and the identification of potential DNA-DNA contacts and TF enrichments that drive changes in chromatin architecture and gene regulation.

BAM/SAM-mapped read files were then converted (by BAMtoSparseMatrix.py’ script) to a raw read-pair count matrix RC, stored using a sparse matrix TSV file format, before use with HIFI.




□ URMAP: an ultra fast read mapper

>> https://www.biorxiv.org/content/10.1101/2020.01.12.903351v1.full.pdf

This strategy saves index space without compromising search time because in the rare cases where a collided slot is aligned, the alignment will be abandoned quickly due to excessive mismatches in the flanking reference sequence.

URMAP is an order of magnitude faster than the BWT mappers, with URMAP ~9× faster than BWA and Bowtie2 and URMAPv ~20× faster, noting that in practice the speed improvement may be less due to file i/o overhead.




□ SIMPLEs: a single-cell RNA sequencing imputation strategy preserving gene modules and cell clusters variation

>> https://www.biorxiv.org/content/10.1101/2020.01.13.904649v1.full.pdf

SIMPLEs can integrate bulk RNASeq data for estimating dropout rates. In simulations, SIMPLEs performed significantly better than prevailing scRNASeq imputation methods by various metrics.

SIMPLEs, which iteratively identifies correlated gene modules and cell clusters and imputes dropouts customized for individual gene module and cell type. Simultaneously, it quantifies the uncertainty of imputation and cell clustering.

In order to analyze large-scale single-cell sequencing data, a stochastic gradient descent algorithm can be employed. SIMPLEs using a nested EM algorithm for estimation, but optimizing the factor loading matrix B in the M-step was computational intensive.





□ CHROMATIX: computing the functional landscape of many-body chromatin interactions in transcriptionally active loci from deconvolved single cells

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1904-z

CHROMATIX enables quantification of the extent of specific 3-, 4-, and higher-order many-body interactions at a large scale.

CHROMATIX also elucidates the functional implications by providing details on how super-enhancers, enhancers, promoters, and other functional units probabilistically assemble into a spatial apparatus with measurable Euclidean distances.




□ Inference of Gene Regulatory Networks Based on Nonlinear Ordinary Differential Equations

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa032/5709036

a nonlinear ordinary differential equations framework to model dynamic gene regulation and an importance measurement strategy to infer all putative regulatory links efficiently. It keeps good robustness and accuracy at a low computational complexity.

The proposed method is a scalable method exploiting time-series and steady-state data jointly, in which nonlinear ODEs and XGBoost are employed to infer gene regulatory networks.




□ A randomized parallel algorithm for efficiently finding near-optimal universal hitting sets

>> https://www.biorxiv.org/content/10.1101/2020.01.17.910513v1.full.pdf

Universal hitting sets (UHS) were recently introduced as an alternative to the central idea of minimizers in sequence analysis with the hopes that they could more efficiently address common tasks such as computing hash functions for read overlap, sparse suffix arrays, and Bloom filters.

PASHA produces sets only slightly larger than those of serial deterministic algorithms; moreover, the set size is provably guaranteed to be within a small factor of the optimal size.

PASHA removes the vertex with the maximum hitting number in each iteration, we consider a set of vertices for removal with hitting numbers within an interval, and pick vertices in this set independently with constant probability.





□ BREM-SC: A Bayesian Random Effects Mixture Model for Joint Clustering Single Cell Multi-omics Data

>> https://www.biorxiv.org/content/10.1101/2020.01.18.911461v1.full.pdf

BREM-SC, a novel Bayesian Random Effects Mixture model that jointly clusters paired single cell transcriptomic and proteomic data. as a probabilistic model-based approach, BREM-SC is able to quantify the clustering uncertainty for each single cell.

BREMSC (with core functions jointDIMMSC and BREMSC) for joint clustering droplet-based scRNA-seq and CITE-seq data. jointDIMMSC is developed as a direct extension of DIMMSC, which assumes full indenpendency between single cell RNA and surface protein data.

BREM-SC uses a computationally intensive MCMC algorithm, which is roughly linear with the number of genes, and utilizes a naïve block-wise Gibbs sampling by which divides cells into multiple groups and use sequential Gibbs sampling when updating cell-specific random effects.





□ PHENSIM: Phenotype Simulator

>> https://www.biorxiv.org/content/10.1101/2020.01.20.912279v1.full.pdf

PHENSIM, a systems biology approach, which can simulate the effects of activation/inhibition of one or multiple biomolecules on cell phenotypes by exploiting signaling pathways.

PHENSIM performs all calculations in the KEGG meta- pathway, obtained by merging KEGG pathways after elimination of duplicates and disease pathways, and integrates information on miRNA-target and transcription factor (TF)-miRNA extracted from online public knowledge bases.





□ LuxHS: DNA methylation analysis with spatially varying correlation structure

>> https://www.biorxiv.org/content/10.1101/2020.01.21.913640v1.full.pdf

This approach builds on a method which combines a generalized linear mixed model (GLMM) with a likelihood that is specific for BS-seq data and that incorporates a spatial correlation for methylation levels.

LuxHS is a novel technique using a sparsity promoting prior to enable cytosines deviating from the spatial correlation pattern.






□ Gene Graph Convolutions: Graph biased feature selection of genes is better than random for many genes

>> https://www.biorxiv.org/content/10.1101/2020.01.17.910703v1.full.pdf

taking the k-core decomposition of the STRING network, and compare it to a degree-matched random model.

the 400-core of the STRING network does indeed appear to improve single gene inference performance somewhat, as the ∆AUC distribution is positively shifted, and the “long tail” of poorly performing genes is less evident relative to the untransformed STRING network.




□ Fast Lasso method for Large-scale and Ultrahigh-dimensional Cox Model with applications to UK Biobank

>> https://www.biorxiv.org/content/10.1101/2020.01.20.913194v1.full.pdf

a scalable and highly efficient algorithm to fit a Cox proportional hazard model by maximizing the L1-regularized (Lasso) partial likelihood function, based on the Batch Screening Iterative Lasso (BASIL) method.

The output of this algorithm is the full Lasso path, the parameter estimates at all predefined regularization parameters, as well as their validation accuracy measured using the concordance index (C-index) or the validation deviance.




□ The Dynamic Shift Detector: An algorithm to identify changes in parameter values governing populations

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007542

The “Dynamic Shift Detector” is an algorithm to identify changes in parameter values governing temporal fluctuations in populations with nonlinear dynamics.

Dynamic Shift Detector can not only identify and quantify parameter changes but also assess uncertainty in potential break points and help detect time frames where additional research should be focused.




□ Tripal EUtils: a Tripal module to increase exchange and reuse of genome assembly metadata

>> https://academic.oup.com/database/article/doi/10.1093/database/baz143/5709695

Tripal requires mapping of all content types, and all their associated metadata, to ontology terms.

a Tripal extension module, Tripal EUtils, which accesses metadata from the NCBI Assembly, BioProject and BioSample resources using NCBI’s E-utilities and imports it to Chado using the proposed map.





□ RTDT: A new method for inferring timetrees from temporally sampled molecular sequences

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007046

RelTime with Dated Tips [RTDT] estimates pathogen timetrees based on a relative rate framework underlying the RelTime approach that is algebraic in nature and distinct from all other current methods.

evaluating the node-by-node accuracy of dates and CIs estimated by RTDT together with Bayesian (BEAST and MCMCTree) and non-Bayesian (LSD, TreeTime, and treedater) methods.

RTDT requires orders of magnitude less computational time than other approaches, which makes it feasible to analyze large datasets containing thousands of sequences.





□ From graph topology to ODE models for gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2020.01.22.916114v1.full.pdf

The ODE models that capture the effect of cis-regulatory elements involving protein complex binding, based on the model in the GeneNetWeaver source code, are described in detail and shown to satisfy the Constant Sign Property.

Such ODE models are shown to have great complexity due to many continuous parameters and combinatorial module configurations. It turns out that many ODE models satisfying CSP can correspond to the same graph model.





□ MATCHA: Probing multi-way chromatin interaction with hypergraph representation learning

>> https://www.biorxiv.org/content/10.1101/2020.01.22.916171v1.full.pdf

MATCHA (Multi-wAy inTeract- ing CHromatin Analysis), based on hypergraph representation learning where multi- way chromatin interactions are represented as hyperedges.

MATCHA is a new computational method based on hypergraph representation learning for the analysis of multi-way chromatin interaction data that can provide new insights into nuclear genome structure and function.






□ Transient amplifiers of selection and reducers of fixation for death-Birth updating on graphs

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007529

a perturbative method for this problem for weak selection regime, meaning that mutations have small fitness effects. the fixation probability under weak selection can be calculated in terms of the coalescence times of random walks.

Using this and other methods, uncovering the first known examples of transient amplifiers of selection for the death-Birth process. And also exhibit new families of “reducers of fixation”, which decrease the fixation probability of all mutations.





□ Limits on amplifiers of natural selection under death-Birth updating

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007494

Extensive literature exists on amplifiers of natural selection for the Birth-death Moran process, but no amplifiers are known for the death-Birth Moran process. if amplifiers exist under death-Birth updating, they must be bounded and transient.

Boundedness implies weak amplification, and transience implies amplification for only a limited range of the mutant fitness advantage. These results demonstrate that amplification depends on the specific mechanisms of the evolutionary process.





□ MSMC and MSMC2: The Multiple Sequentially Markovian Coalescent

>> https://link.springer.com/protocol/10.1007%2F978-1-0716-0199-0_7

Using standard HMM algorithms, the hidden state (trees and recombination events) can be integrated out efficiently using dynamic programming.

MSMC2 uses a much simpler pairwise HMM, the pairwise model is—in contrast to the MSMC—an exact model under the Sequentially Markovian Coalescent, and does not suffer from biases with increasing number of genomes.

MSMC2 estimates coalescent rates across the entire distribution of pairwise coalescence times, with increasing resolution in more recent times, and importantly without biased estimates.




□ qtQDA: quantile transformed quadratic discriminant analysis for high-dimensional RNA-seq data

>> https://peerj.com/articles/8260/

a new classification method for RNA-seq data based on a model where the counts are marginally negative binomial but dependent.

qtQDA works by first performing a quantile transformation (qt) then applying Gaussian quadratic discriminant analysis (QDA) using regularized covariance matrix estimates.

the regularization approach applied in qtQDA requires no special assumptions for the covariance matrix and requires minimal computation since the regularized estimate is obtained with analytic formulas.





□ SemBioNLQA: A semantic biomedical question answering system for retrieving exact and ideal answers to natural language questions

>> https://www.sciencedirect.com/science/article/pii/S0933365718302756

SemBioNLQA, a fully automatic system, integrates NLP methods in question classification, document retrieval, passage retrieval and answer extraction modules.

SemBioNLQA provides an unbeatable advantage over AskHERMES, EAGli and Olelo in that it handles with a large amount of questions types including yes/no, factoid, list and summary questions.




□ Bivartect: accurate and memory-saving breakpoint detection by direct read comparison

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa059/5716329

Bivartect achieves high predictive performance with an elaborate memory-saving mechanism, which allows Bivartect to run on a computer with a single node for analyzing small omics data.

Bivartect, a simple yet versatile variant caller based on direct comparison of short sequence reads between normal and mutated samples. Bivartect can detect not only single nucleotide variants but also insertions/deletions, inversions and their complexes.





□ UMI-VarCal: a new UMI-based variant caller that efficiently improves low-frequency variant detection in paired-end sequencing NGS libraries

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa053/5716330

UMI-VarCal stands out from the crowd by being one of the few variant callers that don’t rely on SAMtools to do their pileup. Instead, at its core runs an innovative homemade pileup algorithm specifically designed to treat the UMI tags in the reads.

After the pileup, a Poisson statistical test is applied at every position to determine if the frequency of the variant is significantly higher than the background error noise.





□ Using machine learning to extract coarse-scale PDEs from fine-scale data

>> https://aip.scitation.org/doi/10.1063/10.0000669

Combining techniques such as Gaussian processes and neural networks with feature selection and manifold learning approaches yields broad possibilities for improved data-driven modeling.

This framework is illustrated through the data-driven discovery of the macroscopic, concentration-level PDE resulting from a fine-scale, lattice Boltzmann (LB) model of a reaction/transport process [the FitzHugh-Nagumo (FHN) process in one spatial dimension].





□ Pore-C: New Method Captures Higher-Order Chromatin Contacts With Nanopore Sequencing

>> https://www.genomeweb.com/sequencing/new-method-captures-higher-order-chromatin-contacts-nanopore-sequencing


To investigate the aspect of genome architecture, Pore-C couples chromatin conformation capture with Oxford Nanopore Technologies (ONT) long reads to directly sequence multi-way chromatin contacts without amplification.

Pore-C tools is designed to analyse the data from multi-contact pore-C reads. It is similar to the pairtools package in scope, however it is specifically designed to handle multi-contact reads (aka c-walks).





□ DeepTE: a computational method for de novo classification of transposons with convolu-tional neural network

>> https://www.biorxiv.org/content/10.1101/2020.01.27.921874v1.full.pdf

DeepTE, which classifies unknown TEs using convolutional neural networks. DeepTE transferred sequences into input vectors based on k-mer counts.

DeepTE contained eight models for different classification purposes, and also wrapped a function to correct false classification based on domain structure, and DeepTE outperforms current PASTEC tool.





□ WINTF: A New Weighted Imputed Neighborhood-regularized Tri-factorization One-class Collaborative Filtering Algorithm: Application to Target Gene Prediction of Transcription Factors

>> https://ieeexplore.ieee.org/document/8970514

a new weighted imputed neighborhood-regularized tri-factorization algorithm (WINTF), an extension of REMAP, which allows us to set different feature sizes for user and items as well as increase the power of modeling complex relationships among them.

Increasing the number of low-rank matrices in WINTF to mimic deep learning may be an interesting future study.

The time complexity due to the introduction of an additional low-rank matrix as well as large number of parameters from multilayer neural network can be overcome by factorizing smaller submatrices and projecting to the original feature space.





□ A Systematic Evaluation of Single-cell RNA-sequencing Imputation Methods

>> https://www.biorxiv.org/content/10.1101/2020.01.29.925974v1.full.pdf

While most scRNA-seq imputation methods recover biological expression observed in bulk RNA-seq data, the majority of the methods do not improve performance in downstream analyses compared to no imputation, in particular for clustering and trajectory analysis.

The imputation methods MAGIC, scVI, and DCA resulted in the highest correlation using both UMI and non-UMI plate-based protocols, but SAVER and SAVER-X (without pretraining) resulted in the highest correlation using UMI count data.

Using Monocle 2 trajectory inference, SAVER, kNN-smoothing, mcImpute, and the latent spaces from SAUCE (SAUCE_latent) increased both the correlation and overlap compared to no imputation.

In terms of computation, MAGIC, DCA and DeepImpute are among the most efficient methods. kNN-smoothing, ALRA, bayNorm, scImpute, SAUCIE, scScope exhibit high scalability. SAVER-X, SAVER and SAVER-X are intermediary while the remaining methods do not scale well for large datasets.





□ Sapling: Accelerating Suffix Array Queries with Learned Data Models

>> https://www.biorxiv.org/content/10.1101/2020.01.29.925768v1.full.pdf

Sapling, an algorithm for sequence alignment which uses a learned data model to augment the suffix array and enable faster queries.

Sapling uses the idea of learned index structures to model the contents of the suffix array as a function rather than as a data structure, and uses a practical piecewise linear model to efficiently approximate this function.

Sapling is implemented a simple seed-and-extend aligner as a proof-of-concept which uses Sapling for seeding and the Striped-Smith-Waterman algorithm for extending seeds into full alignments.





□ Factorial study of the RNA-seq computational workflow identifies biases as technical gene signatures

>> https://www.biorxiv.org/content/10.1101/2020.01.30.924092v1.full.pdf

identifying gene group specific quantification biases in currently used RNA-seq software and references using a wide variety of RNA-seq computational pipelined, and by decomposing these expression datasets using an independent component analysis matrix factorisation method.


Transcriptome-based software and pseudo-aligners, despite the added difficulty of the combined steps, should be studied considering their growing place in the literature.





Aqua Celestia

2020-01-31 01:31:31 | コスメ・ファッション

□ Maison Francis Kurkdjian『Aqua Celestia』

>> https://www.franciskurkdjian.com/

“Aqua Celestia forms a seamless bond between the blue of the sky and the blue of the sea, forging a path toward absolute serenity.”

最近お気に入りの香水。クルジャンのアクア・セレスティア。無限に続く空と海との境界をイメージした香水。ライム、クールミント、カシスの眩い結晶のような煌きがミモザの柔らかいベールを纏って、どこまでも透き通る水に波紋を描いていく…。ミドルは石鹸みたいな至高の爽快感。夏が待ち遠しくなる香り。Max Richterの”Nor earth, Nor boundless sea”という曲が、 この香水のイメージにピッタリです😌✨





□ Byredo『Blanche』

>> https://www.byredo.com/eu_en/

スウェーデン発の高級フレグランスメゾン。近年は調香師の美学を反映したジェンダーレスなコレクションが静かに注目されていて、『ブランシュ』は清廉で澄み渡る純白の香りが特徴。ミドルからはパウダリーな甘さも。衣服と肌との間に香りを乗せるような作品。





□ Byredo『Mojave Ghost』

モハーヴェ・ゴースト。砂漠に咲くゴーストフラワーをモチーフにした、どこか乾いたアンバーノートが格調高いオリエンタルっぽさを演出する中にも、マグノリアの芳潤な甘さが漂う、芯の強い優しさを感じるフレグランス。気持ちを奮わせたい時に使いたい一本。





最近は、帰ってシャワーを浴びた肌に、気分で選んだ香水を乗せて寝るのが至福の時。古書の並ぶ図書館をイメージしたFUEGUIA1833の”Biblioteca de Babel”『バベルの図書館』や、木造寺院の香りと言われる”Jacarandá”を夢の中で薫って眠ろう。


□ Max Richter - nor earth, nor boundless sea






Otium.

2020-01-01 00:01:01 | Science News

“Love is a Healer but who heals love?”

愛の量は無際限ではない。共存共生に至る相互作用の結節点において、私たちが絶えず迫られてきた選択の過程と、その波及効果が蓄積するポテンシャルの総量であり、時に移ろい、時に反転する。



□ Vargas: heuristic-free alignment for assessing linear and graph read aligners

>> https://www.biorxiv.org/content/10.1101/2019.12.20.884676v1.full.pdf

Vargas uses multi-core parallelization and vectorized (SIMD) instructions to make it practical to optimally align large numbers of reads, achieving a maximum speed of 437 billion cell updates per second.

Vargas calculates all possible alignments of read to reference in the course of filling the dynamic programming matrix, and can be used to optimize heuristic alignment accuracy and improve correctness of difficult ChIP-seq reads by 30% over Bowtie 2 most sensitive alignment mode.





□ ICGRM: integrative construction of genomic relationship matrix combining multiple genomic regions for big dataset

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3319-y

ICGRM splits the genome SNPs into several parts and calculate the summary statistics for each part that only needs very few computer RAM; then combines the summary statistics for each part to produce GRM.

ICGRM use Intel Math Kernel Library (Intel MKL) with BLAS routine to perform matrix operation. ICGRM avoids calculating GRM for the whole genome at the same time; thus it makes the construction of the GRM more efficient.

When the number of SNPs is greater than 10 million and number of individual is greater than 10,000, ICGRM solves the problem by splitting the dataset and merging the summary statistics, which reduces the computer memory dramatically.

ICGRM calculates GRM for each segment separately, where users can optionally define the weight of SNP effect and the second command line is to combine each GRM from each segment/loci for generation of the final GRM.






□ NanoCaller for accurate detection of SNPs and small indels from long-read sequencing by deep neural networks:

>> https://www.biorxiv.org/content/10.1101/2019.12.29.890418v1.full.pdf

NanoCaller integrates haplotype structure in deep convolutional neural network for the detection of SNPs/indels from long-read sequencing data, and uses multiple sequence alignment to re-align candidate sites for indels, to improve the performance of variant calling.

NanoCaller uses long-range information to generate predictions for each candidate variant site by considering pileup information of other candidate sites sharing reads. it performs read phasing and carries out local realignment on each set of phased reads to call indels.

NanoCaller uses three convolutional layers with Scaled Exponential Linear Unit (SELU) activation units followed by two different full connection layers for SNP calling. On all genomes, NanoCaller achieves much better performance than Clairvoyante.





□ Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements

>> https://www.biorxiv.org/content/10.1101/2019.12.23.887349v1.full.pdf

Smash++ features improved accuracy, obtained by using multiple finite-context models along with substitution-tolerant Markov models to find fine-grained and coarse-grained chromosomal rearrangements.

Smash++ visualizer allows the visualization of the detected rearrangements along with their self- and relative complexity, by generating an Scalable Vector Graphics.

The Kolmogorov complexity is not computable, hence, an alternative is required to compute it approximately. employ a reference-free compressor to approximate the complexity and, consequently, the redundancy of the found similar regions in the reference and the target sequences.




□ MADOKA: an ultra-fast approach for large-scale protein structure similarity searching

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3235-1

MADOKA performs about 6–100 times faster than existing methods, including TM-align and SAL, in massive alignments.

MADOKA employs a score to select pairs with more similarity to carry out a more accurate fragment-based residue-level alignment.





□ SpecHap: a fast haplotyping method based on spectral graph theory

>> https://www.biorxiv.org/content/biorxiv/early/2019/12/10/870972.full.pdf

SpecHap, a novel approach that adopts spectral graph theory for fast diploid haplotype construction from diverse sequencing protocols.

for 10x linked-reads, barcodes are also examined with its range inferred from the alignment result and hence the linked fragment can cover het-SNVs loci separated by thousands of base pairs; for Hi-C, linkages among het-SNVs apart from millions of base pairs can be extracted.

The unnormalized graph Laplacian is then calculated on the adjacency matrix of the ladder-shape graph, and a cut leading to two haplotype strings guided by the Fielder Vector is performed.

Instead of calculating posterior probability, SpecHap locates conflicting region when the Fielder Vector fails to provide two haplotypes, and cuts the phase block accordingly. SpecHap outputs the VCF file that records phased variants with block identifiers.

Spectral graph theory states that the multiplicity of zero eigenvalues of the graph Laplacian signifies the number of connected components in the graph, and the eigenvector contains the union of spectral signal for all the connected subgraphs.





□ UniPath: A uniform approach for pathway and gene-set based analysis of heterogeneity in single-cell epigenome and transcriptome profiles https://www.biorxiv.org/content/biorxiv/early/2019/12/11/864389.full.pdf

UniPath, for representing single-cells using pathway and gene-set enrichment scores by transformation of their open-chromatin or expression profiles. UniPath also provides consistency and scalability in estimating gene-set enrichment scores for every cell.

UniPath also enables exploiting pathway continuum and dropping known covariate gene-sets for predicting temporal order of single-cells. a novel pseudo-temporal ordering method in UniPath which can use pathway scores and allow dropping gene-sets of known covariates.





□ Explaining the genetic causality for complex diseases via deep association kernel learning

>> https://www.biorxiv.org/content/10.1101/2019.12.17.879866v1.full.pdf

Deep Association Kernel learning (DAK) model to enable automatic causal genotype encoding for GWAS at pathway level. DAK framework incorporates convolutional layers to encode raw SNPs as latent genetic representation.





□ AdaReg: Data Adaptive Robust Estimation in Linear Regression with Application in GTEx Gene Expressions

>> https://www.biorxiv.org/content/biorxiv/early/2019/12/10/869362.full.pdf

constructing a robust likelihood criterion based on weighted densities in the mixture model of Gaussian population distribution mixed with unknown outlier distribution, and developed a data-adaptive γ-selection procedure embedded into the robust estimation.





□ Inferring reaction network structure from single-cell, multiplex data, using toric systems theory

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007311

The Effective Stoichiometric Spaces (ESS) elucidates network structure from the covariances of single-cell multiplex data, fixed time-point, single-cell data.

eigendecomposition of covariance matrices from sc-data can be interpreted in terms of network stoichiometry and timescales, without model simulation, independent of kinetic parameters, and unhindered by unobserved species.

Simulation of synthetic complex-balanced networks and GRNs suggests ways to tailor reaction network ODEs to better match sc-data. it is possible to predict a partitioning of the eigenspace of Σ without actually simulating the ODE network, under the assumption of toric geometry.

This application of toric theory enables a data-driven mapping of covariance relationships in single-cell measurements into stoichiometric information, one in which each cell subpopulation has its associated Effective Stoichiometric Spaces interpreted in terms of CRN theory.






□ The Euler Characteristic and Topological Phase Transitions in Complex Systems

>> https://www.biorxiv.org/content/biorxiv/early/2019/12/11/871632.full.pdf

theoretically illustrate the emergence of topological phase transitions in three classical network models, namely the Watts-Strogratz model, the Random Geometric Graph, and the Barabasi-Albert model.

Topological phase transitions are characterized by the zeros of the Euler characteristic (EC) or by singularities of the Euler entropy and also indicate signal changes in the mean node curvature of networks and the emergence of a giant k-cycle in a simplicial complex.





□ An Efficient hybrid filter-wrapper metaheuristic-based gene selection method for high dimensional datasets

>> https://www.nature.com/articles/s41598-019-54987-1

a hybrid method based on the IWSSr method and Shuffled Frog Leaping Algorithm (SFLA) is proposed to select effective features in a large-scale gene dataset.

in the filter phase, the Relief method is used for weighting the features. Then, in the wrapper step, by using the SFLA and the IWSSr algorithm, the search is performed to find the best subset of the features.





□ ASTRAL-Pro: quartet-based species tree inference despite paralogy

>> https://www.biorxiv.org/content/10.1101/2019.12.12.874727v1.full.pdf

ASTRAL-Pro is more accurate than alternative methods when gene trees differ from the species tree due to the simultaneous presence of gene duplication, gene loss, incomplete lineage sorting, and estimation errors.

ASTRAL-Pro (ASTRAL for PaRalogs and Orthologs), a new quartet-based species tree inference method. ASTRAL-Pro defines a measure in a principled manner and show how to optimize it using dynamic programming.





□ Overcoming high nanopore basecaller error rates for DNA storage via basecaller-decoder integration and convolutional codes

>> https://www.biorxiv.org/content/10.1101/2019.12.20.871939v1.full.pdf

a novel approach which overcomes the high error rates in basecalled sequences by integrating a Viterbi error correction decoder with the basecaller, enabling the decoder to exploit the soft information available in the deep learning based basecaller pipeline.

To match the sequential Markov nature of the nanopore sequencing and basecalling process, using a convolutional coding scheme as the inner code and achieve 3x lower reading cost than the state-of- the-art works at similar writing cost.

The decoding of the convolutional code using the transition probabilities generated by the basecaller, rather than the final basecalled sequence. The code structure dictates the possible transitions while the probabilities from the basecaller are used for scoring possible paths.




□ HyPo: Super Fast & Accurate Polisher for Long Read Genome Assemblies

>> https://www.biorxiv.org/content/10.1101/2019.12.19.882506v1.full.pdf

HyPo – a Hybrid Polisher– that utilises short as well as long reads within a single run to polish a long reads assembly of small and large genomes.

Hypo generates significantly more accurate polished assembly in about one-third of the time with only about half the memory requirements in comparison to Racon.

HyPo exploits unique genomic kmers to selectively polish segments of contigs using Partial Order Alignment of selective read-segments.





□ FRASER: Detection of aberrant splicing events in RNA-seq data with FRASER

>> https://www.biorxiv.org/content/10.1101/2019.12.18.866830v1.full.pdf

FRASER is based on a count distribution and multiple testing correction, reducing the number of calls by two orders of magnitude over commonly applied z score cutoffs, with a minor sensitivity loss.

The optimal dimension for the latent space was determined by maximizing the area under the precision-recall curve when calling artificially injected aberrant values independently for each splicing metric.

The method was robust to the exact choice of the encoding dimension, as the performance for recalling artificial outliers typically plateaued around the optimal dimension.




□ M3S: a comprehensive model selection for multi-modal single-cell RNA sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3243-1

For a gene fitted with multiple peaks in a drop-seq data set, considering the zero expressions as well as those expressions falling into the lowest peak as insignificant expressions, while the rest of the expressions in larger peaks as different levels of true expressions.

Zero Inflated Mixture Gaussian achieved best fitting for 10x genomics data. error of the non-zero expressions are hard to be modeled due to the varied experiment resolutions, ZIMG utilizes a Gaussian distribution to cover the variation of the errors of the lowly expressed genes.




□ The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies

>> https://www.biorxiv.org/content/10.1101/2019.12.17.864991v1.full.pdf

The hybrid strategy can be pursued either by incorporating the short-read data into the early phase of assembly, during the read correction step, or by using short reads to “polish” the consensus built from long reads.

POLCA (POLishing by Calling Alternatives) is more accurate than Pilon, and comparable in accuracy to Racon.






□ Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model

>> https://genomebiology.biomedcentral.com/track/pdf/10.1186/s13059-019-1861-6

The multinomial model adequately describes negative control data, and there is no need to model zero inflation.

A simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance.





□ Representation learning of genomic sequence motifs with convolutional neural networks

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007560

Deep convolutional neural networks designed to foster hierarchical representation learning of sequence motifs—assembling partial features into whole features in deeper layers—tend to learn distributed representations, i.e. partial motifs.

Max-pooling and convolutional filter size modulates information flow, controlling the extent that deeper layers can build features hierarchically.






□ Distance Indexing and Seed Clustering in Sequence Graphs

>> https://www.biorxiv.org/content/10.1101/2019.12.20.884924v1.full.pdf

a minimum distance algorithm with run time that is linear in the depth of the snarl tree. The distance is also more strictly defined than the previous implementation of distance in vg and is faster than other distance algorithms on queries of arbitrary distance.

The minimum distance algorithm will also work with any sequence graph, whereas the preexisting vg distance algorithm required pre-specified paths.




□ A Metric Space of Ranked Tree Shapes and Ranked Genealogies

>> https://www.biorxiv.org/content/10.1101/2019.12.23.887125v1.full.pdf

The metrics provide the basis for a decision-theoretic statistical inference that can be constructed by finding the best-ranked genealogy that minimizes the expected error or loss function, which, in turn, is a function of the tree distance.

This metric space provides a tool for evaluating convergence and the mixing of Markov chain Monte Carlo procedures on ranked genealogies.

The proposed distances inherit the properties of L1 and L2 distances of symmetric positive definite (SPD) matrices, and define a distance on ranked unlabeled isochronous trees, with all samples obtained at the same point in time.





□ TandemMapper and TandemQUAST: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats

>> https://www.biorxiv.org/content/10.1101/2019.12.23.887158v1.full.pdf

The tandemMapper algorithm is inspired by the minimap2 and Flye mappers. TandemQUAST uses general metrics for evaluating ETRs of any kind and centromeric metrics designed specially to account for HOR structure of centromeric ETR.

TandemQUAST consists of the read mapping module that identifies positions of read alignments to the assembly, polishing module for improving the quality of an assembly based on the identified read alignments, and the quality assessment module.





□ GCNG: Graph convolutional networks for inferring cell-cell interactions https://www.biorxiv.org/content/10.1101/2019.12.23.887133v1.full.pdf

GCNG firstly processed the single cell spatial expression data as one matrix encoding cell locations, and one more matrix encoding gene expression, then feed them into a five-layer graph convolutional neural network to predict gene relationships between cells.

The core structure of GCN is its graph convolutional layer, which enables it to combine graph structure (cell location) and node information (gene expression in specific cell) as inputs to a neural network.

on spatial transcriptomics data, GCNG improves upon prior methods suggested for this task and can propose novel pairs of extracellular interacting genes, and the output of GCNG can also be used for down-stream analysis including functional assignment.





□ Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1874-1

an unconstrained negative binomial model may overfit scRNA-seq data, and overcome this by pooling information across genes with similar abundances to obtain stable parameter estimates.

This procedure omits the need for heuristic steps including pseudocount addition or log-transformation and improves common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression.

calculate the Pearson residuals of this model, representing a variance-stabilization transformation that removes the inherent dependence between a gene’s average expression and cell-to-cell variation.

And constructing a generalized linear model (GLM) for each gene with UMI counts as the response and sequencing depth as the explanatory variable.





□ VEF: a Variant Filtering tool based on Ensemble methods

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz952/5686383

VEF, a variant filtering tool based on decision tree ensemble methods that overcomes the main drawbacks of VQSR and HF.

VEF is based on ensemble methods that use decision trees as base learners, and trains the model on a variant call set from a sample for which a high- confidence set of “true” variants (i.e., a ground truth of gold standard) is available.




□ SMART: SuperMaximal Approximate Repeats Tool

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz953/5686382

Supermaximal k-mismatch repeats, which are linear in n and capture all maximal k-mismatch repeats: every maximal k-mismatch repeat is a substring of some supermaximal k-mismatch repeat.

SMART employs recent algorithmic advances in approximate string matching under Hamming distance to compute supermaximal k-mismatch repeats without explicitly computing all maximal repeated pairs.




□ A robustness metric for biological data clustering algorithms

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3089-6

if ground truth is largely unknown, or if hierarchical structure is implicit in the data under study, then hierarchical clustering can serve at least as a good starting candidate given its excellent robustness, relative simplicity and intuitive appeal.

For more complex clustering tasks, however, endorse instead a graph-theoretical method such as paraclique due its solid overall robustness and its much improved potential for biological fidelity.





□ The transcriptome dynamics of single cells during the cell cycle

>> https://www.biorxiv.org/content/10.1101/2019.12.23.887570v1.full.pdf

The linearity of this algorithm is in contrast to non-linear analysis and visualization methods (k-nearest-neighbors, UMAP, tSNE), which can be used to flatten more complex manifolds onto a two-dimensional space.

It is generally accepted that single-cell transcriptomic profiles characterize an expression manifold embedded in the expression space of all genes.

if cells have evolved “optimality principles” to traverse the cell cycle, it is tempting to speculate that similar optimality principles of gene expression trajectories may have evolved for a large variety of biological systems.





□ VariantStore: A Large-Scale Genomic Variant Search Index

>> https://www.biorxiv.org/content/10.1101/2019.12.24.888297v1.full.pdf

the scalability of VariantStore by indexing genomic variants from the TCGA- BRCA project containing 8640 samples and 5M variants in ≈ 4 Hrs and the 1000 genomes project containing 2500 samples and 924M variants in ≈ 3 Hrs.

VariantStore outperformed VG toolkit by 3× in terms of memory-usage and construction time and uses 25% less disk space although VG toolkit does not support variant queries. In future, VariantStore will support dynamic updates following the LSM-tree design.




□ Efficient computation of stochastic cell-size transient dynamics

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3213-7

a continuous time Markov chain (CTMC) which transient dynamics can be estimated numerically using the finite state projection (FSP) approach.

Finite state projection maps the infinite set of the states n∈N of a Markov chain onto a set with a finite number of states. The transient probability distribution of such finite state Markov chain can approximated by using standard numerical ODE solvers.

Continuous rate models consider, besides discrete division events, the cell cycle dynamics. This class of models describes the division as a continous-time stochastic process w/ an associated division rate that sets the probability of division into an infinitesimal time interval.





□ Sanity: Bayesian inference of the gene expression states of single cells from scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2019.12.28.889956v1.full.pdf

the prior distribution over LTQs for a gene is the least assuming, i.e. maximum entropy, distribution that is consistent with only a given mean and variance.

Indeed, even though methods such as SAVER, DCA, MAGIC, and scVI specifically normalize for the total UMI count per cell, their normalized expression levels show strong correlations (and anti-correlations) with total UMI count.

Sanity (SAmpling Noise corrected Inference of Transcription activitY) is deterministic, has zero tunable parameters, and provides error-bars for all its estimates.





□ A novel algorithm for alignment of multiple PPI networks based on simulated annealing

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-6302-0

an efficient algorithm based on graph feature vectors to globally align multiple PPI networks. A target scoring function was used to evaluate the quality of network alignment, which integrates both topology and sequence information.




□ Empirical decomposition of the explained variation in the variance components form of the mixed model

>> https://www.biorxiv.org/content/10.1101/2019.12.28.890061v1.full.pdf

a novel coefficient of determination which is dimensionless, has an intuitive and simple definition in terms of variance explained, is additive for several random effects and reduces to the adjusted coefficient of determination in the linear model.