"7 is the only prime followed by a cube."
生と死が分かつ世界は存在せず、我々は一つの世界に切り出された存在である。生命は物質と信号の支流のようなものであり、死は支流の分断に似ている。絶たれた河は滞留し、澱み、源流より湧き出る慣性によって別の道を見つける。
心と質量の振る舞いは光そのものの動態である。
□ IMELAPSE OF THE FUTURE: A Journey to the End of Time (4K)
How's it all gonna end? This experience takes us on a journey to the end of time, trillions of years into the future, to discover what the fate of our planet and our universe may ultimately be.
□ Confidence reports in decision-making with multiple alternatives violate the Bayesian confidence hypothesis
>> https://www.biorxiv.org/content/biorxiv/early/2019/03/21/583963.full.pdf
The Max model (which corresponds to the Bayesian confidence hypothesis) and the Entropy model (in which confidence is derived from the entropy of the posterior distribution) fell short in accounting for the data.
Theresults were robust under changes of stimulus configurations, and when trial-by-tria feedback was provided, and demonstrate that the posterior probabilities of the unchosen categories impact confidence in decision-making.
□ lionessR: single-sample network reconstruction in R
>> https://www.biorxiv.org/content/biorxiv/early/2019/03/21/582098.full.pdf
LIONESS (Linear Interpolation to Obtain Network Estimates for Single Samples) estimates individual sample networks by applying linear interpolation to the pre- dictions made by existing aggregate network inference approaches.
The default network reconstruction method we use here is based on Pearson correlation. However, lionessR can run on any network reconstruction algorithm that returns a complete, weighted adjacency matrix.
□ Analysis of error profiles in deep next-generation sequencing data
>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1659-6
a comprehensive analysis of the substitution errors in deep sequencing data and discovered that the substitution error rate can be computationally suppressed to 10^−5 to 10^−4, which is 10- to 100-fold lower than generally considered achievable (10−3) in the current literature.
To measure substitution error, took advantage of the high-depth sequencing data generated from the flanking sequences in the amplicons known to be devoid of genetic variations.
□ ChIPulate: A comprehensive ChIP-seq simulation pipeline
>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006921
simulate key steps of the ChIP-seq protocol with the aim of estimating the relative effects of various sources of variations on motif inference and binding affinity estimations.
Besides providing specific insights and recommendations, provides a general framework to simulate sequence reads in a ChIP-seq experiment, which should considerably aid in the development of software aimed at analyzing ChIP-seq data.
□ Multiple Sequentially Markovian Coalescent (MSMC)-IM: Tracking human population structure through time from whole genome sequences
>> https://www.biorxiv.org/content/biorxiv/early/2019/03/21/585265.full.pdf
MSMC-IM, uses an improved implementation of the MSMC (MSMC2) to estimate coalescence rates within and across pairs of populations, and then fits a continuous Isolation-Migration model to these rates to obtain a time-dependent estimate of gene flow.
An important direction for future work is to achieve a generalisation of the continuous concept of population separation to multiple populations, which might help to better understand and quantify the processes that shaped human population diversity in the deep history of our species.
□ TAD fusion score: discovery and ranking the contribution of deletions to genome structure
>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1666-7
There are several applications of the proposed method for TAD fusion discovery, it will provide biologists a way to rank and pick deletions that potentially cause a significant disruption on the genome structure.
the approach presented here for deletions can be extended to consider other types of structural variants, such as inversions and translocations.
□ Error, noise and bias in de novo transcriptome assemblies
>> https://www.biorxiv.org/content/biorxiv/early/2019/03/22/585745.full.pdf
Much of the bias and noise is due to incorrect estimation of the effective length of transcripts and genes, which is fundamental to abundance calculations.
Length-scaled abundance estimators partly alleviate this problem, and more pipelines should be developed to leverage them.
□ OMGS: Optical Map-based Genome Scaffolding:
>> https://www.biorxiv.org/content/10.1101/585794v1
OMGS is a fast genome scaffolding tool which takes advantage of one or multiple Bionano optical maps to accurately generate scaffolds. Instead of alternatively using single optical maps, OMGS uses multiple optical maps at the same time and takes advantage of the redundance contained in multiple maps to generate the ”optimal” scaffolds which make the smartest tradeoff between contiguity and correctness.
□ ngsLD: evaluating linkage disequilibrium using genotype likelihoods
>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz200/5418793
ngsLD is a program to estimate pairwise linkage disequilibrium (LD) taking the uncertainty of genotype's assignation into account. It does so by avoiding genotype calling and using genotype likelihoods or posterior probabilities.
This method makes use of the full information available from sequencing data and provides accurate estimates of linkage disequilibrium patterns compared to approaches based on genotype calling.
□ SArKS: de novo discovery of gene expression regulatory motif sites and domains by suffix array kernel smoothing
>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz198/5418797
SArKS, applies nonparametric kernel smoothing to uncover promoter motif sites that correlate with elevated differential expression scores. SArKS detects motif k-mers by smoothing sequence scores over sequence similarity. A second round of smoothing over spatial proximity reveals multi-motif domains (MMDs).
□ Superlets: time-frequency super-resolution using wavelet sets:
>> https://www.biorxiv.org/content/biorxiv/early/2019/03/21/583732.full.pdf
Classical spectral estimators, like the short-time Fourier transform (STFT) or the continuous-wavelet transform (CWT) optimize either temporal or frequency resolution, or find a tradeoff that is suboptimal in both dimensions. Superlets are able to resolve temporal and frequency details with unprecedented precision, revealing transient oscillation events otherwise hidden in averaged time-frequency analyses.
□ Newest Methods for Detecting Structural Variations
>> https://www.cell.com/trends/biotechnology/fulltext/S0167-7799(19)30036-8#%20
Strand-seq is the most suitable detection method for chromosomal inversions, a particularly challenging group of structural variants.
□ Melissa: Bayesian clustering and imputation of single-cell methylomes
>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1665-8
Melissa (MEthyLation Inference for Single cell Analysis), a Bayesian hierarchical method to cluster cells based on local methylation patterns, discovering patterns of epigenetic variability between cells.
Melissa and DeepCpG models reported substantially better imputation performance compared to the rival methods and show comparable performance when analyzed on real data sets, demonstrating their flexibility in capturing complex patterns of methylation.
□ High-throughput Multimodal Automated Phenotyping (MAP) with Application to PheWAS
>> https://www.biorxiv.org/content/biorxiv/early/2019/03/23/587436.full.pdf
The MAP algorithm achieved higher or similar AUC and F-scores compared to the ICD code across all 16 phenotypes. The features assembled via the automated approach had comparable accuracy to those assembled via manual curation (AUCMAP 0.943, AUCmanual 0.941).
The PheWAS results suggest that the MAP approach detected previously validated associations with higher power when compared to the standard PheWAS method based on ICD codes.
□ The proBAM and proBed standard formats: enabling a seamless integration of genomics and proteomics data
>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1377-x
proBAM and proBed are adaptations of the well-defined, widely used file formats SAM/BAM and BED, respectively, and both have been extended to meet the specific requirements entailed by proteomics data.
□ Parallel clustering of single cell transcriptomic data with split-merge sampling on Dirichlet process mixtures
>> https://academic.oup.com/bioinformatics/article-abstract/35/6/953/5085373
Parallelized Split Merge Sampling on Dirichlet Process Mixture Model (the Para-DPMM model). the split merge mechanism samples on the cluster level, which significantly improves convergence and optimality of the result.
The model is highly parallelized and can utilize the computing power of high performance computing (HPC) clusters, enabling massive inference on huge datasets.
□ The Distance Precision Matrix: computing networks from non-linear relationships
>> https://academic.oup.com/bioinformatics/article/35/6/1009/5079333
Distance Precision Matrix, a network reconstruction method aimed at both lin- ear and non-linear data. Like partial distance correlation, it builds on distance covariance, a measure of possibly non-linear association, and on the idea of full-order partial correlation, which allows to discard indirect associations.
the Distance Precision Matrix method can successfully compute networks from linear and non-linear data, and consistently so across different datasets, even if sample size is low. The method is fast enough to compute networks on hundreds of nodes.
□ Transmission dynamics study of tuberculosis isolates with whole genome sequencing in southern Sweden
>> https://www.nature.com/articles/s41598-019-39971-z
MIRU-VNTR and WGS clustered the same isolates, although the distribution differed depending on MIRU-VNTR limitations. Both genotyping techniques identified clusters where epidemiologic linking was insufficient, although WGS had higher correlation with epidemiologic data.
□ Demonstration of End-to-End Automation of DNA Data Storage
>> https://www.nature.com/articles/s41598-019-41228-8
The device encodes data into a DNA sequence, which is then written to a DNA oligonucleotide using a custom DNA synthesizer, pooled for liquid storage, and read using a nanopore sequencer and a novel, minimal preparation protocol.
This resulting system has three core components that accomplish the write and read operations: an encode/decode software module, a DNA synthesis module, and a DNA preparation and sequencing module.
□ Genetic Research Could Be Suffering From Racial Bias To Detriment Of Science
>> https://www.techtimes.com/articles/240108/20190323/genetic-research-could-be-suffering-from-racial-bias-to-detriment-of-science.htm
"The lack of ethnic diversity in human genomic studies means that our ability to translate genetic research into clinical practice or public health policy may be dangerously incomplete, or worse, mistaken,"
□ Jujujajáki networks: The emergence of communities in weighted networks
>> http://www.complexity-explorables.org/slides/
This explorable illustrates a dynamic network model that was designed to capture the emergence of community structures, heterogeneities and clusters that are frequently observed in social networks.
Jujujajáki written in Japanese is 呪呪邪邪鬼 which, according to google translate means: Curse evil evil demon.
The Jujujajáki Network is a dynamic, weighted network. Existing links between nodes i and j have weights w_{ij} > 0 that quantify the connection strength.
If you now increase the local search probability, strong links will appear, as well as tightly knit groups of triangles. This structure will eventually come to a dynamic equilibrium, exhibiting structures observed in real networks.
□ Cell BLAST: Searching large-scale scRNA-seq database via unbiased cell embedding
>> https://www.biorxiv.org/content/biorxiv/early/2019/03/24/587360.full.pdf
The deep generative model combined with posterior-based latent-space similarity metric enables Cell BLAST to model continuous spectrum of cell states accurately.
Jensen-Shannon divergence between prediction and ground truth shows that our prediction is again more accurate than scmap.
□ On Transformative Adaptive Activation Functions in Neural Networks for Gene Expression Inference
>> https://www.biorxiv.org/content/biorxiv/early/2019/03/24/587287.full.pdf
analyzing the D–GEX method and determined that the inference can be improved using a logistic sigmoid activation function instead of the hyperbolic tangent.
The original method used the linear regression for the profile reconstruction due to its simplicity and scalability, which was then improved by a deep learning method for gene expression inference called D–GEX which allows for reconstruction of non-linear patterns.
The improved neural network achieves average mean absolute error of 0.1340 which is a significant improvement over our reimplementation of the original D–GEX which achieves average mean absolute error 0.1637.
□ Supervised dimension reduction for large-scale "omics" data with censored survival outcomes under possible non-proportional hazards
>> https://www.biorxiv.org/content/biorxiv/early/2019/03/24/586529.full.pdf
This approach can handle censored observations using robust Buckley-James estimation in this high-dimensional setting and the parametric version employs the flexible generalized F model that encompasses a wide spectrum of well known survival models.
□ A Divide-and-Conquer Method for Scalable Phylogenetic Network Inference from Multi-locus Data
>> https://www.biorxiv.org/content/biorxiv/early/2019/03/24/587725.full.pdf
A novel two-step method for scalable inference of phylogenetic networks from the sequence alignments of multiple, unlinked loci. The method infers networks on subproblems and then merges them into a network on the full set of taxa.
To reduce the number of trinets to infer, formulate a Hitting Set version of the problem of finding a small number of subsets, and implement a simple heuristic to solve it.
□ Population divergence time estimation using individual lineage label switching
>> https://www.biorxiv.org/content/biorxiv/early/2019/03/24/587832.full.pdf
a new Bayes inference method that treats the divergence time as a random variable. The divergence time is calculated from an assembly of splitting events on individual lineages in a genealogy.
High immigration rates lead to a time of the most recent common ancestor of the sample that predates the divergence time, thus loses any potential signal of the divergence event in the sample data.
□ Systematic Evaluation of Statistical Methods for Identifying Looping Interactions in 5C Data
>> https://www.cell.com/cell-systems/fulltext/S2405-4712(19)30067-5
Chromosome-Conformation-Capture-Carbon-Copy (5C) is a molecular technology based on proximity ligation that enables high-resolution and high-coverage inquiry of long-range looping interactions.
a comparative assessment of method performance at each step in the 5C analysis pipeline, including sequencing depth and library complexity correction, bias mitigation, spatial noise reduction, distance-dependent expected and variance estimation, statistical modeling, and loop detection.
□ GMASS: a novel measure for genome assembly structural similarity
>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2710-z
The GMASS score was developed based on the distribution pattern of the number and coverage of similar regions between a pair of assemblies.
The GMASS score represents the structural similarity of a pair of genome assemblies based on the length and number of similar genomic regions defined as consensus segment blocks (CSBs) in the assemblies.
□ Kermit: linkage map guided long read assembly
>> https://link.springer.com/article/10.1186/s13015-019-0143-x
Kermit is based on an additional cleaning step added to the assembly. We show that it can simplify the underlying assembly graph, resulting in more contiguous assemblies and reducing the amount of misassemblies when compared to de novo assembly.
Colouring the reads also leads naturally into non-overlapping bins of reads, that can be assembled independently. This allows massive parallelism in the assembly and could make more sophisticated assembly algorithms practical.
Kermit is heavily based on miniasm and as such shares most advantages and disadvantages with it. minimap2 is used to provide all-vs-all read self-mappings to kermit. Kermit outputs an assembly graph in Graphical Fragment Assembly (GFA) Format.
□ CRAM: The Genomics Compression Standard
>> https://www.ga4gh.org/news/cram-compression-for-genomics/
CRAM is really mature now. It is a swap in replacement for BAM for htslib (C) and htsjdk (Java) - meaning GATK, BioPerl and BioPython, Ensembl, ENA, ANVIL, TopMed and many other software stacks.
□ tailfindr: Alignment-free poly(A) length measurement for Oxford Nanopore RNA and DNA sequencing
>> https://www.biorxiv.org/content/biorxiv/early/2019/03/25/588343.full.pdf
tailfindr, an R package to estimate poly(A) tail length on ONT long-read sequencing data. tailfindr operates on unaligned, basecalled data.
The resulting processed raw signal is smoothened by a moving average filter in both directions separately. Both smoothened signal vectors are then merged by point-by-point maximum calculation.
□ PSI : Fully-sensitive Seed Finding in Sequence Graphs Using a Hybrid Index
>> https://www.biorxiv.org/content/biorxiv/early/2019/03/25/587717.full.pdf
the Pan-genome Seed Index (PSI), a fully-sensitive hybrid method for seed finding, which takes full advantage of this property by combining an index over selected paths in the graph with an index over the query reads.
The seed finding step can be fundamentally more challenging on graphs than on sequences, because complex regions in the graph can give rise to a combinatorial explosion in the number of possible paths.
□ Automated Markov state models for molecular dynamics simulations of aggregation and self-assembly
>> https://aip.scitation.org/doi/full/10.1063/1.5083915
Molecular dynamics (MD) simulations have become a fundamental tool for understanding the behavior of both biological and non-biological molecules at full atomic resolution. extend the applicability of automated Markov state modeling to simulation data of molecular self-assembly and aggregation by constructing collective coordinates from molecular descriptors that are invariant to permutations of molecular indexing.
□ Magnus Representation of Genome Sequences
>> https://www.biorxiv.org/content/biorxiv/early/2019/03/25/588582.full.pdf
In the field of combinatorial group theory, Wilhelm Magnus studied representations of free groups by noncommutative power series. For a free group F with basis x1,...,xn and a power series ring Π in indeterminates ξ1 , . . . , ξn , Magnus showed that the map μ : xi → 1 + ξi defines an isomorphism from F into the multiplicative group Π× of units in Π.
an alignment-free method, the Magnus Representation, captures higher-order information in DNA/RNA sequences, and combined the approach with the idea of k-mers to define an effectively computable Mean Magnus Vector.
□ SVCurator: A Crowdsourcing app to visualize evidence of structural variants for the human genome
>> https://www.biorxiv.org/content/biorxiv/early/2019/03/25/581264.full.pdf
a crowdsourcing app - SVCurator - to help curators manually review large indels and SVs within the human genome, and report their genotype and size accuracy. SVCurator is a Python Flask-based web platform that displays images from short, long, and linked read sequencing data from the GIAB Ashkenazi Jewish Trio son [NIST RM 8391/HG002].
□ DARTS: Deep-learning augmented RNA-seq analysis of transcript splicing
>> https://www.nature.com/articles/s41592-019-0351-9
DARTS, a computational framework that integrates deep-learning-based predictions with empirical RNA-seq evidence to infer differential alternative splicing between biological samples. DARTS leverages public RNA-seq big data to provide a knowledge base of splicing regulation via deep learning, thereby helping researchers better characterize alternative splicing using RNA-seq datasets even with modest coverage.
□ Single particle diffusion characterization by deep learning
>> https://www.biorxiv.org/content/biorxiv/early/2019/03/26/588533.full.pdf
using deep learning to infer the underlying process resulting in anomalous diffusion. a neural network to classify single particle trajectories according to diffusion type – Brownian motion, fractional Brownian motion and Continuous Time Random Walk.
Future work in this field should expand the set of networks to incl other models, e.g. estimation of Continuous Time Random Walk parameters, identification of motion on fractal, and levy flights, and to address cases of a hierarchy of transport modes manifested in the same trajectory.
□ SNeCT: Scalable Network Constrained Tucker Decomposition for Multi-Platform Data Profiling
>> https://ieeexplore.ieee.org/document/8669882
SNeCT adopts parallel stochastic gradient descent approach on the proposed parallelizable network constrained optimization function. SNeCT decomposition is applied to a tensor constructed from a large scale multi-platform data.
The decomposed factor matrices are applied to stratify cancers, to search for top- k similar patients given a new patient, and to illustrate how the matrices can be used to identify significant genomic patterns in each patient.
□ Posterior-based proposals for speeding up Markov chain Monte Carlo
>> https://arxiv.org/pdf/1903.10221.pdf
PBPs generates large joint updates in parameter and latent variable space, whilst retaining good acceptance rates. an individual-based model for disease diagnostic test data, a financial stochastic volatility model and mixed and generalised linear mixed models used in statistical genetics.
PBPs are competitive with similarly targeted state-of-the-art approaches such as Hamiltonian MCMC and particle MCMC, and importantly work under scenarios where these approaches do not.
□ Cliques in projective space and construction of Cyclic Grassmannian Codes
>> https://arxiv.org/pdf/1903.09334v1.pdf
The construction of Grassmannian codes in some projective space is of highly mathematical nature and requires strong computational power for the resulting searches. using GAP System for Computational Discrete Algebra and Wolfram Mathematica, cliques in the projective space Pq(n) and then we use these to produce cyclic Grassmannian codes.
C ⊆ Gq(n,k) is an (n,M,d,k)q Grassmannian code if|C| = M and d(X,Y) ≥ d for all distinctX,Y ∈ C . Such a code is also called a constant dimension code.
□ Accounting for missing data in statistical analyses: multiple imputation is not always the answer
>> https://academic.oup.com/ije/advance-article/doi/10.1093/ije/dyz032/5382162
□ Denoising of Aligned Genomic Data
>> https://www.biorxiv.org/content/biorxiv/early/2019/03/26/590372.full.pdf
based on the Discrete Universal Denoiser (DUDE) algorithm, DUDE is a sliding-window discrete denoising scheme which is universally optimal in the limit of input sequence length when applied to an unknown source with finite alphabet size corrupted by a known discrete memoryless channel.