(photo by @KAGAYA_11949)
Sun/
Never to be the same/
if not the eternal flame/
Daily rounds to make/
off to sleep then to wake/
What then to learn/
will it ever burn/
- John Mullinax (@johnm1w)
□ SC3 - consensus clustering of single-cell RNA-Seq data:
>> http://biorxiv.org/content/early/2016/01/13/036558
The output of a scRNAseq experiment is typically represented as an expression matrix (M ) consisting of g rows and N columns. The SC3 algorithm consists of several steps: a cell filter, a gene filter, distance calculations, nonlinear spectral transformations, and k-means clustering, followed by the consensus step. the distance calculation reflects a change of coordinate space, as we go from the expression matrix (g x N) to a celltocell matrix (N x N)
□ Fast and accurate single-cell RNA-Seq by clustering of transcript-compatibility counts:
>> http://biorxiv.org/content/biorxiv/early/2016/01/15/036863.full.pdf
Clustering based on scRNA-Seq expression matrices can also require domain specific information, temporal information or functional constraints so that in some cases hand curation of clusters is performed after unsupervised clustering. utilized the ”pseudo” option of the kallisto RNA-Seq program which computes equivalence classes of reads after pseudo alignment. normalize by the total number of mapped reads to obtain a probability distribution called the transcript-compatibility count or TCC. then compute the square-root of the Jensen-Shannon divergence between the TCC distributions for each pair of cells.
□ Comparative analysis of single-cell RNA-sequencing methods:
>> http://biorxiv.org/content/early/2016/01/05/035758
Smart-seq on a microfluidic platform is the most sensitive method, CEL-seq is the most precise, SCRB-seq & Drop-seq are the most efficient. More recent scRNA-seq protocols have used unique molecular identifiers (UMIs) that tag mRNA molecules with a random barcode sequence during reverse transcription in order to identify sequence reads that originated during amplification.
□ A high-quality reference panel reveals the complexity of structural genome changes in a human population:
>> http://biorxiv.org/content/early/2016/01/18/036897
SV-integrated panel allows for accurate imputation of SVs by imputing structural variants in an independent group of individuals based solely on their SNP genotype status.
□ Machine learning approach improves CRISPR-Cas9 guide pairing:
>> http://www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.3437.html
□ A statistical companion to “Stochastic Gene Expression in a Single Cell”:
>> https://liorpachter.wordpress.com/2016/01/18/a-statistical-companion-to-stochastic-gene-expression-in-a-single-cell/
The statistical framework also highlights the need for quantile normalization, and provides justification for the use of sample correlation between the two reporter expression levels to estimate the percent contribution of extrinsic noise to the total noise. provides a geometric interpretation of these results that clarifies the current interpretation.
The ELSS intrinsic noise estimator is unbiased, whereas the ELSS extrinsic noise estimator is (slightly) biased.
□ Cross-platform normalization of microarray and RNA-seq data for machine learning applications:
>> https://peerj.com/articles/1621/
Training Distribution Matching (TDM) transforms RNA-seq data for use with models constructed from legacy platforms. some methods exist for machine learning under certain types of dataset shift,
Ptrain(y∣∣x)≠Ptest(y∣∣x)∧Ptrain(x)≠Ptest(x)
It refers to the fact that the probability of the dependent variable may not be the same in the training and test set for a given value of an independent variable and that the probability of that value occurring is different in both datasets.
(The green comet represents the true solution, also the best solution π∗ found by OPTIMA (p-value p∗ =2.16e ^-9 ), while the blue comet belongs to a false alignment with the lowest number of cut errors (p=7.35e^-6 ).)
□ OPTIMA: sensitive whole-genome alignment of genomic maps by combinatorial indexing & technology-agnostic statistics:
>> http://www.gigasciencejournal.com/content/5/1/2
The final Z-score θ(π) for each solution π:
θ(π∈Π) + Z-score(π, cuterrors) + Z-score(π, WHT(χ2, matches))) = Z-score(-Z-score(π, matches)
The computational cost of extending a seed (c-tuple) of an experimental map with m fragments is thus, in the worst case, O((m-c) δ^3 ) time, where δ is the bandwidth of the dynamic programming, O((m-c)2) space for allocating the dynamic programming matrix for each side of the seed.
public Pair dynamicProgrammingSOMAv2 (Fragment[] opticalMap, int locationInReference, boolean reverse, int highestLocationInReference)
□ Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks:
>> http://biorxiv.org/content/biorxiv/early/2015/10/05/028399.full.pdf
Basset predicts the cell-specific DNaseI hypersensitivity of sequences. Fully connected layers perform a linear transformation of the input vector and apply a ReLU. The final layer performs a linear transformation to a vector of 164 elements that represents the target cells. A sigmoid nonlinearity maps this vector to the range 0-1, where the elements serve as probability predictions of DNaseI hypersensitivity, to be compared via a loss function to the true hypersensitivity vector.
□ Assessment of megabase-scale somatic copy number variation using single cell sequencing:
>> http://genome.cshlp.org/content/early/2016/01/15/gr.198937.115.full.pdf
Hidden Markov Model and Circular Binary Segmentation identify different private CNVs, suggesting that employing both algorithms simultaneously could enhance specificity. the overlap of HMM and CBS at E = 0.995 and α = 0.0001, respectively, afforded the best combination of sensitivity and specificity.
□ cellTree: Inference and visualisation of Single-Cell RNA-seq Data data as a hierarchical tree structure
>> http://bioconductor.org/packages/devel/bioc/html/cellTree.html
Latent Dirichlet Allocation (LDA) model & builds a compact tree modelling the relationship between individual cells over time or space. cellTree orders genes by per-topic probability and uses a Kolmogorov-Smirnov test to compute a p-value on the matching nodes in the GO graph. Three annotation categories are available: biological processes, cellular components and molecular functions.
□ Integration of ATAC-seq and RNA-seq identifies human alpha cell and beta cell signature genes:
>> http://www.sciencedirect.com/science/article/pii/S221287781600003X
To determine whether cell type-selective open chromatin regions from the ATAC-seq analysis correlated w/ cell type-selective gene expression, integrated α- and β-cell ATAC-seq data with α- and β-cell mRNA-seq data. 785 genes that were expressed at significantly higher levels in α- versus β-cells (≥2-fold difference, w/ a false discovery rate [FDR] <0.1) had at least one associated α-cell-specific open chromatin region that was not identified in β- or acinar cells, which accounted for 78% of differentially-expressed α-cell genes.
<br />
□ INC-Seq: Accurate single molecule reads using nanopore sequencing:
>> http://biorxiv.org/content/early/2016/01/27/038042
INC-Seq (for Intramolecular-ligated Nanpore Consensus Sequencing) that employs rolling circle amplification (RCA) of circularized templates to generate linear products (with tandem copies of the template) that can then be sequenced on the nanopore platform. Chimeras from inter-molecular ligation were observed to be rare under the experimental conditions used in INC-Seq.
□ DamID-seq: Genome-wide Mapping of Protein-DNA Interactions by NGS Sequencing of Adenine-methylated DNA Fragments:
>> http://www.jove.com/video/53620/damid-seq-genome-wide-mapping-protein-dna-interactions-high
The DamID-seq approach enables probing NL associations within gene structures and allows comparing genome-NL interaction maps with other functional genomic data.and allows comparing genome-NL interaction maps with other functional genomic data. DamID-seq will complement ChIP-seq, which is better suited for mapping histone modifications, transcription factors and other proteins that intimately interact with the DNA.
□ Fluctuating Nonlinear Spring Model of Mechanical Deformation of Biological Particles:
>> http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004729
The in-depth analysis of the structure and energy output from MD simulations for this specific example of a thick-shelled nanoparticle has enabled us to identify the most important types of mechanical excitations that contribute to the deformation of biological particles. The FNS model can be used to interpret the FX-curves for biological particles of different regular geometries, including cylindrical or ellipsoidal shapes, as long as the particles are subjected to a uniaxial compressive force induced by a spherical-like indenter.
□ The Northern Arizona SNP Pipeline (NASP): accurate, flexible, and rapid identification of SNPs in WGS datasets:
>> http://biorxiv.org/content/early/2016/01/25/037267
To understand how the SNPs called would affect the overall tree topology, a phylogeny was inferred for each set of SNPs with RAxML. The UPGMA dendrogram demonstrates the NASP results return a phylogeny that is more representative of the true phylogeny than other methods. NASP was developed to work on job management systems including Torque, Slurm, and Sun/Oracle Grid Engine (SGE)
□ ALVIS: interactive non-aggregative visualization and explorative analysis of multiple sequence alignments:
>> http://nar.oxfordjournals.org/content/early/2016/01/26/nar.gkw022.short
Sequences and sites can be selected manually or by entering queries in a search box using a simple query language. The query language accepts arbitrary boolean combinations of regular expressions on both the sequence labels as well as the actual sequences. Alvis imports phylogenetic trees or builds its own tree based on pairwise distances computed from an evolutionary sequence kernel.
□ Combining computational models, semantic annotations and simulation in a graph database:
>> http://www.ncbi.nlm.nih.gov/pubmed/25754863
(MaSyMoS is a Neo4J instance containing SBML and CellML Models as well as Simulation Descriptions: https://sems.uni-rostock.de/projects/masymos/)
CellML- and Systems Biology Markup Language-encoded models to be effectively maintained in one database. Dijkstra’s algorithm, directed path traversing, spanning trees or sophisticated graph matching patterns are hardly applicable on RDF triple.
□ Automating biomedical data science through tree-based pipeline optimization:
>> http://arxiv.org/abs/1601.07925
TPOT is capable of building machine learning pipelines that achieve competitive classification accuracy-discovering novel pipeline operators. TPOT’s GP algorithm continually tinkered with the pipelines - adding new pipeline operators that improve fitness and removing redundant or detrimental operators - in an intelligent, guided search for high-performing pipelines.
(Schematic diagram of modular distributed computation and the scalability of this architecture in logic circuits )
□ Implementation of Complex Biological Logic Circuits Using Spatially Distributed Multicellular Consortia:
>> http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004685
in the biological context easier implementations can be achieved modifying the canonical expression of the Boolean function to obtain an expression involving only OR logic. a formulation of the Boolean function based on the Inverted Logic Formulation that minimizes biological constraints ensuring scalability.
□ Approximating high-dimensional dynamics by barycentric coordinates with linear programming:
>> http://scitation.aip.org/content/aip/journal/chaos/25/1/10.1063/1.4906746
this extension of the barycentric coordinates offers enough accuracy and versatility to reproduce the behavior of target dynamical systems. It naturally solves the problem of under-fitting often observed when dynamical systems are modeled. Furthermore, this extension partially solved the curse of dimensionality in a simple way, because the number of parameters does not increase even if the dimension of phase space increases. The method can be broadly applied, from helping to improve weather forecasting and to comprehensively understanding complex biological data.
□ r2VIM: A new variable selection method for random forests in genome-wide association studies:
>> http://biodatamining.biomedcentral.com/articles/10.1186/s13040-016-0087-3
r2VIM is able to identify interaction effects in situations with purely epistatic effects, i.e. when no marginal effects exist. this simulations show that a fairly stringent parameter is needed to fully control the number of false-positive SNPs that are identified. RF identified a much smaller region in TRINITY data compared to the large number of SNPs with similar p-values based on logistic regression.
□ Real time selective sequencing on a MinION device with 'Read Until’.
>> http://biorxiv.org/content/early/2016/02/03/038760
apply Dynamic Time Warping to match short query current traces to references, demonstrating selection of specific regions of small genomes, individual amplicons from a group of targets, or normalisation of amplicons in a set. increasing speed of nanopore sequencing and the scaling up of the MinION to 3,000 channels, and the PromethION with 144,000 channels.
□ Real-time, portable genome sequencing for Ebola surveillance
>> http://www.nature.com/nature/journal/vaop/ncurrent/full/nature16996.html
>> http://simpsonlab.github.io/2016/02/03/ebola-snps/
simply generate all possible haplotypes (2^n for n SNPs) and calculate the likelihood for each haplotype using the Hidden Markov Model.
MinION nanopore sequencingを用いたエボラウィルスに対するReal Time Analysisの成果物がほぼ同時に発表されてる。
□ A new view of transcriptome complexity through the lens of local splicing variations:
>> http://elifesciences.org/content/early/2016/02/01/eLife.11752
In order to detection, quantification and visualization of LSVs, developed Modeling Alternative Junction Inclusion Quantification (MAJIQ). MAJIQ uses a combination of read rate modeling, Bayesian Ψ modeling, bootstrapping to report posterior Ψ, ΔΨ distributions for quantified LSV.
□ An annotation agnostic algorithm for detecting nascent RNA transcripts in GRO-seq:
>> http://www.ncbi.nlm.nih.gov/pubmed/26829802?dopt=Abstract
Fast Read Stitcher (FStitch) takes advantage of two popular machine-learning techniques, hidden Markov models (HMMs) and logistic regression.
□ Cataloging Splice Junctions in all Human RNA-seq data from Sequence Read Archive:
>> http://nextgenseek.com/2016/02/cataloging-splice-junctions-in-all-human-rna-seq-data-from-sequence-read-archive/
Rail-RNA uses Map-Reduce framework and alternates between aggregating and computing steps. intropolis is a list of exon-exon junctions found across 21,504 human RNA-seq samples on SRA from spliced read alignment to hg19 w/ Rail-RNA. Of junctions found by Rail-RNA in at least 80 SEQC samples, as many as 97.5% are found by at least one SEQC alignment protocol, 90.1% are found by all three. 80 SEQC samples is 4.7% of 1,720, comparable to a 1,000-sample threshold discussed below for the 21,504 SRA.
□ The Hunt for the Algorithms That Drive Life on Earth
>> https://www.quantamagazine.org/20160128-ecorithm-computers-and-life/