(SERGEY NIVENS/ THE DNA OF A NATION
□ 生物種という概念はラベリング以上の意味を持たない。我々に近い形質を象り、道を分かち、彼我を分かたれたもの。互いを知覚し、その信号はトポロジーに作用し、地形を巻き込んで侵食する。自己と他者であることは、人と動物であること、岩と砂であること、星と塵であること以上の差異ではない。
□ 誰かを求めることに意味を与えられるのなら、選んだ誰かがその人であることに、意味を求めたいのだ。私たちが漂う塵芥でしかないのなら、その浮き沈みは音階で、意味は引力に似ている。
□ Using deep learning to predict important non-coding sequence variants
>> http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.3547.html
□ DeepSEA: a Deep Learning to predict the chromatin effects of sequence alterations with single nucleotide sensitivity.
>> http://deepsea.princeton.edu/job/analysis/create/
DeepSEA used the formula:
1/(1+exp(-( log(P/(1-P))+log(5%/(1-5%))-log(c_train/(1-c_train ))) ))
where P is the probability given by DeepSEA model, c_train is the proportion of positive examples for this chromatin feature in the training data.
□ A novel statistical approach identifies feedback interactions for the construction of robust stochastic transcriptional oscillators
>> http://biorxiv.org/content/early/2015/08/19/025056
define a measure of robustness that coincides with the Bayesian model evidence, which allows to exploit Bayesian model selection to calculate the relative structural robustness of gene network models governed by stochastic dynamics. The novelty in this approach is that the algorithm spends time in models and parameters in direct proportion to their robustness, and thus focuses in on interesting regions of joint topology-parameter space.
□ A Bayesian Attractor Model for Perceptual Decision Making:
>> http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004442
the model predicts state-dependent, within-trial gain modulation of sensory processing by top-down feedback of the decision state. Hopfield networks have originally been suggested as a neurobiologically plausible firing-rate models of recurrently connected neurons.
□ Inferring Pairwise Interactions from Biological Data Using Maximum-Entropy Probability Models
>> http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004182
Statistical inference methods using partial correlations in the context of graphical Gaussian models (GGMs) have led to similar results and provide a more intuitive understanding of direct versus indirect interactions by employing the concept of conditional independence.
The binary embedding 1σ: Ω → {0, 1}Lq maps each vector of categorical random variables, xЄΩL, represented by a sequence of amino acids.
□ sleuth: a new RNA-Seq analysis method and software program with the quantification of samples with kallisto:
>> http://pachterlab.github.io/sleuth/
>> https://liorpachter.wordpress.com/2015/08/17/a-sleuth-for-rna-seq/
The main contribution of sleuth is an intuitive yet powerful model that bridges the gap b/n count-based methods & quantification algorithms.
To understand sleuth, it is helpful to start with the general linear model:
Y_t = X_t/beta_t + /epsilon_t
the unobserved logarithms of true counts for each transcript across samples and are assumed to be normally distributed.
the technical noise due to the random sequencing of fragments from a cDNA library & uncertainty introduced in estimating transcript counts. Intuitively, sleuth teases apart the two sources of variance by examining both technical and biological replicates, and in doing so directly estimates “true” biological variance.
kallisto can quantify 30 million human reads in less than 3 min on a Mac w/ read sequences and a transcriptome index takes less than 10 min. The sleuth Shiny interface is much more than just a GUI for kallisto and can be converted into a complete analysis in a matter of minutes.
□ Machine Learning in Cancer Research & Clinical Applications
>> http://bit.ly/1gt5Ypq
□ rpca: RobustPCA: Decompose a Matrix into Low-Rank and Sparse Components:
>> https://mran.revolutionanalytics.com/packages/info/?rpca
□ Robust Principal Component Analysis?:
>> http://www.columbia.edu/~jw2966/papers/CLMW11-JACM.pdf
It is possible to recover both the low-rank and the sparse components exactly by solving a very convenient convex program.
Principal Components Pursuit:
minimize ∥L∥∗ + λ∥S∥1
subjectto L+S =M
where ∥L∥∗ is the nuclear norm of L (sum of singular values).
rpca(M
lambda = 1/sqrt(max(dim(M))), mu = prod(dim(M))/(4 * sum(abs(M))),
term.delta = 10^(-7), max.iter = 5000, trace = FALSE,
□ An Introduction to Distributed Machine Learning:
>> http://blog.dato.com/an-introduction-to-distributed-machine-learning-1
the computational complexity of each iteration is O(MN/K) while the cost of communication is O(M log K) bits. when M ≅ N (or M >> N), things can be very different. As the number of machines increases, the communication costs start to dominate. the benefits of scaling out distributed linear model training to several machines when M ≅ N (or M >> N) are filled w/ qualifiers & caveats.
□ Genes genie: Oxford Nanopore’s Gordon Sanghera:
>> http://www.ft.com/intl/cms/s/0/0873b28a-4269-11e5-b98b-87c7270955cf.html
“internal start-ups” are already exploring related areas. The first, called Metrichor, is a cloud-based analysis service for MinIon. Metrichor designed to make sense of data emerging from MinIon, for example by identifying pathogens from their DNA as it appears on screen. Metrichor is also designed to track related metadata of these experiments like stock market information, labelled time series data.
>> https://metrichor.com/
□ miRTarVis: an interactive visual analysis tool for microRNA-mRNA expression profile data:
>> http://www.biomedcentral.com/1753-6561/9/S6/S2
miRTarVis applies GenMiR++, a Bayesian inference model, and MINE (Maximal Information-based Nonparametric Exploration) analysis. However, the node-link diagram in miRTarVis changes (often significantly) whenever the user changes the filtering or prediction parameters.
□ CoSREM: a graph mining algorithm for discovery of combinatorial splicing regulatory elements
>> http://www.biomedcentral.com/1471-2105/16/285
for k≥1, the k-dimensional de Bruijn graph G=(V,E) over Σ is a directed graph with vertex set V=Σ k
E = { (σw,wτ) ∣ w ∈ Σ k-1, σ, τ ∈ Σ}
The SRE graph G U =(U,E′) for G and U is the vertex-induced subgraph of G with edge set
E′ = { (u,v) ∈ E ∣ u, v ∈ U}.
□ BigBWA: Hadoop Based Approaching the Burrows-Wheeler Aligner to Big Data Technologies. Bioinformatics:
>> http://omictools.com/bigbwa-s10919.html
hadoop jar BigBWA.jar -archives bwa.zip -D mapreduce.input.fileinputformat.split.minsize=123641127 -D mapreduce.input.fileinputformat.split.maxsize=123641127 -mem -paired -index /Data/HumanBase/hg19 -r ERR000589.fqBDP ExitERR000589
□ EMSAR: estimation of transcript abundance from RNA-seq data by mappability-based segmentation and reclustering
>> http://www.biomedcentral.com/1471-2105/16/278
Although the underlying Poisson-based model is equivalent to the multinomial model, partitioning of transcripts and parameter estimation from a joint Poisson model with no hidden variables is conceptually different, and therefore provides a unique opportunity for optimization that is not possible for the multinomial-based model. compared the numbers in the log scale:
log(σM e + 1) versus log(M t +1),
where M e is the estimated TPM ( FPKMi / ∑i∈T FPKMi×10^6
□ At what sample size do correlations stabilize?:
>> http://www.sciencedirect.com/science/article/pii/S0092656613000858
>> https://vimeo.com/57127001
As sample size increases, correlations wiggle up and down. In typical situations, stable estimates can be expected when n approaches 250.
“Evolution of correlations” a bivariate correlation between two scales, “hope of power” and “fear of losing control” It can be seen that the correlation evolved from r = .69 (n = 20, p < .001) to r = .26 (n = 274, p < .001).
... Bayesian confidence statements are conservative in the sense that claims based on 95% posterior intervals have Type S error rates b/n 0-2.5%.
□ Multivariate State Hidden Markov Models for Mark-Recapture Data:
>> http://biorxiv.org/content/early/2015/08/26/025569
By using the multivariate state framework were able to directly extend the double tag loss model to account for movement b/n different areas.
□ WiseScaffolder: an algorithm for the semi-automatic scaffolding of NGS data:
>> http://www.biomedcentral.com/1471-2105/16/281
□ Set up a bioinformatics version of gitxiv
>> https://github.com/samim23/GitXiv/issues/25
□ Single-cell ATAC-seq: strength in numbers:
>> http://www.genomebiology.com/2015/16/1/172
on physical isolation of single cells, and the other avoided single-cell reaction volumes by using a 2-step combinatorial indexing strategy. This combinatorial indexing strategy enabled the recovery of 500–1500 cells with unique tags per experiment.
□ erlichya:
Wonderful @Sophie_Zaaijer just completed a @nanopore run on CRISPRed fragments. You can't fit more buzzwords to this experiment :-)
□ ewanbirney:
@tuuliel @erlichya @Sophie_Zaaijer ...for big data analysis using deep learning ?
□ FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem: A convex formulation for joint RNA isoform detection
>> http://cbio.ensmp.fr/flipflop/
Multi-dimensional splicing graph w/ 3 samples. Each candidate isoform is a path from source node s to sink node t. G=(V,E) the multi-dimensional splicing graph where V is the set of vertices and E the set of edges.
□ SpeedSeq: ultra-fast personal genome analysis and interpretation
>> http://bit.ly/1f5kdQ8
Parallelized SNV and indel calling is achieved by running FreeBayes on 34,123 variably-sized genomic windows that have been selected to contain similar numbers of reads based on aggregate aligned read depth of the 17 CEPH individuals used in this study.
SpeedSeq uses LUMPY to detect structural variant breakpoints from paired-end, split-read signals & CNVnator to detect by read-depth analysis.
□ cyNeo4j - Connecting Neo4j and Cytoscape:
>> http://bioinformatics.oxfordjournals.org/content/early/2015/08/12/bioinformatics.btv460.full.pdf
cyNeo4j allowing to speed up the performance of network analysis algorithms and use the Cypher query language to navigate and explore networks too large for typical desktop computers.
□ deeplearning4j:
word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method:
>> http://arxiv.org/pdf/1402.3722v1.pdf
□ Bayesian Financial Models
>> http://toddmoses.com/articles/read/bayesian_financial_models
□ ainh_z:
Sharing big biomedical data http://www.journalofbigdata.com/content/2/1/7 #journalofbigdata --useful discussion of practical issues: platforms, policy, security
By 2015 more than a 106 whole human genomes will be sequenced totaling over 100 PB and many neuroimaging studies will generate over 1TB/Day. The critical problems with many of these services include the barriers involved in transferring large amounts of TB data and the lack of efficient mechanisms for agile and efficient deployment and management of innovative analytics platforms, incl open-source machine learning, data wrangling, classification and visualization tools.
□ Efficient Integrative Multi-SNP QTL Mapping using Deterministic Approximation of Posteriors:
>> http://biorxiv.org/content/early/2015/09/09/026450
the DAP algorithm presents a significant performance improvement compared with the MCMC in both accuracy and computational efficiency. If Pr(γk =1 | y, G, α, ||γ|| = s )
□ HipMer: An Extreme-Scale De Novo Genome Assembler via efficient parallelization of the Meraculous code:
>> http://gauss.cs.ucsb.edu/~aydin/sc15_genome.pdf
parallelize the Meraculous scaffolding modules by leveraging the one-sided communication capabilities of the Unified Parallel C. Meraculous performance by orders of magnitude, enabling the complete assembly of the human genome in 8.4 min on 15K cores of the Cray XC30.
□ An bmO(m log m) -Time Algorithm for Detecting Superbubbles: an important subgraph class for analyzing assembly graphs
>> http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6998850
an average-case linear time algorithm (i.e., O(n+m) for a graph with n vertices and m edges) for graphs with a reasonable model, though the worst-case time complexity of our algorithm is quadratic (i.e., O(n(n + m))).
□ BAGEL: Bayesian based inference of missing time series values using Genetic Algorithm:
>> http://content.iospress.com/articles/international-journal-of-hybrid-intelligent-systems/his207
□ Simultaneous Modeling of Multiple Diseases for Mortality Prediction in Acute Hospital Care via Multi-task Learning:
>> http://delivery.acm.org/10.1145/2790000/2783308/p855-nori.pdf
a method that effectively integrates medical domain knowledge into a data- driven approach by using multi-task learning. by incorporating a graph Laplacian that encodes the similarities among diseases into the regularization term.
M-dimensional feature vector
for n-th patient with t-th disease, where n ∈ {1,...,Nt} and t ∈ {1,...,T}
Φ(t) Nt×M design matrix: Φ(t) ≡[φ(t),...,φ(t)]⊤
each patient is associated with a binary class label.
domain knowledge about similarities among diseases and similarities, encoded as symmetric similarity matrices.
□ Epiviz: a view inside the design of an integrated visual analysis software for genomics:
>> http://www.biomedcentral.com/1471-2105/16/S11/S4
□ Recycling Deep Learning Models with Transfer Learning:
>> http://www.kdnuggets.com/2015/08/recycling-deep-learning-representations-transfer-ml.html
initializing a network with transferred features from almost any number of layers can produce a boost to generalization that lingers even after fine-tuning to the target dataset. how these results hold up when the new task has far fewer examples than the original task. this imbalance in the number of labeled examples between the original task and the new one, is often what motivates transfer learning.
□ yag_ays:
"Rule-based Information Extraction is Dead! Long Live Rule-based Information Extraction Systems!"読んだ Wolftail Bounds
>> http://yag.xyz/blog/2015/08/18/paper-information-extraction-system/
a mixture of rule-based and machine learning techniques were generally written so as to obfuscate the use of rules, emphasizing the machine learning aspect of the work.
□ MIT announces vision for “affordable, robust, compact” fusion power plant
>> http://ow.ly/R2U2l
□ c_z:
>> http://www.graphistry.com/
>> http://on-demand.gputechconf.com/gtc/2015/presentation/S5589-Leo-Meyerovich.pdf
あるミートアップで会った人がCEOのこのスタートアップ、まだモノは出てきてないけど、ページトップにあるサンプルのパフォーマンスから想像するに、製品版は結構凄いかもしれない。WebGL/CLをめいっぱい活用している。クライアントではWebGL/CL, クラスタでもGPUを使って、今までリアルタイムでブラウズできなかったサイズのデータでストレスなく探索的解析を行えるようにしたいようだ。
□ beam2d:
畳込みは空間方向に決まったパターンで重みを共有することでパラメータを減らすが、空間のある位置に対する変換については重み共有しない。特にCNNの真ん中あたりでパラメータが増える要因になる。この部分を圧縮するにはHashedNetsの技術が使えるんじゃないかな。
※コメント投稿者のブログIDはブログ作成者のみに通知されます