□ テストケースとして、APIドリヴン型のDeep Learningと自然言語処理を組み合わせた、顧客指向型データベースの設計を書き上げた。トレーニング・データセットの扱いが難しい。
□ 先日に設計案を出したDeep Learningと自然言語処理によるユーザ指向型データベース、市場動向とタグ呼び出し相関の解析と実需要データを組み合わせて開発精度を上げるのが目標なのだけど、個々のユーザに最適化されるサジェストは、この相関に干渉してしまい、相関自体が偏ることがネック
□ データの特徴量さえ設計できれば、pre-processingでPCAにかけて次元削減を行う方法でいけるのだけど、やっぱDNNに投げてしまう方が便利だなー。
□ Once sufficiently many layers have been learned the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations.
□ 再来週の取締役会で、件の合弁事業に伴う拠点異動をぶち上げる。アメリカ!
![](https://blogimg.goo.ne.jp/user_image/3e/90/80c819359ec24d6cdb1635b86f74dc43.jpg)
![](https://blogimg.goo.ne.jp/user_image/67/7f/91cb7a58b427f41098069eec408de42e.jpg)
![](https://blogimg.goo.ne.jp/user_image/00/8c/bb5bffa4a834748e9f8522e2b1383b35.jpg)
□ Finding long-range interactions in the genome with Capture Hi-C:
>> http://genome.cshlp.org/content/early/2015/03/07/gr.185272.114.full.pdf
an interaction-calling algorithm called GOTHiC, that accounts for biases in Hi-C experiments by considering that these will be represented by the total coverage of the interacting fragments. GOTHiC detected 317,271 genomic fragments engaged in 548,551 significant, reproducible interactions with 21,748 promoters in ESCs.
![](https://blogimg.goo.ne.jp/user_image/04/f3/ebaa3de06241c769bd3a0f37ee5d24b6.png)
□ sense - agile data science: a collaborative platform to accelerate data science from exploration to production.
>> https://sense.io
Sense provides a powerful workbench supporting multiple tools - R, Python, Julia, Spark, Impala, Redshift, and more.
□ BayesPy: Variational Bayesian Inference in Python:
>> http://arxiv.org/pdf/1410.0870v2.pdf …
BayesPy has two types of nodes: stochastic and deterministic. Stochastic nodes correspond to probability distributions and deterministic nodes correspond to functions.
Future plans include support for non- conjugate models and non-parametric models (e.g., Gaussian and Dirichlet processes).
from bayespy.nodes import Dirichlet, Categorical
alpha=Dirichlet(1e-5*np.ones(K),
name='alpha')
Z=Categorical(alpha,
plates=(N,),
name='z')
![](https://blogimg.goo.ne.jp/user_image/3d/a1/da291deb40d8b163cf0d17fb7d75c4b6.png)
![](https://blogimg.goo.ne.jp/user_image/10/ba/927fd1a38352e909c61e6fed06f576b4.png)
□ Artificial Neurons and Single-Layer Neural Networks: How Machine Learning Algorithms Work:
>> http://sebastianraschka.com/Articles/2015_singlelayer_neurons.html
stochastic gradient descent converges much faster than gradient descent since the updates are applied immediately after each training sample,
Perceptron Rule in Python:
for_in range(self.epochs):
errors=0
for xi, target in zip(X, y):
update = self.eta * (target - self.predict(xi))
□ Training a LSTM language model.: recurrent neural networks (RNNs): a powerful model for sequential data
>> https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/practicals/practical6.pdf …
zthe proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units on the tasks of character-level language modeling and Python program evaluation. GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions by learning to gate these interactions.
□ In-depth introduction to machine learning in 15 hours of expert videos:
>> http://www.dataschool.io/15-hours-of-expert-machine-learning-videos/ …
![](https://blogimg.goo.ne.jp/user_image/42/26/b179b819d8438577462a79a56cc8f7fe.png)
□ An Integrated Approach to Reconstructing Genome-Scale Transcriptional Regulatory Networks
>> http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004103
Transcriptional regulatory networks (TRNs) program cells to dynamically alter their gene expression in response to changing internal or environmental conditions. In this study, we develop a novel workflow for generating large-scale TRN models that integrates comparative genomics data, global gene expression analyses, and intrinsic properties of transcription factors (TFs).
□ SC3-seq: a method for highly parallel, quantitative measurement of single-cell gene expression: 1-cell level analysis
>> http://nar.oxfordjournals.org/content/early/2015/02/26/nar.gkv134.long
10 000-cell level versus 1-cell level: R2 = 0.776–0.797; 1-cell level versus 1-cell level: R2 = 0.677–708
SC3-seq exhibits a superior quantitative performance in that it does not underestimate the expression LVs of relatively longer transcripts and it detects much larger numbers of transcripts with smaller sequence depth.
PCAはスケーリングなしprcomp関数を使用。多群間のDEGs識別にはANOVA、qvalueをp値及び偽陽性の計算に用いる。gene ontologyの解析はDavid Web Toolにて実行。そこはDeep Learningでパイプライン化してみよう。2D PlotとグラフマイニングもDCNNで行ける。
□ SimSeq: A Nonparametric Approach to Simulation of RNA-Sequence Datasets:
>> http://bioinformatics.oxfordjournals.org/content/early/2015/02/26/bioinformatics.btv124.abstract
methods based on parametric modeling assumptions seem to perform better with respect to false discovery rate (FDR) control when data are simulated from parametric models rather than using more realistic nonparametric simulation strategy.
□ how to perform Kolmogorov-Smirnov statistic in GSEA in R?
>> http://buff.ly/17S2ZCF
□ CauseMap: fast inference of causality from complex time series:
>> https://peerj.com/articles/824/
CauseMap, a method for establishing causality from long time series data (≳25 observations). CCM builds on Takens’ Theorem, a well-established result from dynamical systems theory that requires only mild assumptions. This theorem allows us to reconstruct high dimensional system dynamics using a time series of only a single variable.
Whitney’s Theorem tells us that the dimensionality of the full causal system is generically between (Emax - 1)/2 and Emax,
複雑な時系列データからの因果推論。収束クロスマッピング。因果システムは方向性という概念の上に成立する。
□ BGI Plans to Launch Two NGS Systems This Year Based on Complete Genomics Technology:
>> https://www.genomeweb.com/business-news/bgi-plans-launch-two-ngs-systems-year-based-complete-genomics-technology …
□ BGI Sequencer Buzz:
>> https://storify.com/OmicsOmicsBlog/bgi-sequencer-buzz …
BGI/Complete Genomics technology relied on sequencing-by-ligation, with the innovative "rolony" (aka nanoball) template amplification. It potentially offers accuracy advantage as ligases are extremely finicky about insisting on correct base-pairing at the ligation junction.
□ GigaScience:
Tuatara genome assembly v1.0 challenges: 120x coverage, estimated genome size 4.6GB. High GC content and 46% of the genome repetitive #G10K
□ Next Generation Sequencing (NGS) Markets 2015: information and financial information for over 100 companies
>> https://www.reportbuyer.com/product/2736754/next-generation-sequencing-ngs-markets-2015.html …
Driven by the promise of clinical diagnostic applications, the DNA sequencing market is growing rapidly and companies are raising money.
Sequencing companies are starting to develop and sold to clinical laboratories following the traditional in vitro diagnostic (IVD) model. Illumina's MiSeqDx was the first cleared in vitro diagnostic (IVD) next generation sequencing system. Kalorama expects the market to grow from its current size of 2.2 billion to 5.6 billion.
□ smllmp:
Interesting comments on the "BioFabric - Network vizualisation w/o hairballs" thread https://news.ycombinator.com/item?id=9159522 #dataviz
□ “Data-processing and machine learning with Python”
>> http://kachkach.com/data-processing-and-machine-learning-with-python/ …
Numerical variables
Categorial variables
Boolean variables
represent all these variables in the vector space model to train models.
Example: SVM
from sklearn import datasets from sklearn import svm
iris = datasets.load_iris() X = iris.data[:, :2]
y = iris.target
# Training the model
clf = svm.SVC(kernel='rbf') clf.fit(X, y)
# Doing predictions
new_data = [[4.85, 3.1], [5.61, 3.02], [6.63, 3.13]] print clf.predict(new_data)
Most scikit-learn classifiers have a score function that takes a list of inputs & target outputs. calculate accuracy, precision/recall etc…
TruncatedSVD implements a variant of singular value decomposition. applied to term-document matrices (CountVectorizer or TfidfVectorizer)
latent semantic analysis (LSA), it transforms such matrices to a “semantic” space of low dimensionality.
![](https://blogimg.goo.ne.jp/user_image/0a/04/1670b613bba899a5afe37b6fd85fdf09.jpg)
(sensitivities of RIEMS, Clinical PathoScope, Kraken & Megablast against the MetaPhlAn clade specific marker database )
□ RIEMS: a software pipeline for sensitive & comprehensive taxonomic classification of reads from metagenomics datasets
>> http://www.biomedcentral.com/1471-2105/16/69
Although various tools and strategies were published and are publicly accessible, the available capacities are not sufficient.
□ SFA-SPA: a suffix array based short peptide assembler for metagenomic data:
>> http://bioinformatics.oxfordjournals.org/content/early/2015/03/01/bioinformatics.btv052.short
The improved computational efficiency is achieved using a suffix array data structure allows for fast querying during the assembly process and a significant redesign of assembly steps that enables multi-threaded execution.
□ Clive_G_Brown:
@nanopore 1 PromethION produces about 1.5 Gb/s of raw data, similar to LHC, but that is raw. I think this 1Gb/s number must be processed
□ A genomic data viewer for iPad http://bit.ly/1aLJYn8 via @GenomeBiology
□ Now everybody can do their part to advance medical research.
>> https://www.apple.com/researchkit/
□ Apple Introduces ResearchKit, Giving Medical Researchers the Tools to Revolutionize Medical Studies
>> http://www.apple.com/pr/library/2015/03/09Apple-Introduces-ResearchKit-Giving-Medical-Researchers-the-Tools-to-Revolutionize-Medical-Studies.html
□ ResearchKit – how iPhone is transforming medical research
>> https://www.youtube.com/watch?v=VyY2qPb6c0c
□ Sagebio:
Learn more about @sagebio and our two clinical iPhone apps here: http://sagebase.org #AppleLive #AppleEvent
Developed by Sage Bionetworks and the University of Rochester, the Parkinson mPower app helps people living with Parkinson’s disease track their symptoms by recording activities using sensors in iPhone. hese activities include a memory game, finger tapping, speaking & walking.
ResearchKitってUCLA、マウントサイナイ医科大学とLifeMapの共同開発だったのか!SageBioも名を連ねてる。参入障壁高そう。
Apple、医学研究活動と研究者をサポートするツール、ResearchKitを発表:
http://www.apple.com/jp/pr/library/2015/03/09Apple-Introduces-ResearchKit-Giving-Medical-Researchers-the-Tools-to-Revolutionize-Medical-Studies.html …
『世界有数の医療研究機関』がガチでアプリ開発。iPhoneで直接DNAサンプル解析できるアプリも既にあるし、ここにDeep Learning勢力が交われば、ゲノミクス分野で相転移が起こる要素は揃ってる。そのうちilluminaのシーケンシングアプリも出るかも。
□ MetaCompass: a software package for comparative assembly of metagenomic reads
>> http://metacompass.cbcb.umd.edu
□ Metassembler: Merging and optimizing de novo genome assemblies:
>> http://biorxiv.org/content/early/2015/03/10/016352 …
The only data requirement is at least one jumping library is available to evaluate the presence of compression/expansion mis-assemblies, although that data type need not been used in any or all of the assemblies.
metassembly evaluated the presence of core eukaryotic genes using the CEGMA algorithm, as well as the concordance of the metassembly sequence with remapped pair-end and mate-pair reads using REAPR.
CEGMAパイプラインの成果物だった。
□ Modelling Computational Resources for Next Generation Sequencing Bioinformatics Analysis of 16S rRNA Samples:
>> http://arxiv.org/pdf/1503.02974v1.pdf …
多重線形回帰分析によるバイオインフォマティクス計算リソースの構築。AmpliconNoise、Perseus、ChimeraSlayerを用いたパイプラインによるノイズ除去と処理時間の正確なモデリング。DIAGのようなグリッドサービスを標準とした並列化スクリプト。
MLR models were able to accurately predict clock time for denoising sequences from a naturally assembled community dataset, but not an artificial community.
□ HISAT: a fast spliced aligner with low memory requirements
>> http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.3317.html
HISAT's hierarchical index for the human genome contains 48,000 local FM indexes, each representing a genomic region of ~64,000 bp. HISAT requires only 4.3 gigabytes of memory. HISAT supports genomes of any size, including those larger than 4 billion bases.
□ Scaffold assembly based on genome rearrangement analysis
>> http://www.sciencedirect.com/science/article/pii/S1476927115000225
a new method for scaffold assembly based on the analysis of gene orders and genome rearrangements in multiple related genomes (some or even all of which may be fragmented). Evaluation of the proposed method on artificially fragmented mammalian genomes demonstrates its high reliability.
□ 10X uses an expensive box and fancy microfluidics; Dovetail just some molecular biology steps. 10X also has an advantage for input material, requiring only 1 nanograms versus multiple micrograms for Dovetail.
10X and Dovetail are trying to tackle a similar space as those mapping companies, to resolve long range structure but via leveraging an Illumina sequencer.
□ DANN: a deep learning approach for annotating the pathogenicity of genetic variants
>> http://bioinformatics.oxfordjournals.org/content/early/2014/10/22/bioinformatics.btu703.abstract
DNNs can capture nonlinear relationships among features are better suited than SVMs for problems w/ a large number of samples and features.
□ CosmosID Adds Cloud Option with the Release of Metagenomics App on Illumina's BaseSpace:
>> https://www.genomeweb.com/informatics/cosmosid-adds-cloud-option-release-metagenomics-app-illuminas-
□ Hidden meaning and 'speed limits' found within genetic code:
>> http://www.sciencedaily.com/releases/2015/03/150312173800.htm
□ Second-generation PLINK: rising to the challenge of larger and richer datasets:
>> http://www.gigasciencejournal.com/content/4/1/7
[sum of null hypothesis likelihoods of at-least-as-extreme tables] / [sum of null hypothesis likelihoods of all tables]
□ Hadoop as a Platform for Genomics - Strata 2015, San Jose:
>> http://www.slideshare.net/allenday/hadoop-as-a-platform-for-genomics-strata-2015-san-jose …
Genome × Phenome Tensor
• Aggregating over individuals with matrix could ignore the correlations among genotypes and phenotypes
Aadhaar Biometric ID Creation
900MM people loaded in 4 years
1MM registrations/day
200+ trillion lookups/day
All built on MapR-DB (HBase)
![](https://blogimg.goo.ne.jp/user_image/5a/df/38be7c66f927d7f34c47b83753542d41.jpg)
![](https://blogimg.goo.ne.jp/user_image/62/e2/28cc9188158d7281fad17bcea06b6723.jpg)
□ Introducing DataFrames in Spark for Large Scale Data: RDD, Machine Learning, GraphX, Spark R
>> http://www.slideshare.net/databricks/introducing-dataframes-in-spark-for-large-scale-data-science …
![](https://blogimg.goo.ne.jp/user_image/78/fe/714f76f09cbb0cc467e55045740cfc45.jpg)
![](https://blogimg.goo.ne.jp/user_image/32/54/1983c1a808a7da2a3c6e898ac41bb42f.jpg)
□ NoDe: a fast error-correction algorithm for pyrosequencing amplicon reads:
>> http://www.biomedcentral.com/1471-2105/16/88/abstract …
The positive effect of NoDe in 16S rRNA studies was confirmed on the precision of the clustering of pyrosequencing reads in taxonomic units. In NoDe, the pre-cluster like algorithm is proceeded by a machine learning approach identifying potentially erroneous nucleotides,
□ LINKS: Scaffolding genome assemblies with kilobase-long nanopore reads:
>> http://biorxiv.org/content/biorxiv/early/2015/03/13/016519.full.pdf
The predecessor of LINKS is the unpublished scaffolding engine in the SSAKE assembler and foundation of the SSPACE-LongRead scaffolder. LINKS has two main stages: contig pairing & scaffold layout. Cycling through k-mer pairs, that are uniquely placed on contigs are identified.
□ SCIEX Announces OneOmics Collaborators:
>> http://www.businesswire.com/news/home/20150316005134/en/SCIEX-Announces-OneOmics-Collaborators
Advaita Bioinformatics, ISB and Yale Univ.: NGSアプリケーションのクラウド統合が進む。アドヴァイタが来たか。
OneOmics: SCIEX and Illumina have partnered to create the world's first multi-omics applications within a SWATH & BaseSpace cloud-computing.
□
□ 先日に設計案を出したDeep Learningと自然言語処理によるユーザ指向型データベース、市場動向とタグ呼び出し相関の解析と実需要データを組み合わせて開発精度を上げるのが目標なのだけど、個々のユーザに最適化されるサジェストは、この相関に干渉してしまい、相関自体が偏ることがネック
□ データの特徴量さえ設計できれば、pre-processingでPCAにかけて次元削減を行う方法でいけるのだけど、やっぱDNNに投げてしまう方が便利だなー。
□ Once sufficiently many layers have been learned the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations.
□ 再来週の取締役会で、件の合弁事業に伴う拠点異動をぶち上げる。アメリカ!
![](https://blogimg.goo.ne.jp/user_image/3e/90/80c819359ec24d6cdb1635b86f74dc43.jpg)
![](https://blogimg.goo.ne.jp/user_image/67/7f/91cb7a58b427f41098069eec408de42e.jpg)
![](https://blogimg.goo.ne.jp/user_image/00/8c/bb5bffa4a834748e9f8522e2b1383b35.jpg)
□ Finding long-range interactions in the genome with Capture Hi-C:
>> http://genome.cshlp.org/content/early/2015/03/07/gr.185272.114.full.pdf
an interaction-calling algorithm called GOTHiC, that accounts for biases in Hi-C experiments by considering that these will be represented by the total coverage of the interacting fragments. GOTHiC detected 317,271 genomic fragments engaged in 548,551 significant, reproducible interactions with 21,748 promoters in ESCs.
![](https://blogimg.goo.ne.jp/user_image/04/f3/ebaa3de06241c769bd3a0f37ee5d24b6.png)
□ sense - agile data science: a collaborative platform to accelerate data science from exploration to production.
>> https://sense.io
Sense provides a powerful workbench supporting multiple tools - R, Python, Julia, Spark, Impala, Redshift, and more.
□ BayesPy: Variational Bayesian Inference in Python:
>> http://arxiv.org/pdf/1410.0870v2.pdf …
BayesPy has two types of nodes: stochastic and deterministic. Stochastic nodes correspond to probability distributions and deterministic nodes correspond to functions.
Future plans include support for non- conjugate models and non-parametric models (e.g., Gaussian and Dirichlet processes).
from bayespy.nodes import Dirichlet, Categorical
alpha=Dirichlet(1e-5*np.ones(K),
name='alpha')
Z=Categorical(alpha,
plates=(N,),
name='z')
![](https://blogimg.goo.ne.jp/user_image/3d/a1/da291deb40d8b163cf0d17fb7d75c4b6.png)
![](https://blogimg.goo.ne.jp/user_image/10/ba/927fd1a38352e909c61e6fed06f576b4.png)
□ Artificial Neurons and Single-Layer Neural Networks: How Machine Learning Algorithms Work:
>> http://sebastianraschka.com/Articles/2015_singlelayer_neurons.html
stochastic gradient descent converges much faster than gradient descent since the updates are applied immediately after each training sample,
Perceptron Rule in Python:
for_in range(self.epochs):
errors=0
for xi, target in zip(X, y):
update = self.eta * (target - self.predict(xi))
□ Training a LSTM language model.: recurrent neural networks (RNNs): a powerful model for sequential data
>> https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/practicals/practical6.pdf …
zthe proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units on the tasks of character-level language modeling and Python program evaluation. GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions by learning to gate these interactions.
□ In-depth introduction to machine learning in 15 hours of expert videos:
>> http://www.dataschool.io/15-hours-of-expert-machine-learning-videos/ …
![](https://blogimg.goo.ne.jp/user_image/42/26/b179b819d8438577462a79a56cc8f7fe.png)
□ An Integrated Approach to Reconstructing Genome-Scale Transcriptional Regulatory Networks
>> http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004103
Transcriptional regulatory networks (TRNs) program cells to dynamically alter their gene expression in response to changing internal or environmental conditions. In this study, we develop a novel workflow for generating large-scale TRN models that integrates comparative genomics data, global gene expression analyses, and intrinsic properties of transcription factors (TFs).
□ SC3-seq: a method for highly parallel, quantitative measurement of single-cell gene expression: 1-cell level analysis
>> http://nar.oxfordjournals.org/content/early/2015/02/26/nar.gkv134.long
10 000-cell level versus 1-cell level: R2 = 0.776–0.797; 1-cell level versus 1-cell level: R2 = 0.677–708
SC3-seq exhibits a superior quantitative performance in that it does not underestimate the expression LVs of relatively longer transcripts and it detects much larger numbers of transcripts with smaller sequence depth.
PCAはスケーリングなしprcomp関数を使用。多群間のDEGs識別にはANOVA、qvalueをp値及び偽陽性の計算に用いる。gene ontologyの解析はDavid Web Toolにて実行。そこはDeep Learningでパイプライン化してみよう。2D PlotとグラフマイニングもDCNNで行ける。
□ SimSeq: A Nonparametric Approach to Simulation of RNA-Sequence Datasets:
>> http://bioinformatics.oxfordjournals.org/content/early/2015/02/26/bioinformatics.btv124.abstract
methods based on parametric modeling assumptions seem to perform better with respect to false discovery rate (FDR) control when data are simulated from parametric models rather than using more realistic nonparametric simulation strategy.
□ how to perform Kolmogorov-Smirnov statistic in GSEA in R?
>> http://buff.ly/17S2ZCF
□ CauseMap: fast inference of causality from complex time series:
>> https://peerj.com/articles/824/
CauseMap, a method for establishing causality from long time series data (≳25 observations). CCM builds on Takens’ Theorem, a well-established result from dynamical systems theory that requires only mild assumptions. This theorem allows us to reconstruct high dimensional system dynamics using a time series of only a single variable.
Whitney’s Theorem tells us that the dimensionality of the full causal system is generically between (Emax - 1)/2 and Emax,
複雑な時系列データからの因果推論。収束クロスマッピング。因果システムは方向性という概念の上に成立する。
□ BGI Plans to Launch Two NGS Systems This Year Based on Complete Genomics Technology:
>> https://www.genomeweb.com/business-news/bgi-plans-launch-two-ngs-systems-year-based-complete-genomics-technology …
□ BGI Sequencer Buzz:
>> https://storify.com/OmicsOmicsBlog/bgi-sequencer-buzz …
BGI/Complete Genomics technology relied on sequencing-by-ligation, with the innovative "rolony" (aka nanoball) template amplification. It potentially offers accuracy advantage as ligases are extremely finicky about insisting on correct base-pairing at the ligation junction.
□ GigaScience:
Tuatara genome assembly v1.0 challenges: 120x coverage, estimated genome size 4.6GB. High GC content and 46% of the genome repetitive #G10K
□ Next Generation Sequencing (NGS) Markets 2015: information and financial information for over 100 companies
>> https://www.reportbuyer.com/product/2736754/next-generation-sequencing-ngs-markets-2015.html …
Driven by the promise of clinical diagnostic applications, the DNA sequencing market is growing rapidly and companies are raising money.
Sequencing companies are starting to develop and sold to clinical laboratories following the traditional in vitro diagnostic (IVD) model. Illumina's MiSeqDx was the first cleared in vitro diagnostic (IVD) next generation sequencing system. Kalorama expects the market to grow from its current size of 2.2 billion to 5.6 billion.
□ smllmp:
Interesting comments on the "BioFabric - Network vizualisation w/o hairballs" thread https://news.ycombinator.com/item?id=9159522 #dataviz
□ “Data-processing and machine learning with Python”
>> http://kachkach.com/data-processing-and-machine-learning-with-python/ …
Numerical variables
Categorial variables
Boolean variables
represent all these variables in the vector space model to train models.
Example: SVM
from sklearn import datasets from sklearn import svm
iris = datasets.load_iris() X = iris.data[:, :2]
y = iris.target
# Training the model
clf = svm.SVC(kernel='rbf') clf.fit(X, y)
# Doing predictions
new_data = [[4.85, 3.1], [5.61, 3.02], [6.63, 3.13]] print clf.predict(new_data)
Most scikit-learn classifiers have a score function that takes a list of inputs & target outputs. calculate accuracy, precision/recall etc…
TruncatedSVD implements a variant of singular value decomposition. applied to term-document matrices (CountVectorizer or TfidfVectorizer)
latent semantic analysis (LSA), it transforms such matrices to a “semantic” space of low dimensionality.
![](https://blogimg.goo.ne.jp/user_image/0a/04/1670b613bba899a5afe37b6fd85fdf09.jpg)
(sensitivities of RIEMS, Clinical PathoScope, Kraken & Megablast against the MetaPhlAn clade specific marker database )
□ RIEMS: a software pipeline for sensitive & comprehensive taxonomic classification of reads from metagenomics datasets
>> http://www.biomedcentral.com/1471-2105/16/69
Although various tools and strategies were published and are publicly accessible, the available capacities are not sufficient.
□ SFA-SPA: a suffix array based short peptide assembler for metagenomic data:
>> http://bioinformatics.oxfordjournals.org/content/early/2015/03/01/bioinformatics.btv052.short
The improved computational efficiency is achieved using a suffix array data structure allows for fast querying during the assembly process and a significant redesign of assembly steps that enables multi-threaded execution.
□ Clive_G_Brown:
@nanopore 1 PromethION produces about 1.5 Gb/s of raw data, similar to LHC, but that is raw. I think this 1Gb/s number must be processed
□ A genomic data viewer for iPad http://bit.ly/1aLJYn8 via @GenomeBiology
□ Now everybody can do their part to advance medical research.
>> https://www.apple.com/researchkit/
□ Apple Introduces ResearchKit, Giving Medical Researchers the Tools to Revolutionize Medical Studies
>> http://www.apple.com/pr/library/2015/03/09Apple-Introduces-ResearchKit-Giving-Medical-Researchers-the-Tools-to-Revolutionize-Medical-Studies.html
□ ResearchKit – how iPhone is transforming medical research
>> https://www.youtube.com/watch?v=VyY2qPb6c0c
□ Sagebio:
Learn more about @sagebio and our two clinical iPhone apps here: http://sagebase.org #AppleLive #AppleEvent
Developed by Sage Bionetworks and the University of Rochester, the Parkinson mPower app helps people living with Parkinson’s disease track their symptoms by recording activities using sensors in iPhone. hese activities include a memory game, finger tapping, speaking & walking.
ResearchKitってUCLA、マウントサイナイ医科大学とLifeMapの共同開発だったのか!SageBioも名を連ねてる。参入障壁高そう。
Apple、医学研究活動と研究者をサポートするツール、ResearchKitを発表:
http://www.apple.com/jp/pr/library/2015/03/09Apple-Introduces-ResearchKit-Giving-Medical-Researchers-the-Tools-to-Revolutionize-Medical-Studies.html …
『世界有数の医療研究機関』がガチでアプリ開発。iPhoneで直接DNAサンプル解析できるアプリも既にあるし、ここにDeep Learning勢力が交われば、ゲノミクス分野で相転移が起こる要素は揃ってる。そのうちilluminaのシーケンシングアプリも出るかも。
□ MetaCompass: a software package for comparative assembly of metagenomic reads
>> http://metacompass.cbcb.umd.edu
□ Metassembler: Merging and optimizing de novo genome assemblies:
>> http://biorxiv.org/content/early/2015/03/10/016352 …
The only data requirement is at least one jumping library is available to evaluate the presence of compression/expansion mis-assemblies, although that data type need not been used in any or all of the assemblies.
metassembly evaluated the presence of core eukaryotic genes using the CEGMA algorithm, as well as the concordance of the metassembly sequence with remapped pair-end and mate-pair reads using REAPR.
CEGMAパイプラインの成果物だった。
□ Modelling Computational Resources for Next Generation Sequencing Bioinformatics Analysis of 16S rRNA Samples:
>> http://arxiv.org/pdf/1503.02974v1.pdf …
多重線形回帰分析によるバイオインフォマティクス計算リソースの構築。AmpliconNoise、Perseus、ChimeraSlayerを用いたパイプラインによるノイズ除去と処理時間の正確なモデリング。DIAGのようなグリッドサービスを標準とした並列化スクリプト。
MLR models were able to accurately predict clock time for denoising sequences from a naturally assembled community dataset, but not an artificial community.
□ HISAT: a fast spliced aligner with low memory requirements
>> http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.3317.html
HISAT's hierarchical index for the human genome contains 48,000 local FM indexes, each representing a genomic region of ~64,000 bp. HISAT requires only 4.3 gigabytes of memory. HISAT supports genomes of any size, including those larger than 4 billion bases.
□ Scaffold assembly based on genome rearrangement analysis
>> http://www.sciencedirect.com/science/article/pii/S1476927115000225
a new method for scaffold assembly based on the analysis of gene orders and genome rearrangements in multiple related genomes (some or even all of which may be fragmented). Evaluation of the proposed method on artificially fragmented mammalian genomes demonstrates its high reliability.
□ 10X uses an expensive box and fancy microfluidics; Dovetail just some molecular biology steps. 10X also has an advantage for input material, requiring only 1 nanograms versus multiple micrograms for Dovetail.
10X and Dovetail are trying to tackle a similar space as those mapping companies, to resolve long range structure but via leveraging an Illumina sequencer.
□ DANN: a deep learning approach for annotating the pathogenicity of genetic variants
>> http://bioinformatics.oxfordjournals.org/content/early/2014/10/22/bioinformatics.btu703.abstract
DNNs can capture nonlinear relationships among features are better suited than SVMs for problems w/ a large number of samples and features.
□ CosmosID Adds Cloud Option with the Release of Metagenomics App on Illumina's BaseSpace:
>> https://www.genomeweb.com/informatics/cosmosid-adds-cloud-option-release-metagenomics-app-illuminas-
□ Hidden meaning and 'speed limits' found within genetic code:
>> http://www.sciencedaily.com/releases/2015/03/150312173800.htm
□ Second-generation PLINK: rising to the challenge of larger and richer datasets:
>> http://www.gigasciencejournal.com/content/4/1/7
[sum of null hypothesis likelihoods of at-least-as-extreme tables] / [sum of null hypothesis likelihoods of all tables]
□ Hadoop as a Platform for Genomics - Strata 2015, San Jose:
>> http://www.slideshare.net/allenday/hadoop-as-a-platform-for-genomics-strata-2015-san-jose …
Genome × Phenome Tensor
• Aggregating over individuals with matrix could ignore the correlations among genotypes and phenotypes
Aadhaar Biometric ID Creation
900MM people loaded in 4 years
1MM registrations/day
200+ trillion lookups/day
All built on MapR-DB (HBase)
![](https://blogimg.goo.ne.jp/user_image/5a/df/38be7c66f927d7f34c47b83753542d41.jpg)
![](https://blogimg.goo.ne.jp/user_image/62/e2/28cc9188158d7281fad17bcea06b6723.jpg)
□ Introducing DataFrames in Spark for Large Scale Data: RDD, Machine Learning, GraphX, Spark R
>> http://www.slideshare.net/databricks/introducing-dataframes-in-spark-for-large-scale-data-science …
![](https://blogimg.goo.ne.jp/user_image/78/fe/714f76f09cbb0cc467e55045740cfc45.jpg)
![](https://blogimg.goo.ne.jp/user_image/32/54/1983c1a808a7da2a3c6e898ac41bb42f.jpg)
□ NoDe: a fast error-correction algorithm for pyrosequencing amplicon reads:
>> http://www.biomedcentral.com/1471-2105/16/88/abstract …
The positive effect of NoDe in 16S rRNA studies was confirmed on the precision of the clustering of pyrosequencing reads in taxonomic units. In NoDe, the pre-cluster like algorithm is proceeded by a machine learning approach identifying potentially erroneous nucleotides,
□ LINKS: Scaffolding genome assemblies with kilobase-long nanopore reads:
>> http://biorxiv.org/content/biorxiv/early/2015/03/13/016519.full.pdf
The predecessor of LINKS is the unpublished scaffolding engine in the SSAKE assembler and foundation of the SSPACE-LongRead scaffolder. LINKS has two main stages: contig pairing & scaffold layout. Cycling through k-mer pairs, that are uniquely placed on contigs are identified.
□ SCIEX Announces OneOmics Collaborators:
>> http://www.businesswire.com/news/home/20150316005134/en/SCIEX-Announces-OneOmics-Collaborators
Advaita Bioinformatics, ISB and Yale Univ.: NGSアプリケーションのクラウド統合が進む。アドヴァイタが来たか。
OneOmics: SCIEX and Illumina have partnered to create the world's first multi-omics applications within a SWATH & BaseSpace cloud-computing.
□