「論文記録」のブログ記事一覧(5ページ目)-ぴかりんの頭の中味

【論】Kohavi,1996,Wrappers for Feature Subset Select～

2007年10月17日 22時08分47秒 | 論文記録

Ron Kohavi, George H.John
Wrappers for Feature Subset Selection
Artificial Intelligence (1996)
[PDF][Web Site]

・データマイニング（Feature Subset Selection）の手法の一つである、Wrapper法についてのまとめ。そのアルゴリズムについての詳細や、問題点の指摘、Filter法との性能比較など。
・データ：14種（実データ8/人工データ6）
・Induction algorithms:1.C4.5(ID3), 2,Naive-Bayes
・Search Engine:1.Hill-climing, 2.Best-first

・問題「The problem of feature subset selection is that of finding a subset of the original features of a dataset, such that an induction algorithm that is run on data containing only these features generates a classifier with the highest possible accuracy.」
・問題「In learning scenarios, however, we are face with two problems: the learning algorithms are not given access to the underlying distribution, and most practical algorithms attempt to find a hypothesis by approximating NP-hard optimization problems.」
・注意「However, it is important to realize that relevance according to these definitions does not imply membership in the optimal feature subset, and that irrelevance does not imply that a feature cannot be in the optimal feature subset.」
・特性「The main disadvantage of the filter approach is that it totally ignores the effects of the selected feature subset on the performance of the induction algorithm.」
・「We shall investigate two hypotheses: first, that using a filter method will improve the accuracy of ID3 and Naive-Bayes on real datasets but will be fairly erratic (often hurting perfrmance), and second, that improvements from using the wrapper approach will surpass the gains from the filter and will be more consistent.」
・結果「In summary, feature subset selection using the wrapper approach significantly improves ID3, C4.5 and Naive-Bayes on the datasets tested. On the real datasets, the wrapper approach is clearly superior to the filter method. Perhaps the most surprinsing result is how well Naive-Bayes performs on real datasets once discretization and feature subset selection are done.」
・問題点「These problems include: inability to remove a feature in symmetric targets concepts such as m-of-n-3-7-10 where removal of one feature improves performance (Section 4), inability to include irrelevant features that may actually help performance (Example 3), and inability to remove correalated features that may hurt performance (Section 2.4).」

・DNA関連ではなく、工学分野の文献。
・"Feature Subset Selection "、"Wrapper"、"Induction algorithms"、"Search Engine"、いずれもいい日本語が見あたらない。

【論】Hayasaki,2007,Analysis of pharmacological effe～

2007年10月05日 19時03分02秒 | 論文記録

T.Hayasaki, M.Sakurai, T.Hayashi, K.Murakami and T.Hanawa
Analysis of pharmacological effect and molecular mechanisms of a traditional herbal medicine by global gene expression analysis: an exploratory study.
Journal of Clinical Pharmacy & Therapeutics, Volume 32, Number 3, June 2007, pp.247-252(6)
[PDF][Web Site]

・漢方薬の香蘇散（コウソサン；Kososan）が人体に与える影響を、DNAマイクロアレイを用いて調べる。
・データ：被験者14名（健常者）、2週間薬を投与、投与前と2週間後の血液を採取しDNAマイクロアレイを作成。Agilent Whole Human Genome Oligo Microarray G4112A
・結果
1.投与前後の健康状態についてのアンケート結果により、被験者を次の三群に分けた。Responders (6名)、Non-responders (4名)、 Ineligible (4名)。
2.Respondersの遺伝子発現量変化
　2倍以上だったものが通常値(?)まで下がった遺伝子 → 70個
　2分の1以下だったものが通常値まで上がった遺伝子 → 24個

・概要「We examined the pharmacological effect and mechanism of action of a traditional herbal medicine (Kososan) with grobal gene expression analysis using a DNA chip.」
・問題点「Although the effects of traditional medicines are recognized worldwide, the scientific bases and the mechanisms of action are not well understood.」
・効用「Kososan is commomly used for: (i) for the common cold, and continuous ingestion is said to decrease the chances of chatching a cold, (ii) gastro-intestinal distubances linked to food poisoning and (iii) allergic disorders such as urticaria, and (iv) alleviation of depression.」

・ガンのような盛んに研究されている疾病ではなく、これまで手をつかずの分野にマイクロアレイを応用したという点で、ここまで読んできた論文とは毛色の違う内容。
・予備実験ということで、とりあえずデータを取ってみたところまでで、データ解析法や抽出された遺伝子リストの解釈などは、まだまだこれからの段階。

【論】Dettling,2002,Supervised clustering of genes

2007年10月02日 22時07分32秒 | 論文記録

Marcel Dettling and Peter Buhlmann
Supervised clustering of genes
Genome Biology 2002, 3:research0069.1-0069.15
[PDF][Web Site]

・マイクロアレイサンプルの教師付きクラス分け法の提案。
・データ
1.Leukemia dataset [Golub]
2.Breast cancer dataset [West]
3.Colon cancer dataset [Alon]
4.Prostate cancer dataset
5.SRBCT dataset [Khan]
6.Lymphoma dataset [Alizadeh]
7.Brain tumor dataset [Pomeroy]
8.National Cancer Institute (NCI) dataset [Ross]
・クラス分け法（比較法？）
1.Nearest-neighbor classification
2.Aggregated trees
・識別結果の評価法
1.Leave-one-out cross validation
2.Random splitting

・目的「The identification of these functional groups is crucial for tissue classification in medical diagnostics, as well as for understanding how the genome as a whole works.」
・問題点「but as with all other unsupervised techniques, it usually fails to reveal fuctional groups of genes that are of special interest in tissue classification. This is because genes are clustered by similarity only, without using any information about the experiment's response variables.」
・方法「Here we present a promising new method for searching functional groups, each made up of only a few genes whose consensus expression profiles provide useful information for tissue discrimination. Like PLS, it is a one-step approach that directly incorporates the response variables Y into the grouping process, and is thus an algorithm for supervised clustering of genes.」
・方法「Our approach is algorithmically similar and also relies on growing the cluster incrementally by adding one gene after the other.」
・方法「In summary, our cluster algorithm is a combination of variable (gene) selection for cluster membership and formation of a new predictor by possible sign-flipping and averaging the gene expressions within a cluster as in Equation 2.」
・「We assume that problem-dependent solutions that utilize deeper knowledge about the biological relation between the tissue types could be even more accurate for reducing multicategory problems to binary problems.」
・結果「In all eight datasets we analyzed, comprising a total of 24 binary class distinctions, the average cluster expression x_c always perfectly discriminates the two response classes (in multiclass problems, this is one class against the rest).」
・展望「An important task that remains to be addressed in future research is the generalization of the supervised clustering algorithm to quantitative response variables and to censored survival data.」

・アルゴリズムがよくわからず。

【論】Michaud,2003,eXPatGen: generating dynamic exp～

2007年09月26日 22時05分17秒 | 論文記録

Dennis J. Michaud, Adam G. Marsh and Prasad S. Dhurjati
eXPatGen: generating dynamic expression patterns for the systematic evaluation of analytical methods
Bioinformatics Vol.19 no.9 2003 Pages 1140-1146
[PDF][Web Site]

・マイクロアレイデータの解析ツール "eXPatGen" の紹介。与えられたデータより、遺伝子間のネットワーク構造を生成する。Webブラウザを通して実行可能。
・データ：人工的に時系列のデータを生成
1.Example #1：5グループ、各5遺伝子
2.Example #2：10グループ、各10遺伝子
・比較法：Clustering、PCA

・eXPatGenとは「We have developed an on-line simulator, called eXPatGen, to generate dynamic gene expression patterns typical of microarray experiments. eXPatGen provides a quantitative network structure to represent key biological features, including the induction, repression, and cascade regulation of messenger RNA (mRNA).」
・意義「The large number of methods of analysis and the lack of a standard data set with known network connenctions provided a strong motivation for the development of eXPatGen.」

【論】Truntzer,2007,Importance of data structure in ～

2007年09月20日 11時26分57秒 | 論文記録

Caroline Truntzer, Catherine Mercier, Jacques Est?ve, Christian Gautier and Pascal Roy
Importance of data structure in comparing two dimension reduction methods for classification of microarray gene expression data
BMC Bioinformatics 2007, 8:90
[PDF]

・複数の指標に基づいたサンプルの識別法の提案。横軸にBGA(Between-Group Analysis)、縦軸にPCA(Principal Componets Analysis)の第一成分または第二成分をとったグラフを作る。
・予備実験として以下の三法を人工データを用いて性能比較
1.PLS(Partial Least Squares) + DA(Discriminant Analysis)
2.PCA + DA
3.BGA
・データ：いずれもBioconductor("R")のpackageより入手
1.DLBCL : 58 patients (32 "cured" and 26 "fatal/refractory"), 6149 genes [Shipp]
2.Prostate : 102 patients (50 without and 52 with tumor), 12625 genes [Singh]
3.ALL : 125 patients (24 with and 101 without Multi Drug Pesistance -MDR-), 12625 genes [Chiaretti]
4.Leukemia : 72 patients (25 Acute Lymphoblastic Leukemia -ALL- and 47 Acute Myeloide Leukemia -AML-), 7129 genes [Golub]
・提案した識別法のプログラムは"R"で実行可能

・概要「This study evaluates the influence of gene expression variance structure on the performance of methods that describe the relationship between gene expression levels and a given phenotype through projection of data onto discriminant axes.」
・方法「To examine the structure of a dataset before analysis and preselect an a priori appropriate method for its analysis, we proposed a two-graph preliminary visualization tool: plotting patients on the Between-Group Analysis discriminant axis (x-axis) and on the first and the second within-group Principal Components Analysis component (y-axis), respectively.」
・問題点「However, in microarray experiments, there are more variables (genes) than samples (patients); if not taken into account, this dimension problem leads to trivial results with no statistical identifiability or biological singinficance.」

【論】Duan,2005,Multiple SVM-RFE for gene selection～

2007年09月12日 22時09分48秒 | 論文記録

Kai-Bo Duan, Jagath C.Rajapakse, Haiying Wang, Francisco Azuaje
Multiple SVM-RFE for gene selection in cancer classification with expression data
Nanobioscience, IEEE Transactions on, Volume:4, Issue:3, page(s):228-234
[PDF][Web Site]

・遺伝子抽出法であるSVM-RFEを改良したMultiple SVM-RFEの紹介
・データ
1.Breast cancer [Hedenfalk]
2.Colon Tumor [Alon]
3.ALL-AML Leukemia [Golub]
4.Lung Cancer [Gavin]
・実験
1.SVM-RFEとMSVM-RFEをcross validationで比較・評価
2.抽出した遺伝子をGOで評価

・方法「This paper proposes a new feature selection method that uses a backward elimination procedure similar to that implemented in support vector machine recursive feature elimination (SVM-RFE). Unlike the SVM-RFE method, at each step, the proposed approach computes the feature ranking score from a statistical analysis of weight vectors of multiple linear SVMs trained on subsamples of the original training data.」
・SVM-RFEとは「Nested subsets of features are selected in a sequential backward elimination manner, which starts with all the feature variables and removes one feature variable at a time. At each step, the coefficients of the weight vector w of a linear SVM are used to compute the feature ranking score.」
・問題点「Due to computing efficiency reasons, the algorithm can be generalized to remove more than one feature per step [9]. However, the removal of several features at at time may degrade the performance of the feature selection method.」
・ミソ「The bootstrap stabilization idea can be applied to SVM-RFE. However, instead of applying this idea on SVM-RFE as a whole, we may apply it on each step of the recursive procedure of SVM-RFE.」
・特性「The proposed MSVM-RFE is computationally more expensive than SVM-RFE. However, as feature selection is a prestep for building a good classifier, it is worthwhile to go through a computationally more expensive way if a better feature subset can be selected.」
・GOについて「It comprises three hierarchies, sometimes referred to as taxonomies or "aspects," that respectively hold terms describing the molecular function (MF), biological process (BP), and cellular component (CC). The vocabularies (one for each ontology) and their relationships are represented in the form of disrected acyclic graphs(DAGs),」
・GOの意義「Thus, relationships between GO-based similarity and gene expression correlation may offer a new approach to assessing the relevance of a set of genes selected.」
・結論「We conclude that: 1) the proposed MSVM-RFE method can select better gene subsets than SVM-RFE and improve the cancer classification accuracy; 2) gene selection also improves the performance of SVMs and is a necessary step for cancer classification with gene expression data; and 3) GO-based similarity values of pairs of genes belonging to subsets selected by MSVM-RFE are significantly low, which may be seen as an indicator of fuctional diversity (or redundancy reduction).」

【論】Handl,2005,Computational cluster validation in～

2007年09月07日 22時07分35秒 | 論文記録

Julia Handl, Joshua Knowles and Douglas B.Kell
Computational cluster validation in post-genomic data analysis
Bioinformatics 2005 21(15):3201-3212
[PDF][Web Site]

・マイクロアレイデータ解析に関する様々な解析方法のうち、どの方法を選択するかについて、多くの例を用いその特性や差異について述べ、指針を示す。
＜クラス分け法の分類＞
1.Compactness : k-means, average-link agglomerative clustering, SOMs, model-based clustering
2.Connectedness : density-based methods, single-link agglomerative clustering
3.Spatial separation : simulated annealing, tabu search, evolutionary algorithms
＜クラス分け結果の評価法の分類＞
[A]External measures
1.Unary measures : F-measure, 'enrichment'
2.Binary measures : Rand Index, Jaccard coefficient, Minkowski Score
[B]Internal measures
1.Compactness : graph-based approaches
2.Connectedness : k-nearest neighbor consistency, connectivity,
3.Separation : average weighted inter-cluter distance
4.Combinations : SD-validity Index, Dunn Index, Dunn-like Indices, Davies-Bouldin Index, Silhouette Width
5.predictive power/stability
6.Compliance between a partitioning and distance information : Pearson correlation, Spearman rank correlation
7.Specialized measures for highly correlated data : figure of merit, jacknife approach, figure of merit of Yeung

・データ
1.人工データ：'Long', 'Square'
2.白血病データ [Golub]
・クラス分け法
1.K-means
2.Average-link
3.Single-link
4.SOM
5.SOTA
・クラス分け評価法（縦軸）
1.F-measure
2.Adjusted F-measure
3.Silhouette Width
4.Dunn Index
5.Variance
6.Connectivity
7.Stability

・目的「In particular, the paper attempts to familiarize researchers with some of the fundamental concepts behind cluster-validation techniques, and to assist them in making more informed choices of the measures to be used.」
・問題点「There are several valid properties that may be ascribed to a good partitioning, but these are partly in conflict and are generally difficult to express in terms of objective functions.」
・問題点「However, there is hardly any consensus on the best distance function, clustering method or method of feature selection to be used for the different types of post-genomic data.」

【論】Giordano,2001,Organ-specific molecular classif～

2007年08月30日 22時49分25秒 | 論文記録

Thomas J.Giordano, Kerby A.Shedden, Donald R.Schwartz, Rork Kuick, Jeremy M.G.Taylor, Nana Lee, David E.Misek, Joel K.Greenson, Sharon L.R.Kardia||, David G.Beer, Gad Rennert, Kathleen R.Cho, Stephen B.Gruber, Eric R.Fearon and Samir Hanash
Organ-Specific Molecular Classification of Primary Lung, Colon, and Ovarian Adenocarcinomas Using Gene Expression Profiles
American Journal of Pathology. 2001;159:1231-1238.
[PDF]

・単一の組織ではなく複数種類の組織（肺、腸、子宮）の癌のマイクロアレイデータを解析し、各組織に対して特有に発現する遺伝子を抽出する。
・データ：57 lung, 51 colon and 46 ovary adenocarcinomas, 7129 genes
・解析法：PCA ("five-nearest neighbors with majority voting") and cross-validated prediction
・結果：各組織に対して、上位20個の遺伝子をリストアップ

・意義「The establishment of organ-specific gene expression patterns represents a crucial first step in the clinical application of the molecular approach.」
・概要「In this study, we compared the gene expression profiles of 154 primary adenocarcinomas of lung, colon, and ovary and demonstrated these profiles could discriminate the tumors in an organ-specific manner. In addition, we identified genes that are potentially useful as diagnostic markers for these tumors.」

【論】Inza,2004,Filter versus wrapper gene selection～

2007年08月24日 21時05分42秒 | 論文記録

I.Inza, P.Larrañaga, R.Blanco, A.Cerrolaza
Filter versus wrapper gene selection approaches in DNA microarray domains.
Artificial Intelligence in Medicine, Volume 31, Issue 2, Pages 91-103
[PDF][Web Site]

・サンプルのクラス分けに使う遺伝子の抽出法である Filter method や Wrapper method を、複数のデータやクラス分け法を用いて性能比較する。
・データ
1.Colon dataset, 62 samples (22 tumor/40 normal), selected 2000 genes [Alon]
2.Leukemia dataset, 72 samples (25 AML/47 ALL), 7129 genes [Golub]
・遺伝子抽出法
[A]Filter approach
(a)For continuous data
1.P-metric
2.t-score
(b)For discrete data
1.Shannon-entropy
2.Euclidean-distance
3.Kolmogorov-dependence
4.Kullback-Leibler
[B]Wrapper approach：Sequential forward selection (SFS) : {3,5,10,20} genes of highest scoring value
・サンプルクラス分け法（Supervised classifiers）
1.IB1 : Nearest-neighbor (K-NN) classifier
2.Naive-Bayes (NB) rule : Bayes theorem
3.C4.5 : Decision tree
4.CN2 : Set of IF-THEN rules
・クラス分け結果の評価法：LOOCV

・概要「In this work, a comparison between a group of different filtermetrics and a wrapper sequential search procedure is arried out.」
・「Although the wrapper approach mainly shows a more accurate behavior than filter metrics, this improvement is coupled with considerable computer-load necessities.」
・目的「By an extensive comparison with more popular filter techniques, we would like to make contributions in the expansion and study of the wrapper approach in this type of domains.」
・問題点「For most biological problems, information about the class (or type) of each cell-line exists: reflecting whether the tissue is diseased or healthy, the distinction of the specific tumor type, etc.」
・「To avoid this 'curse of dimensionality' [12], feature selection plays a crucial role in DNA microarray analysis.」
・問題点「Most of the supervised learning algorithms perform rather poorly when faced with many irrelevant or redundant (depending on the specific characteristics of the classifier) features.」
・注意「It must be noted that there are few coincidences in both datasets among the genes selected by the filter and wrapper approaches. It seems that the wrapper approach, by its multivariate selection search procedure, prefers genes which directly cause high accuracy levels in the induced classifiers. On the other hand, the filter approach does not directly take the predictive power of the genes into account, and it univariately selects the genes that are closely related with the class label. Thus, there are no large coincidences between the ‘accurate’genes multivariately selected by the wrapper approach and the class-related genes univariately proposed by the filter metrics.」
・今後「As future work, we envision to use new filter metrics which, by the use of statistical hypothesis tests, automatically fix the number of genes to induce the classifier. We also plan to use population-based, randomized search algorithms, such as genetic algorithms or estimation of distribution algorithms for the selection of discreminative genes in DNA microarray tasks:」

・発現量データを、{under-expressed, baseline, over-expressed} の三値に分ける Discrete data の手法に興味あり。
・非英語圏の著者らしく、読みやすい英語。

【論】Zhao,2001,Statistical modeling of large microa～

2007年08月19日 22時03分51秒 | 論文記録

Lue Ping Zhao, Ross Prentice, and Linda Breeden
Statistical modeling of large microarray data sets to identify stimulus-response profiles
PNAS,May 8,2001,vol.98,no.10,5631-5636
[PDF][Web Site]

・時系列の酵母マイクロアレイデータから、その Cell cycle を統計的手法の Single-pulse model (SPM)でモデル化する。これにより刺激を加えた時の Cell cycle への影響を推定する。
・データ：Yeast
1.Temperature-sensitive cdc28 mutation [Cho]
2.Alpha factor-mediated G₁ arrest [Spellman]
3.Temperature-sensitive cdc15 mutation [Spellman]
～～～～～～～
・"読んだ" だけで、内容をほとんど理解できず。酵母と時系列データのモデリングにあまり馴染みがないせい？

アクセス
閲覧	3,746	PV
訪問者	907	IP
トータル
閲覧	28,332,327	PV
訪問者	5,927,011	IP
ランキング
日別	173	位
週別	186	位

2025年9月
日	月	火	水	木	金	土
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

主に食べ歩きの記録。北海道室蘭市在住。

アクセス状況

カレンダー

検索

プロフィール

ブックマーク

カテゴリー

最新記事

最新コメント

バックナンバー

文字サイズ変更

goo blog お知らせ

goo blog おすすめ

ログイン