Ron Kohavi, George H.John
Wrappers for Feature Subset Selection
Artificial Intelligence (1996)
[PDF][Web Site]
・データマイニング(Feature Subset Selection)の手法の一つである、Wrapper法についてのまとめ。そのアルゴリズムについての詳細や、問題点の指摘、Filter法との性能比較など。
・データ:14種(実データ8/人工データ6)
・Induction algorithms:1.C4.5(ID3), 2,Naive-Bayes
・Search Engine:1.Hill-climing, 2.Best-first
・問題「The problem of feature subset selection is that of finding a subset of the original features of a dataset, such that an induction algorithm that is run on data containing only these features generates a classifier with the highest possible accuracy.」
・問題「In learning scenarios, however, we are face with two problems: the learning algorithms are not given access to the underlying distribution, and most practical algorithms attempt to find a hypothesis by approximating NP-hard optimization problems.」
・注意「However, it is important to realize that relevance according to these definitions does not imply membership in the optimal feature subset, and that irrelevance does not imply that a feature cannot be in the optimal feature subset.」
・特性「The main disadvantage of the filter approach is that it totally ignores the effects of the selected feature subset on the performance of the induction algorithm.」
・「We shall investigate two hypotheses: first, that using a filter method will improve the accuracy of ID3 and Naive-Bayes on real datasets but will be fairly erratic (often hurting perfrmance), and second, that improvements from using the wrapper approach will surpass the gains from the filter and will be more consistent.」
・結果「In summary, feature subset selection using the wrapper approach significantly improves ID3, C4.5 and Naive-Bayes on the datasets tested. On the real datasets, the wrapper approach is clearly superior to the filter method. Perhaps the most surprinsing result is how well Naive-Bayes performs on real datasets once discretization and feature subset selection are done.」
・問題点「These problems include: inability to remove a feature in symmetric targets concepts such as m-of-n-3-7-10 where removal of one feature improves performance (Section 4), inability to include irrelevant features that may actually help performance (Example 3), and inability to remove correalated features that may hurt performance (Section 2.4).」
・DNA関連ではなく、工学分野の文献。
・"Feature Subset Selection "、"Wrapper"、"Induction algorithms"、"Search Engine"、いずれもいい日本語が見あたらない。
Wrappers for Feature Subset Selection
Artificial Intelligence (1996)
[PDF][Web Site]
・データマイニング(Feature Subset Selection)の手法の一つである、Wrapper法についてのまとめ。そのアルゴリズムについての詳細や、問題点の指摘、Filter法との性能比較など。
・データ:14種(実データ8/人工データ6)
・Induction algorithms:1.C4.5(ID3), 2,Naive-Bayes
・Search Engine:1.Hill-climing, 2.Best-first
・問題「The problem of feature subset selection is that of finding a subset of the original features of a dataset, such that an induction algorithm that is run on data containing only these features generates a classifier with the highest possible accuracy.」
・問題「In learning scenarios, however, we are face with two problems: the learning algorithms are not given access to the underlying distribution, and most practical algorithms attempt to find a hypothesis by approximating NP-hard optimization problems.」
・注意「However, it is important to realize that relevance according to these definitions does not imply membership in the optimal feature subset, and that irrelevance does not imply that a feature cannot be in the optimal feature subset.」
・特性「The main disadvantage of the filter approach is that it totally ignores the effects of the selected feature subset on the performance of the induction algorithm.」
・「We shall investigate two hypotheses: first, that using a filter method will improve the accuracy of ID3 and Naive-Bayes on real datasets but will be fairly erratic (often hurting perfrmance), and second, that improvements from using the wrapper approach will surpass the gains from the filter and will be more consistent.」
・結果「In summary, feature subset selection using the wrapper approach significantly improves ID3, C4.5 and Naive-Bayes on the datasets tested. On the real datasets, the wrapper approach is clearly superior to the filter method. Perhaps the most surprinsing result is how well Naive-Bayes performs on real datasets once discretization and feature subset selection are done.」
・問題点「These problems include: inability to remove a feature in symmetric targets concepts such as m-of-n-3-7-10 where removal of one feature improves performance (Section 4), inability to include irrelevant features that may actually help performance (Example 3), and inability to remove correalated features that may hurt performance (Section 2.4).」
・DNA関連ではなく、工学分野の文献。
・"Feature Subset Selection "、"Wrapper"、"Induction algorithms"、"Search Engine"、いずれもいい日本語が見あたらない。