Susmita Datta and Somnath Datta
Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes
BMC Bioinformatics 2006, 7:397
[PDF][Web Site]
・クラス分け法の評価法として、Gene ontology (GO)データベースの知識を利用したBiological homogeneity index (BHI)とBiological stability index (BSI)の、二つの方法を紹介する。
・データ
1.Human breast cancer progression data, 258 genes, 4 normal/7 ductal carcinoma [Abba]
2.Yeast sporulation data, 513 genes [Chu]
・クラス分け法(※いずれも 'R' で実行可能)
1.2.UPGMA (Pearson's correlation coefficient, Euculidian distance), Agglomerative hierarchical clustering algorithm
3.4.Diana (Pearson's~, Euculidian~)
5.6.Fanny (Pearson's~, Euculidian~)
7.K-means
8.SOM (Self-organizing maps)
9.Model based clustering
10.SOTA (Self-oragnising tree algorithm)
・問題点「One potential difficulty with this approach is that a quantitative conversion of biological attributes is needed (which may not be natural and may not preserve the information content).」
・概要「In this paper, we introduce two performance measures for evaluating the results of a clustering algorithm in its ability to produce biologically meaningful clusters. The first measure is a biological homogeneity index (BHI). As the name suggests, it is a measure of how biologically homogeneous the clusters are.(中略)The second performance measure is called a biological stability index (BSI). For a given clustering algorithm and an expression data set, it measures the consistency of the clustering algorithm's ability to produce biologically meaningful clusters when applied repeatedly to similar data sets.(中略)We evaluated the performance of ten well known clusering algorithms using this dual measures approach on two gene expression data sets and identified the optimal algorithm in each case.」
・意義「However, for clustering biological data such as the gene expression profiles, it would be reasonable to consider external measures that employ the existing biological knowledge (which can be taken as the "ground truth").」
・問題点「Such conclusions are inherently incomplete unless one can quantify the agreement between the clusters produced via the expression profiles and the biological classes because it is likely that many biologically unrelated genes will be grouped together as well.」
・利点「The proposed indices are easy to interpret and easy to implement. They are also useful in identifying the optimal clustering algorithm for a given data set in its ability to cluster biologically similar genes.」
・GOについての勉強が必要かも。アノテーションの情報がなぜ数式につっこめるのか不思議。
Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes
BMC Bioinformatics 2006, 7:397
[PDF][Web Site]
・クラス分け法の評価法として、Gene ontology (GO)データベースの知識を利用したBiological homogeneity index (BHI)とBiological stability index (BSI)の、二つの方法を紹介する。
・データ
1.Human breast cancer progression data, 258 genes, 4 normal/7 ductal carcinoma [Abba]
2.Yeast sporulation data, 513 genes [Chu]
・クラス分け法(※いずれも 'R' で実行可能)
1.2.UPGMA (Pearson's correlation coefficient, Euculidian distance), Agglomerative hierarchical clustering algorithm
3.4.Diana (Pearson's~, Euculidian~)
5.6.Fanny (Pearson's~, Euculidian~)
7.K-means
8.SOM (Self-organizing maps)
9.Model based clustering
10.SOTA (Self-oragnising tree algorithm)
・問題点「One potential difficulty with this approach is that a quantitative conversion of biological attributes is needed (which may not be natural and may not preserve the information content).」
・概要「In this paper, we introduce two performance measures for evaluating the results of a clustering algorithm in its ability to produce biologically meaningful clusters. The first measure is a biological homogeneity index (BHI). As the name suggests, it is a measure of how biologically homogeneous the clusters are.(中略)The second performance measure is called a biological stability index (BSI). For a given clustering algorithm and an expression data set, it measures the consistency of the clustering algorithm's ability to produce biologically meaningful clusters when applied repeatedly to similar data sets.(中略)We evaluated the performance of ten well known clusering algorithms using this dual measures approach on two gene expression data sets and identified the optimal algorithm in each case.」
・意義「However, for clustering biological data such as the gene expression profiles, it would be reasonable to consider external measures that employ the existing biological knowledge (which can be taken as the "ground truth").」
・問題点「Such conclusions are inherently incomplete unless one can quantify the agreement between the clusters produced via the expression profiles and the biological classes because it is likely that many biologically unrelated genes will be grouped together as well.」
・利点「The proposed indices are easy to interpret and easy to implement. They are also useful in identifying the optimal clustering algorithm for a given data set in its ability to cluster biologically similar genes.」
・GOについての勉強が必要かも。アノテーションの情報がなぜ数式につっこめるのか不思議。