Documentation Center

  • Trials
  • Product Updates

evalclusters

Evaluate clustering solutions

Syntax

  • eva = evalclusters(x,clust,criterion) example
  • eva = evalclusters(x,clust,criterion,Name,Value)

Description

example

eva = evalclusters(x,clust,criterion) creates a clustering evaluation object containing data used to evaluate the optimal number of data clusters.

eva = evalclusters(x,clust,criterion,Name,Value) creates a clustering evaluation object using additional options specified by one or more name-value pair arguments.

Examples

expand all

Evaluate the Clustering Solution Using Calinski-Harabasz Criterion

Evaluate the optimal number of clusters using the Calinski-Harabasz clustering evaluation criterion.

Load the sample data.

load fisheriris;

The data contains length and width measurements from the sepals and petals of three species of iris flowers.

Evaluate the optimal number of clusters using the Calinski-Harabasz criterion. Cluster the data using kmeans.

rng('default');  % For reproducibility
eva = evalclusters(meas,'kmeans','CalinskiHarabasz','KList',[1:6])
eva = 

  CalinskiHarabaszEvaluation with properties:

    NumObservations: 150
       InspectecedK: [1 2 3 4 5 6]
    CriterionValues: [1x6 double]
           OptimalK: 3

The OptimalK value indicates that, based on the Calinski-Harabasz criterion, the optimal number of clusters is three.

Evaluate a Matrix of Clustering Solutions

Use an input matrix of proposed clustering solutions to evaluate the optimal number of clusters.

Load the sample data.

load fisheriris;

The data contains length and width measurements from the sepals and petals of three species of iris flowers.

Use kmeans to create an input matrix of proposed clustering solutions for the sepal length measurements, using 1, 2, 3, 4, 5, and 6 clusters.

clust = zeros(size(meas,1),6);
for i=1:6
clust(:,i) = kmeans(meas,i,'emptyaction','singleton',...
        'replicate',5);
end

Each row of clust corresponds to one sepal length measurement. Each of the six columns corresponds to a clustering solution containing 1 to 6 clusters.

Evaluate the optimal number of clusters using the Calinski-Harabasz criterion.

eva = evalclusters(meas,clust,'CalinskiHarabasz')
eva = 

  CalinskiHarabaszEvaluation with properties:

    NumObservations: 150
         InspectedK: [1 2 3 4 5 6]
    CriterionValues: [NaN 513.9245 561.6278 530.7658 459.5058 473.6577]
           OptimalK: 3

The OptimalK value indicates that, based on the Calinski-Harabasz criterion, the optimal number of clusters is three.

Specify Clustering Algorithm with a Function Handle

Use a function handle to specify the clustering algorithm, then evaluate the optimal number of clusters.

Load the sample data.

load fisheriris;

The data contains length and width measurements from the sepals and petals of three species of iris flowers.

Use a function handle to specify the clustering algorithm.

myfunc = @(X,K)(kmeans(X, K, 'emptyaction','singleton',...
    'replicate',5));

Evaluate the optimal number of clusters for the sepal length data using the Calinski-Harabasz criterion.

eva = evalclusters(meas,myfunc,'CalinskiHarabasz',...
    'klist',[1:6])
eva = 

  CalinskiHarabaszEvaluation with properties:

    NumObservations: 150
         InspectedK: [1 2 3 4 5 6]
    CriterionValues: [NaN 513.9245 561.6278 530.7658 459.5058 473.6577]
           OptimalK: 3

The OptimalK value indicates that, based on the Calinski-Harabasz criterion, the optimal number of clusters is three.

Input Arguments

expand all

x — Input datamatrix

Input data, specified as an N-by-P matrix. N is the number of observations, and P is the number of variables.

Data Types: single | double

clust — Clustering algorithm'kmeans' | 'linkage' | 'gmdistribution' | matrix of clustering solutions | function handle

Clustering algorithm, specified as one of the following.

'kmeans'Cluster the data in x using the kmeans clustering algorithm, with 'EmptyAction' set to 'singleton' and 'Replicates' set to 5.
'linkage'Cluster the data in x using the clusterdata agglomerative clustering algorithm, with 'Linkage' set to 'ward'.
'gmdistribution'Cluster the data in x using the gmdistribution Gaussian mixture distribution algorithm, with 'SharedCov' set to true and 'Replicates' set to 5.

If Criterion is 'CalinskHarabasz', 'DaviesBouldin', or 'silhouette', you can specify a clustering algorithm using the function_handle (@) operator. The function must be of the form C = clustfun(DATA,K), where DATA is the data to be clustered, and K is the number of clusters. The output of clustfun must be one of the following:

  • A vector of integers representing the cluster index for each observation in DATA. There must be K unique values in this vector.

  • A numeric n-by-K matrix of score for n observations and K classes. In this case, the cluster index for each observation is determined by taking the largest score value in each row.

If Criterion is 'CalinskHarabasz', 'DaviesBouldin', or 'silhouette', you can also specify clust as a n-by-K matrix containing the proposed clustering solutions. n is the number of observations in the sample data, and K is the number of proposed clustering solutions. Column j contains the cluster indices for each of the N points in the jth clustering solution.

criterion — Clustering evaluation criterion'CalinskiHarabasz' | 'DaviesBouldin' | 'gap' | 'silhouette'

Clustering evaluation criterion, specified as one of the following.

'CalinskiHarabasz'Create a CalinskiHarabaszEvaluation clustering evaluation object containing Calinski-Harabasz index values.
'DaviesBouldin'Create a DaviesBouldinEvaluation cluster evaluation object containing Davies-Bouldin index values.
'gap'Create a GapEvaluation cluster evaluation object containing gap criterion values.
'silhouette'Create a SilhouetteEvaluation cluster evaluation object containing silhouette values.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'KList',[1:5],'Distance','cityblock' specifies to test 1, 2, 3, 4, and 5 clusters using the sum of absolute differences distance measure.

For All Criteria

'KList' — List of number of clusters to evaluatevector

List of number of clusters to evaluate, specified as the comma-separated pair consisting of 'KList' and a vector of positive integer values. You must specify KList when clust is a clustering algorithm name string or a function handle. When criterion is 'gap', clust must be a string or a function handle, and you must specify KList.

Example: 'KList',[1:6]

For Silhouette and Gap

'Distance' — Distance metric'sqEuclidean' (default) | 'Euclidean' | 'cityblock' | vector | function | ...

Distance metric used for computing the criterion values, specified as the comma-separated pair consisting of 'Distance' and one of the following.

'sqEuclidean'Squared Euclidean distance
'Euclidean'Euclidean distance
'cityblock'Sum of absolute differences
'cosine'One minus the cosine of the included angle between points (treated as vectors)
'correlation'One minus the sample correlation between points (treated as sequences of values)
'Hamming'Percentage of coordinates that differ
'Jaccard'Percentage of nonzero coordinates that differ

For detailed information about each distance metric, see pdist.

You can also specify a function for the distance metric by using the function_handle (@) operator. The distance function must be of the form d2 = distfun(XI,XJ), where XI is a 1-by-n vector corresponding to a single row of the input matrix X, and XJ is an m2-by-n matrix corresponding to multiple rows of X. distfun must return an m2-by-1 vector of distances d2, whose kth element is the distance between XI and XJ(k,:).

If Criterion is 'silhouette', you can also specify Distance as the output vector output created by the function pdist.

When Clust a string representing a built-in clustering algorithm, evalclusters uses the distance metric specified for Distance to cluster the data, except for the following:

  • If Clust is 'linkage', and Distance is either 'sqEuclidean' or 'Euclidean', then the clustering algorithm uses Euclidean distance and Ward linkage.

  • If Clust is 'linkage' and Distance is any other metric, then the clustering algorithm uses the specified distance metric and average linkage.

In all other cases, the distance metric specified for Distance must match the distance metric used in the clustering algorithm to obtain meaningful results.

Example: 'Distance','Euclidean'

For Silhouette Only

'ClusterPriors' — Prior probabilities for each cluster'empirical' (default) | 'equal'

Prior probabilities for each cluster, specified as the comma-separated pair consisting of 'ClusterPriors' and one of the following.

'empirical'Compute the overall silhouette value for the clustering solution by averaging the silhouette values for all points. Each cluster contributes to the overall silhouette value proportionally to its size.
'equal'Compute the overall silhouette value for the clustering solution by averaging the silhouette values for all points within each cluster, and then averaging those values across all clusters. Each cluster contributes equally to the overall silhouette value, regardless of its size.

Example: 'ClusterPriors','empirical'

For Gap Only

'B' — Number of reference data sets100 (default) | positive integer value

Number of reference data sets generated from the reference distribution ReferenceDistribution, specified as the comma-separated pair consisting of 'B' and a positive integer value.

Example: 'B',150

'ReferenceDistribution' — Reference data generation method'PCA' (default) | 'uniform'

Reference data generation method, specified as the comma-separated pair consisting of 'ReferenceDistributions' and one of the following.

'PCA'Generate reference data from a uniform distribution over a box aligned with the principal components of the data matrix x.
'uniform'Generate reference data uniformly over the range of each feature in the data matrix x.

Example: 'ReferenceDistribution','uniform'

'SearchMethod' — Method for selecting optimal number of clusters'globalMaxSE' (default) | 'firstMaxSE'

Method for selecting the optimal number of clusters, specified as the comma-separated pair consisting of 'SearchMethod' and one of the following.

'globalMaxSE'Evaluate each proposed number of clusters in KList and select the smallest number of clusters satisfying

where K is the number of clusters, Gap(K) is the gap value for the clustering solution with K clusters, GAPMAX is the largest gap value, and SE(GAPMAX) is the standard error corresponding to the largest gap value.
'firstMaxSE'Evaluate each proposed number of clusters in KList and select the smallest number of clusters satisfying

where K is the number of clusters, Gap(K) is the gap value for the clustering solution with K clusters, and SE(K + 1) is the standard error of the clustering solution with K + 1 clusters.

Example: 'SearchMethod','globalMaxSE'

Output Arguments

expand all

eva — Clustering evaluation dataclustering evaluation object

Clustering evaluation data, returned as a clustering evaluation object.

See Also

| | |

Was this topic helpful?