KMEANS - Routines for the K-Means Problem
KMEANS is a collection of routines for the K-Means problem.
In the K-Means problem, a set of N points X(I) in M-dimensions is
given. The goal is to arrange these points into K clusters,
with each cluster having a representative point Z(J), usually
chosen as the centroid of the points in the cluster. The energy
of each cluster is
E(J) = Sum ( all points X(I) in cluster J ) || X(I) - Z(J) ||^2
For a given set of clusters, the total energy is then simply
the sum of the cluster energies E(J). The goal is to choose the
clusters in such a way that the total energy is minimized.
Usually, a point X(I) goes into the cluster with the closest
representative point Z(J). So to define the clusters, it's
enough simply to specify the locations of the cluster representatives.
This is actually a fairly hard problem. Most algorithms do
reasonably well, but cannot guarantee that the best solution
has been found. It is very common for algorithms to get
stuck at a solution which is merely a "local minimum".
For such a local minimum, every slight rearrangement of
the solution makes the energy go up; however a major
rearrangement would result in a big drop in energy.
A simple algorithm for the problem is known as "HMEANS".
It alternates between two procedures:
-
Using the given cluster centers, assign each point to the
cluster with the nearest center;
-
Using the given cluster assignments, replace each cluster
center by the centroid or average of the points in the cluster.
These steps are repeated until no points are moved, or some
other termination criterion is reached.
A more sophisticated algorithm, known as "KMEANS", takes advantage
of the fact that it is possible to quickly determine the decrease in
energy caused by moving a point from its current cluster to another.
It repeats the following procedure:
-
For each point, move it to another cluster if that would lower
the energy.
This procedure is repeated until no points are moved, or some
other termination criterion is reached.
-
Reference 1:
-
J A Hartigan, M A Wong,
Algorithm AS 136: A K-Means Clustering Algorithm,
Applied Statistics,
Volume 28, Number 1, 1979, pages 100-108.
-
Reference 2:
-
Wendy Martinez and Angel Martinez,
Computational Statistics Handbook with MATLAB,
pages 373-376,
Chapman and Hall / CRC, 2002.
-
Reference 3:
-
D N Sparks,
Algorithm AS 58: Euclidean Cluster Analysis,
Applied Statistics,
Volume 22, Number 1, 1973,
pages 126-130.
Files you may copy include:
-
kmeans.f90, the source code.
-
kmeans_prb.f90, a sample problem.
-
kmeans_prb.out, sample problem output.
-
points_100.xy, 100 2D points,
used as a case study by the sample problem. (Can be
plotted using
PLOT_POINTS).
-
test01_clusters.xy, the
100 point data set, clustered by test #1 in the sample problem.
(Can be plotted using
PLOT_POINTS)
-
test01_clusters.eps, a
PostScript image of the clusters.
-
asa_58.f, the original text of the Sparks
Applied Statistics Algorithm #58.
-
asa_136.f, the original text of the
Hartigan and Wong Applied Statistics Algorithm #136.
The list of routines includes:
-
CH_CAP capitalizes a single character.
-
CH_EQI is a case insensitive comparison of two characters for equality.
-
CH_TO_DIGIT returns the integer value of a base 10 digit.
-
CLUSTER_ENERGY_COMPUTE computes the energy of the clusters.
-
CLUSTER_INITIALIZE_1 initializes the clusters to data points.
-
CLUSTER_INITIALIZE_2 initializes the cluster centers to random values.
-
CLUSTER_INITIALIZE_3 initializes the cluster centers to random values.
-
CLUSTER_INITIALIZE_4 initializes the cluster centers to random values.
-
CLUSTER_INITIALIZE_5 initializes the cluster centers to random values.
-
CLUSTER_PRINT_SUMMARY prints a summary of data about a clustering.
-
CLUSTER_WRITE writes points and cluster centers to a file.
-
FILE_COLUMN_COUNT counts the number of columns in the first line of a file.
-
GET_UNIT returns a free FORTRAN unit number.
-
HMEANS_01 is a simple implementation of the H-Means algorithm.
-
I_RANDOM returns a random integer in a given range.
-
KMEANS_01 attempts to solve the K-Means problem.
-
KMEANS_02 seeks a solution of the K-Means problem.
-
KMEANS_03 seeks a solution of the K-Means problem.
-
KMEANS_03_OPTRA carries out the optimal transfer stage.
-
KMEANS_03_QTRAN carries out the quick transfer stage.
-
KMEANS_04 tries to improve an initial K-Means partition of points.
-
POINTS_COUNT counts the valid point coordinates in a file.
-
POINTS_READ reads point coordinates from a file.
-
R_RANDOM returns a random real in a given range.
-
RANDOM_INITIALIZE initializes the FORTRAN 90 random number seed.
-
S_TO_R reads a real number from a string.
-
S_TO_RVEC reads a real vector from a string.
-
TIMESTAMP prints the current YMDHMS date as a time stamp.
-
WORD_COUNT counts the number of "words" in a string.
Return to the FORTRAN software page.
Last revised on 21 March 2002.