# CLUSTER

```
CLUSTER is a sublibrary of Fortran subroutines for cluster analysis
and related line printer graphics.  It includes routines for clustering
variables and/or observations using algorithms such as direct joining
and splitting, Fisher's exact optimization, single-link, K-means, and
minimum mutations, and routines for estimating missing values.

The subroutines in CLUSTER are described in the book "CLUSTERING
ALGORITHMS" by J. A. Hartigan.  Major modifications that have been
made to the original code found in the book include:

The elimination of Hartigan's "bordered matrix".  The data matrix
has been placed in a separate matrix, the borders have been placed in
their own character vectors, and the title has been placed in its own
character variable.

The amalgamation of extra vectors and matrices into work vectors
and matrices for shorter call sequences.

Setup and plotting subroutines incorporated into the primary
subroutines rather than being explicitly called from the user's
main program.  Results are automatically produced on a user-specified
unit.

A cluster is a set of similar objects.  One principal function of
clustering is to summarize data by referring to properties of clusters
rather than to properties of individual objects.  If some objects in a
cluster have a certain property, other objects in the cluster will be
expected to have the same property.

The standard data structure of the input is a set of observations
on each of which a number of variables is measured.  Some routines
require the values to be categorical, the others allow continuous
data.

Two clustering structures that are used are the partition and the
tree.  A partition is a family of clusters such that each observation
lies in exactly one member of the partition.  A tree is a family of
clusters, which includes the set of all observations, for which any
two clusters are disjoint or one includes the other.  The structure can
be drawn as a tree with clusters as the nodes, where the direct ancestor
of a node is the smallest cluster that contains it, and the cluster of
all observations is at the root.

The types of clustering algorithms used are:

Sorting --   An important variable is chosen somehow, and the observations
are partitioned according to the values taken by this
variable.  Within each of the clusters of the partition,
further partitioning takes place according to further
important variables.  Example: SPARSE

Switching -- An initial partition of the observations is given, and new
partitions are obtained by switching an observation from one
cluster to another, with the algorithm terminating when no
further switche improves some criterion.  Examples:  BUILD,
MIX, and DITTO

Joining   -- Initially each observation is a cluster and the closest pair
of clusters are joined to form a new cluster, continuing
this step until a single cluster containing all the original
observations is obtained.  Examples:  SLINK, JOIN, JOIN2,
and DOT

Splitting -- Initially all observations are one cluster and then a cluster
is chosen according to some criterion and split into
smaller clusters.  Examples:  SPLIT1, SPLIT2, and MIXIND

and each object is added to the closest cluster by some
criterion in turn.  Examples:  QUICK, LETREE, SEARCH, and LINK

Searching -- Search over a subset of all possible clusters for the optimal
one.  Example:  FISH

CLUSTERING ROUTINES
-------------------

Unconstrained

Partitions (non-nested)

QUICK    produces a quick partition based on a threshold

BUILD    creates clusters by K-means algorithm

MIX      creates clusters by fitting a maximum likelihood criterion
to the mixture model under various constraints on the
within-cluster covariances

MIXIND   creates clusters by fitting the mixture model where the
variables have constant variances and zero within-cluster
covariances

DITTO    clusters category data to maximize the matches between
observations in a cluster and the cluster mode

SPLIT1   clusters observations within each variable by splitting the
cluster with the largest variance at each iteration

SPLIT2   clusters the data matrix by splitting both observations and
variables in the cluster with the largest variance at each
iteration.  contains user-controlled constraint options

JOIN2    clusters the data matrix by joining either the observations
or the  variables of the two closest clusters at each iteration

analysis

Trees (nested)

LETREE   creates tree of clusters based on a set of threshold variances

SEARCH   creates clusters by finding the tree which predicts the most

JOIN     creates tree of clusters of observations by joining the two
closest clusters at each iteration

DOT      creates tree of clusters of category data using minimum
mutation fits to the variables

Constrained

FISH     partitions sequence of observations into subsequences by
Fisher's method of exact optimization

SPLIT2   clusters the data matrix by splitting both observations and
variables in the cluster with the largest variance at each
iteration.  contains user-controlled constraint options

The cluster approach to prediction is to partition the data into
clusters according to the variable to be predicted.  Assign an existing
observation to a cluster, and use the value associated with that cluster
as the value for the predicted variable for the existing observation.

PREDICTION ROUTINES
-------------------

AID      uses the automatic interaction detection technique to split
data to best predict a variable for an existing observation

VARCO    uses the variance components algorithm to a set of clusters
to assign values for the mean and variance of the predicted
variable for each cluster

Profiles can be used as an inexpensive method of suggesting possible
clusters, or they may suggest reasonable weights for variables.  Profiles
may be best described as histograms on each variable, connected between
variables by identifying observations.

One model for clusters in multivariate data is that the data are
sampled from a density with many modes, one mode for each cluster.
Methods of estimating multivariate densities may therefore be converted
to clustering techniques.  Univariate, bivariate, and multivariate
histograms are useful in suggesting modes.

Another approach is to represent each observation as a three-dimensional
box.  If the observations are ordered by some other clustering technique,
the similarity of neighboring boxes suggests or denies clusters.

STATISTICAL PLOTTING ROUTINES
-----------------------------

PROF     constructs linearly optimal profiles of the variables

RANK     constructs rank profiles of the variables

LINE     constructs line profiles of the cases

HIST     constructs a univariate histogram

MODAL    constructs a bivariate block histogram for a pair of variables

MHIST    constructs multivariate block histograms

BOXES    constructs multivariate plots of the observations in the form
of three-dimensional boxes

PLOT     constructs scatter plot of a sequence of paired variables

PMANY    constructs a matrix of pairwise plots of the variables

MISCELLANEOUS ROUTINES
----------------------

TWO      computes overall means and covariances, replacing missing values
by the overall mean

WCOV     computes within-cluster covariances

DIST     computes Euclidean distance between two observations or variables

WDIST    computes Mahalanobis distance between two observations

MISS     estimates missing values by the cluster mean

STAND    standardizes each variable to have zero mean and unit variance

SCALE    discretizes the data into classes

ASSIGN   assigns an observation to its closest cluster

RELOC    sets each cluster center equal to the cluster mean

INVERT   computes the inverse of a covariance matrix and its determinant

IN       reads a data matrix with appropriate labels

OUT      outputs a data matrix with appropriate labels

NOTE:  "The subroutines in 'Clustering Algorithms' were written as
research tools.  Speed of execution and storage requirements
are not necessarily optimized.  These subroutines should NOT
be used routinely to analyze large data sets."

Programs giving examples of each of the above routines (except IN
which every sample program uses) exist in the file SAMPLE under the
CMLIB account.  The commands

ATTACH,SAMPLE/UN=CMLIB.
GTR,SAMPLE,<lfn>.<subroutine>.

will place a program which calls <subroutine> into the local file <lfn>.
All sample programs use the CLUSTER subroutine IN which expects input
data on Fortran unit 5 (see documentation of IN for format of the input
data file), and places output on a user-specified unit .  The sample
data set that was used in each sample program can be gotten with the
commands:

ATTACH,SMPIN/UN=CMLIB.
GTR,SMPIN,TAPE5.<subroutine>.

All tests were run on the data set in the SMPIN file.  The corres-
ponding output can be found in the SMPOUT file.  The commands

ATTACH,SMPOUT/UN=CMLIB.
GTR,SMPOUT,<lfn>.<subroutine>.

will place the output from the sample program which calls <subroutine>
into the local file <lfn>.

Parts of the above text were taken from the following (the last two
are available from Greg Rhoads of the Scientific Computing Division
FTS 921-3395).

Hartigan, J. A. (1975).  Clustering Algorithms, John Wiley &
Sons, Inc., New York.

Hartigan, J. A. (1975) Printer Graphics for Clustering. Journal of
Statistical Computation and Simulation. Volume 4, Pages 187-213.

Dallal, Gerard E. A User's Guide to J. A. Hartigan's "Clustering
Algorithms".  Yale University.

```

 Math Department Homepage Locally Maintained Software CMLIB Last updated: August 18, 1996