CLUSTER

 
    CLUSTER is a sublibrary of Fortran subroutines for cluster analysis
  and related line printer graphics.  It includes routines for clustering
  variables and/or observations using algorithms such as direct joining
  and splitting, Fisher's exact optimization, single-link, K-means, and
  minimum mutations, and routines for estimating missing values.
 
    The subroutines in CLUSTER are described in the book "CLUSTERING
  ALGORITHMS" by J. A. Hartigan.  Major modifications that have been
  made to the original code found in the book include:
 
       The elimination of Hartigan's "bordered matrix".  The data matrix
     has been placed in a separate matrix, the borders have been placed in
     their own character vectors, and the title has been placed in its own
     character variable.
 
       The amalgamation of extra vectors and matrices into work vectors
     and matrices for shorter call sequences.
 
       Setup and plotting subroutines incorporated into the primary
     subroutines rather than being explicitly called from the user's
     main program.  Results are automatically produced on a user-specified
     unit.
 
 
    A cluster is a set of similar objects.  One principal function of
  clustering is to summarize data by referring to properties of clusters
  rather than to properties of individual objects.  If some objects in a
  cluster have a certain property, other objects in the cluster will be
  expected to have the same property.
 
    The standard data structure of the input is a set of observations
  on each of which a number of variables is measured.  Some routines
  require the values to be categorical, the others allow continuous
  data.
 
    Two clustering structures that are used are the partition and the
  tree.  A partition is a family of clusters such that each observation
  lies in exactly one member of the partition.  A tree is a family of
  clusters, which includes the set of all observations, for which any
  two clusters are disjoint or one includes the other.  The structure can
  be drawn as a tree with clusters as the nodes, where the direct ancestor
  of a node is the smallest cluster that contains it, and the cluster of
  all observations is at the root.
 
    The types of clustering algorithms used are:
 
  Sorting --   An important variable is chosen somehow, and the observations
                  are partitioned according to the values taken by this
                  variable.  Within each of the clusters of the partition,
                  further partitioning takes place according to further
                  important variables.  Example: SPARSE
 
  Switching -- An initial partition of the observations is given, and new
                  partitions are obtained by switching an observation from one
                  cluster to another, with the algorithm terminating when no
                  further switche improves some criterion.  Examples:  BUILD,
                  MIX, and DITTO
 
  Joining   -- Initially each observation is a cluster and the closest pair
                  of clusters are joined to form a new cluster, continuing
                  this step until a single cluster containing all the original
                  observations is obtained.  Examples:  SLINK, JOIN, JOIN2,
                  and DOT
 
  Splitting -- Initially all observations are one cluster and then a cluster
                  is chosen according to some criterion and split into
                  smaller clusters.  Examples:  SPLIT1, SPLIT2, and MIXIND
 
  Adding    -- A clustering structure (partition or tree) already exists,
                  and each object is added to the closest cluster by some
                  criterion in turn.  Examples:  QUICK, LETREE, SEARCH, and LINK
 
  Searching -- Search over a subset of all possible clusters for the optimal
                  one.  Example:  FISH
 
  CLUSTERING ROUTINES
  -------------------
 
  Unconstrained
 
    Partitions (non-nested)
 
      QUICK    produces a quick partition based on a threshold
 
      BUILD    creates clusters by K-means algorithm
 
      MIX      creates clusters by fitting a maximum likelihood criterion
               to the mixture model under various constraints on the
               within-cluster covariances
 
      MIXIND   creates clusters by fitting the mixture model where the
               variables have constant variances and zero within-cluster
               covariances
 
      DITTO    clusters category data to maximize the matches between
               observations in a cluster and the cluster mode
 
      SPLIT1   clusters observations within each variable by splitting the
               cluster with the largest variance at each iteration
 
      SPLIT2   clusters the data matrix by splitting both observations and
               variables in the cluster with the largest variance at each
               iteration.  contains user-controlled constraint options
 
      JOIN2    clusters the data matrix by joining either the observations
               or the  variables of the two closest clusters at each iteration
 
      SPARSE   clusters variables to approximate the loading matrix for factor
               analysis
 
    Trees (nested)
 
      LETREE   creates tree of clusters based on a set of threshold variances
 
      SEARCH   creates clusters by finding the tree which predicts the most
               correct triads (expensive to run)
 
      SLINK    creates tree of clusters by single-linkage algorithm
 
      JOIN     creates tree of clusters of observations by joining the two
               closest clusters at each iteration
 
      DOT      creates tree of clusters of category data using minimum
               mutation fits to the variables
 
      LINK     creates tree by adding observations which minimize the sum of
               the link distances
 
  Constrained
 
      FISH     partitions sequence of observations into subsequences by
               Fisher's method of exact optimization
 
      SPLIT2   clusters the data matrix by splitting both observations and
               variables in the cluster with the largest variance at each
               iteration.  contains user-controlled constraint options
 
 
    The cluster approach to prediction is to partition the data into
  clusters according to the variable to be predicted.  Assign an existing
  observation to a cluster, and use the value associated with that cluster
  as the value for the predicted variable for the existing observation.
 
  PREDICTION ROUTINES
  -------------------
 
    AID      uses the automatic interaction detection technique to split
             data to best predict a variable for an existing observation
 
    VARCO    uses the variance components algorithm to a set of clusters
             to assign values for the mean and variance of the predicted
             variable for each cluster
 
 
    Profiles can be used as an inexpensive method of suggesting possible
  clusters, or they may suggest reasonable weights for variables.  Profiles
  may be best described as histograms on each variable, connected between
  variables by identifying observations.
 
    One model for clusters in multivariate data is that the data are
  sampled from a density with many modes, one mode for each cluster.
  Methods of estimating multivariate densities may therefore be converted
  to clustering techniques.  Univariate, bivariate, and multivariate
  histograms are useful in suggesting modes.
 
    Another approach is to represent each observation as a three-dimensional
  box.  If the observations are ordered by some other clustering technique,
  the similarity of neighboring boxes suggests or denies clusters.
 
  STATISTICAL PLOTTING ROUTINES
  -----------------------------
 
    PROF     constructs linearly optimal profiles of the variables
 
    RANK     constructs rank profiles of the variables
 
    LINE     constructs line profiles of the cases
 
    HIST     constructs a univariate histogram
 
    MODAL    constructs a bivariate block histogram for a pair of variables
 
    MHIST    constructs multivariate block histograms
 
    BOXES    constructs multivariate plots of the observations in the form
             of three-dimensional boxes
 
    PLOT     constructs scatter plot of a sequence of paired variables
 
    PMANY    constructs a matrix of pairwise plots of the variables
 
  MISCELLANEOUS ROUTINES
  ----------------------
 
    TWO      computes overall means and covariances, replacing missing values
                by the overall mean
 
    WCOV     computes within-cluster covariances
 
    DIST     computes Euclidean distance between two observations or variables
 
    WDIST    computes Mahalanobis distance between two observations
 
    MISS     estimates missing values by the cluster mean
 
    STAND    standardizes each variable to have zero mean and unit variance
 
    SCALE    discretizes the data into classes
 
    ASSIGN   assigns an observation to its closest cluster
 
    RELOC    sets each cluster center equal to the cluster mean
 
    INVERT   computes the inverse of a covariance matrix and its determinant
 
    IN       reads a data matrix with appropriate labels
 
    OUT      outputs a data matrix with appropriate labels
 
 
  NOTE:  "The subroutines in 'Clustering Algorithms' were written as
          research tools.  Speed of execution and storage requirements
          are not necessarily optimized.  These subroutines should NOT
          be used routinely to analyze large data sets."
 
    Programs giving examples of each of the above routines (except IN
  which every sample program uses) exist in the file SAMPLE under the
  CMLIB account.  The commands
 
    ATTACH,SAMPLE/UN=CMLIB.
    GTR,SAMPLE,<lfn>.<subroutine>.
 
  will place a program which calls <subroutine> into the local file <lfn>.
  All sample programs use the CLUSTER subroutine IN which expects input
  data on Fortran unit 5 (see documentation of IN for format of the input
  data file), and places output on a user-specified unit .  The sample
  data set that was used in each sample program can be gotten with the
  commands:
 
    ATTACH,SMPIN/UN=CMLIB.
    GTR,SMPIN,TAPE5.<subroutine>.
 
    All tests were run on the data set in the SMPIN file.  The corres-
  ponding output can be found in the SMPOUT file.  The commands
 
    ATTACH,SMPOUT/UN=CMLIB.
    GTR,SMPOUT,<lfn>.<subroutine>.
 
  will place the output from the sample program which calls <subroutine>
  into the local file <lfn>.
 
  Parts of the above text were taken from the following (the last two
     are available from Greg Rhoads of the Scientific Computing Division
     FTS 921-3395).
 
  Hartigan, J. A. (1975).  Clustering Algorithms, John Wiley &
     Sons, Inc., New York.
 
  Hartigan, J. A. (1975) Printer Graphics for Clustering. Journal of
     Statistical Computation and Simulation. Volume 4, Pages 187-213.
 
  Dallal, Gerard E. A User's Guide to J. A. Hartigan's "Clustering
     Algorithms".  Yale University.
 

Math Department Homepage Locally Maintained Software CMLIB Last updated: August 18, 1996