Many computation problems use data types that are variations on abstract graphs. As a data type, an abstract graph will have an ordering of its nodes. This simple fact means that it can be very difficult to tell whether two graphs are equal, that is, represent the same physical object. If two graphs differ, it might be in an "important" way, indicating a difference in the corresponding physical objects, or an unimportant way, simply a numbering difference in the way the same information was recorded two different times.
Suppose our job, then, is to determine whether two graphs represent the same physical object. A useful technique is to come up with some kind of identification number which reflects only the important information, so that two graphs have the same ID if and only if they represent the same physical object.
For some biomedical applications, we need to consider a collection of objects called "chemical subsystems". Each such object will be a set of atoms with some chemical bonds between certain pairs. We will start with a standard set of such objects, and then we will "encounter" new objects. However, many of these new objects will be essentially identical to one of the standard objects we defined earlier. Our task is then, for each new object, to determine if it is the same as one of the standard ones.
To carry out this task, we need to generalize the notion of a graph to that of a directed color multigraph. Then we consider the notion of how two objects can be "essentially identical" or "isomorphic". We then discuss how to set up an "identification number", here called a code, that can settle the question of when two objects are the same.
Only some of the properties of a chemical subsystem need to be considered to carry out a comparison. These qualities can be abstracted and conveniently represented in terms of an abstract data structure called a directed color multigraph. The properties of this data structure can be defined in terms of the following sequence of definitions:
A directed color multigraph can be used to represent a chemical subsystem. The representation will have the property that two chemical subsystems are "the same" exactly when the corresponding directed color multigraphs are "the same".
The relevant properties of a chemical subsystem are
The numbering of the nodes in a directed color multigraph is considered to be arbitrary, and not a fundamental feature of the object. Indeed, if we were to draw a directed color multigraph on a sheet of paper, and ask two different people to create the corresponding abstract arrays, the form of these quantities would differ, depending on the order in which the nodes and arcs were numbered. Thus, it is possible that two directed color multigraphs G1 and G2, which are "different" when considered as data objects, may represent identical physical objects; in this case, the differences between the data objects are of no physical significance. This is precisely the relationship that needs to be determined. Such a relationship is called an isomorphism.
Consider two directed color multigraphs G1 and G2, with associated sets of nodes N1 and N2, color functions COLOR1 and COLOR2, and directed adjacency relationships ARCS1 and ARCS2. We say that G1 and G2 are isomorphic if there is a renumbering of the nodes of G1 which makes all the properties of the two objects correspond. In particular, there must be a renumbering of the nodes, or more precisely, a permutation P that can be applied to the indexes of the nodes of G1, so that
N1(P(I)) <=> N2(I)
COLOR1(N1(P(I)) = COLOR2(N2(I)) for all I;
ARCS1(N1(P(I)),N1(P(J)) = ARCS2(N2(I),N2(J)) for every I and J.
The question of determining whether two objects are isomorphic usually requires a great deal of computation and checking. On the other hand, it is often possible to quickly recognize certain cases in which isomorphism is impossible, in which case the isomorphism question can be answered very quickly. This concept will now be formalized.
A partial signature function for a set of objects is a function f(G) which has the property that if G1 and G2 are isomorphic, then f(G1) = f(G2).
It is possible for objects that are not isomorphic to have equal values of f, but if two objects have distinct values of f, they cannot be isomorphic. Simple examples of partial signature functions for our objects include:
In some cases, a partial signature function may in fact exactly distinguish between isomorphic and nonisomorphic objects. In such a case, we can remove the qualifer "partial" from the name.
A signature function for a set of objects is a function f(G) which has the property that f(G1) = f(G2) if and only if G1 and G2 are isomorphic.
Is it possible to determine whether two objects, G1 and G2, are isomorphic? There is surely one way, known as the brute force method. If G1 and G2 are isomorphic, then the only thing disguising this fact is that the nodes of G1 have been permuted. It is possible to consider every possible permutation of the nodes of G1, checking whether any such permutation makes the relevant properties of the two graphs match up.
The brute force method is not considered an effective procedure, because the amount of work required to solve the problem increases explosively as the problem size increases. To estimate the amount of work, measure the problem size in terms of the number of nodes N. Then the number of permutations of those nodes is the factorial of N, represented as "N!". If we make the reasonable assumption that there is some constant quantity C of work to examine whether any particular permutation makes the graphs coincide, then the total amount of work required for a single isomorphism determination is of the order of C * N!.
For the general graph isomorphism problem, no algorithm is known that has a significantly better performance than the brute force method, and in fact, the problem is considered to be NP hard, that is, inherently unsolvable for sufficiently large problems.
To determine whether two directed color multigraphs are isomorphic will require, as part of the computation, that we verify that the underlying graphs are isomorphic. Hence, this problem is in some sense at least as hard as the graph isomorphism problem, and we should only expect to be able to solve problems with a limited number of nodes, or with other properties that allow the determination to be made more rapidly.
The adjacency matrix for a graph of N nodes is an N by N matrix A whose offdiagonal element A(I,J) is 1 if there is an undirected arc between the nodes I and J, and whose diagonal element A(I,I) is zero.
The adjacency matrix for a directed color multigraph is an extension of this idea; the offdiagonal element A(I,J) is the number of arcs directed from node I to node J; each diagonal elements A(I,I) is the index of the color of node I.
The adjacency matrix completely summarizes the information in the data structure of the directed color multigraph. Hence, we can determine the question of isomorphism by comparing adjacency matrices. However, the adjacency matrix carries with it the numbering of the nodes, and hence is not an immediate answer to our isomorphism question. We need a way to "factor out" the node numbering.
We first need to define an order relation on N by N integer matrices. We need to choose an ordering so that we can compare two matrices. The particular ordering we choose here will greatly simplify our backtracking procedure later.
When comparing two matrix indices, we will say that
(I1,J1) < (I2,J2)if either:
1 2 5 10 17
3 4 7 12 19
6 8 9 14 21
11 13 15 16 23
18 20 22 24 25
We now say that matrix A is less than or equal to matrix B if, under the given index ordering, either every entry of the two matrices is equal, or else, at the first entry (I,J) for which the matrices differ, it is the case that A(I,J) < B(I,J).
For a given directed color multigraph, we now consider each possible ordering O of the nodes. For each ordering, there is a corresponding adjacency matrix A(O). We define the directed color multigraph code to be the maximum element A* of all the adjacency matrices defined in this way.
Now we note that a directed color multigraph code is a signature function. That is, two directed color multigraphs are isomorphic if and only if their codes are equal. In this setting, the cost of an isomorphism comparison becomes equal to the cost of two code computations.
The problem of isomorphism determination has now been recast into the problem of computing the signature function, that is, the directed color multigraph code. One way to do this, in turn, would be to generate every possible permutation of the nodes of the graph, in order to find the maximal adjacency matrix. It has been noted that such an approach becomes computationally infeasible, since the associated work load increases like the factorial function, as the number of nodes increases.
While a brute force approach will still work for small problems, it is necessary to come up with another approach for problems with a moderate number of nodes, and to identify any features of our problem that make a factorial work load less likely.
Rather than a brute force computation, we will use backtracking. In other words, we will start with the identity permutation of the nodes, and compute the associated adjacency matrix, which will be our initial estimate for the directed color multigraph code. Then we will consider all permutations in which the first node is swapped with the second. It is possible to compute a portion of the adjacency matrix that will result, and in some cases, it is possible to determine if this adjacency matrix is going to be smaller than our current best. If so, we immediately abandon the current search (which actually represents an enormous number of possibilities), and move to the next possibility.
In most cases, a backtracking approach is extremely efficient when compared to a brute force method. The cases where backtracking cannot offer much of an improvement will occur typically when the underlying graph is close to maximally connected, that is, almost every pair of nodes is connected. However, this will not happen in our problems, because the individual chemical components have a fixed and low valence. The total valence of any component (and correspondingly, the maximum degree of a node) will be 4 or 5.
Other features that will help us avoid the factorial explosion include the "color" of the nodes, the differing valences of the chemical constituents, and the fact that the typical total number of components in an object will be less than 40.
There are two advantages to the use of the code. First, it shows an organized and systematic way to approach the isomorphism question using a convenient data structure. Secondly, it allows us to precompute codes for any special or preferred objects.
The precomputation of codes is extremely useful in the case at hand. We expect to classify a set of standard objects, and then analyze, one at a time, many newly encountered objects. We expect each of the new objects to be equal to one of the standards. If we compute the code for the new object, we can inexpensively compare it to the precomputed codes of each of the standards, and we have our answer. Thus, a new object can be classified at the cost of "half" of a single isomorphism comparison, rather than the potential cost of carrying out a full isomorphism comparison between the new object and every one of the standards.
FORTRAN software for computing signature codes for graphs, multigraphs, directed graphs, directed multigraphs, color graphs, color digraphs, and color directed multigraphs is available in CODEPACK.
Back to the home page.