Chaos Games and Genetics

A semi-random tutorial by Dan Ashlock

Contents:


Introduction.

The original chaos game moved a point by averaging toward random vertices of a triangle. Plotting the point as it moves yields a Serpinski triangle, below.

If you use the corners of a square instead of a triangle, then you just fill in the square. The example below uses color to show you what happens as the points move.

But what if you don't chose the corners of the square uniformly at random? Lots of things:


DNA Sequences Driving Chaos Games

The alert reader with some grounding in biology will have already made the connection between the four-cornered chaos game and DNA. Since the DNA alphabet has four letters in it, we can drive a chaos game with DNA and get a fractal picture of the DNA. This idea is due to Jim Golden.

It turns out that the four cornered chaos game gives a useful representation of a large set of genetic data (you need lots of of data to get the chaos game to fill in all the points it is going to hit). When you look at a fractal, the spaces left blank are the image of short sequences of DNA codons that don't appear in the data. A click-and-tell-me-whats-here tool permits you to use this fractal to browse the short subsequences of your information and get some idea what's in it. Compare this with direct examination of the sequence or histograms of short sequence frequency and you will perceive a substantial advantage to the fractal view. Below are the chaos game fractals, colored with pixels and then with shaded disks for DNA drawn from several different organisms.

Gene Source

Chaos/Pixels

Chaos/Disks

Human immunodeficiency virus
type 1 (HXB2)
complete genome

Zea mays
opaque-2 gene

Chloroplast
Nicotiana tabacum

Drosophila melanogaster
dual specificity kinase
DYRK2

Tobacco mosaic virus
complete genome

Mouse MHC class I
cell surface glycoprotein
(H-2T24)

These chaos-fractal representations of the different sorts of genetic material do in fact look different. They also have a lot of appearance of similarity. This similarity comes from the fact that any four-cornered chaos-game fractal will look somewhat like a subset of a ramified cross.

All of the fractals above used the same number of iterations of the chaos game, but they had to deal with differing amounts of input data. For the shorter DNA strings the sequence data was reused until 2,000,000 points (or 20,000 shaded disks) were plotted. The gaps in the fractals may be real but they may also be small sample size effects (if we are thinking of them as part of a type of DNA). This suggests it would be nice to get a model of the data and then plot a chaos fractal based on the model. A simple Markov chain produced the fractal

,

and so we will look at deriving Markov models from DNA and then plotting the chaos fractal of the Markov chain. This serves to ``smooth'' the data.


Markov Models of DNA and their Chaos Fractals

We will use a simple Markov model, defined as follows. For a window of size k, we will record all the patterns of k consecutive nucleotides that occur, together with the distribution of the next nucleotide that follows that pattern. The number k is the order of the Markov model. Once we have fit a kth order Markov model to a given genetic sample we will then use that Markov chain to generate a chaos fractal. Starting with the pattern of k Cs, we generate a "next" nucleotide from the Markov chain. We then move this new nucleotide into our k-nucleotide window, bumping the oldest member of the window, and get a new distribution of "next" nucleotides from the Markov chain. By iterating this process we synthesizing DNA with the same (k+1)st order statistics as the original sample. The result of using this fake DNA to draw chaos fractals, for 2nd, 3rd, 4th, and 5th order Markov models, are shown below.

Gene
Source

Markov Chain Order

2

3

4

5

HIV

Maize

Chloro-
plast

Drosophila

Tobacco
mosaic
virus

Mouse
MHC


Markov Models of DNA and their Chaos Fractals
for Six Whole Microbial Genomes

In this section we begin by simply repeating the process of extracting a Markov model from DNA and using it to drive a chaos game for six whole microbial genomes. To show more detail these are 240x240 gif files: widen your browser.

Gene
Source
(Whole
Genome)

Markov Chain Order

3

4

5

Archaeoglobus
fulgidus

Borrelia
burgdorferi

Haemophilus
influenzae

Heliobacter
pylori

Methanococcus
jannaschii

Mycobacterium
tuberculosis


Markov Models of DNA and the Differences in
Chaos Fractals for Six Whole Microbial Genomes

In this section, we compute the fifth order Markov models of the DNA for all six of our whole microbial genomes, create chaos fractals with hit counts, and then let the differential hit count dictate the color of a difference fractal. The color that is ahead at any given pixel is dictated by the red and green row/column labels, as follows:

Archaeoglobus fulgidus

AF

Borrelia burgdorferi

BB

Haemophilus influenzae

HI

Heliobacter pylori

HP

Methanococcus jannaschii

MJ

Mycobacterium tuberculosis

MT

The difference fractals are indexed with half size version of the fifth-order Markov chaos game. The difference for two microbial genomes is given by the intersection of the row and column where their fifth order Markov fractals intersect.

Fifth order Markov chaos fractal differences

AF

BB

HI

HP

MJ

BB

HI

HP

MJ

MT


Similitude Based Barnsley Fractals
That Detect the Presence of Stop Codons

At this point we drop the four cornered chaos game and look at other sorts of fractal-generating processes. The fractals at the end of this section are a restricted type of contraction map fractal (as is the chaos game). There are three issues here: the generation method for the fractals, the way we drive the fractals with our DNA data, and finally the way we located the fractals so that they had the property of separating DNA with stop codons from DNA without stop codons. Disclaimer: the DNA used for these fractals was maximally random simulated DNA - which is probably the most difficult case, of which we will say more later.

The Fractals

A contraction map is a map f(x), in this case from the plane to itself, that has the property that for any two points x,y, the distance from x to y is at least the distance from f(x) to f(y). In other words, all distances are contracted by this map. It is a theorem that such a map has a unique fixed point p so that f(p)=p and the map contracts the whole plane toward that point.

Michael Barnsley showed that you can use contraction maps to generate fractals in the following way. Much as with the chaos game, you plot moving points. Instead of moving toward the corners of a square or a triangle selected at random, you instead select contraction maps at random from a list of contraction maps. The unique fixed point replaces the corners of the polygon. The 20 fractals at the end of this section were generated in this fashion.

For our DNA driven fractals we need a large source of random contraction maps. This is because we will search for fractals that can separate different types of DNA and so need a space of fractals which in turn requires a space of contraction maps. We chose to use similitudes. Similitudes, in the plane, are a four-parameter family of contraction maps. The first parameter is an angle t through which the plane is rotated. The second and third parameters are the coordinates of a vector through which the plane is displaced. The fourth parameter is a shrink factor in the range from 0 to 1 which is used to contract the plane toward the origin. Since rotation and translation are distance preserving, it follows that a rotation, a displacement, and then a contraction form a contraction map. The mathematical form of a similitude is:

The fractals we are using have eight similitudes. We measure the average position of the moving point and the maximum distance the point gets from that average position to get a "center" and "radius" that can be used to normalize the resulting fractal.

Driving the fractals with DNA

The "DNA" used to drive these fractals comes from two random sources. One is a maximum entropy source with 25% each of CGAT. The other is a maximum entropy source that avoids the triples TAA, TGA, and TAG, i.e. stop codons. This latter source is implemented by generating, with equal probability, any letter that does not create a stop codon in the reading frame starting with the first character generated.

We read the DNA in triples mapping the 64 possible triples onto one of right possible similitudes with a lookup table. For the first of the 20 fractals below the lookup table was:

CCCCCCCCCCCCCCCCGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTT
CCCCGGGGTTTTAAAACCCCCGGGGTTTTAAAACCCCGGGGTTTTAAAACCCGGGGTTTTAAAA
CGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGAT
1333714743627132617277121322763137276232223623277726627337713321
So, for example, the triple GCC caused the fractal to use similitude 6. This scheme lest us drive the eight similitudes with 64 DNA based features. At this point it is probably worth noting that this selection of features, in this case DNA triples, is a major area for exploration. Using different features to drive the similitudes opens up many possible fractal tools.

Searching for separating fractals.

The data structure we used to store fractals is a list of eight similitudes and a 64 element lookup table mapping DNA triples onto those similitudes. This means each fractal has 8x4=32 real parameters and 64 three-bit integer parameters. This is a very large search space. Because of this we used a genetic algorithm to search the space of possible fractals.

Genetic algorithms operate by starting with a collection of random objects (in this case fractals). Our genetic algorithm would these select four fractals and sort them from best to worst in a manner described below. The two best fractals were then copied over the two worst. Within those copies a random suffix of the list of eight similitudes and a random middle segment of the lookup table were exchanged, a process called crossover. One of the two new fractals then has a small amount added or subtracted to one of its similitude parameters, selected at random, while the other had one entry of its lookup table changed, two different sorts of mutation. Repeated many times, this process locates high quality fractals.

So, what quality measure are we using? In this case we sent in the stop codon free and stop codon containing DNA in blocks of from 50-400 triples. The mean position of points plotted for each sort of random source were computed and the measure of quality was taken to be the Euclidean distance between these means. Red points result from stop-codon free DNA, green points from stop-codon containing DNA. While the means are separate, the red and green domains mix: they don't overlap much though - overlap would be yellow in our coding scheme. We give the best-of-run fractals from 20 different runs of the genetic algorithm. Enjoy the fractals.

(No) Stop

Full Fractal


Similitude Based Barnsley Fractals
That Detect the Frame of DNA

In this section we take the whole genome of mycobacterium tuberculosis and use it, both in frame and out of frame, to replace the two random sources in the previous section. Below, we show the output of three runs of the genetic algorithm in three forms. The leftmost gif shows the raw separation, red for in frame and green for out-of-frame. The middle gif displays this same image run through a smoothing filter to widen the yellow boundary areas where the fractal system is not in the attractor of either reading frame. The third image is the fractal with each of the eight similitudes color coded with on of the eight possible RGB colors. All of the below are thumbnails, click on them to obtain full-sized pictures.


Data versus Markov Differences

In this section we take the whole genome of mycobacterium tuberculosis and a good chunk of human chromosome 22 and compute the difference between the 5th order Markov model of the DNA and the DNA. To do this two chaos games are run and hit counts for pixels compiled. These chaos games are then used as strengths of a red and a green channel. The red channel is the Markov chain chaos game, the green channel one driven directly by data. Red and green blend to make yellow. Finally areas hit by neither chaos game are set to a constant blue value. To prevent outliers from dominating, the data are truncated to twice the sample variance of the hit count in each channel before they are superimposed.

Tuberculosi

H. Sapiens


Mycobacterium tuberculosis Markov-driven distiguishing fractal.

Here we give a pair of images (click for double size) that use a histogrammatic count of the 6-mers (length six DNA substrings) of the Mycobacterium tuberculosis genome in the correct reading frame and in an incorrect reading frame to drive a six similitude Barnsley type contraction may fractal. The fractal pictures look different with the yellow lobe being larger for the out-of-frame data while the cyan lobe is larger got the in-frame data.

In frame.

Out of frame.