SS_GG_ALIGN - Sequence/sequence Global Gap Alignment
SS_GG_ALIGN is a draft implementation of some of the string
matching algorithms described in the reference [Chao]. These
algorithms carry out the computation in linear space, and
compute not just the optimal alignment score, but also the corresponding
optimal alignment.
It's important to be able to compute alignments using "linear space",
that is, just a few vectors whose length N is equal to that
of a typical string. A quadratic algorithm would require a two
dimensional array of total dimension N*N. Realistic alignment
problems can involve strings of N=100,000 elements or more,
so a quadratic algorithm would be expensive or impossible to use.
The "matching" being considered does not actually require that every
element of string A match an identical element of string B.
Instead, the matching algorithm is essentially looking for the cheapest
way of transforming one string into the other, allowing the operations of
"mutation" (change one letter to another), "deletion" (drop a string
of consecutive letters) and "insertion (insert a string of consecutive
letters). Thus, at the same time, we are measuring a sort of
evolutionary distance between two strings and a formal "editing
distance" between them.
This set of routines assumes that an insertion or deletion of length
K is penalized using an "affine gap penalty formula" of the form:
Penalty = Gap_Open + K * Gap_Extend
This choice of penalty function has a major effect on the form
of the matching algorithms, particularly in the linear space case.
(Compare the simpler "global distance" routines in
SS_GD_ALIGN.)
The score for the actual best matching is determined without explicitly
constructing the best matching. It is a matter of some
difficulty to recover the matching corresponding to the best score.
This is particularly true if the algorithm is a linear space one, which
discards a great deal of intermediate information. However, it is
possible to set up a recursive algorithm which determines the best
alignment, using only linear space.
Routines that use quadratic space are included as well, so the algorithms
can be compared for storage, speed, and correctness.
The names of the scoring and path routines include information
about whether they use a forward, backward, or recursive algorithm,
whether they compute the score or the path, and whether they use
linear or quadratic space. Thus, the routine
SS_GG_FSQ uses the forward algorithm to compute the score,
with quadratic space requirements.
-
Reference 1:
-
Kun-Mao Chao, Ross Hardison, Webb Miller,
Recent Developments in Linear-Space Alignment Methods: A Survey,
Journal of Computational Biology,
Volume 1, Number 4, 1994, pages 271-291.
-
Reference 2:
-
Eugene Myers and Webb Miller,
Optimal Alignments in Linear Space,
CABIOS, volume 4, number 1, 1988, pages 11-17.
-
Reference 3:
-
Michael Waterman,
Introduction to Computational Biology,
Chapman and Hall, 1995.
Files you may copy include:
The list of routines includes:
-
A_INDEX sets up a reverse index for the amino acid codes.
-
A_TO_I returns the index of an alphabetic character.
-
C_CAP capitalizes a single character.
-
CVEC2_PRINT prints two vectors of characters.
-
CVEC_PRINT prints a vector of characters.
-
GET_SEED returns a seed for the random number generator.
-
I_RANDOM returns a random integer in a given range.
-
I_SWAP switches two integer values.
-
I_TO_A returns the I-th alphabetic character.
-
I_TO_AMINO_CODE converts an integer to an amino code.
-
IVEC2_COMPARE compares pairs of integers stored in two vectors.
-
IVEC2_PRINT prints a pair of integer vectors, with an optional title.
-
IVEC2_SORT_A ascending sorts a vector of pairs of integers.
-
IVEC_REVERSE reverses the elements of an integer vector.
-
MUTATE applies a few mutations to a sequence.
-
PAM120 returns the PAM 120 substitution matrix.
-
PAM120_SCORE computes a single entry sequence/sequence matching score.
-
PAM200 returns the PAM 200 substitution matrix.
-
PAM200_SCORE computes a single entry sequence/sequence matching score.
-
RVEC2_SUM_IMAX returns the index of the maximum sum of two real vectors.
-
S_EQI is a case insensitive comparison of two strings for equality.
-
S_TO_CVEC converts a string to a character vector.
-
S_TO_I reads an integer value from a string.
-
SIMPLE_SCORE computes a single entry sequence/sequence matching score.
-
SORT_HEAP_EXTERNAL externally sorts a list of items into linear order.
-
SS_GG_BPQ does a global gap backward path quadratic alignment.
-
SS_GG_BSL does a global gap partial backward score linear alignment.
-
SS_GG_BSQ does a global gap partial backward score quadratic alignment.
-
SS_GG_FPQ does a global gap forward path quadratic alignment.
-
SS_GG_FSL does a global gap forward score linear alignment.
-
SS_GG_FSQ does a global gap forward score quadratic computation.
-
SS_GG_MATCH_PRINT prints a global gap alignment.
-
SS_GG_MATCH_SCORE scores a global gap alignment.
-
SS_GG_RPL determines an alignment using recursion and linear space.
-
SS_GG_RPL_POP pops the data describing a subproblem off of the stack.
-
SS_GG_RPL_PUSH pushes the data describing a subproblem onto the stack.
-
UNIFORM_01_SAMPLE is a portable random number generator.
-
WORD_LAST_READ returns the last word from a string.
-
WORD_NEXT_READ "reads" words from a string, one at a time.
Return to the biomedical software page.
Last revised on 13 March 2001.