SS_LG_ALIGN - Sequence/sequence Local Gap Alignment
SS_LG_ALIGN is a draft implementation of some of the string
matching algorithms described in the reference [Chao].
These algorithms look for optimal local alignments of two strings
using linear space. (Compare the global alignment routines in
SS_GG_ALIGN.)
It's important to be able to compute alignments using "linear space",
that is, just a few vectors whose length N is equal to that
of a typical string. A quadratic algorithm would require a two
dimensional array of total dimension N*N. Realistic alignment
problems can involve strings of N=100,000 elements or more,
so a quadratic algorithm would be expensive or impossible to use.
The "matching" being considered does not actually require that every
element of string A match an identical element of string B.
Instead, the matching algorithm is essentially looking for the highest
scoring way of associating a portion of one string with a portion of the
other, allowing the operations of "mutation" (change one letter to
another), "deletion" (drop a string of consecutive letters) and
"insertion (insert a string of consecutive letters). Thus, at the same
time, we are measuring a sort of evolutionary distance between two
strings and a formal "editing distance" between them.
This set of routines assumes that an insertion or deletion of length
K is penalized using an "affine gap penalty formula" of the form:
Penalty = Gap_Open + K * Gap_Extend
This choice of penalty function has a major effect on the form
of the matching algorithms, particularly in the linear space case.
The score for the actual best matching is determined without explicitly
constructing the best matching. It is a matter of some
difficulty to recover the matching corresponding to the best score.
This is particularly true if the algorithm is a linear space one, which
discards a great deal of intermediate information. However, it is
possible to set up a recursive algorithm which determines the best
alignment, using only linear space.
Routines that use quadratic space are included as well, so the algorithms
can be compared for storage, speed, and correctness.
The names of the scoring and path routines include information
about whether they use a forward, backward, or recursive algorithm,
whether they compute the score or the path, and whether they use
linear or quadratic space. Thus, the routine
SS_LG_FSQ uses the forward algorithm to compute the score,
with quadratic space requirements.
-
Reference 1:
-
Kun-Mao Chao, Ross Hardison, Webb Miller,
Recent Developments in Linear-Space Alignment Methods: A Survey,
Journal of Computational Biology,
Volume 1, Number 4, 1994, pages 271-291.
-
Reference 2:
-
Eugene Myers and Webb Miller,
Optimal Alignments in Linear Space,
CABIOS, volume 4, number 1, 1988, pages 11-17.
-
Reference 3:
-
Michael Waterman,
Introduction to Computational Biology,
Chapman and Hall, 1995.
Files you may copy include:
The list of routines includes:
-
A_INDEX sets up a reverse index for the amino acid codes.
-
A_TO_I returns the index of an alphabetic character.
-
C_CAP capitalizes a single character.
-
CVEC2_PRINT prints two vectors of characters.
-
CVEC_PRINT prints a vector of characters.
-
GET_SEED returns a seed for the random number generator.
-
I_RANDOM returns a random integer in a given range.
-
I_SWAP switches two integer values.
-
I_TO_A returns the I-th alphabetic character.
-
I_TO_AMINO_CODE converts an integer to an amino code.
-
IVEC2_COMPARE compares pairs of integers stored in two vectors.
-
IVEC2_PRINT prints a pair of integer vectors.
-
IVEC2_SORT_A ascending sorts a vector of pairs of integers.
-
IVEC_REVERSE reverses the elements of an integer vector.
-
MUTATE applies a few mutations to a sequence.
-
PAM120 returns the PAM 120 substitution matrix.
-
PAM120_SCORE computes a single entry sequence/sequence matching score.
-
PAM200 returns the PAM 200 substitution matrix.
-
PAM200_SCORE computes a single entry sequence/sequence matching score.
-
RMAT_IMAX returns the location of the maximum of a real M by N matrix.
-
RVEC2_SUM_IMAX returns the index of the maximum sum of two real vectors.
-
S_EQI is a case insensitive comparison of two strings for equality.
-
S_TO_CVEC converts a string to a character vector.
-
S_TO_I reads an integer value from a string.
-
SIMPLE_SCORE computes a single entry sequence/sequence matching score.
-
SORT_HEAP_EXTERNAL externally sorts a list of items into linear order.
-
SS_GG_BSL determines a global gap backward alignment score in linear space.
-
SS_GG_FSL determines a global gap forward alignment score in linear space.
-
SS_LG_BPQ determines a local gap backward alignment path in quadratic space.
-
SS_LG_BSL determines a local gap backward alignment score in linear space.
-
SS_LG_BSQ determines a local gap backward alignment score in quadratic space.
-
SS_LG_CORNERS determines the "corners" of an optimal local alignment.
-
SS_LG_FPQ determines a local gap forward alignment path in quadratic space.
-
SS_LG_FSL determines a local gap forward alignment score in linear space.
-
SS_LG_FSQ determines a local gap forward alignment score in quadratic space.
-
SS_LG_MATCH_PRINT prints a local gap alignment.
-
SS_LG_MATCH_SCORE scores a local gap alignment.
-
SS_LG_RPL determines a local gap recursive alignment path in linear space.
-
SS_LG_RPL_POP pops the data describing a subproblem off of the stack.
-
SS_LG_RPL_PUSH pushes the data describing a subproblem onto the stack.
-
UNIFORM_01_SAMPLE is a portable random number generator.
-
WORD_LAST_READ returns the last word from a string.
-
WORD_NEXT_READ "reads" words from a string, one at a time.
Return to the biomedical software page.
Last revised on 13 March 2001.