The longest common subsequence problem[1] (LCS) is a computer science problem to find the longest subsequence common to a set of sequences. This paper will address the special problem of comparing the words or whole lines of 2 text files where the goal is to discover how the second file differs from the first. It is useful in comparing versions of text files to determine what changes were made in subsequent versions. We will call this special case of the problem to have a bias for the first sequence.
Using Bias
editThe problem can be solved in polynomial time with memory for the two sequences and memory for a binding vector for each sequence. The binding vectors are vectors that remember which element of one sequence is common to the other. A matrix of common sequence lengths is not required as in the classical LCS solution[2].
Although in the classical problem, the LCS is not unique, it is unique in this special case of bias.
Suffixes
editLet us define a suffix of a sequence as the sequence with the top cut off. Let X be the sequence {GCAT}, Then there are 4 suffixes, S1, S2, S3, S4 which are subsequences of X where the subscript denotes the number of elements in the subsequence. The possible suffixes of X are
- S1 = {T}
- S2 = {AT}
- S3 = {CAT}
- S4 = {GCAT} = X
CS function
editThe CS function is a function that yields a common subsequence, but not necessarily the longest. It is a guess of how the second sequence is a modification of the first. Given two sequences, and , their binding vectors, Xbind[1..m] and Ybind[1..n], and their corresponding suffixes, and the CS function is given by
LCS function
editNow, the longest common subsequence can be found using the CS function and is
Example
editGiven the two sequences, and the longest common sequence is . A Difference Report of the two sequences side by side commenting on the additions, changes and deletions made to the left (bias) sequence, might be
Version 1 Version 2 --------- --------- A Added B B C Deleted D D F E Changed K K L N Changed P Deleted
Now, consider the Versions reversed. and . Although the same longest common sequence is found, the additions, changes and deletions are different since the bias has changed.
Version 1 Version 2 --------- --------- A Deleted B B C Added D D F E Changed K K L N Changed P Added
Non-Recursive Code for LCS with Bias
editfunction forwardTrace( X[1..m], Y[1..n]) ) array Xbind[1..m], Ybind[1..n] Xbind, Ybind := Zero Vector mtop, ntop := 1 j := 1 while j ≤ n j, j0 := ntop Find some CS(X,Y) scanning the first sequence X for i := mtop..m j := j0 while j ≤ n and Y[j] ≠ X[i] j := j+1 end if j ≤ n Xbind[i] := j Ybind[j] := i j0 := j+1 else Xbind[i] := 0 end next i i, i0 := mtop Scan the second sequence to see if CS(Y,X) = CS(X,Y), i.e., symmetrical isSymmetrical := true j = ntop while j ≤ n and isSymmetrical i := i0 while i ≤ m and X[i] ≠ Y[j] i := i+1 end if i ≤ m if Xbind[i] = 0 or ( Ybind[j] > 0 and Xbind(Ybind[j]) ≠ j ) CS(Y,X) ≠ CS(X,Y) at this j so… erase all bindings from here down for Ybind for j2 := j+1..n if Ybind[j2] > 0 Xbind[Ybind[j2]] := 0 Ybind[j2] := 0 end next j2 Set new bindings Ybind[j] := i Xbind[Ybind[j]] := i mtop := i+1 ntop := j+1 isSymmetrical := false end end j := j+1 end end return {Xbind, Ybind} For languages that cannot return sets, include the arrays as parameters by reference.
Print the Differences
editThis function will print the two sequences side by side commenting on the additions, changes and deletions made to the left (first) sequence, the bias sequence. Appropriate column formatting may be added.
function PrintDiff( X[1..m], Y[1..n], Xbind[1..m], Ybind[1..n] ) i, j := 1 while i ≤ m and j ≤ m if Xbind[i] = 0 and Ybind[j] > 0 Print X[i] + " Deleted" i := i+1 else if Xbind[i] > 0 and Ybind[y] = 0 Print Space(10) + Y[j] + " Added" j := j+1 else if Xbind[i] = 0 and Ybind[j] = 0 Print X[i] + " " + Y[j] + " Changed" i := i+1 j := j+1 else Print X[i] + " " + Y[j] i := i+1 j := j+1 end end if i ≤ m for i := i..m Print X[i] + " Deleted" end end if j ≤ n for j := j..n Print Space(10) + Y[j] + " Added" end end return 0