Edit distance

We studied the edit distance problem in class. I will repeat the problem definition and the basic facts that we derived.

The edit distance problem

A basic edit operation on a string is either to insert a character anywhere, to delete any character, or to replace any one character by another.

You are given two strings X and Y. You want to know the smallest number of basic edit operations that can be used to change X into Y. Define that to be distance(X, Y). For example, distance("cat", "cute") = 2, since you can change "cat" into "cute" by performing two basic edit operations. (Replace the a by u, then insert e at the end.)

Prefixes and prefix distances

Suppose that X and Y are two strings. Define X(i) to be the length i prefix of X, and Y(j) to be the length j prefix of Y. For example, if X is "kangaroo", then X(5) is "kanga".

Define pdist(X, Y, i, j) to be distance(X(i), Y(j)). That is, pdist(X, Y, i, j) is the edit distance between the length i prefix of X and the length j prefix of Y. For example, pdist("category", "cuter", 3, 4) = distance("cat", "cute") = 2.

Facts

Suppose that the characters in strings X and Y are numbered starting at 0. So the first character of X is X0, the second character is X1, etc.

We derived the following facts in class.

  1. pdist(X, Y, 0, j) = j

  2. pdist(X, Y, i, 0) = i

  3. pdist(X, Y, i, j) = pdist(X, Y, i-1, j-1)  provided Xi-1 = Yj-1

  4. pdist(X, Y, i, j) = 1 + min(pdist(X, Y, i-1, j-1), pdist(X, Y, i-1, j), pdist(X, Y, i, j-1)  provided Xi-1 =/= Yj-1

Algorithm 1

A straightforward algorithm is to use the above facts about the pdist function in a recursive function definition. Here is pseudo-code (not Java) for the recursive definition.

  pdist(X,Y,i,j)

      if i = 0 return j
      else if j = 0 return i
      else if Xi-1 = Yj-1 return pdist(X,Y,i-1,j-1)
      else return 1 + min(pdist(X,Y,i-1,j-1), pdist(X,Y,i-1,j), pdist(X,Y,i,j-1))

  distance(X,Y)
      return pdist(X, Y, X.length(), Y.length())

Algorithm 2

Algorithm 1 has the problem that it recomputes the same thing many times. That makes it very slow. Recomputation can be avoided by writing things down and using the results that you wrote down earlier rather than recomputing them.

You can create a two-dimensional array (a grid) D where Di, j will be set to hold pdist(X, Y, i, j). Then just look in the two-dimensional array to get the pdist values that you need. (Of course, you must have stored the values earlier.)

Java two-dimensional arrays are described below. In Java, you write D[i][j] to mean Di, j.

The following algorithm uses the same facts as the first algorithnm, but avoids repeated computation of the same thing. Again, it is written in pseudo-code.

  distance(X,Y)

    xlen = X.length()
    ylen = Y.length()

    --Create a two-dimensional array D with xlen+1 rows and ylen+1 columns.
    --The idea is that D[i][j] will be set to hold pdist(X,Y,i,j) for
    --all i from 0 to xlen and j from 0 to ylen.

    D = new two-dimensional array with xlen + 1 rows and ylen + 1 columns

    --Using fact 1:

    for j = 0,...,ylen
      D[0][j] = j
    end for
          
    --Using fact 2:
   
    for i = 0,...,xlen
      D[i][0] = i
    end for
          
    --Using facts 3 and 4:

    for i = 1,...,xlen
      for j = 1,...,ylen
        if Xi-1 = Yj-1
          D[i][j] = D[i-1][j-1]
        else
          D[i][j] = 1 + min(D[i-1][j-1], D[i-1][j], D[i][j-1])
        end if
      end for
    end for
  
    return D[xlen][ylen]

The assignment

  1. Implement algorithm 1 in Java. Call your function distance1. Function Math.min(a, b) returns the smaller of two numbers a and b. How can you use it to get the smallest of three numbers?

    Write a short main program to test distance1. For example, a start might be

       public static void main(String[] args)
       {
         System.out.println("The distance between \"cat\" and \"cute\" is "
                            + distance1("cat", "cute"));
       }
    
    Try at least two examples, but keep them reasonably short. Do not proceed until you believe that distance1 is working correctly.

     

  2. Implement algorithm 2 in Java. Call your function distance2. Do not remove or modify distance1. Leave your distance1 function in your program. Just add this new one.

    Change your main program to call distance2. Test it. Do not proceed until it works.

     

  3. Comment out your main program that performs tests. Do not remove it.

     

  4. For this program, you will interact with the user using message boxes. They are described in Section 2.5 (page 100) of your text. The things needed here are also described at the bottom of this page.

    Write a main program that gets two strings X and Y from the user using message boxes. Using another message box, your program should ask the user for a number, either 1 or 2, telling which algorithm to use. Your main program should then pop up a box telling the edit distance between X and Y, using algorithm 1 if the number is 1, and algorithm 2 if the number is 2.

    Test your program before continuing.

     

  5. Try your algorithms on some long strings. Your algorithms will take the most time when the two strings do not contain any characters in common. For example, compute the edit distance between abcdefghijklm and nopqrstuvwxyz. Which algorithm appears to be more efficient? Is the difference significant? What do you think will happen if you use two strings of length 30, such as aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa and bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb?

Reminders about strings

You can get the i-th character of a string x using Java expression x.charAt(i). The characters are numbered starting at 0. So, where you see Xi in the pseudo-code, write X.charAt(i).

Two-dimensional arrays

For algorithm 2, you need to use a two-dimensional array. A two dimensional array of integers has type int[][]. So

  int[][] d;
creates a variable d that can refer to a two-dimensional array of integers. Now you need to build an actual array, and make d refer to it. Statement
  d = new int[m][n];
builds a two dimensional array of integers with m rows and n columns, and puts that array into variable d. For example, assignment
  d = new int[2][3];
creates an array that looks like this:

              
              

Each slot in the grid can hold one integer.

To use the value at row i and column j in array d, write d[i][j]. But remember that numbering is from 0, so the first row of the array is row 0, and the first column is column 0.

What a two-dimensional array really is

A two-dimensional array is really just an array of arrays. It is an array of its rows. If d is a two-dimensional array, then d.length is the number of rows and d[0].length is the number of columns in row 0.

Message boxes using the Swing library

The Java library has been built up over time, and has different sections that were contributed by different groups, or at different times. One part is the Swing library, which helps you do graphics, and is popular and easy to use. To use Swing, you should put

  import javax.swing.*;
in your program.

Java statement

  String xstr = JOptionPane.showInputDialog("What is the first string?");
causes a message box to pop up that contains the text "What is the first string?", and also has a box where the user can type a string. Method showInputDialog returns when the user presses OK. The value returned (and here put into variable xstr) is the string that the user typed.

You will need to do three calls to JOptionPane.showInputDialog, one to get the first string, one to get the second string and one to get the algorithm number. The answer that showInputDialog produces will always be a string. Just ask whether the algorithm is "1" or "2". Remember to use the equals method to ask whether two strings are equal.

When you have the answer, you just want to show a message, not ask for information. Use method JOptionPane.showMessageDialog. If msg is a string that you want to show, then

  JOptionPane.showMessageDialog(null, msg);
will pop up a box showing message msg, with an OK button to allow the user to say when he or she is done reading the message.

Important. When you use graphics, you need to shut down the graphics support when the program is done. At the end of main, add statement

  System.exit(0);
to shut everything down. If this statement is performed anywhere in a program, the entire program is stopped. You cannot do anything after this statement.

Submitting your work

Test your program. Look at what it produces. Are the results correct?

When you are satisfied that it works, paste it into the box below and push the submit button.