Tags

, , , ,

Often you need to create a Spell-checker as a part of larger project, maybe to indicate if user is writing something wrong, or try to parse the data correctly from the web. In this blogpost, we will see what all we need to create a simple spell check-cum-suggestion program that will find out if there’s something wrong, and suggest best possible matches.

Here’s how a Spell Checker works:

Image

City Checker

To simplify this blog post, we will take name of an Indian city from the user, compare it to the cities saved in a Text File, if incorrect, we will show the best possible matches.

Edit Distance

We use the concept of Edit Distance to find how different two strings are. Here, we’ll have to compare user’s entered city with each city in our list. We’ll find edit distance between them.

Levenshtein distance is a string metric for measuring the difference between two sequences. Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other.

Eg, Levenshtein distance between Hello and Hallo is 1, as we just need to substitute a in place of e.

Levenshtein distance between “kitten” and “sitting” is 3 [wiki], since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

  1. kitten → sitten (substitution of “s” for “k”)
  2. sitten → sittin (substitution of “i” for “e”)
  3. sittin → sitting (insertion of “g” at the end).

We create a simple JAVA class to compute Edit Distance

public class LevenshteinDistance {
  public static int computeDistance(String s1, String s2) {
      /**
       * Returns the edit distance needed to convert string s1 to s2
       * If returns 0, the strings are same
       * If returns 1, that means either a character is added, removed or replaced
       */
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();
    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
      int lastValue = i;
      for (int j = 0; j <= s2.length(); j++) {
        if (i == 0)
          costs[j] = j;
        else {
          if (j > 0) {
            int newValue = costs[j - 1];
            if (s1.charAt(i - 1) != s2.charAt(j - 1))
              newValue = Math.min(Math.min(newValue, lastValue), costs[j]) + 1;
            costs[j - 1] = lastValue;
            lastValue = newValue;
          }
        }
      }
      if (i > 0)
        costs[s2.length()] = lastValue;
    }
    return costs[s2.length()];
  }
}

To find distance between two strings, you will call

 LevenshteinDistance.computeDistance(string1, string2)

It will return zero if the strings are same.

To proceed, we ensure that we have a text file (dictionary) containing one city name per line, that we will use to verify and correct new names.

    static String cityfile = "citynames.txt";
    static ArrayList <String>citylist = new ArrayList<String>();
    static HashMap <String, Integer>newlist = new HashMap<String, Integer>();
    public static void loadcities(String filename) throws IOException{
        String line;
        BufferedReader br = new BufferedReader(new FileReader(filename));
        while((line = br.readLine()) != null){
            if(line.length()>2 && line.length()<26)
                citylist.add(line);
        }
    }

Currently we have a method called load cities that adds all the cities to an arraylist. Now comes the real part.

We will now create a method of signature

public static void citySuggester(String cityName)

This method has following tasks:
Compare the supplied cityName with all the cities in citylist using the Edit Distance measure we created earlier.  Then we create a HashMap called newlist that will store city names that are closer to the given cityName. Finally we sort the cities in newlist based on their distance and display them to the user.

Here’s the complete method:

    public static void citySuggester(String cityName){
        /**
         * Prints a list of cities that can replace the city name in a 
         * sorted list, along with their edit distances (difference in name)
         */
        int i;
        try {
            loadcities(cityfile);
        } catch (IOException e3) {
            // TODO Auto-generated catch block
            e3.printStackTrace();
        }
        for(String s : citylist){
            i = LevenshteinDistance.computeDistance(s, cityName);
            if(i<3){
                // Adjust i. The lesser, the more precise number of options
                newlist.put(s, i);
            }
        }

        // Sorting to get the best cities in the top
        // Don't play with this part
        List<Entry<String, Integer>> entries = new ArrayList<Entry<String, Integer>>(newlist.entrySet());
        Collections.sort(entries, new Comparator<Entry<String, Integer>>() {
            public int compare(Entry<String, Integer> e1, Entry<String, Integer> e2) {
                return e1.getValue().compareTo(e2.getValue());
            }
        });

        Map<String, Integer> orderedMap = new LinkedHashMap<String, Integer>();
        for (Entry<String, Integer> entry : entries) {
            orderedMap.put(entry.getKey(), entry.getValue());
        }

        // Display the list of cities        for (Entry<String, Integer> e : orderedMap.entrySet()){
            System.out.println(e.getKey() + " " + e.getValue());
        }
    }

This was a genre-specific spell-checker. To create a generic spell-checker, you would create a dictionary comprising of all the possible words the user may use. This will include words from English language, user’s preferred jargaon, and if needed, names and other proper nouns.

This is a perfectly naive approach to create a spell-checker. Though this would suffice most academic needs, we have to build robust ways to optimize spell-checking and correction. Advanced spell and grammar checkers use the concept of language models to find the probability that a sentence is correct. To learn more, check the references.

References

Peter Norwig – How to Write a Spelling Corrector
Manning – Language Models for IR