PHP Levenshtein distance

Posted by on December 5, 2011 in PHP, Programming | 0 comments

Levenshtein distance is a way of calculating how different two strings are. This function is extremely helpful in determining if a user made a mistake in entry, trying to find similar inputs, or comparing strings for any reason. The Levenshtein distance is a numeric value of how many changes are necessary to turn the one string into the other. There are default weights assigned to different operations like Capitalizing a letter or adding/removing letters.

The syntax is below:

$difference = levenshtein($string1, $string2);

Tips:

  • If you don’t care about capitalization, then just turn both strings into a lowercase version before comparing.
  • You can set the “cost” of the different operations with the three optional arguments.  ($insertion, $replacement, $deletion).   They’re all integers.  Replacement counts if you have to replace a lowercase letter with a capital one.  It’s the same cost as replacing the letter with a completely different letter.
  • I often find that the Levenshtein distance is not as useful for small strings, so I usually put a percentage that it has to meet in my comparison code that’s based off the string length.  For instance if the distance must be less than 20% of the string length, and the string is 10 characters, the Levenshtein distance must be less than 2.  This means for shorter strings, it actually makes the Levenshtein distance 0, which stops the comparison from thinking that “ask” and “task” are the same thing.  I’m not quite sure if there’s a solution to short words, but I’d love to hear it if anyone has one.
If you enjoyed this post, please consider leaving a comment or subscribing to the RSS feed to have future articles delivered to your feed reader.

Leave a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>