Free Trial

Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.


  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • DownloadDownload
  • PrintPrint
Share this Page URL
Help

Chapter 17. String Matching > Understanding Levenshtein Word Distance

17.2. Understanding Levenshtein Word Distance

While phonetic coding such as Soundex is excellent for fuzzy-matching misspelled English names and even some minor spelling mistakes, it isn't very good at detecting large typing errors. For example, the Soundex values for "mistakes" and "msitakes" are the same, but the values for "shop" and "sjop" are not, even though transposing a "j" for an "h" is a common mistake—both letters are next to each other on a standard QWERTY keyboard.

The Levenshtein word distance (also known as edit distance) algorithm compares words for similarity by calculating the smallest number of insertions, deletions, and substitutions required to transform one string into another. You can then choose some limit—say, 4—below which the distance between two words is short enough to consider. Thus, the algorithm presented often forms the basis for a number of other techniques used in word processor spell-checking, DNA matching, and plagiarism detection.


  

You are currently reading a PREVIEW of this book.

                                                                                        

Get instant access to over
$1 million worth of books and videos.

  

Start a Free Trial