{"id":2637,"date":"2017-08-14T12:00:43","date_gmt":"2017-08-14T17:00:43","guid":{"rendered":"https:\/\/blogs.mathworks.com\/cleve\/?p=2637"},"modified":"2017-08-13T16:06:06","modified_gmt":"2017-08-13T21:06:06","slug":"levenshtein-edit-distance-between-strings","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/cleve\/2017\/08\/14\/levenshtein-edit-distance-between-strings\/","title":{"rendered":"Levenshtein Edit Distance Between Strings"},"content":{"rendered":"<div class=\"content\"><!--introduction--><p>How can you measure the distance between two words? How can you find the closest match to a word in a list of words? The Levenshtein distance between two strings is the number of single character deletions, insertions, or substitutions required to transform one string into the other. This is also known as the <i>edit distance<\/i>.<\/p><p>Vladimir Levenshtein is a Russian mathematician who published this notion in 1966.  I am using his distance measure in a project that I will describe in a future post.  Other applications include optical character recognition (OCR), spell checking, and DNA sequencing.<\/p><p>I learned about Levenshtein distance from the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Levenshtein_distance\">Wikipedia page<\/a>.<\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#1be47676-fc05-41a0-8a18-2c363c1e4db9\">\"Hi ho, hi ho\"<\/a><\/li><li><a href=\"#0d02b633-677c-4401-a1a6-d667d1c11e69\">Recursive<\/a><\/li><li><a href=\"#1251cb0c-96bd-4920-933c-774c9f41c07a\">Matrix Memory<\/a><\/li><li><a href=\"#c5c79cf6-4073-422d-9b0b-cbbecbda0808\">Details<\/a><\/li><li><a href=\"#072efa50-5775-4472-b560-9fe44a547166\">Best Program<\/a><\/li><li><a href=\"#d5678626-ea93-4f5f-876b-afc9486c3411\">Metric Space<\/a><\/li><li><a href=\"#dfdc149c-26a6-4b35-9650-2b5377bbfca1\">44 Programming Languages<\/a><\/li><\/ul><\/div><h4>\"Hi ho, hi ho\"<a name=\"1be47676-fc05-41a0-8a18-2c363c1e4db9\"><\/a><\/h4><p>For my examples, I will use these seven famous names.<\/p><pre class=\"codeinput\">   T = {<span class=\"string\">\"Doc\"<\/span>,<span class=\"string\">\"Grumpy\"<\/span>,<span class=\"string\">\"Happy\"<\/span>,<span class=\"string\">\"Sleepy\"<\/span>,<span class=\"string\">\"Bashful\"<\/span>,<span class=\"string\">\"Sneezy\"<\/span>,<span class=\"string\">\"Dopey\"<\/span>}\r\n<\/pre><pre class=\"codeoutput\">T =\r\n  1&times;7 cell array\r\n  Columns 1 through 5\r\n    [\"Doc\"]    [\"Grumpy\"]    [\"Happy\"]    [\"Sleepy\"]    [\"Bashful\"]\r\n  Columns 6 through 7\r\n    [\"Sneezy\"]    [\"Dopey\"]\r\n<\/pre><p>The strings \"Sleepy\" and \"Sneezy\" are close to each other because they are the same length, and we can transform \"Sleepy\" to \"Sneezy\" by two edit operations, change 'l' to 'n' and 'p' to 'z'.  The Levenshtein distance between these two words is 2.<\/p><p>On the other hand, \"Bashful\" is not close to his friends.  His name is longer and the only letter he shares with another is an 'a' with \"Happy\".  So his distance to \"Happy\" is 6, while the distance to any of the others is 7, the length of his name.<\/p><p>Here is the matrix of pair-wise distances<\/p><pre class=\"codeinput\">   <span class=\"keyword\">for<\/span> i = 1:7\r\n       <span class=\"keyword\">for<\/span> j = 1:7\r\n           D(i,j) = lev(T{i},T{j});\r\n       <span class=\"keyword\">end<\/span>\r\n   <span class=\"keyword\">end<\/span>\r\n   D\r\n<\/pre><pre class=\"codeoutput\">D =\r\n     0     6     5     6     7     6     3\r\n     6     0     4     4     7     5     5\r\n     5     4     0     4     6     5     3\r\n     6     4     4     0     7     2     4\r\n     7     7     6     7     0     7     7\r\n     6     5     5     2     7     0     4\r\n     3     5     3     4     7     4     0\r\n<\/pre><p>Bashful's column is all 7's, except for a 6 in Happy's row and a 0 on the diagonal.<\/p><p>A bar graph of the total distances shows Bashful's name is the furthest from all the others, and that Dopey's is the closest, but just barely.<\/p><pre class=\"codeinput\">   bar(sum(D))\r\n   title(<span class=\"string\">'Total Levenshtein Distance'<\/span>)\r\n   set(gca,<span class=\"string\">'xticklabels'<\/span>,T)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/cleve\/files\/levenshtein_blog_01.png\" alt=\"\"> <h4>Recursive<a name=\"0d02b633-677c-4401-a1a6-d667d1c11e69\"><\/a><\/h4><p>Here is a recursive program that provides one way to compute <tt>lev(s,t)<\/tt>. It compares each character in a string with all of the characters in the other string.  For each comparison, a cost <tt>c<\/tt> is added to a total that is accumulated by the recursion.  The cost of one comparison is 0 if the pair is a match and 1 if it isn't.<\/p><pre class=\"codeinput\">   type <span class=\"string\">levr<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\n\r\nfunction d = levr(s,t,m,n)\r\n% Levenshtein distance between strings, recursive implementation.\r\n% levr(s,t) is the number of deletions, insertions,\r\n% or substitutions required to transform s to t.\r\n% m = length(s), n = length(t), levr(s,t,m,n) initiates recursion.\r\n% https:\/\/en.wikipedia.org\/wiki\/Levenshtein_distance\r\n\r\n    if nargin == 2\r\n        s = char(s);\r\n        t = char(t);\r\n        d = levr(s,t,length(s),length(t));\r\n    elseif m == 0\r\n        d = n;\r\n    elseif n == 0\r\n        d = m;\r\n    else\r\n        c = s(m) ~= t(n); % c = 0 if chars match, 1 if not.\r\n        d = min([levr(s,t,m-1,n) + 1\r\n                 levr(s,t,m,n-1) + 1\r\n                 levr(s,t,m-1,n-1) + c]);\r\n    end\r\nend\r\n        \r\n        \r\n<\/pre><p>Like all recursive programs, this code is impractical to use in practice with long strings or long lists of words because it repeatedly compares the same pairs of characters.  The complexity is exponential in the lengths of the strings.<\/p><h4>Matrix Memory<a name=\"1251cb0c-96bd-4920-933c-774c9f41c07a\"><\/a><\/h4><p>A time-memory tradeoff can be made by allocating storage to save the results of individual comparisons.  The memory involved is a matrix of size (m+1)-by-(n+1) where m and n are the lengths of the two strings, so it's O(m*n). The time complexity is also O(m*n).<\/p><pre class=\"codeinput\">    type <span class=\"string\">levm<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\n\r\nfunction d = levm(s,t)\r\n% Levenshtein distance between strings, matrix implementation.\r\n% levr(s,t) is the number of deletions, insertions,\r\n% or substitutions required to transform s to t.\r\n% https:\/\/en.wikipedia.org\/wiki\/Levenshtein_distance\r\n\r\n    s = char(s);\r\n    t = char(t);\r\n    m = length(s);\r\n    n = length(t);\r\n    D = zeros(m+1,n+1);\r\n    D(:,1) = (0:m)';\r\n    D(1,:) = (0:n);\r\n    for i = 1:m\r\n        for j = 1:n\r\n            c = s(i) ~= t(j); % c = 0 if chars match, 1 if not.\r\n            D(i+1,j+1) = min([D(i,j+1) + 1\r\n                              D(i+1,j) + 1\r\n                              D(i,j)  +  c]);\r\n        end\r\n    end\r\n    levm_print(s,t,D)    \r\n    d = D(m+1,n+1);\r\nend          \r\n<\/pre><h4>Details<a name=\"c5c79cf6-4073-422d-9b0b-cbbecbda0808\"><\/a><\/h4><p>I've included a print routine so we can see some detail. Let's begin by finding the distance between a single letter and a word that doesn't contain that letter.  The distance is the length n of the word because one substitution and n-1 deletions are required.<\/p><pre class=\"codeinput\">    d = levm(<span class=\"string\">\"S\"<\/span>,<span class=\"string\">\"Dopey\"<\/span>)\r\n<\/pre><pre class=\"codeoutput\">\r\n         D    o    p    e    y\r\n    S    1    2    3    4    5\r\n\r\nd =\r\n     5\r\n<\/pre><p>Now let's have one character match.  In this case the character is 'e'.<\/p><pre class=\"codeinput\">    d = levm(<span class=\"string\">\"Sle\"<\/span>,<span class=\"string\">\"Dopey\"<\/span>)\r\n<\/pre><pre class=\"codeoutput\">\r\n         D    o    p    e    y\r\n    S    1    2    3    4    5\r\n    l    2    2    3    4    5\r\n    e    3    3    3    3    4\r\n\r\nd =\r\n     4\r\n<\/pre><p>Finally, two full words. The distance is the last entry in the last row or column of the matrix.<\/p><pre class=\"codeinput\">    d = levm(<span class=\"string\">\"Sleepy\"<\/span>,<span class=\"string\">\"Dopey\"<\/span>)\r\n<\/pre><pre class=\"codeoutput\">\r\n         D    o    p    e    y\r\n    S    1    2    3    4    5\r\n    l    2    2    3    4    5\r\n    e    3    3    3    3    4\r\n    e    4    4    4    3    4\r\n    p    5    5    4    4    4\r\n    y    6    6    5    5    4\r\n\r\nd =\r\n     4\r\n<\/pre><h4>Best Program<a name=\"072efa50-5775-4472-b560-9fe44a547166\"><\/a><\/h4><p>We don't need storage for the whole matrix, just two rows. The storage cost is now linear in the lengths of the strings. This is the most efficient of my three functions.<\/p><pre class=\"codeinput\">   type <span class=\"string\">lev<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\nfunction d = lev(s,t)\r\n% Levenshtein distance between strings or char arrays.\r\n% lev(s,t) is the number of deletions, insertions,\r\n% or substitutions required to transform s to t.\r\n% https:\/\/en.wikipedia.org\/wiki\/Levenshtein_distance\r\n\r\n    s = char(s);\r\n    t = char(t);\r\n    m = length(s);\r\n    n = length(t);\r\n    x = 0:n;\r\n    y = zeros(1,n+1);   \r\n    for i = 1:m\r\n        y(1) = i;\r\n        for j = 1:n\r\n            c = (s(i) ~= t(j)); % c = 0 if chars match, 1 if not.\r\n            y(j+1) = min([y(j) + 1\r\n                          x(j+1) + 1\r\n                          x(j) + c]);\r\n        end\r\n        % swap\r\n        [x,y] = deal(y,x);\r\n    end\r\n    d = x(n+1);\r\nend\r\n        \r\n        \r\n<\/pre><h4>Metric Space<a name=\"d5678626-ea93-4f5f-876b-afc9486c3411\"><\/a><\/h4><p>The function <tt>lev<\/tt> makes the set of strings a <i>metric space<\/i>. That's because of four properties.  The distance from an element to itself is zero.<\/p><pre class=\"language-matlab\">d(x,x) = 0\r\n<\/pre><p>Otherwise the distance is positive.<\/p><pre class=\"language-matlab\">d(x,y) &gt; 0 <span class=\"keyword\">if<\/span> x ~= y\r\n<\/pre><p>Distance is <i>symmetric<\/i>.<\/p><pre class=\"language-matlab\">d(x,y) = d(y,x)\r\n<\/pre><p>The triangle inequality, for any three elements,<\/p><pre class=\"language-matlab\">d(x,y) &lt;= d(x,w) + d(w,y)\r\n<\/pre><h4>44 Programming Languages<a name=\"dfdc149c-26a6-4b35-9650-2b5377bbfca1\"><\/a><\/h4><p>I usually avoid programming language debates, but <a href=\"https:\/\/en.wikibooks.org\/wiki\/Algorithm_Implementation\/Strings\/Levenshtein_distance\">here<\/a> are implementations in 44 different programming languages, including one in MATLAB.  I prefer my own code.<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_6edce509ea604c6fbef7aee4ee314bbd() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='6edce509ea604c6fbef7aee4ee314bbd ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 6edce509ea604c6fbef7aee4ee314bbd';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2017 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_6edce509ea604c6fbef7aee4ee314bbd()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2017a<br><\/p><\/div><!--\r\n6edce509ea604c6fbef7aee4ee314bbd ##### SOURCE BEGIN #####\r\n%% Levenshtein Edit Distance Between Strings\r\n% How can you measure the distance between two words?\r\n% How can you find the closest match to a word in a list of words?\r\n% The Levenshtein distance between two strings is the number of\r\n% single character deletions, insertions, or substitutions required \r\n% to transform one string into the other.\r\n% This is also known as the _edit distance_.\r\n%\r\n% Vladimir Levenshtein is a Russian mathematician who published this\r\n% notion in 1966.  I am using his distance measure in a project that I\r\n% will describe in a future post.  Other applications include optical\r\n% character recognition (OCR), spell checking, and DNA sequencing.\r\n%\r\n% I learned about Levenshtein distance from the\r\n% <https:\/\/en.wikipedia.org\/wiki\/Levenshtein_distance Wikipedia page>.\r\n\r\n%% \"Hi ho, hi ho\"\r\n% For my examples, I will use these seven famous names.\r\n\r\n   T = {\"Doc\",\"Grumpy\",\"Happy\",\"Sleepy\",\"Bashful\",\"Sneezy\",\"Dopey\"}\r\n   \r\n%%\r\n% The strings \"Sleepy\" and \"Sneezy\" are close to each other because they\r\n% are the same length, and we can transform \"Sleepy\" to \"Sneezy\" by two\r\n% edit operations, change 'l' to 'n' and 'p' to 'z'.  The Levenshtein\r\n% distance between these two words is 2.\r\n\r\n%%\r\n% On the other hand, \"Bashful\" is not close to his friends.  His name\r\n% is longer and the only letter he shares with another is an 'a' with\r\n% \"Happy\".  So his distance to \"Happy\" is 6, while the distance to\r\n% any of the others is 7, the length of his name.\r\n\r\n%%\r\n% Here is the matrix of pair-wise distances\r\n\r\n   for i = 1:7\r\n       for j = 1:7\r\n           D(i,j) = lev(T{i},T{j});\r\n       end\r\n   end\r\n   D\r\n\r\n%%\r\n% Bashful's column is all 7's, except for a 6 in Happy's row and a 0\r\n% on the diagonal.\r\n\r\n%%\r\n% A bar graph of the total distances shows Bashful's name is the\r\n% furthest from all the others, and that Dopey's is the closest,\r\n% but just barely.\r\n\r\n   bar(sum(D))\r\n   title('Total Levenshtein Distance')\r\n   set(gca,'xticklabels',T)\r\n   \r\n%% Recursive\r\n% Here is a recursive program that provides one way to compute |lev(s,t)|.\r\n% It compares each character in a string with all of the characters in\r\n% the other string.  For each comparison, a cost |c| is added to a\r\n% total that is accumulated by the recursion.  The cost of one comparison\r\n% is 0 if the pair is a match and 1 if it isn't.\r\n\r\n   type levr\r\n   \r\n%%\r\n% Like all recursive programs, this code is impractical to use in\r\n% practice with long strings or long lists of words because it \r\n% repeatedly compares the same pairs of characters.  The complexity\r\n% is exponential in the lengths of the strings.\r\n   \r\n%% Matrix Memory\r\n% A time-memory tradeoff can be made by allocating storage to save the\r\n% results of individual comparisons.  The memory involved is a matrix of \r\n% size (m+1)-by-(n+1) where m and n are the lengths of the two strings,\r\n% so it's O(m*n). The time complexity is also O(m*n).\r\n\r\n    type levm\r\n    \r\n%% Details\r\n% I've included a print routine so we can see some detail.\r\n% Let's begin by finding the distance between a single letter and a word\r\n% that doesn't contain that letter.  The distance is the length n of the\r\n% word because one substitution and n-1 deletions are required.\r\n\r\n    d = levm(\"S\",\"Dopey\")\r\n    \r\n%%\r\n% Now let's have one character match.  In this case the character is 'e'.\r\n\r\n    d = levm(\"Sle\",\"Dopey\")\r\n    \r\n%%\r\n% Finally, two full words. The distance is the last entry in the last\r\n% row or column of the matrix.\r\n    \r\n    d = levm(\"Sleepy\",\"Dopey\")\r\n    \r\n%% Best Program\r\n% We don't need storage for the whole matrix, just two rows.\r\n% The storage cost is now linear in the lengths of the strings.\r\n% This is the most efficient of my three functions.\r\n\r\n   type lev\r\n   \r\n%% Metric Space\r\n% The function |lev| makes the set of strings a _metric space_.\r\n% That's because of four properties.  The distance from an element to\r\n% itself is zero.\r\n%\r\n%   d(x,x) = 0\r\n%\r\n% Otherwise the distance is positive.\r\n%\r\n%   d(x,y) > 0 if x ~= y\r\n%\r\n% Distance is _symmetric_.\r\n%\r\n%   d(x,y) = d(y,x)\r\n%\r\n% The triangle inequality, for any three elements,\r\n%\r\n%   d(x,y) <= d(x,w) + d(w,y)\r\n\r\n%% 44 Programming Languages\r\n% I usually avoid programming language debates, but\r\n% <https:\/\/en.wikibooks.org\/wiki\/Algorithm_Implementation\/Strings\/Levenshtein_distance\r\n% here> are implementations in 44 different programming languages,\r\n% including one in MATLAB.  I prefer my own code.\r\n##### SOURCE END ##### 6edce509ea604c6fbef7aee4ee314bbd\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/cleve\/files\/levenshtein_blog_01.png\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction--><p>How can you measure the distance between two words? How can you find the closest match to a word in a list of words? The Levenshtein distance between two strings is the number of single character deletions, insertions, or substitutions required to transform one string into the other. This is also known as the <i>edit distance<\/i>.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/cleve\/2017\/08\/14\/levenshtein-edit-distance-between-strings\/\">read more >><\/a><\/p>","protected":false},"author":78,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[5,4,6,31],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/2637"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/users\/78"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/comments?post=2637"}],"version-history":[{"count":6,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/2637\/revisions"}],"predecessor-version":[{"id":2644,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/2637\/revisions\/2644"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/media?parent=2637"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/categories?post=2637"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/tags?post=2637"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}