{"id":1231,"date":"2015-10-14T10:14:19","date_gmt":"2015-10-14T15:14:19","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=1231"},"modified":"2016-01-10T10:42:43","modified_gmt":"2016-01-10T15:42:43","slug":"40-year-old-algorithm-that-cannot-be-improved","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2015\/10\/14\/40-year-old-algorithm-that-cannot-be-improved\/","title":{"rendered":"40-year-old Algorithm That Cannot Be Improved"},"content":{"rendered":"\r\n<div class=\"content\"><!--introduction--><p>It is interesting to see how an obscure algorithm sometimes gets a break in news headlines like <a href=\"https:\/\/www.bostonglobe.com\/ideas\/2015\/08\/10\/computer-scientists-have-looked-for-solution-that-doesn-exist\/tXO0qNRnbKrClfUPmavifK\/story.html\">For 40 years, computer scientists looked for a solution that doesn&#8217;t exist<\/a> in the Boston Globe. Today's guest blogger, Toshi Takeuchi, introduces us to this curious computer algorithm.<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/screen_shot.jpg\" alt=\"\"> <\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#25749fb8-6184-4c9a-9c1c-30d3f7a3d624\">Edit Distance<\/a><\/li><li><a href=\"#151caa68-d6db-411d-9e63-69e504313cf3\">Examples<\/a><\/li><li><a href=\"#6c43461c-87ca-4e6f-bac1-69242cbc74e7\">The Algorithm<\/a><\/li><li><a href=\"#b3000b05-ed70-44d3-8c08-70d5540b233e\">How about Sentences?<\/a><\/li><\/ul><\/div><h4>Edit Distance<a name=\"25749fb8-6184-4c9a-9c1c-30d3f7a3d624\"><\/a><\/h4><p>The Boston Globe article talks about an algorithm known as <a href=\"https:\/\/en.wikipedia.org\/wiki\/Edit_distance\">edit distance<\/a> . This is actually an important algorithm used in natural language processing and computational biology for such applications as spell checking, OCR, and genomic sequencing. But this discovery is perhaps most interesting to computational biologists because the computational intensity grows in quadratic time with respect to the length of strings, and the human genome contains about 3 billion base pairs (and quadratic time of that to compute)! So it would have been enormously useful if there were faster alternatives.<\/p><p>Too bad there isn't a faster algorithm, but the article says this has been received with <b>\"something like relief among computer scientists, who can now stop beating their heads against a wall in search of a faster method that doesn't exist.\"<\/b><\/p><p>Are you ready to see how this fastest possible algorithm works?<\/p><h4>Examples<a name=\"151caa68-d6db-411d-9e63-69e504313cf3\"><\/a><\/h4><p>It is a distance measure between a pair of strings by the minimum number of edit operations required to transform one into another, as a way to quantify how similar or dissimilar they are.<\/p><p>For example, 'string' can be transformed into 'bring' by two edit operations &#8211; delete 's', and substitute 'b' for 't'.<\/p><pre>s t r i n g\r\n* b r i n g<\/pre><p>If you count each edit as 1, then this pair has the edit distance of 2.<\/p><p>Edit distances comes in different variants, but let's try this with the <a href=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/editdist.m\"><tt>editdist<\/tt><\/a> function I created based on pseudocode I found on Wikipedia.<\/p><pre class=\"codeinput\">str1 = <span class=\"string\">'string'<\/span>;\r\nstr2 = <span class=\"string\">'bring'<\/span>;\r\neditdist(str1, str2)\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n     2\r\n<\/pre><p>You can see this can be useful to determine the closest proper spelling of a word for spell correction or in a noisy scan using OCR.<\/p><p>Now let's try this for DNA sequence analysis to count the number of mutations, which are essentially 'typos'.<\/p><pre class=\"codeinput\">str1 = <span class=\"string\">'ACAAGATGCCATTGTCCACCGGCCTCCTGCCTGCTGCTCTCGGCCACCGCTGCCCTGCC'<\/span>;\r\nstr2 = <span class=\"string\">'ACCTGGAGGGTTTGTCCACCGGCCTCCTGCCTGCTCTCTCCGGCCACCGCTGCCCTGCC'<\/span>;\r\neditdist(str1, str2)\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n     9\r\n<\/pre><h4>The Algorithm<a name=\"6c43461c-87ca-4e6f-bac1-69242cbc74e7\"><\/a><\/h4><p>Let's go back to the first example to see how the algorithm works. There are three edit operations you consider: insert, delete or substitute. You have seen the delete and substitute. If you want to go from 'bring' to 'string' instead, then you would have to insert 's'. The algorithm must search for a minimum effort path through a sequence of possible edits from the first character to the last, and this could be a huge search space!<\/p><p>The solution to this issue is a dynamic programming approach that uses, of course, a matrix! Let's use an example from the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Wagner%E2%80%93Fischer_algorithm\">Wagner&#8211;Fischer algorithm<\/a> Wikipedia page, which also provides the pseudocode I mentioned earlier.<\/p><p>In this example, I will reproduce the example matrix from Wikipedia using <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/tables.html\"><tt>table<\/tt><\/a> in MATLAB. I substitute 'a' in 'saturday' with 'o' because you cannot repeat the same letter as row names or variable names in tables.<\/p><pre class=\"codeinput\">str1 = <span class=\"string\">'sunday'<\/span>;\r\nstr2 = <span class=\"string\">'soturday'<\/span>;\r\n[~, tbl, rows, cols] = editdist(str1, str2);\r\ndisp(array2table(tbl, <span class=\"string\">'RowNames'<\/span>, rows, <span class=\"string\">'VariableNames'<\/span>, cols))\r\n<\/pre><pre class=\"codeoutput\">            Null    s    o    t    u    r    d    a    y\r\n            ____    _    _    _    _    _    _    _    _\r\n    Null    0       1    2    3    4    5    6    7    8\r\n    s       1       0    1    2    3    4    5    6    7\r\n    u       2       1    1    2    2    3    4    5    6\r\n    n       3       2    2    2    3    3    4    5    6\r\n    d       4       3    3    3    3    4    3    4    5\r\n    a       5       4    4    4    4    4    4    3    4\r\n    y       6       5    5    5    5    5    5    4    3\r\n<\/pre><p>The first row or column represents a null string where each character is empty. You see 1 in the 's' row and 'Null' column because it represents a deletion of letter 's' in order to make it an empty character, and that counts as 1 edit operation, and 'u' row and 'Null' column is 2 because it is another deletion and now we have done 2 edits to turn 'su' to '**' if you consider '*' to be an empty character. So the first column and row increases from 0 to the length of the string because we are just adding up the number of delete (column) or insert (row) operations.<\/p><p>Now let's work on the rest of the matrix. For each element in the matrix, you just look at the 3 neighbor elements, add 1 to their values, and use the minimum of the three for the given element.<\/p><div><ul><li>the element to the left = previous cost for a delete operation + 1 ... this is the current deletion cost<\/li><li>the element above = previous cost for an insert operation + 1 ... this is the current insertion cost<\/li><li>the element to the upper left = previous cost for a substitute operation + 1 if the letters differ, or + 0 if the letters are the same ... this is the current substitution cost<\/li><\/ul><\/div><p>Let's work out the case of going from 's' to 's'.<\/p><div><ul><li>the element to the left is 1, so the current delete cost = 1+1 =2<\/li><li>the one above is also 1, so the current insertion cost = 1+1 =2<\/li><li>the one to the upper left is 0, and we are not changing the letter, so the current substitution cost = 0 + 0 = 0. This is also the minimum of the three, so this is 0.<\/li><\/ul><\/div><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/algorithm.png\" alt=\"\"> <\/p><p>Now, from 's' to 'so', delete is 3, insert is 1, and substitute is 2, so it is 1. You continue filling out the matrix and the value you have in the last element is the minimum edit distance.<\/p><p>If you insert 'o' and 't', and substitute 'n' with 'r', then 'sunday' is transformed to 'soturday' with a minimum of three edit operations, so this matches with the computed result.<\/p><pre>s * * u n d a y\r\ns o t u r d a y<\/pre><p>Check out my <a href=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/editdist.m\"><tt>editdist<\/tt><\/a> function that implements this algorithm.<\/p><h4>How about Sentences?<a name=\"b3000b05-ed70-44d3-8c08-70d5540b233e\"><\/a><\/h4><p>We happen to use edit distance at a character level, but we can use it on a string level, say, to compare two sentences generated by machine translation or a speech recognition system, or perhaps to detect plagiarism.<\/p><pre>edit distance is a way of quantifying how dissimilar two strings\r\nedit distance is a string metric for measuring the difference between two sequences.<\/pre><p>Try modifying <tt>editdist<\/tt> to get it work on a string level and share your experience <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=1231#respond\">here<\/a>!<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_755b7f7f3ff04aa5a9fc08cede2ab663() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='755b7f7f3ff04aa5a9fc08cede2ab663 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 755b7f7f3ff04aa5a9fc08cede2ab663';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2015 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_755b7f7f3ff04aa5a9fc08cede2ab663()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2015b<br><\/p><\/div><!--\r\n755b7f7f3ff04aa5a9fc08cede2ab663 ##### SOURCE BEGIN #####\r\n%% 40-year-old Algorithm That Cannot Be Improved\r\n% It is interesting to see how an obscure algorithm sometimes gets a break\r\n% in news headlines like\r\n% <http:\/\/www.bostonglobe.com\/ideas\/2015\/08\/10\/computer-scientists-have-looked-for-solution-that-doesn-exist\/\r\n% For 40 years, computer scientists looked for a solution that doesn\u00e2\u20ac\u2122t\r\n% exist> in the Boston Globe. Today's guest blogger, Toshi Takeuchi,\r\n% introduces us to this curious computer algorithm.\r\n%\r\n% <<screen_shot.jpg>>\r\n% \r\n%% Edit Distance\r\n% The Boston Globe article talks about an algorithm known as\r\n% <https:\/\/en.wikipedia.org\/wiki\/Edit_distance edit distance> . This is\r\n% actually an important algorithm used in natural language processing and\r\n% computational biology for such applications as spell checking, OCR, and\r\n% genomic sequencing. But this discovery is perhaps most interesting to\r\n% computational biologists because the computational intensity grows in\r\n% quadratic time with respect to the length of strings, and the human\r\n% genome contains about 3 billion base pairs (and quadratic time of that to\r\n% compute)! So it would have been enormously useful if there were faster\r\n% alternatives.\r\n%\r\n% Too bad there isn't a faster algorithm, but the article says this has\r\n% been received with *\"something like relief among computer scientists, who\r\n% can now stop beating their heads against a wall in search of a faster\r\n% method that doesn't exist.\"*\r\n%\r\n% Are you ready to see how this fastest possible algorithm works?\r\n%\r\n%% Examples\r\n% It is a distance measure between a pair of strings by the minimum number\r\n% of edit operations required to transform one into another, as a way to\r\n% quantify how similar or dissimilar they are.\r\n%\r\n% For example, 'string' can be transformed into 'bring' by two edit\r\n% operations \u00e2\u20ac\u201c delete 's', and substitute 'b' for 't'.\r\n%\r\n%  s t r i n g\r\n%  * b r i n g\r\n%  \r\n% If you count each edit as 1, then this pair has the edit distance of 2.\r\n%\r\n% Edit distances comes in different variants, but let's try this with the\r\n% <https:\/\/blogs.mathworks.com\/images\/loren\/2015\/editdist.m |editdist|>\r\n% function I created based on pseudocode I found on Wikipedia.\r\n\r\nstr1 = 'string';\r\nstr2 = 'bring';\r\neditdist(str1, str2)\r\n\r\n%%\r\n% You can see this can be useful to determine the closest proper spelling\r\n% of a word for spell correction or in a noisy scan using OCR.\r\n% \r\n% Now let's try this for DNA sequence analysis to count the number of\r\n% mutations, which are essentially 'typos'.\r\n\r\nstr1 = 'ACAAGATGCCATTGTCCACCGGCCTCCTGCCTGCTGCTCTCGGCCACCGCTGCCCTGCC';\r\nstr2 = 'ACCTGGAGGGTTTGTCCACCGGCCTCCTGCCTGCTCTCTCCGGCCACCGCTGCCCTGCC';\r\neditdist(str1, str2)\r\n\r\n%% The Algorithm\r\n% Let's go back to the first example to see how the algorithm works. There\r\n% are three edit operations you consider: insert, delete or substitute. You\r\n% have seen the delete and substitute. If you want to go from 'bring' to\r\n% 'string' instead, then you would have to insert 's'. The algorithm must\r\n% search for a minimum effort path through a sequence of possible edits\r\n% from the first character to the last, and this could be a huge search\r\n% space!\r\n%\r\n% The solution to this issue is a dynamic programming approach that uses,\r\n% of course, a matrix! Let's use an example from the\r\n% <https:\/\/en.wikipedia.org\/wiki\/Wagner%E2%80%93Fischer_algorithm\r\n% Wagner\u00e2\u20ac\u201cFischer algorithm> Wikipedia page, which also provides the\r\n% pseudocode I mentioned earlier.  \r\n%\r\n% In this example, I will reproduce the example matrix from Wikipedia using\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/tables.html |table|> in MATLAB. I\r\n% substitute 'a' in 'saturday' with 'o' because you cannot repeat the same\r\n% letter as row names or variable names in tables.\r\n\r\nstr1 = 'sunday';                            \r\nstr2 = 'soturday'; \r\n[~, tbl, rows, cols] = editdist(str1, str2);\r\ndisp(array2table(tbl, 'RowNames', rows, 'VariableNames', cols))\r\n\r\n%%\r\n% The first row or column represents a null string where each character is\r\n% empty. You see 1 in the 's' row and 'Null' column because it represents a\r\n% deletion of letter 's' in order to make it an empty character, and that\r\n% counts as 1 edit operation, and 'u' row and 'Null' column is 2 because it\r\n% is another deletion and now we have done 2 edits to turn 'su' to '**' if\r\n% you consider '*' to be an empty character. So the first column and row\r\n% increases from 0 to the length of the string because we are just adding\r\n% up the number of delete (column) or insert (row) operations. \r\n%\r\n% Now let's work on the rest of the matrix. For each element in the matrix,\r\n% you just look at the 3 neighbor elements, add 1 to their values, and use\r\n% the minimum of the three for the given element. \r\n%\r\n% * the element to the left = previous cost for a delete operation + 1 ...\r\n% this is the current deletion cost\r\n% * the element above = previous cost for an insert operation + 1 ... \r\n% this is the current insertion cost\r\n% * the element to the upper left = previous cost for a substitute\r\n% operation + 1 if the letters differ, or + 0 if the letters are the same\r\n% ... this is the current substitution cost\r\n%\r\n% Let's work out the case of going from 's' to 's'.\r\n%\r\n% * the element to the left is 1, so the current delete cost = 1+1 =2\r\n% * the one above is also 1, so the current insertion cost = 1+1 =2\r\n% * the one to the upper left is 0, and we are not changing the letter, so\r\n% the current substitution cost = 0 + 0 = 0. This is also the minimum of\r\n% the three, so this is 0.\r\n%\r\n% <<algorithm.png>>\r\n%\r\n% Now, from 's' to 'so', delete is 3, insert is 1, and substitute is 2, so\r\n% it is 1. You continue filling out the matrix and the value you have in\r\n% the last element is the minimum edit distance.\r\n%\r\n% If you insert 'o' and 't', and substitute 'n' with 'r', then 'sunday' is\r\n% transformed to 'soturday' with a minimum of three edit operations, so\r\n% this matches with the computed result.\r\n%\r\n%  s * * u n d a y\r\n%  s o t u r d a y\r\n%\r\n% Check out my <https:\/\/blogs.mathworks.com\/images\/loren\/2015\/editdist.m\r\n% |editdist|> function that implements this algorithm. \r\n\r\n%% How about Sentences?\r\n% We happen to use edit distance at a character level, but we can use it on\r\n% a string level, say, to compare two sentences generated by machine\r\n% translation or a speech recognition system, or perhaps to detect\r\n% plagiarism.\r\n% \r\n%  edit distance is a way of quantifying how dissimilar two strings\r\n%  edit distance is a string metric for measuring the difference between two sequences.\r\n%\r\n% Try modifying |editdist| to get it work on a string level and share your\r\n% experience <https:\/\/blogs.mathworks.com\/loren\/?p=1231#respond here>! \r\n \r\n\r\n##### SOURCE END ##### 755b7f7f3ff04aa5a9fc08cede2ab663\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/algorithm.png\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction--><p>It is interesting to see how an obscure algorithm sometimes gets a break in news headlines like <a href=\"https:\/\/www.bostonglobe.com\/ideas\/2015\/08\/10\/computer-scientists-have-looked-for-solution-that-doesn-exist\/tXO0qNRnbKrClfUPmavifK\/story.html\">For 40 years, computer scientists looked for a solution that doesn&#8217;t exist<\/a> in the Boston Globe. Today's guest blogger, Toshi Takeuchi, introduces us to this curious computer algorithm.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2015\/10\/14\/40-year-old-algorithm-that-cannot-be-improved\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[66,33,1],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1231"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=1231"}],"version-history":[{"count":4,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1231\/revisions"}],"predecessor-version":[{"id":1301,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1231\/revisions\/1301"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=1231"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=1231"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=1231"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}