{"id":189,"date":"2007-12-21T13:03:26","date_gmt":"2007-12-21T18:03:26","guid":{"rendered":"https:\/\/blogs.mathworks.com\/steve\/2007\/12\/21\/cleaning-up-scanned-text-revisited\/"},"modified":"2019-10-23T15:30:10","modified_gmt":"2019-10-23T19:30:10","slug":"cleaning-up-scanned-text-revisited","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/steve\/2007\/12\/21\/cleaning-up-scanned-text-revisited\/","title":{"rendered":"Cleaning up scanned text &#8211; revisited"},"content":{"rendered":"<div xmlns:mwsh=\"https:\/\/www.mathworks.com\/namespace\/mcode\/v1\/syntaxhighlight.dtd\" class=\"content\">\r\n   <p>Have you ever used the distance transform?<\/p>\r\n   <p>For a binary image, the distance transform is the distance from every pixel to the nearest foreground (nonzero) pixel. (Sometimes\r\n      you'll see it defined the other way around.  It doesn't really matter that much; you just need to pay attention to whatever\r\n      convention is being used and complement the image as needed.)\r\n   <\/p>\r\n   <p><a href=\"https:\/\/blogs.mathworks.com\/steve\/2007\/12\/07\/cleaning-up-scanned-text\/\">Last week I posted<\/a> a method for cleaning up scanned text.  The method distinguished between small dots that were far away from or were close\r\n      to pixels belonging to large characters.  In that post I used dilation and logical operators to identify small dots that were\r\n      far away from characters.  Later it occurred to me that one could also use the distance transform for this purpose.\r\n   <\/p>\r\n   <p>Here again are the first few steps.  Use <tt>bwareaopen<\/tt> to remove the small dots, and use a logical operator to identify the pixels that got removed.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">url = <span style=\"color: #A020F0\">'https:\/\/blogs.mathworks.com\/images\/steve\/186\/scanned_page.png'<\/span>;\r\npage = imread(url);\r\nbw = page(735:1280, 11:511);\r\nimshow(bw)<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/steve\/189\/clean_dots_bwdist_demo_01.png\"> <pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">bw2 = imcomplement(bw);\r\nbw3 = bwareaopen(bw2, 8);\r\nremoved = xor(bw2, bw3);<\/pre><p>Next, use the distance transform to identify all pixels that are within a certain distance from foreground pixels in the image\r\n      <tt>bw3<\/tt>.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">D = bwdist(bw3);\r\nwithin_hailing_distance = D &lt;= 10;\r\nimshow(within_hailing_distance)<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/steve\/189\/clean_dots_bwdist_demo_02.png\"> <p>Which removed pixels do we want to put back?<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">put_back_pixels = removed &amp; within_hailing_distance;<\/pre><p>Use a logical OR to put the pixels back.<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">gotta_go_the_holiday_party_is_about_to_start = bw3 | put_back_pixels;\r\nimshow(~gotta_go_the_holiday_party_is_about_to_start)<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/steve\/189\/clean_dots_bwdist_demo_03.png\"> <p>I suspect that not many people know about the \"extra\" feature that's tucked inside <tt>bwdist<\/tt>.  Not only can <tt>bwdist<\/tt> compute the distance transform for you, but it can also compute a result sometimes called the \"feature transform.\"  For each\r\n      pixel, the feature transform tells you <i>which<\/i> foreground pixel is nearest.  You get the feature transform simply by using a second output argument when you call <tt>bwdist<\/tt>.\r\n   <\/p>\r\n   <p>Do you use this capability?  Or do you think you might have a use for it?  Please let me know.  I'd love to hear about it.<\/p><script language=\"JavaScript\">\r\n<!--\r\n\r\n    function grabCode_8e43cba5542445b7810a65142bfdf8cf() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='8e43cba5542445b7810a65142bfdf8cf ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 8e43cba5542445b7810a65142bfdf8cf';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        author = 'Steve Eddins';\r\n        copyright = 'Copyright 2007 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add author and copyright lines at the bottom if specified.\r\n        if ((author.length > 0) || (copyright.length > 0)) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (author.length > 0) {\r\n                d.writeln('% _' + author + '_');\r\n            }\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n      \r\n      d.title = title + ' (MATLAB code)';\r\n      d.close();\r\n      }   \r\n      \r\n-->\r\n<\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_8e43cba5542445b7810a65142bfdf8cf()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n            the MATLAB code \r\n            <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; 7.5<br><\/p>\r\n<\/div>\r\n<!--\r\n8e43cba5542445b7810a65142bfdf8cf ##### SOURCE BEGIN #####\r\n%%\r\n% Have you ever used the distance transform?\r\n%\r\n% For a binary image, the distance transform is the distance from\r\n% every pixel to the nearest foreground (nonzero) pixel.\r\n% (Sometimes you'll see it defined the other way around.  It\r\n% doesn't really matter that much; you just need to pay attention\r\n% to whatever convention is being used and complement the image\r\n% as needed.)\r\n%\r\n% <https:\/\/blogs.mathworks.com\/steve\/2007\/12\/07\/cleaning-up-scanned-text\/ \r\n% Last week I posted> a method for cleaning up scanned text.  The\r\n% method distinguished between small dots that were far away from\r\n% or were close to pixels belonging to large characters.  In that\r\n% post I used dilation and logical operators to identify small\r\n% dots that were far away from characters.  Later it occurred to\r\n% me that one could also use the distance transform for this\r\n% purpose.\r\n%\r\n% Here again are the first few steps.  Use |bwareaopen| to remove\r\n% the small dots, and use a logical operator to identify the\r\n% pixels that got removed.\r\n\r\nurl = 'https:\/\/blogs.mathworks.com\/images\/steve\/186\/scanned_page.png';\r\npage = imread(url);\r\nbw = page(735:1280, 11:511);\r\nimshow(bw)\r\n\r\n%%\r\nbw2 = imcomplement(bw);\r\nbw3 = bwareaopen(bw2, 8);\r\nremoved = xor(bw2, bw3);\r\n\r\n%%\r\n% Next, use the distance transform to identify all pixels that\r\n% are within a certain distance from foreground pixels in the\r\n% image |bw3|.\r\n\r\nD = bwdist(bw3);\r\nwithin_hailing_distance = D <= 10;\r\nimshow(within_hailing_distance)\r\n\r\n%%\r\n% Which removed pixels do we want to put back?\r\n\r\nput_back_pixels = removed & within_hailing_distance;\r\n\r\n%%\r\n% Use a logical OR to put the pixels back.\r\n\r\ngotta_go_the_holiday_party_is_about_to_start = bw3 | put_back_pixels;\r\nimshow(~gotta_go_the_holiday_party_is_about_to_start)\r\n\r\n%%\r\n% I suspect that not many people know about the \"extra\" feature\r\n% that's tucked inside |bwdist|.  Not only can |bwdist| compute\r\n% the distance transform for you, but it can also compute a\r\n% result sometimes called the \"feature transform.\"  For each\r\n% pixel, the feature transform tells you _which_ foreground\r\n% pixel is nearest.  You get the feature transform simply by\r\n% using a second output argument when you call |bwdist|.\r\n%\r\n% Do you use this capability?  Or do you think you might have a\r\n% use for it?  Please let me know.  I'd love to hear about it.\r\n\r\n##### SOURCE END ##### 8e43cba5542445b7810a65142bfdf8cf\r\n-->","protected":false},"excerpt":{"rendered":"<p>\r\n   Have you ever used the distance transform?\r\n   For a binary image, the distance transform is the distance from every pixel to the nearest foreground (nonzero) pixel. (Sometimes\r\n      you'll see... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/steve\/2007\/12\/21\/cleaning-up-scanned-text-revisited\/\">read more >><\/a><\/p>","protected":false},"author":42,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[138,456,146,76,36,454],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/posts\/189"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/users\/42"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/comments?post=189"}],"version-history":[{"count":1,"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/posts\/189\/revisions"}],"predecessor-version":[{"id":3580,"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/posts\/189\/revisions\/3580"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/media?parent=189"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/categories?post=189"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/tags?post=189"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}