{"id":186,"date":"2007-12-07T19:52:35","date_gmt":"2007-12-08T00:52:35","guid":{"rendered":"https:\/\/blogs.mathworks.com\/steve\/2007\/12\/07\/cleaning-up-scanned-text\/"},"modified":"2019-10-23T15:28:56","modified_gmt":"2019-10-23T19:28:56","slug":"cleaning-up-scanned-text","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/steve\/2007\/12\/07\/cleaning-up-scanned-text\/","title":{"rendered":"Cleaning up scanned text"},"content":{"rendered":"<div xmlns:mwsh=\"https:\/\/www.mathworks.com\/namespace\/mcode\/v1\/syntaxhighlight.dtd\" class=\"content\">\r\n   <p>Earlier this year I exchanged e-mail with blog reader Craig Doolittle. Craig was writing MATLAB scripts to clean up scanned\r\n      pages from old manuscripts.  One of the samples he sent me was a <a href=\"https:\/\/blogs.mathworks.com\/images\/steve\/186\/scanned_page.png\">page<\/a> from \"Fragmentation of Service Projectiles,\" N.F. Mott, J. H. Wilkinson, and T.H. Wise, Ministry of Supply, Armament Research\r\n      Department, Theoretical Research Report No. 37\/44, December 1944.\r\n   <\/p>\r\n   <p>The image is too big to show at full resolution in this blog, so here's a thumbnail view.<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">url = <span style=\"color: #A020F0\">'https:\/\/blogs.mathworks.com\/images\/steve\/186\/scanned_page.png'<\/span>;\r\npage = imread(url);\r\nthumbnail = imresize(im2uint8(page), <span style=\"color: #A020F0\">'OutputSize'<\/span>, [256 NaN]);\r\nimshow(thumbnail)<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/steve\/186\/clean_dots_demo_01.png\"> <p>Craig wanted suggestions on how to clean up isolated \"noise\" dots without removing small dots that are part of characters.\r\n      Let's look closely at a cropped portion of the page.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">bw = page(735:1280, 11:511);\r\nimshow(bw)<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/steve\/186\/clean_dots_demo_02.png\"> <p>We could start by using <tt>bwareaopen<\/tt> to remove small dots.  For For example:\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">bw2 = imcomplement(bw);\r\nbw3 = bwareaopen(bw2, 8);\r\nimshow(imcomplement(bw3))<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/steve\/186\/clean_dots_demo_03.png\"> <p>Unfortunately, this approach has removed portions of some of the characters.  Here's a method using <tt>bwlabel<\/tt> and <tt>regionprops<\/tt> to highlight the pixels that were removed.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">removed = xor(bw2, bw3);\r\nL = bwlabel(removed);\r\ns = regionprops(L, <span style=\"color: #A020F0\">'Centroid'<\/span>);\r\ncentroids = cat(1, s.Centroid);\r\nimshow(bw)\r\nhold <span style=\"color: #A020F0\">on<\/span>\r\nplot(centroids(:,1), centroids(:,2), <span style=\"color: #A020F0\">'ro'<\/span>)\r\nhold <span style=\"color: #A020F0\">off<\/span><\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/steve\/186\/clean_dots_demo_04.png\"> <p>You can see that some of the removed dots were noise, while others were parts of the characters \"e\", \"i\", \"m\", etc.<\/p>\r\n   <p>My suggestion to Craig was to restore removed dots that are \"close\" to the characters remaining after <tt>bwareaopen<\/tt>.  We can do this using dilation and some logical operators.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">bw4 = imdilate(bw3, strel(<span style=\"color: #A020F0\">'disk'<\/span>, 5));\r\nimshow(bw4)<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/steve\/186\/clean_dots_demo_05.png\"> <p>Now do a logical AND of the dilated characters with the pixels removed by <tt>bwareaopen<\/tt>.  These are the pixels we are going to put back.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">overlaps = bw4 &amp; removed;\r\nimshow(overlaps)<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/steve\/186\/clean_dots_demo_06.png\"> <p>Use a logical OR to restore the removed pixels.<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">bwout = imcomplement(bw3 | overlaps);\r\nimshow(bwout)<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/steve\/186\/clean_dots_demo_07.png\"> <p>I also suggested using morphological recontruction to get all the pixels connected to the overlapping pixels found above.\r\n       It doesn't seem to be really necessary here, though, so I'm going to save this technique for a future blog post, using a\r\n      better example.\r\n   <\/p>\r\n   <p>I'm sure there are lots of different ways to approach this text clean-up problem.  Does anyone have suggestions for other\r\n      approaches?\r\n   <\/p>\r\n   <p><i>Thanks for letting me use your example, Craig.<\/i><\/p><script language=\"JavaScript\">\r\n<!--\r\n\r\n    function grabCode_da06d98513e74b4a80b71699e59acb2e() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='da06d98513e74b4a80b71699e59acb2e ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' da06d98513e74b4a80b71699e59acb2e';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        author = 'Steve Eddins';\r\n        copyright = 'Copyright 2007 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add author and copyright lines at the bottom if specified.\r\n        if ((author.length > 0) || (copyright.length > 0)) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (author.length > 0) {\r\n                d.writeln('% _' + author + '_');\r\n            }\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n      \r\n      d.title = title + ' (MATLAB code)';\r\n      d.close();\r\n      }   \r\n      \r\n-->\r\n<\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_da06d98513e74b4a80b71699e59acb2e()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n            the MATLAB code \r\n            <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; 7.5<br><\/p>\r\n<\/div>\r\n<!--\r\nda06d98513e74b4a80b71699e59acb2e ##### SOURCE BEGIN #####\r\n%%\r\n% Earlier this year I exchanged e-mail with blog reader Craig\r\n% Doolittle. Craig was writing MATLAB scripts to clean up scanned\r\n% pages from old manuscripts.  One of the samples he sent me was \r\n% a <https:\/\/blogs.mathworks.com\/images\/steve\/186\/scanned_page.png \r\n% page> from \"Fragmentation of Service Projectiles,\" N.F. \r\n% Mott, J. H. Wilkinson, and T.H. Wise, Ministry of Supply, \r\n% Armament Research Department, Theoretical Research Report \r\n% No. 37\/44, December 1944.\r\n%\r\n% The image is too big to show at full resolution in this blog,\r\n% so here's a thumbnail view.\r\n\r\nurl = 'https:\/\/blogs.mathworks.com\/images\/steve\/186\/scanned_page.png';\r\npage = imread(url);\r\nthumbnail = imresize(im2uint8(page), 'OutputSize', [256 NaN]);\r\nimshow(thumbnail)\r\n\r\n%%\r\n% Craig wanted suggestions on how to clean up isolated \"noise\"\r\n% dots without removing small dots that are part of characters.\r\n% Let's look closely at a cropped portion of the page.\r\n\r\nbw = page(735:1280, 11:511);\r\nimshow(bw)\r\n\r\n%%\r\n% We could start by using |bwareaopen| to remove small dots.  For\r\n% For example:\r\nbw2 = imcomplement(bw);\r\nbw3 = bwareaopen(bw2, 8);\r\nimshow(imcomplement(bw3))\r\n\r\n%%\r\n% Unfortunately, this approach has removed portions of some of\r\n% the characters.  Here's a method using |bwlabel| and\r\n% |regionprops| to highlight the pixels that were removed.\r\n\r\nremoved = xor(bw2, bw3);\r\nL = bwlabel(removed);\r\ns = regionprops(L, 'Centroid');\r\ncentroids = cat(1, s.Centroid);\r\nimshow(bw)\r\nhold on\r\nplot(centroids(:,1), centroids(:,2), 'ro')\r\nhold off\r\n\r\n%%\r\n% You can see that some of the removed dots were noise, while\r\n% others were parts of the characters \"e\", \"i\", \"m\", etc.\r\n%\r\n% My suggestion to Craig was to restore removed dots that are\r\n% \"close\" to the characters remaining after |bwareaopen|.  We \r\n% can do this using dilation and some logical operators.\r\n\r\nbw4 = imdilate(bw3, strel('disk', 5));\r\nimshow(bw4)\r\n\r\n%%\r\n% Now do a logical AND of the dilated characters with the pixels\r\n% removed by |bwareaopen|.  These are the pixels we are going to\r\n% put back.\r\n\r\noverlaps = bw4 & removed;\r\nimshow(overlaps)\r\n\r\n%%\r\n% Use a logical OR to restore the removed pixels.\r\n\r\nbwout = imcomplement(bw3 | overlaps);\r\nimshow(bwout)\r\n\r\n%%\r\n% I also suggested using morphological recontruction to get all\r\n% the pixels connected to the overlapping pixels found above.  It\r\n% doesn't seem to be really necessary here, though, so I'm going \r\n% to save this technique for a future blog post, using a better \r\n% example.\r\n%\r\n% I'm sure there are lots of different ways to approach this text\r\n% clean-up problem.  Does anyone have suggestions for other\r\n% approaches?\r\n%\r\n% _Thanks for letting me use your example, Craig._\r\n##### SOURCE END ##### da06d98513e74b4a80b71699e59acb2e\r\n-->","protected":false},"excerpt":{"rendered":"<p>\r\n   Earlier this year I exchanged e-mail with blog reader Craig Doolittle. Craig was writing MATLAB scripts to clean up scanned\r\n      pages from old manuscripts.  One of the samples he sent me was... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/steve\/2007\/12\/07\/cleaning-up-scanned-text\/\">read more >><\/a><\/p>","protected":false},"author":42,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[138,166,46,90,452,146,124,76,156,36,68,168,454],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/posts\/186"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/users\/42"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/comments?post=186"}],"version-history":[{"count":1,"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/posts\/186\/revisions"}],"predecessor-version":[{"id":3578,"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/posts\/186\/revisions\/3578"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/media?parent=186"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/categories?post=186"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/steve\/wp-json\/wp\/v2\/tags?post=186"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}