Steve on Image Processing

December 7th, 2007

Cleaning up scanned text

Earlier this year I exchanged e-mail with blog reader Craig Doolittle. Craig was writing MATLAB scripts to clean up scanned pages from old manuscripts. One of the samples he sent me was a page from "Fragmentation of Service Projectiles," N.F. Mott, J. H. Wilkinson, and T.H. Wise, Ministry of Supply, Armament Research Department, Theoretical Research Report No. 37/44, December 1944.

The image is too big to show at full resolution in this blog, so here's a thumbnail view.

url = 'http://blogs.mathworks.com/images/steve/186/scanned_page.png';
page = imread(url);
thumbnail = imresize(im2uint8(page), 'OutputSize', [256 NaN]);
imshow(thumbnail)

Craig wanted suggestions on how to clean up isolated "noise" dots without removing small dots that are part of characters. Let's look closely at a cropped portion of the page.

bw = page(735:1280, 11:511);
imshow(bw)

We could start by using bwareaopen to remove small dots. For For example:

bw2 = imcomplement(bw);
bw3 = bwareaopen(bw2, 8);
imshow(imcomplement(bw3))

Unfortunately, this approach has removed portions of some of the characters. Here's a method using bwlabel and regionprops to highlight the pixels that were removed.

removed = xor(bw2, bw3);
L = bwlabel(removed);
s = regionprops(L, 'Centroid');
centroids = cat(1, s.Centroid);
imshow(bw)
hold on
plot(centroids(:,1), centroids(:,2), 'ro')
hold off

You can see that some of the removed dots were noise, while others were parts of the characters "e", "i", "m", etc.

My suggestion to Craig was to restore removed dots that are "close" to the characters remaining after bwareaopen. We can do this using dilation and some logical operators.

bw4 = imdilate(bw3, strel('disk', 5));
imshow(bw4)

Now do a logical AND of the dilated characters with the pixels removed by bwareaopen. These are the pixels we are going to put back.

overlaps = bw4 & removed;
imshow(overlaps)

Use a logical OR to restore the removed pixels.

bwout = imcomplement(bw3 | overlaps);
imshow(bwout)

I also suggested using morphological recontruction to get all the pixels connected to the overlapping pixels found above. It doesn't seem to be really necessary here, though, so I'm going to save this technique for a future blog post, using a better example.

I'm sure there are lots of different ways to approach this text clean-up problem. Does anyone have suggestions for other approaches?

Thanks for letting me use your example, Craig.


Get the MATLAB code

Published with MATLAB® 7.5

26 Responses to “Cleaning up scanned text”

  1. Etamar Laron replied on :

    Hi Craig & Steve,

    Sounds like a good approach if you have the computation power to support it. How about achieving the same result with a microprocessor and/or limited resources such as memory and processing power?

    I am under the impression it will pose many difficulties trying to perform it using inexpensive hardware, or am I wrong?

  2. Steve replied on :

    Etamar—I don’t see any operation here that couldn’t be performed with a fixed-point processor.

  3. Jason Brown replied on :

    I’ve used many of the operations in this example on a laptop from 3 years ago that wasn’t exactly top of the line when I purchased it. Don’t think there would be any issues as long as it’s being handled a page at a time.

    Are there any plans to further this project by filling in spaces where the ink has been rubbed off over the years? It would definitely make the end product more legible.

  4. Steve replied on :

    Jason—I don’t know what Craig’s plans are for this project. He and I last corresponded about this about 8 months ago.

  5. Hans replied on :

    How about using OCR techniques to clean up the tekst?
    Although I’m not sure how OCR can be used in combination with formulas. Just a thought…

  6. Renji replied on :

    Hi Steve,
    Iam a final year student in computer engineering.We are doing a main project in device control using hand gestures.The gestures are taken by a webcam and the image is converted into YCbCr model before skin segmentation.Can you please provide me with proper guidance as how to obtain the cb and cr values.Thanking you,expecting a reply at the earliest

  7. Steve replied on :

    Renji—See rgb2ycbcr.

  8. Kevin Lan replied on :

    Hi Craig and Steve,
    I am currently analyzing an image that has a lot of layers. I used your method above to get rid of the distracting objects and I have only the layers now. However right now I want to measure the thickness of the layers. Each layer has inconsistent thickness over its different sections. Is there a way that I can measure the average thickness of the layer? What about measuring the average thickness of all the layers in the image? Please email me if you have a solution. Thank you so much!

  9. Steve replied on :

    Kevin—Can you segment the layers successfully? If so, you might try thinning the segmented layers using bwmorph, then compute the distance transform of the complemented segmentation, then evaluate the distance transform at the pixel locations in the thinned image.

  10. Svendsen replied on :

    Hi Steve et al., after viewing this post I am wondering if Matlab/IPT could be used to somehow help clean up scanned texts that have been underlined by pen/pencil. Some ideas: 1. determine ‘channels’ between lines of text (where most underlines are done) and restrict cleaning to these channels (but many characters protude into these channels); 2. look for pixels that can form horizontal straight line segments…
    Any idea?
    Thanks, JS

  11. Steve replied on :

    Svendsen—It sounds like some of the morphological operators might be useful, given the right structuring elements, but I don’t have specific suggestions for you.

  12. Alfian replied on :

    Hi Steve,

    At the moment, I’m working on how to extract text from images/video frames. I’ve managed to get the text areas out (plus a few false positives, but that doesnt matter)
    My problem now is… after binarization… most of the text is what I want them to be (BLACK TEXT on WHITE BACKGROUND)… but there are also quite important text that are the opposite (WHITE TEXT on BLACK BACKGROUND).

    I was playing around with the im2bw function, and also the ait_imneg function I dloaded somewhere on the INternet.

    Can you possibly help me out here Steve?

    Thanks a million :)

  13. Alfian replied on :

    hi there Steve :) I’ve been trying around to solve the question I posted. I was looking at the mean intensity within the text region, as well as outside the region. One paper I read says that this can be done to determine whether it’s inverse text (bright colored text with dark bground) or not. But still I am not getting anywhere. Sorry if its a bit off topic… but just thought u might be able to give any pointers. anyway, thanks for letting me post here. Cheers!

  14. smitha replied on :

    is there any matlab function to segment a portion of an image?

  15. Steve replied on :

    Alfian—You might try doing a hole-filling operation (imfill) after thresholding. Inverse text will show up as hole regions. Use bwareaopen to eliminate the holes inside normal text.

  16. Steve replied on :

    Smitha—Image segmentation is really an algorithm development problem that usually must be customized for your own data. There are no “canned” routines that work for everything. For possible algorithms you might be able to use, take a look at the image analysis and segmentation chapters in Digital Image Processing Using MATLAB.

  17. Svendsen replied on :

    Hi Steve, after much trial & error, I’m closer to get rid of underlines in scanned text, mostry with bwhitmiss. If you’re interested, I can show you a sample image and the Matlab code

  18. Alfian replied on :

    Thanks Steve. Ill give it a go… will hopefully let U know if I get results.

  19. paras replied on :

    i need to do skeletonization….if i use the comman
    “bwmorph(’img’,’skel’,Inf)” it gives skeleton of white part…actually i hav a image in which thr is a black line on white sheet of paper ..so i need skeleton of black part…plz help me…

  20. DP replied on :

    Hi steve:
    is there any function in image processing toolbox to visulize 3D image (512×512x22)image size.

    Thank U.
    sincerely yours;
    damodar

  21. Steve replied on :

    Para—Complement the image before calling bwmorph. Use ~ or imcomplement.

  22. Steve replied on :

    DP—You could try montage. You should also look at the 3-D visualization material in the MATLAB documentation.

  23. Bob replied on :

    DP, you might also check out the follow File Exchange submission for ideas.
    http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=4879

    Cheers
    Bob

  24. DP replied on :

    Thank U.

  25. Etamar Laron replied on :

    Hi guys,

    To reopen my old question - do you think this algorithm can be performed in a reasonable (several microseconds) using an inexpensive hardware?

    In inexpensive I mean $1 or less 8051 or similar……… not a laptop from three years ago.

    Appreciate your thoughts!

  26. Steve replied on :

    Etamar—I don’t have anything to add to my original answer. The operations could be readily implemented on a fixed-point processor, but I don’t know how long the operation might take. Did you really mean “several microseconds”? I assume you’re not really hoping to process hundreds of thousands of images per second. :-)

Leave a Reply

Wrap code fragments inside <pre> tags, like this:

<pre class="code">
a = magic(3);
sum(a)
</pre>

If you have a "<" character in your code, either follow it with a space or replace it with "&lt;" (including the semicolon).


Steve Eddins manages the Image & Geospatial development team at The MathWorks and coauthored Digital Image Processing Using MATLAB. He writes here about image processing concepts, algorithm implementations, and MATLAB.

  • murat: Hi Steve, I have an rgb image of a kind of cream and it contains some small black particles (black dots). In...
  • Steve: Ernest—Look at setting the FaceColor property. The code for setting that is shown on the page you asked...
  • Ernest Miller: Hi Steve, Understood. However, can you explain how to change the colors? Thanks, Ernest
  • Jan: Hi Steve Very useful code, yet what if I parts of my rotated+translated object are outside the original...
  • Steve: MoHDa—It might be possible. You’ll need to use one of the options that produces closed edge...
  • MoHDa: I have one question about the ROIPOLY: I have an image with stripes, I use the “edge” command for...
  • Steve: Shahn—My November 17, 2006 post shows you how to do it.
  • Steve: Kay-Uwe—Thanks for following up. I am planning to make it easier to use test directories in a package....
  • shahn: Hello Steve Instead of superimposing a star on the image to show the centroide. How would you superimpose a...
  • Kay-Uwe: Having TestSuite.fromPackag e() would be nice to have, but so far using simple “test” subdirs...

These postings are the author's and don't necessarily represent the opinions of The MathWorks.