Steve on Image Processing

December 7th, 2007

Cleaning up scanned text

Earlier this year I exchanged e-mail with blog reader Craig Doolittle. Craig was writing MATLAB scripts to clean up scanned pages from old manuscripts. One of the samples he sent me was a page from "Fragmentation of Service Projectiles," N.F. Mott, J. H. Wilkinson, and T.H. Wise, Ministry of Supply, Armament Research Department, Theoretical Research Report No. 37/44, December 1944.

The image is too big to show at full resolution in this blog, so here's a thumbnail view.

url = 'http://blogs.mathworks.com/images/steve/186/scanned_page.png';
page = imread(url);
thumbnail = imresize(im2uint8(page), 'OutputSize', [256 NaN]);
imshow(thumbnail)

Craig wanted suggestions on how to clean up isolated "noise" dots without removing small dots that are part of characters. Let's look closely at a cropped portion of the page.

bw = page(735:1280, 11:511);
imshow(bw)

We could start by using bwareaopen to remove small dots. For For example:

bw2 = imcomplement(bw);
bw3 = bwareaopen(bw2, 8);
imshow(imcomplement(bw3))

Unfortunately, this approach has removed portions of some of the characters. Here's a method using bwlabel and regionprops to highlight the pixels that were removed.

removed = xor(bw2, bw3);
L = bwlabel(removed);
s = regionprops(L, 'Centroid');
centroids = cat(1, s.Centroid);
imshow(bw)
hold on
plot(centroids(:,1), centroids(:,2), 'ro')
hold off

You can see that some of the removed dots were noise, while others were parts of the characters "e", "i", "m", etc.

My suggestion to Craig was to restore removed dots that are "close" to the characters remaining after bwareaopen. We can do this using dilation and some logical operators.

bw4 = imdilate(bw3, strel('disk', 5));
imshow(bw4)

Now do a logical AND of the dilated characters with the pixels removed by bwareaopen. These are the pixels we are going to put back.

overlaps = bw4 & removed;
imshow(overlaps)

Use a logical OR to restore the removed pixels.

bwout = imcomplement(bw3 | overlaps);
imshow(bwout)

I also suggested using morphological recontruction to get all the pixels connected to the overlapping pixels found above. It doesn't seem to be really necessary here, though, so I'm going to save this technique for a future blog post, using a better example.

I'm sure there are lots of different ways to approach this text clean-up problem. Does anyone have suggestions for other approaches?

Thanks for letting me use your example, Craig.


Get the MATLAB code

Published with MATLAB® 7.5

28 Responses to “Cleaning up scanned text”

  1. Etamar Laron replied on :

    Hi Craig & Steve,

    Sounds like a good approach if you have the computation power to support it. How about achieving the same result with a microprocessor and/or limited resources such as memory and processing power?

    I am under the impression it will pose many difficulties trying to perform it using inexpensive hardware, or am I wrong?

  2. Steve replied on :

    Etamar—I don’t see any operation here that couldn’t be performed with a fixed-point processor.

  3. Jason Brown replied on :

    I’ve used many of the operations in this example on a laptop from 3 years ago that wasn’t exactly top of the line when I purchased it. Don’t think there would be any issues as long as it’s being handled a page at a time.

    Are there any plans to further this project by filling in spaces where the ink has been rubbed off over the years? It would definitely make the end product more legible.

  4. Steve replied on :

    Jason—I don’t know what Craig’s plans are for this project. He and I last corresponded about this about 8 months ago.

  5. Hans replied on :

    How about using OCR techniques to clean up the tekst?
    Although I’m not sure how OCR can be used in combination with formulas. Just a thought…

  6. Renji replied on :

    Hi Steve,
    Iam a final year student in computer engineering.We are doing a main project in device control using hand gestures.The gestures are taken by a webcam and the image is converted into YCbCr model before skin segmentation.Can you please provide me with proper guidance as how to obtain the cb and cr values.Thanking you,expecting a reply at the earliest

  7. Steve replied on :

    Renji—See rgb2ycbcr.

  8. Kevin Lan replied on :

    Hi Craig and Steve,
    I am currently analyzing an image that has a lot of layers. I used your method above to get rid of the distracting objects and I have only the layers now. However right now I want to measure the thickness of the layers. Each layer has inconsistent thickness over its different sections. Is there a way that I can measure the average thickness of the layer? What about measuring the average thickness of all the layers in the image? Please email me if you have a solution. Thank you so much!

  9. Steve replied on :

    Kevin—Can you segment the layers successfully? If so, you might try thinning the segmented layers using bwmorph, then compute the distance transform of the complemented segmentation, then evaluate the distance transform at the pixel locations in the thinned image.

  10. Svendsen replied on :

    Hi Steve et al., after viewing this post I am wondering if Matlab/IPT could be used to somehow help clean up scanned texts that have been underlined by pen/pencil. Some ideas: 1. determine ‘channels’ between lines of text (where most underlines are done) and restrict cleaning to these channels (but many characters protude into these channels); 2. look for pixels that can form horizontal straight line segments…
    Any idea?
    Thanks, JS

  11. Steve replied on :

    Svendsen—It sounds like some of the morphological operators might be useful, given the right structuring elements, but I don’t have specific suggestions for you.

  12. Alfian replied on :

    Hi Steve,

    At the moment, I’m working on how to extract text from images/video frames. I’ve managed to get the text areas out (plus a few false positives, but that doesnt matter)
    My problem now is… after binarization… most of the text is what I want them to be (BLACK TEXT on WHITE BACKGROUND)… but there are also quite important text that are the opposite (WHITE TEXT on BLACK BACKGROUND).

    I was playing around with the im2bw function, and also the ait_imneg function I dloaded somewhere on the INternet.

    Can you possibly help me out here Steve?

    Thanks a million :)

  13. Alfian replied on :

    hi there Steve :) I’ve been trying around to solve the question I posted. I was looking at the mean intensity within the text region, as well as outside the region. One paper I read says that this can be done to determine whether it’s inverse text (bright colored text with dark bground) or not. But still I am not getting anywhere. Sorry if its a bit off topic… but just thought u might be able to give any pointers. anyway, thanks for letting me post here. Cheers!

  14. smitha replied on :

    is there any matlab function to segment a portion of an image?

  15. Steve replied on :

    Alfian—You might try doing a hole-filling operation (imfill) after thresholding. Inverse text will show up as hole regions. Use bwareaopen to eliminate the holes inside normal text.

  16. Steve replied on :

    Smitha—Image segmentation is really an algorithm development problem that usually must be customized for your own data. There are no “canned” routines that work for everything. For possible algorithms you might be able to use, take a look at the image analysis and segmentation chapters in Digital Image Processing Using MATLAB.

  17. Svendsen replied on :

    Hi Steve, after much trial & error, I’m closer to get rid of underlines in scanned text, mostry with bwhitmiss. If you’re interested, I can show you a sample image and the Matlab code

  18. Alfian replied on :

    Thanks Steve. Ill give it a go… will hopefully let U know if I get results.

  19. paras replied on :

    i need to do skeletonization….if i use the comman
    “bwmorph(’img’,’skel’,Inf)” it gives skeleton of white part…actually i hav a image in which thr is a black line on white sheet of paper ..so i need skeleton of black part…plz help me…

  20. DP replied on :

    Hi steve:
    is there any function in image processing toolbox to visulize 3D image (512×512x22)image size.

    Thank U.
    sincerely yours;
    damodar

  21. Steve replied on :

    Para—Complement the image before calling bwmorph. Use ~ or imcomplement.

  22. Steve replied on :

    DP—You could try montage. You should also look at the 3-D visualization material in the MATLAB documentation.

  23. Bob replied on :

    DP, you might also check out the follow File Exchange submission for ideas.
    http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=4879

    Cheers
    Bob

  24. DP replied on :

    Thank U.

  25. Etamar Laron replied on :

    Hi guys,

    To reopen my old question - do you think this algorithm can be performed in a reasonable (several microseconds) using an inexpensive hardware?

    In inexpensive I mean $1 or less 8051 or similar……… not a laptop from three years ago.

    Appreciate your thoughts!

  26. Steve replied on :

    Etamar—I don’t have anything to add to my original answer. The operations could be readily implemented on a fixed-point processor, but I don’t know how long the operation might take. Did you really mean “several microseconds”? I assume you’re not really hoping to process hundreds of thousands of images per second. :-)

  27. alosha replied on :

    hi steve,
    could please let me know can i map some text data to the image .. thank you very much

  28. Steve replied on :

    Alosha—I don’t know what that means. Can you clarify?

Leave a Reply

Wrap code fragments inside <pre> tags, like this:

<pre class="code">
a = magic(3);
sum(a)
</pre>

If you have a "<" character in your code, either follow it with a space or replace it with "&lt;" (including the semicolon).


Steve Eddins manages the Image & Geospatial development team at The MathWorks and coauthored Digital Image Processing Using MATLAB. He writes here about image processing concepts, algorithm implementations, and MATLAB.

  • Sana: hi steve, could you explain to me how i would be able to use the dir function, to do a loop through a directory...
  • Nishtha: Sir, I have preprocessed the image in following steps: [1] adaptive histogram equalization [2] thresholding...
  • Kristof: I also strongly support the idea. I have just recently bumped into the problem that im2single was not...
  • Steve: David—I’ m glad you found it useful!
  • David Lalejini: I found your example very useful for finding connected nodes in a large set of input pairs. I start...
  • tommy: Dear Steve, I have a question,please if you are kind to help me regarding the accumulator array dimensions of...
  • Steve: Abc—I don’t know how to distinguish the faces. You might try posting your question in the MATLAB...
  • Manju: well if we have a few ovals within each other like in a cell how do we measure the distance from the center...
  • Steve: Manju—What do you mean? How is each region defined?
  • Manju: if we have 2-3 regions within each other how do we measure the regions of each one?

These postings are the author's and don't necessarily represent the opinions of The MathWorks.