Skip to Main Content Skip to Search
File Exchange
MATLAB Newsgroup
Link Exchange
  Blogs  
 Contest 
MathWorks.com

Steve on Image Processing

December 7th, 2007

Cleaning up scanned text

Earlier this year I exchanged e-mail with blog reader Craig Doolittle. Craig was writing MATLAB scripts to clean up scanned pages from old manuscripts. One of the samples he sent me was a page from "Fragmentation of Service Projectiles," N.F. Mott, J. H. Wilkinson, and T.H. Wise, Ministry of Supply, Armament Research Department, Theoretical Research Report No. 37/44, December 1944.

The image is too big to show at full resolution in this blog, so here's a thumbnail view.

url = 'http://blogs.mathworks.com/images/steve/186/scanned_page.png';
page = imread(url);
thumbnail = imresize(im2uint8(page), 'OutputSize', [256 NaN]);
imshow(thumbnail)

Craig wanted suggestions on how to clean up isolated "noise" dots without removing small dots that are part of characters. Let's look closely at a cropped portion of the page.

bw = page(735:1280, 11:511);
imshow(bw)

We could start by using bwareaopen to remove small dots. For For example:

bw2 = imcomplement(bw);
bw3 = bwareaopen(bw2, 8);
imshow(imcomplement(bw3))

Unfortunately, this approach has removed portions of some of the characters. Here's a method using bwlabel and regionprops to highlight the pixels that were removed.

removed = xor(bw2, bw3);
L = bwlabel(removed);
s = regionprops(L, 'Centroid');
centroids = cat(1, s.Centroid);
imshow(bw)
hold on
plot(centroids(:,1), centroids(:,2), 'ro')
hold off

You can see that some of the removed dots were noise, while others were parts of the characters "e", "i", "m", etc.

My suggestion to Craig was to restore removed dots that are "close" to the characters remaining after bwareaopen. We can do this using dilation and some logical operators.

bw4 = imdilate(bw3, strel('disk', 5));
imshow(bw4)

Now do a logical AND of the dilated characters with the pixels removed by bwareaopen. These are the pixels we are going to put back.

overlaps = bw4 & removed;
imshow(overlaps)

Use a logical OR to restore the removed pixels.

bwout = imcomplement(bw3 | overlaps);
imshow(bwout)

I also suggested using morphological recontruction to get all the pixels connected to the overlapping pixels found above. It doesn't seem to be really necessary here, though, so I'm going to save this technique for a future blog post, using a better example.

I'm sure there are lots of different ways to approach this text clean-up problem. Does anyone have suggestions for other approaches?

Thanks for letting me use your example, Craig.


Get the MATLAB code

Published with MATLAB® 7.5

24 Responses to “Cleaning up scanned text”

  1. Etamar Laron replied on :

    Hi Craig & Steve,

    Sounds like a good approach if you have the computation power to support it. How about achieving the same result with a microprocessor and/or limited resources such as memory and processing power?

    I am under the impression it will pose many difficulties trying to perform it using inexpensive hardware, or am I wrong?

  2. Steve replied on :

    Etamar—I don’t see any operation here that couldn’t be performed with a fixed-point processor.

  3. Jason Brown replied on :

    I’ve used many of the operations in this example on a laptop from 3 years ago that wasn’t exactly top of the line when I purchased it. Don’t think there would be any issues as long as it’s being handled a page at a time.

    Are there any plans to further this project by filling in spaces where the ink has been rubbed off over the years? It would definitely make the end product more legible.

  4. Steve replied on :

    Jason—I don’t know what Craig’s plans are for this project. He and I last corresponded about this about 8 months ago.

  5. Hans replied on :

    How about using OCR techniques to clean up the tekst?
    Although I’m not sure how OCR can be used in combination with formulas. Just a thought…

  6. Renji replied on :

    Hi Steve,
    Iam a final year student in computer engineering.We are doing a main project in device control using hand gestures.The gestures are taken by a webcam and the image is converted into YCbCr model before skin segmentation.Can you please provide me with proper guidance as how to obtain the cb and cr values.Thanking you,expecting a reply at the earliest

  7. Steve replied on :

    Renji—See rgb2ycbcr.

  8. Kevin Lan replied on :

    Hi Craig and Steve,
    I am currently analyzing an image that has a lot of layers. I used your method above to get rid of the distracting objects and I have only the layers now. However right now I want to measure the thickness of the layers. Each layer has inconsistent thickness over its different sections. Is there a way that I can measure the average thickness of the layer? What about measuring the average thickness of all the layers in the image? Please email me if you have a solution. Thank you so much!

  9. Steve replied on :

    Kevin—Can you segment the layers successfully? If so, you might try thinning the segmented layers using bwmorph, then compute the distance transform of the complemented segmentation, then evaluate the distance transform at the pixel locations in the thinned image.

  10. Svendsen replied on :

    Hi Steve et al., after viewing this post I am wondering if Matlab/IPT could be used to somehow help clean up scanned texts that have been underlined by pen/pencil. Some ideas: 1. determine ‘channels’ between lines of text (where most underlines are done) and restrict cleaning to these channels (but many characters protude into these channels); 2. look for pixels that can form horizontal straight line segments…
    Any idea?
    Thanks, JS

  11. Steve replied on :

    Svendsen—It sounds like some of the morphological operators might be useful, given the right structuring elements, but I don’t have specific suggestions for you.

  12. Alfian replied on :

    Hi Steve,

    At the moment, I’m working on how to extract text from images/video frames. I’ve managed to get the text areas out (plus a few false positives, but that doesnt matter)
    My problem now is… after binarization… most of the text is what I want them to be (BLACK TEXT on WHITE BACKGROUND)… but there are also quite important text that are the opposite (WHITE TEXT on BLACK BACKGROUND).

    I was playing around with the im2bw function, and also the ait_imneg function I dloaded somewhere on the INternet.

    Can you possibly help me out here Steve?

    Thanks a million :)

  13. Alfian replied on :

    hi there Steve :) I’ve been trying around to solve the question I posted. I was looking at the mean intensity within the text region, as well as outside the region. One paper I read says that this can be done to determine whether it’s inverse text (bright colored text with dark bground) or not. But still I am not getting anywhere. Sorry if its a bit off topic… but just thought u might be able to give any pointers. anyway, thanks for letting me post here. Cheers!

  14. smitha replied on :

    is there any matlab function to segment a portion of an image?

  15. Steve replied on :

    Alfian—You might try doing a hole-filling operation (imfill) after thresholding. Inverse text will show up as hole regions. Use bwareaopen to eliminate the holes inside normal text.

  16. Steve replied on :

    Smitha—Image segmentation is really an algorithm development problem that usually must be customized for your own data. There are no “canned” routines that work for everything. For possible algorithms you might be able to use, take a look at the image analysis and segmentation chapters in Digital Image Processing Using MATLAB.

  17. Svendsen replied on :

    Hi Steve, after much trial & error, I’m closer to get rid of underlines in scanned text, mostry with bwhitmiss. If you’re interested, I can show you a sample image and the Matlab code

  18. Alfian replied on :

    Thanks Steve. Ill give it a go… will hopefully let U know if I get results.

  19. paras replied on :

    i need to do skeletonization….if i use the comman
    “bwmorph(’img’,’skel’,Inf)” it gives skeleton of white part…actually i hav a image in which thr is a black line on white sheet of paper ..so i need skeleton of black part…plz help me…

  20. DP replied on :

    Hi steve:
    is there any function in image processing toolbox to visulize 3D image (512×512x22)image size.

    Thank U.
    sincerely yours;
    damodar

  21. Steve replied on :

    Para—Complement the image before calling bwmorph. Use ~ or imcomplement.

  22. Steve replied on :

    DP—You could try montage. You should also look at the 3-D visualization material in the MATLAB documentation.

  23. Bob replied on :

    DP, you might also check out the follow File Exchange submission for ideas.
    http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=4879

    Cheers
    Bob

  24. DP replied on :

    Thank U.

Leave a Reply


Steve Eddins manages the Image & Geospatial development team at The MathWorks and coauthored Digital Image Processing Using MATLAB. He writes here about image processing concepts, algorithm implementations, and MATLAB.

  • ismail: i love chess keep posting :) can we make a web cam identify a chess set ? so we have a roboarm plays for real...
  • Navan: While black is going to win with a smothered mate, it is hard to see what moves would have led to this...
  • Doug: Forced ’smothermate&# 8217; is about to happen, with the added insult of threatening the queen on the...
  • Viton: Let’s give it a try: - RxQ : White Rook takes black Queen (White King was checked, can’t take...
  • Omar: Hi Steve, when using tformfwd to find corresponding points in the new space, the resulting co-ordinates from...
  • Steve: Cris—You’ ;re right, I should have caught the plot scaling issue. I wasn’t actually trying...
  • Cris Luengo: Not to spoil your upcoming bog entry too much, but if you scale the first graph (times vs Q) by setting...
  • Steve: Jim—Thanks for adding your comment showing how the syntax works for matrices.
  • Jim: for i = A …statements 230; end; Description: The …statements 230; are executed (as MATLAB...
  • Steve: Omar—Nice work.

These postings are the author's and don't necessarily represent the opinions of The MathWorks.

Related Topics