# Cleaning up scanned text

Earlier this year I exchanged e-mail with blog reader Craig Doolittle. Craig was writing MATLAB scripts to clean up scanned pages from old manuscripts. One of the samples he sent me was a page from "Fragmentation of Service Projectiles," N.F. Mott, J. H. Wilkinson, and T.H. Wise, Ministry of Supply, Armament Research Department, Theoretical Research Report No. 37/44, December 1944.

The image is too big to show at full resolution in this blog, so here's a thumbnail view.

url = 'https://blogs.mathworks.com/images/steve/186/scanned_page.png';
thumbnail = imresize(im2uint8(page), 'OutputSize', [256 NaN]);
imshow(thumbnail)

Craig wanted suggestions on how to clean up isolated "noise" dots without removing small dots that are part of characters. Let's look closely at a cropped portion of the page.

bw = page(735:1280, 11:511);
imshow(bw)

We could start by using bwareaopen to remove small dots. For For example:

bw2 = imcomplement(bw);
bw3 = bwareaopen(bw2, 8);
imshow(imcomplement(bw3))

Unfortunately, this approach has removed portions of some of the characters. Here's a method using bwlabel and regionprops to highlight the pixels that were removed.

removed = xor(bw2, bw3);
L = bwlabel(removed);
s = regionprops(L, 'Centroid');
centroids = cat(1, s.Centroid);
imshow(bw)
hold on
plot(centroids(:,1), centroids(:,2), 'ro')
hold off

You can see that some of the removed dots were noise, while others were parts of the characters "e", "i", "m", etc.

My suggestion to Craig was to restore removed dots that are "close" to the characters remaining after bwareaopen. We can do this using dilation and some logical operators.

bw4 = imdilate(bw3, strel('disk', 5));
imshow(bw4)

Now do a logical AND of the dilated characters with the pixels removed by bwareaopen. These are the pixels we are going to put back.

overlaps = bw4 & removed;
imshow(overlaps)

Use a logical OR to restore the removed pixels.

bwout = imcomplement(bw3 | overlaps);
imshow(bwout)

I also suggested using morphological recontruction to get all the pixels connected to the overlapping pixels found above. It doesn't seem to be really necessary here, though, so I'm going to save this technique for a future blog post, using a better example.

I'm sure there are lots of different ways to approach this text clean-up problem. Does anyone have suggestions for other approaches?

Thanks for letting me use your example, Craig.

Published with MATLAB® 7.5

|