Loren on the Art of MATLAB

Turn ideas into MATLAB

Finding Patterns in Arrays 19

Posted by Loren Shure,

Recently, my colleague Jeff asked me if I would look at some code he wrote to find a pattern of numbers in a larger array. Without looking at his code, I asked if he had tried using strfind, despite his data not being strings. He found that it solved his problem and was faster than his M-file. In the meantime, I wanted to see what it took for me to write a simple algorithm I was thinking of in an M-file. I show and discuss the results here.

Contents

Simple Test Data

Let me start off with really simple test data to be sure all algorithms are getting the right answers.

a = [0 1 4 9 16 4 9];
b = double('The year is 2003.');

First Algorithm : findpattern

Here's the first findpattern algorithm.

type findpattern
function idx = findpattern(in_array, pattern)
%FINDPATTERN  Find a pattern in an array.
%
%    K = FINDPATTERN(ARRAY, PATTERN) returns the starting indices
%    of any occurences of the PATTERN in the ARRAY.  ARRAY and PATTERN
%    can be any mixture of character and numeric types.
%
%    Examples:
%        a = [0 1 4 9 16 4 9];
%        b = double('The year is 2003.');
%        findpattern(a, [4 9])         returns [3 6]
%        findpattern(a, [9 4])         returns []
%        findpattern(b, '2003')        returns 13
%        findpattern(b, uint8('2003')) returns 13
%
%    See also STRFIND, STRCMP, STRNCMP, STRMATCH.

% Algorithm:
% Find all of the occurrences of each number of the pattern.
%
% For an n element pattern, the result is an n element cell array.  The
% i-th cell contains the positions in the input array that match the i-th
% element of the pattern.
%
% When the pattern exists in the input stream, there will be a sequence
% of consecutive integers in consecutive cells.

% As currently implemented, this routine has poor performance for patterns
% with more than half a dozen elements where the first element in the
% pattern matches many positions in the array.

locations = cell(1, numel(pattern));
for p = 1:(numel(pattern))
    locations{p} = find(in_array == pattern(p));
end

% Find instances of the pattern in the array.
idx = [];
for p = 1:numel(locations{1})
    
    % Look for a consecutive progression of locations.
    start_value = locations{1}(p);
    for q = 2:numel(locations)
        
        found = true;
        
        if (~any((start_value + q - 1) == locations{q}))
            found = false;
            break;
        end
        
    end
    
    if (found)
        idx(end + 1) = locations{1}(p);
    end
    
end

You'll notice that Jeff chooses to store derived information on the pattern being present in a cell array, and then looks for consecutive locations.

Here are some results using findpattern. First I set f to be a function handle to the function in question. Then I can reuse the same code for the other cases simply by redefining the function.

f = @findpattern
t(4) = false;
t(1) = isequal(f(a, [4 9]), [3 6]);
t(2) = isempty(f(a, [9 4]));
t(3) = isequal(f(b, '2003'),13);
t(4) = isequal(f(b, uint8('2003')),13);
AOK = all(t==true)
f = 
    @findpattern
AOK =
     1

My Homebrew Algorithm : findPattern2

Here's my own algorithm. The idea here is to find possible pattern locations first, and winnow them out, marching through the pattern, which I am assuming is generally smaller, and often a lot smaller, than the data.

type findPattern2
function start = findPattern2(array, pattern)
%findPattern2 Locate a pattern in an array.
%
%   indices = findPattern2(array, pattern) finds the starting indices of
%   pattern within array.
%
%   Example:
%   a = [0 1 4 9 16 4 9];
%   patt = [4 9];
%   indices = findPattern2(a,patt)
%   indices =
%        3     6

% Let's assume for now that both the pattern and the array are non-empty
% VECTORS, but there's no checking for this. 
% For this algorithm, I loop over the pattern elements.
len = length(pattern);
% First, find candidate locations; i.e., match the first element in the
% pattern.
start = find(array==pattern(1));
% Next remove start values that are too close to the end to possibly match
% the pattern.
endVals = start+len-1;
start(endVals>length(array)) = [];
% Next, loop over elements of pattern, usually much shorter than length of
% array, to check which possible locations are valid still.
for pattval = 2:len
    % check viable locations in array
    locs = pattern(pattval) == array(start+pattval-1);
    % delete false ones from indices
    start(~locs) = [];
end

Get results and time it.

f = @findPattern2
t(1) = isequal(f(a, [4 9]), [3 6]);
t(2) = isempty(f(a, [9 4]));
t(3) = isequal(f(b, '2003'),13);
t(4) = isequal(f(b, uint8('2003')),13);
AOK = all(t==true)
f = 
    @findPattern2
AOK =
     1

Using strfind

Next I test using the same data with strfind. Despite its name, strfind can happily handle non-character datatypes, and particularly doubles and integers.

f = @strfind
t(1) = isequal(f(a, [4 9]), [3 6]);
t(2) = isempty(f(a, [9 4]));
t(3) = isequal(f(b, '2003'),13);
t(4) = isequal(f(b, uint8('2003')),13);
AOK = all(t==true)
f = 
    @strfind
AOK =
     1

Use Case and Performance

Jeff described the problem he was solving in more detail. He has a file with multiple images in it, with data stored as uint8. The images are separated by a particular bit pattern. Let me show you one of the images in the sequence, after processing and extracting the frames.

load forloren
image(X(:,:,:,17)), axis off
whos X
  Name      Size               Bytes  Class    Attributes

  X         4-D             26726400  uint8              

Now let me show and time finding the pattern in the raw data. The data contain 29 images.

clear
load imrawdata
whos
pattern = [254 255 0 224];
f = @()findpattern(rawdata, pattern);
tfind = timeit(f);
f = @()findPattern2(rawdata, pattern);
tfind(2) = timeit(f);
f = @()strfind(rawdata, pattern);
tfind(3) = timeit(f)
  Name         Size                   Bytes  Class    Attributes

  rawdata      1x1259716            1259716  uint8              

tfind =
      0.80941     0.011273     0.019194

Puzzle and Next Steps

In the case of the larger dataset, strfind is not the fastest algorithm, though I found with much smaller data, strfind outperformed findPattern2. Some possible reasons why findPattern2 is the fastest of the three algorithms:

  • it is not as general purpose and does no error checking,
  • it was only written to work on vectors,
  • it does nothing to handle cases where |NaN|s might be involved.
  • If I find out why, I will let you know. In the meantime, if you have any thoughts to add, please comment here.


    Get the MATLAB code

    Published with MATLAB® 7.6

    166 views (last 30 days)  | |

    Comments

    To leave a comment, please click here to sign in to your MathWorks Account or create a new one.