# Finding Patterns in Arrays

Recently, my colleague Jeff asked me if I would look at some code he wrote to find a pattern of numbers in a larger array. Without looking at his code, I asked if he had tried using strfind, despite his data not being strings. He found that it solved his problem and was faster than his M-file. In the meantime, I wanted to see what it took for me to write a simple algorithm I was thinking of in an M-file. I show and discuss the results here.

### Simple Test Data

Let me start off with really simple test data to be sure all algorithms are getting the right answers.

a = [0 1 4 9 16 4 9];
b = double('The year is 2003.');

### First Algorithm : findpattern

Here's the first findpattern algorithm.

type findpattern
function idx = findpattern(in_array, pattern)
%FINDPATTERN  Find a pattern in an array.
%
%    K = FINDPATTERN(ARRAY, PATTERN) returns the starting indices
%    of any occurences of the PATTERN in the ARRAY.  ARRAY and PATTERN
%    can be any mixture of character and numeric types.
%
%    Examples:
%        a = [0 1 4 9 16 4 9];
%        b = double('The year is 2003.');
%        findpattern(a, [4 9])         returns [3 6]
%        findpattern(a, [9 4])         returns []
%        findpattern(b, '2003')        returns 13
%        findpattern(b, uint8('2003')) returns 13
%

% Algorithm:
% Find all of the occurrences of each number of the pattern.
%
% For an n element pattern, the result is an n element cell array.  The
% i-th cell contains the positions in the input array that match the i-th
% element of the pattern.
%
% When the pattern exists in the input stream, there will be a sequence
% of consecutive integers in consecutive cells.

% As currently implemented, this routine has poor performance for patterns
% with more than half a dozen elements where the first element in the
% pattern matches many positions in the array.

locations = cell(1, numel(pattern));
for p = 1:(numel(pattern))
locations{p} = find(in_array == pattern(p));
end

% Find instances of the pattern in the array.
idx = [];
for p = 1:numel(locations{1})

% Look for a consecutive progression of locations.
start_value = locations{1}(p);
for q = 2:numel(locations)

found = true;

if (~any((start_value + q - 1) == locations{q}))
found = false;
break;
end

end

if (found)
idx(end + 1) = locations{1}(p);
end

end


You'll notice that Jeff chooses to store derived information on the pattern being present in a cell array, and then looks for consecutive locations.

Here are some results using findpattern. First I set f to be a function handle to the function in question. Then I can reuse the same code for the other cases simply by redefining the function.

f = @findpattern
t(4) = false;
t(1) = isequal(f(a, [4 9]), [3 6]);
t(2) = isempty(f(a, [9 4]));
t(3) = isequal(f(b, '2003'),13);
t(4) = isequal(f(b, uint8('2003')),13);
AOK = all(t==true)
f =
@findpattern
AOK =
1


### My Homebrew Algorithm : findPattern2

Here's my own algorithm. The idea here is to find possible pattern locations first, and winnow them out, marching through the pattern, which I am assuming is generally smaller, and often a lot smaller, than the data.

type findPattern2
function start = findPattern2(array, pattern)
%findPattern2 Locate a pattern in an array.
%
%   indices = findPattern2(array, pattern) finds the starting indices of
%   pattern within array.
%
%   Example:
%   a = [0 1 4 9 16 4 9];
%   patt = [4 9];
%   indices = findPattern2(a,patt)
%   indices =
%        3     6

% Let's assume for now that both the pattern and the array are non-empty
% VECTORS, but there's no checking for this.
% For this algorithm, I loop over the pattern elements.
len = length(pattern);
% First, find candidate locations; i.e., match the first element in the
% pattern.
start = find(array==pattern(1));
% Next remove start values that are too close to the end to possibly match
% the pattern.
endVals = start+len-1;
start(endVals>length(array)) = [];
% Next, loop over elements of pattern, usually much shorter than length of
% array, to check which possible locations are valid still.
for pattval = 2:len
% check viable locations in array
locs = pattern(pattval) == array(start+pattval-1);
% delete false ones from indices
start(~locs) = [];
end


Get results and time it.

f = @findPattern2
t(1) = isequal(f(a, [4 9]), [3 6]);
t(2) = isempty(f(a, [9 4]));
t(3) = isequal(f(b, '2003'),13);
t(4) = isequal(f(b, uint8('2003')),13);
AOK = all(t==true)
f =
@findPattern2
AOK =
1


### Using strfind

Next I test using the same data with strfind. Despite its name, strfind can happily handle non-character datatypes, and particularly doubles and integers.

f = @strfind
t(1) = isequal(f(a, [4 9]), [3 6]);
t(2) = isempty(f(a, [9 4]));
t(3) = isequal(f(b, '2003'),13);
t(4) = isequal(f(b, uint8('2003')),13);
AOK = all(t==true)
f =
@strfind
AOK =
1


### Use Case and Performance

Jeff described the problem he was solving in more detail. He has a file with multiple images in it, with data stored as uint8. The images are separated by a particular bit pattern. Let me show you one of the images in the sequence, after processing and extracting the frames.

load forloren
image(X(:,:,:,17)), axis off
whos X
  Name      Size               Bytes  Class    Attributes

X         4-D             26726400  uint8



Now let me show and time finding the pattern in the raw data. The data contain 29 images.

clear
whos
pattern = [254 255 0 224];
f = @()findpattern(rawdata, pattern);
tfind = timeit(f);
f = @()findPattern2(rawdata, pattern);
tfind(2) = timeit(f);
f = @()strfind(rawdata, pattern);
tfind(3) = timeit(f)
  Name         Size                   Bytes  Class    Attributes

rawdata      1x1259716            1259716  uint8

tfind =
0.80941     0.011273     0.019194


### Puzzle and Next Steps

In the case of the larger dataset, strfind is not the fastest algorithm, though I found with much smaller data, strfind outperformed findPattern2. Some possible reasons why findPattern2 is the fastest of the three algorithms:

• it is not as general purpose and does no error checking,
• it was only written to work on vectors,
• it does nothing to handle cases where |NaN|s might be involved.
• If I find out why, I will let you know. In the meantime, if you have any thoughts to add, please comment here.

Published with MATLAB® 7.6

|