Finding Patterns in Arrays

作者 Loren Shure, September 8, 2008

5 次查看（过去 30 天） | 0 个赞 | 19 个评论

Recently, my colleague Jeff asked me if I would look at some code he wrote to find a pattern of numbers in a larger array. Without looking at his code, I asked if he had tried using strfind, despite his data not being strings. He found that it solved his problem and was faster than his M-file. In the meantime, I wanted to see what it took for me to write a simple algorithm I was thinking of in an M-file. I show and discuss the results here.

Simple Test Data
First Algorithm : findpattern
My Homebrew Algorithm : findPattern2
Using strfind
Use Case and Performance
Puzzle and Next Steps

Simple Test Data

Let me start off with really simple test data to be sure all algorithms are getting the right answers.

a = [0 1 4 9 16 4 9];
b = double('The year is 2003.');

First Algorithm : findpattern

Here's the first findpattern algorithm.

type findpattern

function idx = findpattern(in_array, pattern)
%FINDPATTERN  Find a pattern in an array.
%
%    K = FINDPATTERN(ARRAY, PATTERN) returns the starting indices
%    of any occurences of the PATTERN in the ARRAY.  ARRAY and PATTERN
%    can be any mixture of character and numeric types.
%
%    Examples:
%        a = [0 1 4 9 16 4 9];
%        b = double('The year is 2003.');
%        findpattern(a, [4 9])         returns [3 6]
%        findpattern(a, [9 4])         returns []
%        findpattern(b, '2003')        returns 13
%        findpattern(b, uint8('2003')) returns 13
%
%    See also STRFIND, STRCMP, STRNCMP, STRMATCH.

% Algorithm:
% Find all of the occurrences of each number of the pattern.
%
% For an n element pattern, the result is an n element cell array.  The
% i-th cell contains the positions in the input array that match the i-th
% element of the pattern.
%
% When the pattern exists in the input stream, there will be a sequence
% of consecutive integers in consecutive cells.

% As currently implemented, this routine has poor performance for patterns
% with more than half a dozen elements where the first element in the
% pattern matches many positions in the array.

locations = cell(1, numel(pattern));
for p = 1:(numel(pattern))
    locations{p} = find(in_array == pattern(p));
end

% Find instances of the pattern in the array.
idx = [];
for p = 1:numel(locations{1})
    
    % Look for a consecutive progression of locations.
    start_value = locations{1}(p);
    for q = 2:numel(locations)
        
        found = true;
        
        if (~any((start_value + q - 1) == locations{q}))
            found = false;
            break;
        end
        
    end
    
    if (found)
        idx(end + 1) = locations{1}(p);
    end
    
end

You'll notice that Jeff chooses to store derived information on the pattern being present in a cell array, and then looks for consecutive locations.

Here are some results using findpattern. First I set f to be a function handle to the function in question. Then I can reuse the same code for the other cases simply by redefining the function.

f = @findpattern
t(4) = false;
t(1) = isequal(f(a, [4 9]), [3 6]);
t(2) = isempty(f(a, [9 4]));
t(3) = isequal(f(b, '2003'),13);
t(4) = isequal(f(b, uint8('2003')),13);
AOK = all(t==true)

f = 
    @findpattern
AOK =
     1

My Homebrew Algorithm : findPattern2

Here's my own algorithm. The idea here is to find possible pattern locations first, and winnow them out, marching through the pattern, which I am assuming is generally smaller, and often a lot smaller, than the data.

type findPattern2

function start = findPattern2(array, pattern)
%findPattern2 Locate a pattern in an array.
%
%   indices = findPattern2(array, pattern) finds the starting indices of
%   pattern within array.
%
%   Example:
%   a = [0 1 4 9 16 4 9];
%   patt = [4 9];
%   indices = findPattern2(a,patt)
%   indices =
%        3     6

% Let's assume for now that both the pattern and the array are non-empty
% VECTORS, but there's no checking for this. 
% For this algorithm, I loop over the pattern elements.
len = length(pattern);
% First, find candidate locations; i.e., match the first element in the
% pattern.
start = find(array==pattern(1));
% Next remove start values that are too close to the end to possibly match
% the pattern.
endVals = start+len-1;
start(endVals>length(array)) = [];
% Next, loop over elements of pattern, usually much shorter than length of
% array, to check which possible locations are valid still.
for pattval = 2:len
    % check viable locations in array
    locs = pattern(pattval) == array(start+pattval-1);
    % delete false ones from indices
    start(~locs) = [];
end

Get results and time it.

f = @findPattern2
t(1) = isequal(f(a, [4 9]), [3 6]);
t(2) = isempty(f(a, [9 4]));
t(3) = isequal(f(b, '2003'),13);
t(4) = isequal(f(b, uint8('2003')),13);
AOK = all(t==true)

f = 
    @findPattern2
AOK =
     1

Using strfind

Next I test using the same data with strfind. Despite its name, strfind can happily handle non-character datatypes, and particularly doubles and integers.

f = @strfind
t(1) = isequal(f(a, [4 9]), [3 6]);
t(2) = isempty(f(a, [9 4]));
t(3) = isequal(f(b, '2003'),13);
t(4) = isequal(f(b, uint8('2003')),13);
AOK = all(t==true)

f = 
    @strfind
AOK =
     1

Use Case and Performance

Jeff described the problem he was solving in more detail. He has a file with multiple images in it, with data stored as uint8. The images are separated by a particular bit pattern. Let me show you one of the images in the sequence, after processing and extracting the frames.

load forloren
image(X(:,:,:,17)), axis off
whos X

  Name      Size               Bytes  Class    Attributes

  X         4-D             26726400  uint8

Now let me show and time finding the pattern in the raw data. The data contain 29 images.

clear
load imrawdata
whos
pattern = [254 255 0 224];
f = @()findpattern(rawdata, pattern);
tfind = timeit(f);
f = @()findPattern2(rawdata, pattern);
tfind(2) = timeit(f);
f = @()strfind(rawdata, pattern);
tfind(3) = timeit(f)

  Name         Size                   Bytes  Class    Attributes

  rawdata      1x1259716            1259716  uint8              

tfind =
      0.80941     0.011273     0.019194

Puzzle and Next Steps

In the case of the larger dataset, strfind is not the fastest algorithm, though I found with much smaller data, strfind outperformed findPattern2. Some possible reasons why findPattern2 is the fastest of the three algorithms:

it is not as general purpose and does no error checking,

it was only written to work on vectors,

it does nothing to handle cases where |NaN|s might be involved.

If I find out why, I will let you know. In the meantime, if you have any thoughts to add, please comment here.

Published with MATLAB® 7.6