Loren on the Art of MATLAB

Finding Patterns in Arrays 19

Posted by Loren Shure,

Recently, my colleague Jeff asked me if I would look at some code he wrote to find a pattern of numbers in a larger array. Without looking at his code, I asked if he had tried using strfind, despite his data not being strings. He found that it solved his problem and was faster than his M-file. In the meantime, I wanted to see what it took for me to write a simple algorithm I was thinking of in an M-file. I show and discuss the results here.

Contents

Simple Test Data

Let me start off with really simple test data to be sure all algorithms are getting the right answers.

a = [0 1 4 9 16 4 9];
b = double('The year is 2003.');

First Algorithm : findpattern

Here's the first findpattern algorithm.

type findpattern
function idx = findpattern(in_array, pattern)
%FINDPATTERN  Find a pattern in an array.
%
%    K = FINDPATTERN(ARRAY, PATTERN) returns the starting indices
%    of any occurences of the PATTERN in the ARRAY.  ARRAY and PATTERN
%    can be any mixture of character and numeric types.
%
%    Examples:
%        a = [0 1 4 9 16 4 9];
%        b = double('The year is 2003.');
%        findpattern(a, [4 9])         returns [3 6]
%        findpattern(a, [9 4])         returns []
%        findpattern(b, '2003')        returns 13
%        findpattern(b, uint8('2003')) returns 13
%
%    See also STRFIND, STRCMP, STRNCMP, STRMATCH.

% Algorithm:
% Find all of the occurrences of each number of the pattern.
%
% For an n element pattern, the result is an n element cell array.  The
% i-th cell contains the positions in the input array that match the i-th
% element of the pattern.
%
% When the pattern exists in the input stream, there will be a sequence
% of consecutive integers in consecutive cells.

% As currently implemented, this routine has poor performance for patterns
% with more than half a dozen elements where the first element in the
% pattern matches many positions in the array.

locations = cell(1, numel(pattern));
for p = 1:(numel(pattern))
    locations{p} = find(in_array == pattern(p));
end

% Find instances of the pattern in the array.
idx = [];
for p = 1:numel(locations{1})
    
    % Look for a consecutive progression of locations.
    start_value = locations{1}(p);
    for q = 2:numel(locations)
        
        found = true;
        
        if (~any((start_value + q - 1) == locations{q}))
            found = false;
            break;
        end
        
    end
    
    if (found)
        idx(end + 1) = locations{1}(p);
    end
    
end

You'll notice that Jeff chooses to store derived information on the pattern being present in a cell array, and then looks for consecutive locations.

Here are some results using findpattern. First I set f to be a function handle to the function in question. Then I can reuse the same code for the other cases simply by redefining the function.

f = @findpattern
t(4) = false;
t(1) = isequal(f(a, [4 9]), [3 6]);
t(2) = isempty(f(a, [9 4]));
t(3) = isequal(f(b, '2003'),13);
t(4) = isequal(f(b, uint8('2003')),13);
AOK = all(t==true)
f = 
    @findpattern
AOK =
     1

My Homebrew Algorithm : findPattern2

Here's my own algorithm. The idea here is to find possible pattern locations first, and winnow them out, marching through the pattern, which I am assuming is generally smaller, and often a lot smaller, than the data.

type findPattern2
function start = findPattern2(array, pattern)
%findPattern2 Locate a pattern in an array.
%
%   indices = findPattern2(array, pattern) finds the starting indices of
%   pattern within array.
%
%   Example:
%   a = [0 1 4 9 16 4 9];
%   patt = [4 9];
%   indices = findPattern2(a,patt)
%   indices =
%        3     6

% Let's assume for now that both the pattern and the array are non-empty
% VECTORS, but there's no checking for this. 
% For this algorithm, I loop over the pattern elements.
len = length(pattern);
% First, find candidate locations; i.e., match the first element in the
% pattern.
start = find(array==pattern(1));
% Next remove start values that are too close to the end to possibly match
% the pattern.
endVals = start+len-1;
start(endVals>length(array)) = [];
% Next, loop over elements of pattern, usually much shorter than length of
% array, to check which possible locations are valid still.
for pattval = 2:len
    % check viable locations in array
    locs = pattern(pattval) == array(start+pattval-1);
    % delete false ones from indices
    start(~locs) = [];
end

Get results and time it.

f = @findPattern2
t(1) = isequal(f(a, [4 9]), [3 6]);
t(2) = isempty(f(a, [9 4]));
t(3) = isequal(f(b, '2003'),13);
t(4) = isequal(f(b, uint8('2003')),13);
AOK = all(t==true)
f = 
    @findPattern2
AOK =
     1

Using strfind

Next I test using the same data with strfind. Despite its name, strfind can happily handle non-character datatypes, and particularly doubles and integers.

f = @strfind
t(1) = isequal(f(a, [4 9]), [3 6]);
t(2) = isempty(f(a, [9 4]));
t(3) = isequal(f(b, '2003'),13);
t(4) = isequal(f(b, uint8('2003')),13);
AOK = all(t==true)
f = 
    @strfind
AOK =
     1

Use Case and Performance

Jeff described the problem he was solving in more detail. He has a file with multiple images in it, with data stored as uint8. The images are separated by a particular bit pattern. Let me show you one of the images in the sequence, after processing and extracting the frames.

load forloren
image(X(:,:,:,17)), axis off
whos X
  Name      Size               Bytes  Class    Attributes

  X         4-D             26726400  uint8              

Now let me show and time finding the pattern in the raw data. The data contain 29 images.

clear
load imrawdata
whos
pattern = [254 255 0 224];
f = @()findpattern(rawdata, pattern);
tfind = timeit(f);
f = @()findPattern2(rawdata, pattern);
tfind(2) = timeit(f);
f = @()strfind(rawdata, pattern);
tfind(3) = timeit(f)
  Name         Size                   Bytes  Class    Attributes

  rawdata      1x1259716            1259716  uint8              

tfind =
      0.80941     0.011273     0.019194

Puzzle and Next Steps

In the case of the larger dataset, strfind is not the fastest algorithm, though I found with much smaller data, strfind outperformed findPattern2. Some possible reasons why findPattern2 is the fastest of the three algorithms:

  • it is not as general purpose and does no error checking,
  • it was only written to work on vectors,
  • it does nothing to handle cases where |NaN|s might be involved.
  • If I find out why, I will let you know. In the meantime, if you have any thoughts to add, please comment here.


    Get the MATLAB code

    Published with MATLAB® 7.6

    19 CommentsOldest to Newest

    Hi, welcome back. Been a while since we saw something of your writing here.

    Anyways, good pointer about the fact that strfind isn’t limited to strings. Would this extend to other string functions (sprintf etc) and would bere be any possible use for them in a non-string context?

    –DA

    For example, could one write the entire image-splitting-problem as a 3-liner as follows:

    
    for i = 1:numberOfImages
     images{i} = strread(data, ['%c' deliminatorsequence] )
    end
    

    Daniel-

    You can use non-string inputs to sprintf for printing out numbers, etc. But strcmp, for example, expects string inputs. If not, you get an automatic false for the answer.

    Nice algorithm for getting the images in this case. You could rewrite it so it didn’t require you to know up front how many there were without much more code.

    –Loren

    One more note of interest. Yesterday I tried a different use case where the pattern to find was present much more often than for the case of the separated images. In that case, strfind was 5 times faster than my code, instead of 2 times slower.

    So, some of the performance depends very much on what the data landscape is like and what the expectations are for different conditions, e.g., lots of found patterns vs. a few.

    –Loren

    As an aside to Daniel’s solution, that wouldn’t work for Jeff’s cawse, because the images between the tags are compressed .jpg. So he writes out each found jpg and then reads them in for real using imread.

    –Loren

    Hello,

    I continue to be impressed with the gains Matlab has made in speeding up loops. I solved this problem using a brute force loop written for the JIT, and it is only 3 times slower than findPattern2.m for a random vector the same length as rawdata. This is obviously faster than findpattern.m.
    I then mexed the function and it is now faster than strfind and findPattern2. Neat puzzle.

    Hi Loren,

    a little off-topic: I tried to grab the matlab-code and repeat to publish this blog.
    Unfortunately I was missing the code for the functions findpattern.
    Is this a known issue for you? And further: how about adding the referenced (not built-in) functions in the code? Same goes for using existing workspace variables: they could be recreated in the code as well.

    keep up the good work, Han

    Han-

    Thanks for the comments. The reason I type out the code for findpattern in the blog is so you can copy it to run the example. I don’t know of a way, with grabcode, to get ancillary files.

    I’m not sure what you mean about adding the referenced, not built-in functions. Perhaps an example would help me out here. But again, grabcode only grabs the code for the published file itself.

    –Loren

    Hello,

    Nice blog, and it proved quite useful to me as a student.

    I have a query, that if I am not sure of the exact no. of values that may occur for a range based pattern, then how will it may be done.

    for. e.g.

    I need to check for a pattern whose values may be between -3.7 to +3.7 and the no. of values in between them too aren’t fixed as well.

    I have data from accelerometer against time. Which reads at 100Hz. So, I have alot data of the form for e.g:

    time = [0.0000 0.0001 0.0002 0.0003 0.0004 0.0005 0.0006 0.0007 0.0008 0.0009.. ]
    and
    acc = [ 2.345 3.345 6.454 6.545 -3.5 1.5 2.6 6 2.5 99 32 -3.7 1.3 -2 2.7 … ]

    Now I want to check for a pattern of values which stays between -3.7 to + 3.7 for a duration 0.05 seconds or greater not less that 0.05 seconds. I want of find the start and end time of the pattern.

    thanks and kind regards,

    /Rameez.

    Rameez-

    That’s not the sort of pattern I was testing here initially. What I would do with your situation is something like this. Convert the array to logicals, where true represents data in the correct range. Then look for a long-enough string on true values. Get that index. Voila.

    –Loren

    I found this really helpful, thanks. Is there a fast and easy way for searching patterns that extent over multiple rows and colums?

    such as pattern=[0 1;0 0] in

    [0 1 1 1; 0 0 1 1; 0 0 0 1; 0 0 0 0]. The search should fined three locations for such a pattern.

    Any ideas?
    Phil

    Suppose I had a data matrix with 8 rows and 800 columns, representing the results of 800 experiments with 8 variables each – call the 8 variables age, height, weight, etc. Now suppose someone hands me a new column containing 8 data, in the same order as those in the matrix. How do I find which of the other 800 columns best “matches” the new column?

    Stuart-

    You have to define the criterion for choosing which one fits best – then compare the new data to each row in the dateset you already have.

    –Loren

    Can you also find patterns without specifying the input?
    If a = [0 1 4 9 16 4 9];
    suppose I just wanted to find any pattern in which two adjacent characters repeat. The answer here would be: Pattern 4 9 and placement 3 6.
    How would you do that?
    I’m actually working on a music analysis problem looking for melodic patterns in midi data.

    Thanks,
    MN

    MN-

    I can’t think of an elegant way to do that. Is it only 2 character matches you are looking for?

    Just thinking. Here are 2 possible approaches.
    1) You might create a new matrix with one row offset by one location (left or right, and padded to get all elements to be comparable) from the other row. Transposing those into a nx2 array, you could then look for matched rows by calling intersect with the rows option perhaps. But I haven’t thought hard about this.
    2)Something similar where you use diff to check…

    –Loren

    @MN
    If the original data are only 2 or 3 bytes long, which I think is the case with MIDI, could you not typecast a to get 2 values per element then use find:
    You may need to pad the end of a with zeros to align the data, then e.g

    a = [0 1 4 9 16 4 9 0 7 8 4 9 4 9 7 8 9 4 9 10];
    a=uint32(a);
    b=typecast(a,'uint64');
    % find odd numbered elements...
    idxodd=(find(b==typecast(uint32([4,9]),'uint64'))-1)*2+1;
    %then even
    b=typecast([uint32(0) a uint32(0)], 'uint64');
    idxeven=(find(b==typecast(uint32([4,9]),'uint64'))-1)*2;
    % Both
    idx=sort([idxodd idxeven])
    

    Hi Loren et al.,
    As always, thanks for the great post! How about extending to the case where you don’t know the exact pattern, but do know something about the pattern. For example, find the longest set of sequential numbers in a monotonically increasing array of integers. For example, in the sequence

    a = [1 2 14 15 16 17 18 29 30 41 44 45],
    ,
    the longest sequential set is 14:18. One vectorized way I came up with is:

    b = find([1 ~(diff(a)==1) 1])
    [~,c] = max(diff(b))
    a(b(c):b(c+1)-1)

    But I’d love to hear of more elegant solutions. Thanks!

    These postings are the author's and don't necessarily represent the opinions of MathWorks.