Finding Patterns in Arrays
Recently, my colleague Jeff asked me if I would look at some code he wrote to find a pattern of numbers in a larger array. Without looking at his code, I asked if he had tried using strfind, despite his data not being strings. He found that it solved his problem and was faster than his M-file. In the meantime, I wanted to see what it took for me to write a simple algorithm I was thinking of in an M-file. I show and discuss the results here.
Contents
Simple Test Data
Let me start off with really simple test data to be sure all algorithms are getting the right answers.
a = [0 1 4 9 16 4 9];
b = double('The year is 2003.');
First Algorithm : findpattern
Here's the first findpattern algorithm.
type findpattern
function idx = findpattern(in_array, pattern) %FINDPATTERN Find a pattern in an array. % % K = FINDPATTERN(ARRAY, PATTERN) returns the starting indices % of any occurences of the PATTERN in the ARRAY. ARRAY and PATTERN % can be any mixture of character and numeric types. % % Examples: % a = [0 1 4 9 16 4 9]; % b = double('The year is 2003.'); % findpattern(a, [4 9]) returns [3 6] % findpattern(a, [9 4]) returns [] % findpattern(b, '2003') returns 13 % findpattern(b, uint8('2003')) returns 13 % % See also STRFIND, STRCMP, STRNCMP, STRMATCH. % Algorithm: % Find all of the occurrences of each number of the pattern. % % For an n element pattern, the result is an n element cell array. The % i-th cell contains the positions in the input array that match the i-th % element of the pattern. % % When the pattern exists in the input stream, there will be a sequence % of consecutive integers in consecutive cells. % As currently implemented, this routine has poor performance for patterns % with more than half a dozen elements where the first element in the % pattern matches many positions in the array. locations = cell(1, numel(pattern)); for p = 1:(numel(pattern)) locations{p} = find(in_array == pattern(p)); end % Find instances of the pattern in the array. idx = []; for p = 1:numel(locations{1}) % Look for a consecutive progression of locations. start_value = locations{1}(p); for q = 2:numel(locations) found = true; if (~any((start_value + q - 1) == locations{q})) found = false; break; end end if (found) idx(end + 1) = locations{1}(p); end end
You'll notice that Jeff chooses to store derived information on the pattern being present in a cell array, and then looks for consecutive locations.
Here are some results using findpattern. First I set f to be a function handle to the function in question. Then I can reuse the same code for the other cases simply by redefining the function.
f = @findpattern t(4) = false; t(1) = isequal(f(a, [4 9]), [3 6]); t(2) = isempty(f(a, [9 4])); t(3) = isequal(f(b, '2003'),13); t(4) = isequal(f(b, uint8('2003')),13); AOK = all(t==true)
f = @findpattern AOK = 1
My Homebrew Algorithm : findPattern2
Here's my own algorithm. The idea here is to find possible pattern locations first, and winnow them out, marching through the pattern, which I am assuming is generally smaller, and often a lot smaller, than the data.
type findPattern2
function start = findPattern2(array, pattern) %findPattern2 Locate a pattern in an array. % % indices = findPattern2(array, pattern) finds the starting indices of % pattern within array. % % Example: % a = [0 1 4 9 16 4 9]; % patt = [4 9]; % indices = findPattern2(a,patt) % indices = % 3 6 % Let's assume for now that both the pattern and the array are non-empty % VECTORS, but there's no checking for this. % For this algorithm, I loop over the pattern elements. len = length(pattern); % First, find candidate locations; i.e., match the first element in the % pattern. start = find(array==pattern(1)); % Next remove start values that are too close to the end to possibly match % the pattern. endVals = start+len-1; start(endVals>length(array)) = []; % Next, loop over elements of pattern, usually much shorter than length of % array, to check which possible locations are valid still. for pattval = 2:len % check viable locations in array locs = pattern(pattval) == array(start+pattval-1); % delete false ones from indices start(~locs) = []; end
Get results and time it.
f = @findPattern2 t(1) = isequal(f(a, [4 9]), [3 6]); t(2) = isempty(f(a, [9 4])); t(3) = isequal(f(b, '2003'),13); t(4) = isequal(f(b, uint8('2003')),13); AOK = all(t==true)
f = @findPattern2 AOK = 1
Using strfind
Next I test using the same data with strfind. Despite its name, strfind can happily handle non-character datatypes, and particularly doubles and integers.
f = @strfind t(1) = isequal(f(a, [4 9]), [3 6]); t(2) = isempty(f(a, [9 4])); t(3) = isequal(f(b, '2003'),13); t(4) = isequal(f(b, uint8('2003')),13); AOK = all(t==true)
f = @strfind AOK = 1
Use Case and Performance
Jeff described the problem he was solving in more detail. He has a file with multiple images in it, with data stored as uint8. The images are separated by a particular bit pattern. Let me show you one of the images in the sequence, after processing and extracting the frames.
load forloren image(X(:,:,:,17)), axis off whos X
Name Size Bytes Class Attributes X 4-D 26726400 uint8
Now let me show and time finding the pattern in the raw data. The data contain 29 images.
clear
load imrawdata
whos
pattern = [254 255 0 224];
f = @()findpattern(rawdata, pattern);
tfind = timeit(f);
f = @()findPattern2(rawdata, pattern);
tfind(2) = timeit(f);
f = @()strfind(rawdata, pattern);
tfind(3) = timeit(f)
Name Size Bytes Class Attributes rawdata 1x1259716 1259716 uint8 tfind = 0.80941 0.011273 0.019194
Puzzle and Next Steps
In the case of the larger dataset, strfind is not the fastest algorithm, though I found with much smaller data, strfind outperformed findPattern2. Some possible reasons why findPattern2 is the fastest of the three algorithms:
If I find out why, I will let you know. In the meantime, if you have any thoughts to add, please comment here.
- Category:
- Efficiency,
- Less Used Functionality