Loren on the Art of MATLAB

January 20th, 2009

More Ways to Find Matching Data

Today on the newsgroup, a user wanted help finding when values in a matrix matched some other values (see the post). There was already a solution posted when I was reading, but something about this problem kept nagging at me. So I've invested a little bit of time thinking more about the problem.

Contents

Sample Data

Here's sample data, and the user wants to find all the places in A which have values that match values in B. Simple enough statement.

A = [11 22 34 56 89
23 44 11 20 66
79 54 32 17 89
11 66 21 45 90]
B = [11 66 44 40 90]
A =
    11    22    34    56    89
    23    44    11    20    66
    79    54    32    17    89
    11    66    21    45    90
B =
    11    66    44    40    90

Bruno's Solution - Column-wise Solution

As I said, there was already a solution when I was reading. Here it is.

RESULTS = zeros(size(A));
for i = 1: size(B,2)
    RESULTS = RESULTS + ( A == B(1,i) );
end
RESULTS
RESULTS =
     1     0     0     0     0
     0     1     1     0     1
     0     0     0     0     0
     1     1     0     0     1

The idea here is to create the right size output, and cycle through the values in B (the smaller array for the user's example). Check to see where a given value in B matches one in A, and add a 1 to the RESULTS when those hits are found.

Find Exact Location Matches

To be honest, I misread the question at first, and came up with the following code. However, it does not solve the problem as stated!

C = ~(A-repmat(B,size(A,1),1))
C =
     1     0     0     0     0
     0     0     0     0     0
     0     0     0     0     0
     1     1     0     0     1

The reason I tried the solution with repmat was so I could first get the right answer, and then find a solution instead with bsxfun. Instead, this solution (which could be quite costly due to the repmat) looks to match values in specific column locations. In other words, the wrong problem solved.

Find Presence - Row-wise Solution

Wising up a very tiny amount, that I was solving the wrong problem, I next tried finding matches by row in A.

Z = zeros(size(A));
for k = 1:size(A,1)
    Z(k,:) = ismember(A(k,:),B);
end
Z
Z =
     1     0     0     0     0
     0     1     1     0     1
     0     0     0     0     0
     1     1     0     0     1

At least this time I get the right answer and am solving the right problem! But this normally would take longer than Bruno's approach because the user premise was that A was huge and B wasn't nearly so large.

isequal(Z,RESULTS)
ans =
     1

The reason I tried this approach was to then see if could convert it to something using arrayfun, or perhaps cellfun.

Another Solution

Finally I had some coffee however! And thank goodness. The answer was in front of me all along. And I was already using it before: ismember.

FinalAnswer = ismember(A,B)
FinalAnswer =
     1     0     0     0     0
     0     1     1     0     1
     0     0     0     0     0
     1     1     0     0     1
isequal(FinalAnswer, RESULTS)
ans =
     1

In one fell swoop I can compute the entire result because I don't care about matching those locations. It's enough to say that a value in A matches some value in B. Voila!

Other Approaches

I just showed you a few approaches for solving this problem. Do you have similar problems, perhaps ones that don't yield as simple a solution? Or do you have other approaches to solving this problem that might be useful, especially in more complicated situations? Please share them here.


Get the MATLAB code

Published with MATLAB® 7.7

27 Responses to “More Ways to Find Matching Data”

  1. matt fig replied on :

    Seems like the OP knew ismember was the way to go, but the arrays under consideration were to large.

  2. Mathias Francke replied on :

    Hi,

    I have two comments to this post:

    First, the user’s problem was memory related, not a question of method. On my computer, neither of the proposed solutions solve this problem (which could be taken care of by splitting A into submatrices).

    Second, if B should have repeated values (i.e. B not unique), Bruno’s solution is invalid because it adds the result of each comparison and thus would produce a 2 instead of a 1 if some value was represented twice in B.

    Nice blog, by the way.

    regards
    Mathias

  3. Daniel replied on :

    Well, it is hard to beat a one-command solution, but I allways begin with looking at something vectorized these days. In this case it is obvious that the data in question will be too large for this to work, but I like the clean workings of thie kinds of solutions.

    The idea is to make a multidimensional matrix that contains all possible combinations and then flatten it down to give us the answer we want.

    A = [11 22 34 56 89
    23 44 11 20 66
    79 54 32 17 89
    11 66 21 45 90];
    B = [11 66 44 40 90];
    
    A1 = repmat(A,[1 1 length(B)]);
    B1 = reshape(B,1,1,[]);
    B2 = repmat(B1,[size(A,1) size(A,2) 1] );
    ResultFull = (A1 == B2);
    FinalResult = (sum(F,3) >= 1);
    

  4. Daniel replied on :

    Speaking of bsxfun, Even after many years of daily Matlab, I feel very uncomfortable with bsxfun. However, I get the impression that properly used it is extremely powerful. Has someone done a crash-course in bsxfun?

  5. Marco replied on :

    Hi Loren,
    your final answer was my initial answer.
    But in this way you don’t have problem of memory?
    A = uint8
    B = uint8

    FinalAnswer = ismember(A,B);

    ??? Error using ==> double
    Out of memory. Type HELP MEMORY for your options.

    Error in ==> ismember at 88
    a = double(a);

    I have XP with 2GB of memory

  6. Loren replied on :

    To all,

    Yes, memory was an issue. Bruno’s solution produced doubles while ismember produces logicals, which take up less space. Plus the issue with duplicate values that was mentioned.

    I think if arrays are really that large, they probably need to be broken down into smaller chunks, either literally, or via use of the Parallel Computing Toolbox and codistributed arrays.

    –Loren

  7. Gautam Vallabha replied on :

    If memory is the limiting factor, then Bruno’s solution with a sparse matrix (& logical indexing rather than additions to accumulate the results) may do the trick:

    A = round(rand(10,10)*100);
    B = [10:10:100];
    
    spA = sparse(size(A,1),size(A,2));
    for i=1:length(B)
       spA( A == B(i) ) = 1;
    end
    all(all(ismember(A,B) == full(spA)))
    
  8. matt fig replied on :

    Daniel,

    I too had a little difficulty understanding bsxfun. I think the best way
    to understand it’s operation is in analogy to scalar expansion, which is
    fairly intuitive for most Matlab users. Given that we have:

    A = magic(5);
    B = 3;

    Matlab returns the result of

    A - B

    as if it did this:

    A - repmat(B,size(A,1),size(A,2))

    That is, Matlab expanded B to the same size as A, then performed the
    operation. Note that the requirement for this to automatically work is
    that all dimensions of B are equal to one.
    As for bsxfun, it works when all dimensions of B are equal to the
    corresponding dimensions of A or are equal to one. So given

    A = magic(5);
    B = [1 2 3 4 5];

    We see that all of the dimensions of B are equal to the corresponding
    dimensions of A except one. The first dimension of B (which is not equal
    to the first dimension of A) is equal to 1, so we have a candidate for
    bsxfun. Matlab produces the result of

    bsxfun(@minus,A,B)

    as if it had done (but it doesn’t do this!):

    A - repmat(B,size(A,1),1)

    We see that bsxfun expanded B along the first dimension. This extends to
    higher dimensions.

    A(:,:,1) = magic(5);
    A(:,:,2) = magic(5) - 3; %A is 5×5x2
    B = [1 2 3 4 5];

    bsxfun(@minus,A,B) == (A - repmat(B,[size(A,1),1,size(A,3)]))

    Hope that helps.

  9. Daniel replied on :

    Thanks matt fig. That was enlightening. Now I just need to do some excercises to help get used to the broader concept.

  10. Fabio Gori replied on :

    I have another similar problem to solve: I don’t want, only, to find out if elements of a vector/matrix A belong to another vector/matrix B; I’d like also to find out all the (multiple) locations of all the elements of A in B.

    Till now I have used something like this (a, b are vectors)
    for i = 1 : length(a)
    index{i} = find(a(i)==b) ;
    end

    Unfortunatly it is not so fast, but ismember takes only the highest location.

    Thanks, Fabio

  11. Jos replied on :

    To Fabio:

    This will do the job:

    %data
    A = [1 2 3 4] ;
    B = [1 2 1 1 4 2 2 2 4] ;

    % engine
    [r,c] = find(bsxfun(@eq,B(:),A))
    n = histc(c,1:numel(A))
    index = mat2cell(r,n,1)

    hth
    Jos

    http://www.mathworks.com/matlabcentral/fileexchange/authors/10584

  12. Daniel replied on :

    I notice noone seems to have actually presented a bsxfun solution to the initial problem, although it becomes very neet. Here it is for completeness

    clear;
    A = [11 22 34 56 89
    23 44 11 20 66
    79 54 32 17 89
    11 66 21 45 90];
    B = [11 66 44 40 90];

    B1 = reshape ( B, 1, 1, [] );
    ResultFull = bsxfun( @eq, A, B1 );
    FinalResult = any(ResultFull,3);

    Unfortunately, it still takes twice the time of ismember and the memory is still an issue as ResultFull is monsterously large. Of course, on my computer, making A a 100k*10 array doen’t even make the computer pause for memory and takes about 0.1s to compute.

  13. George Kapodistrias replied on :

    Hi Loren,

    Thank you for your post. I would not have thought to use the ismember function to solve the problem. I would have solved it using solution 1 (for loop, column-wise comparison). Out of curiosity I decided to time (tic-toc)the 3 solutions expecting to find the MATLAB function to produce the fastest result. This is what I got:

    For loop- column-wise: 0.00021 sec
    For loop- row-wise: 0.046 sec
    ismember: 0.055 sec

    Interestingly ismember was the slowest by quite a bit. I have been using MATLAB for a long time and I have always tried to vectorize and use TMW generated functions since they are (were?) most efficient. But in the last 1-2 years I have found examples that it is either faster to use a brute force method (e.g. a for loop) or that vectorization is not the time saver it used to be.

    Is that something that you have observed? Does it make any sense with what is under the hood of MATLAB now?

    Cheers,

    gk.

  14. Loren replied on :

    George-

    It’s hard to give good advice. We have been working hard on for-loop performance. Sometimes, as you observe in this case, it’s harder for a general purpose function, even vectorized, (such as ismember) to outperform an efficient single-purpose piece of code which does not have to do error checking, etc.

    –Loren

  15. matt fig replied on :

    George K,

    I see this all of the time, especially on the newsgroup. People tend to want to vectorize for performance when a loop is actually faster.

    Here is one current example: http://www.mathworks.com/matlabcentral/newsreader/view_thread/243067#623897

    Part of the problem is that, when the speed of execution is the primary concern, it is often hard to know beforehand if a for loop will be faster. There are several guides available which indicate what kinds of things a for loop needs to be accelerated efficiently. I find that if all of these conditions are met, a for loop will often be as fast or faster than even a built-in function. I personally view vectorization of loops built with acceleration in mind to be more of a fun challenge than the critical skill it once was. That doesn’t mean I think vectorization has no functional purpose, by the way.

  16. benh replied on :

    Why can “find” only work on integers, not floating point values? Is there another function that will do the same thing to non-integer values, or do i need to convert all the numbers in the search to integers? Thanks,

  17. Daniel replied on :

    George Kapodistrias,

    Just out of curiosity. How did you actually perform the timings? Timing things as fast as your results indicate means you have to do some tricks (Or use timeit which supposedly knows all the tricks). Otherwise, the numbers you get will be almost pure noise.

    As for the vecotrisation vs for-loops, I have noticed that vectorisation sometimes introduces massive dynamic memory allocation which is Slow(tm). In those cases a for-loop is the way to go.

    There is, in my opinion, another benefit of vectorisation, though, and that is clean and clear code. I find multiple for-loops quickly get me confused when reading code.

    –DA

  18. Loren replied on :

    Benh-

    find works on logical expressions. So you can make one that incorporates floating point values - but you have to be careful since equality is may not be exactly what you mean with floating point values.

    –Loren

  19. beljsl replied on :

    Actually if you look inside the function ‘ismember’, it is still using the iteration inside. That makes not a big difference with Bruno’s Solution.

    Here i give out another solution without iteration and can show exactly the location of each member of B in A.

    [ma,na] = size(A);
    [mb,nb] = size(B);
    [ia,ib] = meshgrid(1:ma,1:mb);
    [ja,jb] = meshgrid(1:na,1:nb);
    I = ~(A(ia,ja)-B(ib,jb));

  20. Loren replied on :

    Beljsl-

    I think your solution will run into memory issues even earlier than some of the others becuase of the large arrays you create with meshgrid.

    –loren

  21. beljsl replied on :

    dear loren-
    Yes. You are right. Just show my new idea.
    Vectorized operation always requires more memory than the others. It is really a trade-off. The advantage of this method is to produce exact location of each member of B in A.

  22. Chris Sulprizio replied on :

    Dear Loren,

    I have had a similar data mathcing question that I have been struggling to solve efficiently.

    Let us say I have a cell array S filled with strings. For this example I will use:

    S={'L';'6';'D';'D';'1';'9';'5';'T';'L';'G';'Z';'4';'5';'6';'7'};
    

    I also have another cell C array full strings.

    C={'6','D';'T','L';'4','5'}';
    

    I would like to search and match the columns of C within S. Eventually what I would like to do is build the cell array X which includes the search terms and some following terms(for this example I took the next two).

    X={'6','D','D','1';'T','L','G','Z';'4','5','6','7'}';
    

    In reality, S and C are very large so an efficient solution is important.

  23. Loren replied on :

    Chris-

    I haven’t tried anything out yet. Are the strings always a single character? If so, you could convert to character arrays and loop using strfind perhaps.

    –Loren

  24. matt fig replied on :

    Loren,
    That is the approach I took. I didn’t post to the blog because I was hoping Chris would take the question to the newsgroup. One problem is that it is not entirely clear what to do if more than one match is found. That is why I have a try catch in my code, which should be replaced once the appropriate action is clear.

     
    
    [r,c] = size(C);
    Sc = char(S)';
    Cc = reshape(char(C),r,c);
    nm2kp = 2;  % The number to keep after the match.
    C(r + 1:r + nm2kp,:) = {[]}; % Re-use C to make X.
    
    for ii = 1:c
        try
        idx = strfind(Sc,Cc(:,ii)');
        % This next line will error if more than one match.
        C(r+1 : r+nm2kp,ii) = S(idx+r : idx+r+1);
        catch
        end
    end
    
     
  25. Chris S replied on :

    Loren,

    Good question about the single letter thing. I am quickly discovering that if the strings of S and C were more than a single letter, the problem becomes a bit more difficult.

    Chris

  26. Vijay Pappu replied on :

    I just happened to see this post today and thought of giving it a try:
    My solution is:

    C = zeros(numel(A),1);
    for i = 1:numel(A)
    C(i) = (any(B == A(i)));
    end
    C = reshape(C,size(A));

    I am sure there are more elegant solutions for this but this is one of the ways of doing it.

  27. Ben replied on :

    Users working with larger arrays could consider the following algorithm. The complexity O(N*log(N)) is about same as ismember if the arrays have similar size, but surprisingly it seems to run up to 2x faster considering it’s up against mex code. If one of the arrays is small then ismember is faster. I haven’t tested it on distributed arrays but it should work if sort does.

     function y=ismember2(a,b)
    
    [as,ia]=sort(a(:));
    bs=sort(b(:));
    iis=false(size(as));
    n=numel(as);
    m=numel(bs);
    i=1;
    j=1;
    while (i<=n)&&(j<=m),
      if (as(i)>bs(j)),
        j=j+1;
      else
        iis(i)=(as(i)==bs(j));
        i=i+1;
      end;
    end;
    y=false(size(a));
    y(ia)=iis;
    end 

    Timings:

    >> n=2e7;a=ceil(n*rand(n,1));b=ceil(n*rand(n,1));
    >> tic;ismember(a,b);toc;
    Elapsed time is 39.229270 seconds.
    >> tic;ismember2(a,b);toc;
    Elapsed time is 21.876701 seconds.

Leave a Reply

Wrap code fragments inside <pre> tags, like this:

<pre class="code">
a = magic(3);
sum(a)
</pre>

If you have a "<" character in your code, either follow it with a space or replace it with "&lt;" (including the semicolon).


Loren Shure works on design of the MATLAB language at The MathWorks. She writes here about once a week on MATLAB programming and related topics.

  • Loren: Paul- There *are* issues depending on the sizes of ii and jj. And it’s a bit complicated, but really...
  • Loren: Bob- You don’t say what happens when you run your code. Can you please explain more. It looks like you...
  • Loren: Kishore- It is not clear to me what you are trying to actually achieve. If you want to concatenate the 4...
  • Kishore: sorry, in the previous code mat2cell(c,[19 121],[19 134],[19 84],[19 107])
  • Kishore: Hi Loren, Why does the following not work? data_classwise = [19x121 double] [19x134 double] [19x84 double]...
  • Paul Jackson: Loren, Are there any aspects of empty matrices that may be tricky when they are used as indices into...
  • Bob: Hi Lori, Im trying to process Unicode text files from more than one different locales than the standard latin...
  • Loren: Ben- The reference link in my post documents the behavior of sum([]) and prod([]) (although the prod part only...
  • Ben: Loren/Andrey, A further advantage of having sum([])==0 and prod([])==1 is that it’s consistent with array...
  • Loren: OysterEngineer- I will SO take you up on that offer. Can’t wait for a good reason to visit now....

These postings are the author's and don't necessarily represent the opinions of The MathWorks.