Loren on the Art of MATLAB

January 20th, 2009

More Ways to Find Matching Data

Today on the newsgroup, a user wanted help finding when values in a matrix matched some other values (see the post). There was already a solution posted when I was reading, but something about this problem kept nagging at me. So I've invested a little bit of time thinking more about the problem.

Contents

Sample Data

Here's sample data, and the user wants to find all the places in A which have values that match values in B. Simple enough statement.

A = [11 22 34 56 89
23 44 11 20 66
79 54 32 17 89
11 66 21 45 90]
B = [11 66 44 40 90]
A =
    11    22    34    56    89
    23    44    11    20    66
    79    54    32    17    89
    11    66    21    45    90
B =
    11    66    44    40    90

Bruno's Solution - Column-wise Solution

As I said, there was already a solution when I was reading. Here it is.

RESULTS = zeros(size(A));
for i = 1: size(B,2)
    RESULTS = RESULTS + ( A == B(1,i) );
end
RESULTS
RESULTS =
     1     0     0     0     0
     0     1     1     0     1
     0     0     0     0     0
     1     1     0     0     1

The idea here is to create the right size output, and cycle through the values in B (the smaller array for the user's example). Check to see where a given value in B matches one in A, and add a 1 to the RESULTS when those hits are found.

Find Exact Location Matches

To be honest, I misread the question at first, and came up with the following code. However, it does not solve the problem as stated!

C = ~(A-repmat(B,size(A,1),1))
C =
     1     0     0     0     0
     0     0     0     0     0
     0     0     0     0     0
     1     1     0     0     1

The reason I tried the solution with repmat was so I could first get the right answer, and then find a solution instead with bsxfun. Instead, this solution (which could be quite costly due to the repmat) looks to match values in specific column locations. In other words, the wrong problem solved.

Find Presence - Row-wise Solution

Wising up a very tiny amount, that I was solving the wrong problem, I next tried finding matches by row in A.

Z = zeros(size(A));
for k = 1:size(A,1)
    Z(k,:) = ismember(A(k,:),B);
end
Z
Z =
     1     0     0     0     0
     0     1     1     0     1
     0     0     0     0     0
     1     1     0     0     1

At least this time I get the right answer and am solving the right problem! But this normally would take longer than Bruno's approach because the user premise was that A was huge and B wasn't nearly so large.

isequal(Z,RESULTS)
ans =
     1

The reason I tried this approach was to then see if could convert it to something using arrayfun, or perhaps cellfun.

Another Solution

Finally I had some coffee however! And thank goodness. The answer was in front of me all along. And I was already using it before: ismember.

FinalAnswer = ismember(A,B)
FinalAnswer =
     1     0     0     0     0
     0     1     1     0     1
     0     0     0     0     0
     1     1     0     0     1
isequal(FinalAnswer, RESULTS)
ans =
     1

In one fell swoop I can compute the entire result because I don't care about matching those locations. It's enough to say that a value in A matches some value in B. Voila!

Other Approaches

I just showed you a few approaches for solving this problem. Do you have similar problems, perhaps ones that don't yield as simple a solution? Or do you have other approaches to solving this problem that might be useful, especially in more complicated situations? Please share them here.


Get the MATLAB code

Published with MATLAB® 7.7

33 Responses to “More Ways to Find Matching Data”

  1. matt fig replied on :

    Seems like the OP knew ismember was the way to go, but the arrays under consideration were to large.

  2. Mathias Francke replied on :

    Hi,

    I have two comments to this post:

    First, the user’s problem was memory related, not a question of method. On my computer, neither of the proposed solutions solve this problem (which could be taken care of by splitting A into submatrices).

    Second, if B should have repeated values (i.e. B not unique), Bruno’s solution is invalid because it adds the result of each comparison and thus would produce a 2 instead of a 1 if some value was represented twice in B.

    Nice blog, by the way.

    regards
    Mathias

  3. Daniel replied on :

    Well, it is hard to beat a one-command solution, but I allways begin with looking at something vectorized these days. In this case it is obvious that the data in question will be too large for this to work, but I like the clean workings of thie kinds of solutions.

    The idea is to make a multidimensional matrix that contains all possible combinations and then flatten it down to give us the answer we want.

    A = [11 22 34 56 89
    23 44 11 20 66
    79 54 32 17 89
    11 66 21 45 90];
    B = [11 66 44 40 90];
    
    A1 = repmat(A,[1 1 length(B)]);
    B1 = reshape(B,1,1,[]);
    B2 = repmat(B1,[size(A,1) size(A,2) 1] );
    ResultFull = (A1 == B2);
    FinalResult = (sum(F,3) >= 1);
    

  4. Daniel replied on :

    Speaking of bsxfun, Even after many years of daily Matlab, I feel very uncomfortable with bsxfun. However, I get the impression that properly used it is extremely powerful. Has someone done a crash-course in bsxfun?

  5. Marco replied on :

    Hi Loren,
    your final answer was my initial answer.
    But in this way you don’t have problem of memory?
    A = uint8
    B = uint8

    FinalAnswer = ismember(A,B);

    ??? Error using ==> double
    Out of memory. Type HELP MEMORY for your options.

    Error in ==> ismember at 88
    a = double(a);

    I have XP with 2GB of memory

  6. Loren replied on :

    To all,

    Yes, memory was an issue. Bruno’s solution produced doubles while ismember produces logicals, which take up less space. Plus the issue with duplicate values that was mentioned.

    I think if arrays are really that large, they probably need to be broken down into smaller chunks, either literally, or via use of the Parallel Computing Toolbox and codistributed arrays.

    –Loren

  7. Gautam Vallabha replied on :

    If memory is the limiting factor, then Bruno’s solution with a sparse matrix (& logical indexing rather than additions to accumulate the results) may do the trick:

    A = round(rand(10,10)*100);
    B = [10:10:100];
    
    spA = sparse(size(A,1),size(A,2));
    for i=1:length(B)
       spA( A == B(i) ) = 1;
    end
    all(all(ismember(A,B) == full(spA)))
    
  8. matt fig replied on :

    Daniel,

    I too had a little difficulty understanding bsxfun. I think the best way
    to understand it’s operation is in analogy to scalar expansion, which is
    fairly intuitive for most Matlab users. Given that we have:

    A = magic(5);
    B = 3;

    Matlab returns the result of

    A - B

    as if it did this:

    A - repmat(B,size(A,1),size(A,2))

    That is, Matlab expanded B to the same size as A, then performed the
    operation. Note that the requirement for this to automatically work is
    that all dimensions of B are equal to one.
    As for bsxfun, it works when all dimensions of B are equal to the
    corresponding dimensions of A or are equal to one. So given

    A = magic(5);
    B = [1 2 3 4 5];

    We see that all of the dimensions of B are equal to the corresponding
    dimensions of A except one. The first dimension of B (which is not equal
    to the first dimension of A) is equal to 1, so we have a candidate for
    bsxfun. Matlab produces the result of

    bsxfun(@minus,A,B)

    as if it had done (but it doesn’t do this!):

    A - repmat(B,size(A,1),1)

    We see that bsxfun expanded B along the first dimension. This extends to
    higher dimensions.

    A(:,:,1) = magic(5);
    A(:,:,2) = magic(5) - 3; %A is 5×5x2
    B = [1 2 3 4 5];

    bsxfun(@minus,A,B) == (A - repmat(B,[size(A,1),1,size(A,3)]))

    Hope that helps.

  9. Daniel replied on :

    Thanks matt fig. That was enlightening. Now I just need to do some excercises to help get used to the broader concept.

  10. Fabio Gori replied on :

    I have another similar problem to solve: I don’t want, only, to find out if elements of a vector/matrix A belong to another vector/matrix B; I’d like also to find out all the (multiple) locations of all the elements of A in B.

    Till now I have used something like this (a, b are vectors)
    for i = 1 : length(a)
    index{i} = find(a(i)==b) ;
    end

    Unfortunatly it is not so fast, but ismember takes only the highest location.

    Thanks, Fabio

  11. Jos replied on :

    To Fabio:

    This will do the job:

    %data
    A = [1 2 3 4] ;
    B = [1 2 1 1 4 2 2 2 4] ;

    % engine
    [r,c] = find(bsxfun(@eq,B(:),A))
    n = histc(c,1:numel(A))
    index = mat2cell(r,n,1)

    hth
    Jos

    http://www.mathworks.com/matlabcentral/fileexchange/authors/10584

  12. Daniel replied on :

    I notice noone seems to have actually presented a bsxfun solution to the initial problem, although it becomes very neet. Here it is for completeness

    clear;
    A = [11 22 34 56 89
    23 44 11 20 66
    79 54 32 17 89
    11 66 21 45 90];
    B = [11 66 44 40 90];

    B1 = reshape ( B, 1, 1, [] );
    ResultFull = bsxfun( @eq, A, B1 );
    FinalResult = any(ResultFull,3);

    Unfortunately, it still takes twice the time of ismember and the memory is still an issue as ResultFull is monsterously large. Of course, on my computer, making A a 100k*10 array doen’t even make the computer pause for memory and takes about 0.1s to compute.

  13. George Kapodistrias replied on :

    Hi Loren,

    Thank you for your post. I would not have thought to use the ismember function to solve the problem. I would have solved it using solution 1 (for loop, column-wise comparison). Out of curiosity I decided to time (tic-toc)the 3 solutions expecting to find the MATLAB function to produce the fastest result. This is what I got:

    For loop- column-wise: 0.00021 sec
    For loop- row-wise: 0.046 sec
    ismember: 0.055 sec

    Interestingly ismember was the slowest by quite a bit. I have been using MATLAB for a long time and I have always tried to vectorize and use TMW generated functions since they are (were?) most efficient. But in the last 1-2 years I have found examples that it is either faster to use a brute force method (e.g. a for loop) or that vectorization is not the time saver it used to be.

    Is that something that you have observed? Does it make any sense with what is under the hood of MATLAB now?

    Cheers,

    gk.

  14. Loren replied on :

    George-

    It’s hard to give good advice. We have been working hard on for-loop performance. Sometimes, as you observe in this case, it’s harder for a general purpose function, even vectorized, (such as ismember) to outperform an efficient single-purpose piece of code which does not have to do error checking, etc.

    –Loren

  15. matt fig replied on :

    George K,

    I see this all of the time, especially on the newsgroup. People tend to want to vectorize for performance when a loop is actually faster.

    Here is one current example: http://www.mathworks.com/matlabcentral/newsreader/view_thread/243067#623897

    Part of the problem is that, when the speed of execution is the primary concern, it is often hard to know beforehand if a for loop will be faster. There are several guides available which indicate what kinds of things a for loop needs to be accelerated efficiently. I find that if all of these conditions are met, a for loop will often be as fast or faster than even a built-in function. I personally view vectorization of loops built with acceleration in mind to be more of a fun challenge than the critical skill it once was. That doesn’t mean I think vectorization has no functional purpose, by the way.

  16. benh replied on :

    Why can “find” only work on integers, not floating point values? Is there another function that will do the same thing to non-integer values, or do i need to convert all the numbers in the search to integers? Thanks,

  17. Daniel replied on :

    George Kapodistrias,

    Just out of curiosity. How did you actually perform the timings? Timing things as fast as your results indicate means you have to do some tricks (Or use timeit which supposedly knows all the tricks). Otherwise, the numbers you get will be almost pure noise.

    As for the vecotrisation vs for-loops, I have noticed that vectorisation sometimes introduces massive dynamic memory allocation which is Slow(tm). In those cases a for-loop is the way to go.

    There is, in my opinion, another benefit of vectorisation, though, and that is clean and clear code. I find multiple for-loops quickly get me confused when reading code.

    –DA

  18. Loren replied on :

    Benh-

    find works on logical expressions. So you can make one that incorporates floating point values - but you have to be careful since equality is may not be exactly what you mean with floating point values.

    –Loren

  19. beljsl replied on :

    Actually if you look inside the function ‘ismember’, it is still using the iteration inside. That makes not a big difference with Bruno’s Solution.

    Here i give out another solution without iteration and can show exactly the location of each member of B in A.

    [ma,na] = size(A);
    [mb,nb] = size(B);
    [ia,ib] = meshgrid(1:ma,1:mb);
    [ja,jb] = meshgrid(1:na,1:nb);
    I = ~(A(ia,ja)-B(ib,jb));

  20. Loren replied on :

    Beljsl-

    I think your solution will run into memory issues even earlier than some of the others becuase of the large arrays you create with meshgrid.

    –loren

  21. beljsl replied on :

    dear loren-
    Yes. You are right. Just show my new idea.
    Vectorized operation always requires more memory than the others. It is really a trade-off. The advantage of this method is to produce exact location of each member of B in A.

  22. Chris Sulprizio replied on :

    Dear Loren,

    I have had a similar data mathcing question that I have been struggling to solve efficiently.

    Let us say I have a cell array S filled with strings. For this example I will use:

    S={'L';'6';'D';'D';'1';'9';'5';'T';'L';'G';'Z';'4';'5';'6';'7'};
    

    I also have another cell C array full strings.

    C={'6','D';'T','L';'4','5'}';
    

    I would like to search and match the columns of C within S. Eventually what I would like to do is build the cell array X which includes the search terms and some following terms(for this example I took the next two).

    X={'6','D','D','1';'T','L','G','Z';'4','5','6','7'}';
    

    In reality, S and C are very large so an efficient solution is important.

  23. Loren replied on :

    Chris-

    I haven’t tried anything out yet. Are the strings always a single character? If so, you could convert to character arrays and loop using strfind perhaps.

    –Loren

  24. matt fig replied on :

    Loren,
    That is the approach I took. I didn’t post to the blog because I was hoping Chris would take the question to the newsgroup. One problem is that it is not entirely clear what to do if more than one match is found. That is why I have a try catch in my code, which should be replaced once the appropriate action is clear.

     
    
    [r,c] = size(C);
    Sc = char(S)';
    Cc = reshape(char(C),r,c);
    nm2kp = 2;  % The number to keep after the match.
    C(r + 1:r + nm2kp,:) = {[]}; % Re-use C to make X.
    
    for ii = 1:c
        try
        idx = strfind(Sc,Cc(:,ii)');
        % This next line will error if more than one match.
        C(r+1 : r+nm2kp,ii) = S(idx+r : idx+r+1);
        catch
        end
    end
    
     
  25. Chris S replied on :

    Loren,

    Good question about the single letter thing. I am quickly discovering that if the strings of S and C were more than a single letter, the problem becomes a bit more difficult.

    Chris

  26. Vijay Pappu replied on :

    I just happened to see this post today and thought of giving it a try:
    My solution is:

    C = zeros(numel(A),1);
    for i = 1:numel(A)
    C(i) = (any(B == A(i)));
    end
    C = reshape(C,size(A));

    I am sure there are more elegant solutions for this but this is one of the ways of doing it.

  27. Ben replied on :

    Users working with larger arrays could consider the following algorithm. The complexity O(N*log(N)) is about same as ismember if the arrays have similar size, but surprisingly it seems to run up to 2x faster considering it’s up against mex code. If one of the arrays is small then ismember is faster. I haven’t tested it on distributed arrays but it should work if sort does.

     function y=ismember2(a,b)
    
    [as,ia]=sort(a(:));
    bs=sort(b(:));
    iis=false(size(as));
    n=numel(as);
    m=numel(bs);
    i=1;
    j=1;
    while (i<=n)&&(j<=m),
      if (as(i)>bs(j)),
        j=j+1;
      else
        iis(i)=(as(i)==bs(j));
        i=i+1;
      end;
    end;
    y=false(size(a));
    y(ia)=iis;
    end 

    Timings:

    >> n=2e7;a=ceil(n*rand(n,1));b=ceil(n*rand(n,1));
    >> tic;ismember(a,b);toc;
    Elapsed time is 39.229270 seconds.
    >> tic;ismember2(a,b);toc;
    Elapsed time is 21.876701 seconds.

  28. Ghazanfar Ali replied on :

    Hi Miss Loren
    I am in need of an algo to count cosecutive duplicate values in a one dimensional matrix. e.g A=[1 1 1 2 2 3 3 3 3 4 4 4 4 4 5 6 6] should give me another matrix with the exact count of these repeated values e.g B=[3 2 4 5 1 2] . If u kindly send me the code at my email address i.e ghazanfar.ali@live.com I shall be very obliged. Thanks and Best Regards. Ghazanfar Ali. Pakistan

  29. Loren replied on :

    Hi Ghazanfar-

    Please look at the help for the function hist (and perhaps histc). They will do what you want for these integer values.

    –Loren

  30. dpath2o replied on :

    Using ismember is a novel technique and reading through this discussion has been very informative. I’m wondering if someone here has had to deal with tolerances in there matching? What I mean to say is:

    A = [ 5.1 9.5 20 31];
    B = [ 5 10 15 20 25 30];

    My of finding of near and exact matches:
    i1=[];
    i2=[];
    for l1=1:numel(B)
    for l2=1:numel(A)
    if (B(l1)=A(l2)-2.5)
    i1 = [i1;l1];
    i2 = [i2;l2];
    end
    end
    end

    This works fine for small arrays but starts taking LONG time when A and B are considerably large, which they are in my real-world case.

    Anyone have a suggestion as to vectorize or speed this up?

    Thanks

  31. Loren replied on :

    Dpath2o-

    You can try bsxfun with an anonymous function (@(x,y) x-y+2.5) and turn one of the arrays into a column. You will end up with a (possibly large) matrix with 1s where the corresponding row and column have the relationship you want. With floating point though, checking for equality is not likely to get you results you really want. See my little example here. You probably want to build in some tolerance.

    A = 1:3
    B = [1.1 3.5 6]
    f = @(x,y) (x-y == 2.5)
    bsxfun(f,A',B)
    ans =
         0     0     0
         0     0     0
         0     0     0
    

    Do you really mean to compare all elements of A with all elements of B?

    –Loren

  32. dpath2o replied on :

    Loren,

    Thanks for help. Though this approach constructs a length(A)xlength(B(1,:)) matrix and that is not what the job requires.

    What I have is a time array of matlab datenums. Sometimes there are gaps in this time array. The gaps are not filled with nan’s or zeros — i.e. the gap is merely missing.

    So what I need to construct is a complete array of times and then find then find the times that match from the in-situ data. Using the indexes of the found times I will put the in-situ results back into a large matrix that is filled with nan’s so that gaps in time will be represented by NaN.

    This is my hurdle.

    Cheers,
    Dan

  33. dpath2o replied on :

    From the above example here is my solution that works well but doesn’t allow for near misses — i.e. times that are close but not exact:

    x=[5.1 12.23 0.567 12.01 0.555;
          9.5  12.03 0.578 12.11 0.595;
          20   12.06 0.588 12.12 0.596;
          31   12.09  0.591 12.20 0.601];
    A = x(:,1);
    B = 5:5:30;
    tmp=ismember(A,B);
    X=repmat(nan,length(B),length(A(1,:)));
    for l1=1:length(tmp)
    X(tmp(l1),:) = A(l1,2:end);
    end
    

    Not that pretty and it doesn’t allow for near misses, but I can’t think of a different approach.

    Any other thoughtful clues?

    Thanks!

    Cheers,
    Dan

Leave a Reply

Wrap code fragments inside <pre> tags, like this:

<pre class="code">
a = magic(3);
sum(a)
</pre>

If you have a "<" character in your code, either follow it with a space or replace it with "&lt;" (including the semicolon).


Loren Shure works on design of the MATLAB language at The MathWorks. She writes here about once a week on MATLAB programming and related topics.

  • Jun: I totally can not believe it, Loren. You are really helpful. Thank you so much, MATLAB master!
  • Loren: Wow folks- Always lots of interest when there’s a quickie to try out! I will only make 2 general...
  • Loren: Jun- ismember is your friend here: >> [aa,ind] = ismember(Array2,Arra y1) aa = 1 1 1 1 1 1 1 ind = 1 2 1 4 4 3...
  • Dan: I like the first way better than the second way. Combining the arrays into one and running any is nice, although...
  • James Myatt: How about I = (a == 0 | b == 0); a(I) = []; b(I) = [];
  • Tunc: Hello Loren, love your blog because of such inspiring and challenging comments to such ’small’...
  • Pekka Kumpulainen: Here is my tradeoff. I usually want to keep the original variables as they are most probably...
  • Iain: Followup: Of course, to allow NaNs (counting them as non-zero): mask = (a~=0) & (b~=0); The mask says “a...
  • Matt Fig: I would usually go with something like this: y = a&b; x = a(y); y = b(y); But I was surprised to find...
  • kk: c=all([a;b]) a(c) a(b)

These postings are the author's and don't necessarily represent the opinions of The MathWorks.