Loren on the Art of MATLAB

More Ways to Find Matching Data 55

Posted by Loren Shure,

Today on the newsgroup, a user wanted help finding when values in a matrix matched some other values (see the post). There was already a solution posted when I was reading, but something about this problem kept nagging at me. So I've invested a little bit of time thinking more about the problem.

Contents

Sample Data

Here's sample data, and the user wants to find all the places in A which have values that match values in B. Simple enough statement.

A = [11 22 34 56 89
23 44 11 20 66
79 54 32 17 89
11 66 21 45 90]
B = [11 66 44 40 90]
A =
    11    22    34    56    89
    23    44    11    20    66
    79    54    32    17    89
    11    66    21    45    90
B =
    11    66    44    40    90

Bruno's Solution - Column-wise Solution

As I said, there was already a solution when I was reading. Here it is.

RESULTS = zeros(size(A));
for i = 1: size(B,2)
    RESULTS = RESULTS + ( A == B(1,i) );
end
RESULTS
RESULTS =
     1     0     0     0     0
     0     1     1     0     1
     0     0     0     0     0
     1     1     0     0     1

The idea here is to create the right size output, and cycle through the values in B (the smaller array for the user's example). Check to see where a given value in B matches one in A, and add a 1 to the RESULTS when those hits are found.

Find Exact Location Matches

To be honest, I misread the question at first, and came up with the following code. However, it does not solve the problem as stated!

C = ~(A-repmat(B,size(A,1),1))
C =
     1     0     0     0     0
     0     0     0     0     0
     0     0     0     0     0
     1     1     0     0     1

The reason I tried the solution with repmat was so I could first get the right answer, and then find a solution instead with bsxfun. Instead, this solution (which could be quite costly due to the repmat) looks to match values in specific column locations. In other words, the wrong problem solved.

Find Presence - Row-wise Solution

Wising up a very tiny amount, that I was solving the wrong problem, I next tried finding matches by row in A.

Z = zeros(size(A));
for k = 1:size(A,1)
    Z(k,:) = ismember(A(k,:),B);
end
Z
Z =
     1     0     0     0     0
     0     1     1     0     1
     0     0     0     0     0
     1     1     0     0     1

At least this time I get the right answer and am solving the right problem! But this normally would take longer than Bruno's approach because the user premise was that A was huge and B wasn't nearly so large.

isequal(Z,RESULTS)
ans =
     1

The reason I tried this approach was to then see if could convert it to something using arrayfun, or perhaps cellfun.

Another Solution

Finally I had some coffee however! And thank goodness. The answer was in front of me all along. And I was already using it before: ismember.

FinalAnswer = ismember(A,B)
FinalAnswer =
     1     0     0     0     0
     0     1     1     0     1
     0     0     0     0     0
     1     1     0     0     1
isequal(FinalAnswer, RESULTS)
ans =
     1

In one fell swoop I can compute the entire result because I don't care about matching those locations. It's enough to say that a value in A matches some value in B. Voila!

Other Approaches

I just showed you a few approaches for solving this problem. Do you have similar problems, perhaps ones that don't yield as simple a solution? Or do you have other approaches to solving this problem that might be useful, especially in more complicated situations? Please share them here.


Get the MATLAB code

Published with MATLAB® 7.7

55 CommentsOldest to Newest

Hi,

I have two comments to this post:

First, the user’s problem was memory related, not a question of method. On my computer, neither of the proposed solutions solve this problem (which could be taken care of by splitting A into submatrices).

Second, if B should have repeated values (i.e. B not unique), Bruno’s solution is invalid because it adds the result of each comparison and thus would produce a 2 instead of a 1 if some value was represented twice in B.

Nice blog, by the way.

regards
Mathias

Well, it is hard to beat a one-command solution, but I allways begin with looking at something vectorized these days. In this case it is obvious that the data in question will be too large for this to work, but I like the clean workings of thie kinds of solutions.

The idea is to make a multidimensional matrix that contains all possible combinations and then flatten it down to give us the answer we want.

A = [11 22 34 56 89
23 44 11 20 66
79 54 32 17 89
11 66 21 45 90];
B = [11 66 44 40 90];

A1 = repmat(A,[1 1 length(B)]);
B1 = reshape(B,1,1,[]);
B2 = repmat(B1,[size(A,1) size(A,2) 1] );
ResultFull = (A1 == B2);
FinalResult = (sum(F,3) >= 1);

Speaking of bsxfun, Even after many years of daily Matlab, I feel very uncomfortable with bsxfun. However, I get the impression that properly used it is extremely powerful. Has someone done a crash-course in bsxfun?

Hi Loren,
your final answer was my initial answer.
But in this way you don’t have problem of memory?
A = uint8
B = uint8

FinalAnswer = ismember(A,B);

??? Error using ==> double
Out of memory. Type HELP MEMORY for your options.

Error in ==> ismember at 88
a = double(a);

I have XP with 2GB of memory

To all,

Yes, memory was an issue. Bruno’s solution produced doubles while ismember produces logicals, which take up less space. Plus the issue with duplicate values that was mentioned.

I think if arrays are really that large, they probably need to be broken down into smaller chunks, either literally, or via use of the Parallel Computing Toolbox and codistributed arrays.

–Loren

If memory is the limiting factor, then Bruno’s solution with a sparse matrix (& logical indexing rather than additions to accumulate the results) may do the trick:

A = round(rand(10,10)*100);
B = [10:10:100];

spA = sparse(size(A,1),size(A,2));
for i=1:length(B)
   spA( A == B(i) ) = 1;
end
all(all(ismember(A,B) == full(spA)))

Daniel,

I too had a little difficulty understanding bsxfun. I think the best way
to understand it’s operation is in analogy to scalar expansion, which is
fairly intuitive for most Matlab users. Given that we have:

A = magic(5);
B = 3;

Matlab returns the result of

A – B

as if it did this:

A – repmat(B,size(A,1),size(A,2))

That is, Matlab expanded B to the same size as A, then performed the
operation. Note that the requirement for this to automatically work is
that all dimensions of B are equal to one.
As for bsxfun, it works when all dimensions of B are equal to the
corresponding dimensions of A or are equal to one. So given

A = magic(5);
B = [1 2 3 4 5];

We see that all of the dimensions of B are equal to the corresponding
dimensions of A except one. The first dimension of B (which is not equal
to the first dimension of A) is equal to 1, so we have a candidate for
bsxfun. Matlab produces the result of

bsxfun(@minus,A,B)

as if it had done (but it doesn’t do this!):

A – repmat(B,size(A,1),1)

We see that bsxfun expanded B along the first dimension. This extends to
higher dimensions.

A(:,:,1) = magic(5);
A(:,:,2) = magic(5) – 3; %A is 5x5x2
B = [1 2 3 4 5];

bsxfun(@minus,A,B) == (A – repmat(B,[size(A,1),1,size(A,3)]))

Hope that helps.

Thanks matt fig. That was enlightening. Now I just need to do some excercises to help get used to the broader concept.

I have another similar problem to solve: I don’t want, only, to find out if elements of a vector/matrix A belong to another vector/matrix B; I’d like also to find out all the (multiple) locations of all the elements of A in B.

Till now I have used something like this (a, b are vectors)
for i = 1 : length(a)
index{i} = find(a(i)==b) ;
end

Unfortunatly it is not so fast, but ismember takes only the highest location.

Thanks, Fabio

I notice noone seems to have actually presented a bsxfun solution to the initial problem, although it becomes very neet. Here it is for completeness

clear;
A = [11 22 34 56 89
23 44 11 20 66
79 54 32 17 89
11 66 21 45 90];
B = [11 66 44 40 90];

B1 = reshape ( B, 1, 1, [] );
ResultFull = bsxfun( @eq, A, B1 );
FinalResult = any(ResultFull,3);

Unfortunately, it still takes twice the time of ismember and the memory is still an issue as ResultFull is monsterously large. Of course, on my computer, making A a 100k*10 array doen't even make the computer pause for memory and takes about 0.1s to compute.

Hi Loren,

Thank you for your post. I would not have thought to use the ismember function to solve the problem. I would have solved it using solution 1 (for loop, column-wise comparison). Out of curiosity I decided to time (tic-toc)the 3 solutions expecting to find the MATLAB function to produce the fastest result. This is what I got:

For loop- column-wise: 0.00021 sec
For loop- row-wise: 0.046 sec
ismember: 0.055 sec

Interestingly ismember was the slowest by quite a bit. I have been using MATLAB for a long time and I have always tried to vectorize and use TMW generated functions since they are (were?) most efficient. But in the last 1-2 years I have found examples that it is either faster to use a brute force method (e.g. a for loop) or that vectorization is not the time saver it used to be.

Is that something that you have observed? Does it make any sense with what is under the hood of MATLAB now?

Cheers,

gk.

George-

It’s hard to give good advice. We have been working hard on for-loop performance. Sometimes, as you observe in this case, it’s harder for a general purpose function, even vectorized, (such as ismember) to outperform an efficient single-purpose piece of code which does not have to do error checking, etc.

–Loren

George K,

I see this all of the time, especially on the newsgroup. People tend to want to vectorize for performance when a loop is actually faster.

Here is one current example: http://www.mathworks.com/matlabcentral/newsreader/view_thread/243067#623897

Part of the problem is that, when the speed of execution is the primary concern, it is often hard to know beforehand if a for loop will be faster. There are several guides available which indicate what kinds of things a for loop needs to be accelerated efficiently. I find that if all of these conditions are met, a for loop will often be as fast or faster than even a built-in function. I personally view vectorization of loops built with acceleration in mind to be more of a fun challenge than the critical skill it once was. That doesn’t mean I think vectorization has no functional purpose, by the way.

Why can “find” only work on integers, not floating point values? Is there another function that will do the same thing to non-integer values, or do i need to convert all the numbers in the search to integers? Thanks,

George Kapodistrias,

Just out of curiosity. How did you actually perform the timings? Timing things as fast as your results indicate means you have to do some tricks (Or use timeit which supposedly knows all the tricks). Otherwise, the numbers you get will be almost pure noise.

As for the vecotrisation vs for-loops, I have noticed that vectorisation sometimes introduces massive dynamic memory allocation which is Slow(tm). In those cases a for-loop is the way to go.

There is, in my opinion, another benefit of vectorisation, though, and that is clean and clear code. I find multiple for-loops quickly get me confused when reading code.

–DA

Benh-

find works on logical expressions. So you can make one that incorporates floating point values – but you have to be careful since equality is may not be exactly what you mean with floating point values.

–Loren

Actually if you look inside the function ‘ismember’, it is still using the iteration inside. That makes not a big difference with Bruno’s Solution.

Here i give out another solution without iteration and can show exactly the location of each member of B in A.

[ma,na] = size(A);
[mb,nb] = size(B);
[ia,ib] = meshgrid(1:ma,1:mb);
[ja,jb] = meshgrid(1:na,1:nb);
I = ~(A(ia,ja)-B(ib,jb));

Beljsl-

I think your solution will run into memory issues even earlier than some of the others becuase of the large arrays you create with meshgrid.

–loren

dear loren-
Yes. You are right. Just show my new idea.
Vectorized operation always requires more memory than the others. It is really a trade-off. The advantage of this method is to produce exact location of each member of B in A.

Dear Loren,

I have had a similar data mathcing question that I have been struggling to solve efficiently.

Let us say I have a cell array S filled with strings. For this example I will use:

S={'L';'6';'D';'D';'1';'9';'5';'T';'L';'G';'Z';'4';'5';'6';'7'};

I also have another cell C array full strings.

C={'6','D';'T','L';'4','5'}';

I would like to search and match the columns of C within S. Eventually what I would like to do is build the cell array X which includes the search terms and some following terms(for this example I took the next two).

X={'6','D','D','1';'T','L','G','Z';'4','5','6','7'}';

In reality, S and C are very large so an efficient solution is important.

Chris-

I haven’t tried anything out yet. Are the strings always a single character? If so, you could convert to character arrays and loop using strfind perhaps.

–Loren

Loren,
That is the approach I took. I didn’t post to the blog because I was hoping Chris would take the question to the newsgroup. One problem is that it is not entirely clear what to do if more than one match is found. That is why I have a try catch in my code, which should be replaced once the appropriate action is clear.

 

[r,c] = size(C);
Sc = char(S)';
Cc = reshape(char(C),r,c);
nm2kp = 2;  % The number to keep after the match.
C(r + 1:r + nm2kp,:) = {[]}; % Re-use C to make X.

for ii = 1:c
    try
    idx = strfind(Sc,Cc(:,ii)');
    % This next line will error if more than one match.
    C(r+1 : r+nm2kp,ii) = S(idx+r : idx+r+1);
    catch
    end
end

 

Loren,

Good question about the single letter thing. I am quickly discovering that if the strings of S and C were more than a single letter, the problem becomes a bit more difficult.

Chris

I just happened to see this post today and thought of giving it a try:
My solution is:

C = zeros(numel(A),1);
for i = 1:numel(A)
C(i) = (any(B == A(i)));
end
C = reshape(C,size(A));

I am sure there are more elegant solutions for this but this is one of the ways of doing it.

Users working with larger arrays could consider the following algorithm. The complexity O(N*log(N)) is about same as ismember if the arrays have similar size, but surprisingly it seems to run up to 2x faster considering it’s up against mex code. If one of the arrays is small then ismember is faster. I haven’t tested it on distributed arrays but it should work if sort does.

 function y=ismember2(a,b)

[as,ia]=sort(a(:));
bs=sort(b(:));
iis=false(size(as));
n=numel(as);
m=numel(bs);
i=1;
j=1;
while (i<=n)&&(j<=m),
  if (as(i)>bs(j)),
    j=j+1;
  else
    iis(i)=(as(i)==bs(j));
    i=i+1;
  end;
end;
y=false(size(a));
y(ia)=iis;
end 

Timings:

>> n=2e7;a=ceil(n*rand(n,1));b=ceil(n*rand(n,1));
>> tic;ismember(a,b);toc;
Elapsed time is 39.229270 seconds.
>> tic;ismember2(a,b);toc;
Elapsed time is 21.876701 seconds.

Hi Miss Loren
I am in need of an algo to count cosecutive duplicate values in a one dimensional matrix. e.g A=[1 1 1 2 2 3 3 3 3 4 4 4 4 4 5 6 6] should give me another matrix with the exact count of these repeated values e.g B=[3 2 4 5 1 2] . If u kindly send me the code at my email address i.e ghazanfar.ali@live.com I shall be very obliged. Thanks and Best Regards. Ghazanfar Ali. Pakistan

Hi Ghazanfar-

Please look at the help for the function hist (and perhaps histc). They will do what you want for these integer values.

–Loren

Using ismember is a novel technique and reading through this discussion has been very informative. I’m wondering if someone here has had to deal with tolerances in there matching? What I mean to say is:

A = [ 5.1 9.5 20 31];
B = [ 5 10 15 20 25 30];

My of finding of near and exact matches:
i1=[];
i2=[];
for l1=1:numel(B)
for l2=1:numel(A)
if (B(l1)=A(l2)-2.5)
i1 = [i1;l1];
i2 = [i2;l2];
end
end
end

This works fine for small arrays but starts taking LONG time when A and B are considerably large, which they are in my real-world case.

Anyone have a suggestion as to vectorize or speed this up?

Thanks

Dpath2o-

You can try bsxfun with an anonymous function (@(x,y) x-y+2.5) and turn one of the arrays into a column. You will end up with a (possibly large) matrix with 1s where the corresponding row and column have the relationship you want. With floating point though, checking for equality is not likely to get you results you really want. See my little example here. You probably want to build in some tolerance.

A = 1:3
B = [1.1 3.5 6]
f = @(x,y) (x-y == 2.5)
bsxfun(f,A',B)
ans =
     0     0     0
     0     0     0
     0     0     0

Do you really mean to compare all elements of A with all elements of B?

–Loren

Loren,

Thanks for help. Though this approach constructs a length(A)xlength(B(1,:)) matrix and that is not what the job requires.

What I have is a time array of matlab datenums. Sometimes there are gaps in this time array. The gaps are not filled with nan’s or zeros — i.e. the gap is merely missing.

So what I need to construct is a complete array of times and then find then find the times that match from the in-situ data. Using the indexes of the found times I will put the in-situ results back into a large matrix that is filled with nan’s so that gaps in time will be represented by NaN.

This is my hurdle.

Cheers,
Dan

From the above example here is my solution that works well but doesn’t allow for near misses — i.e. times that are close but not exact:

x=[5.1 12.23 0.567 12.01 0.555;
      9.5  12.03 0.578 12.11 0.595;
      20   12.06 0.588 12.12 0.596;
      31   12.09  0.591 12.20 0.601]; 
A = x(:,1);
B = 5:5:30;
tmp=ismember(A,B);
X=repmat(nan,length(B),length(A(1,:)));
for l1=1:length(tmp)
X(tmp(l1),:) = A(l1,2:end);
end

Not that pretty and it doesn’t allow for near misses, but I can’t think of a different approach.

Any other thoughtful clues?

Thanks!

Cheers,
Dan

Hi Loren,
Can a ‘find’ function be used with a string dataset? Actually I want to get the index/location when the strings from different rows does not match.
Rite now the result is not what I expected.
Thanx for your response.

%% 
% Read from xls and put in dataset
ds = dataset('xlsfile', 'iris.xls');
[nobs,nvars] = size(ds);
%% 
% Sort based on 1st column
span = 5;
for i = 1 : nvars-1
    sample = ds(1:nobs,[i nvars]);
    answer = sortrows(sample);
    disp(answer); 
    m = 1;
    cutpoint = 0;
    while (m < nobs)  
        m = m  + span;
            obs1 = answer{m , 2};
            obs2 = answer{m + 1 , 2};
            tf=strcmpi(obs1,obs2);
            if (tf == 0)
                cutpoint = 1;
                span=5;
                %when str not same want to return the index
                [row, col, v] = find(answer{m,2});
                disp(row);
                disp(col);
                disp(v);
            elseif (tf == 1)
                cutpoint = 0;
                span=0;
            end
            m = m + span + 1;
    end
end

Hi Loren,
Hope you can help me regarding string and dataset problem as mentioned in query dated 2nd July.I am doing an algorithm for data mining from iris data. The idea is whenever there is changing class e.g iris-setosa vs. iris-versicolor, I want to display which row is it.
Really hope for your guidance.
Thanx a lot.

Yasmin, MATLAB does not ship with the file iris.xls, so it’s pretty hard to guess what you are trying to do. My best guess is that you’re using Fisher’s iris data as your example, but since there’s only one string variable in those data, I don’t follow why you’d need all that code when you could refer to the species name as, for example

>> ds = dataset({meas,’sepalLen’,'sepalWidth,’petalLen’,'petalWidth”,},species);
>> ds.species
ans =
‘setosa’
‘setosa’
‘setosa’
‘setosa’
[snip]

and do whatever you need to do on that one variable.

Hi Loren,
how can i compare the rows of two matrixes? My problem is to find the indexes of egual rows of the two corresponding matrixes.
Suppose that the two matrixes are XYZ and xyz I tried something like this:

for i = 1:N
ii = find(XYZ(:,1) == xyz(i,1) & XYZ(:,2) == xyz(i,2) & XYZ(:,3) == xyz(i,3));
end

Can I improve my code in a faster way?

Thank you a lot
Vince

Hi,
Is there any faster way to execute the piece of code below?

r=rand(4,1);
V=[ .1,.5,.7,1
.3,.5,.7,1
.4,.5,.8,1
.1,.5,.9,1];
for j=1:4
loc(j)=find(r(j)<V(j,:),1,’first’);
end

Patrick-

You’ll have to check the timing, but using ismember appropriately might be faster.

–Loren

Hi Loren:
Thanks for the reply. I do not understand where I could use ismember in this problem as you are suggesting.

There is an INEQUALITY sign in the find statement, and so the problem is not to find matches. The find call in the loop is

loc(j)=find(r(j)<V(j,:),1,’first’);

- Patrick

Hello Loren,
I am trying to array indexing to create an array from a row vector,say A=1:5, when I wanted to generate 10 rows array by using B=(1:10,:)=A; it generated error; Is there better way not to use ‘for loop’ like below to generate the array from a vector.

a = 1:5; % create a vector
for n=1:10
    b=(n,:)=a;
end

Patrick-

ismember might not be faster, but you can use it on your logical expression to look for the true values perhaps.

–Loren

The following code finds the specified value in A and gives its location,

A = [11 22 34 56 89
23 44 11 20 66
79 54 32 17 89
11 66 21 45 90];


value = 22; %value to be searched

[row,col]=find(ismember(A,[value]))

hi, I have an vector like x=[0.2000 0.2000 0.7375 1.0000 1.8125 2.0000 2.8875 3.0000 3.9625 4.0000 4.5000 4.5000]

unique does not stop repition of 0.2000 in this case.But in another similar case of 0.2000, it stops repetition.What is the solution

Mahbub-

Welcome to the world of floating point arithmetic. The numbers you are printing out are not the full picture. If you use format long, you would likely see differences between the first 2 entities in your vector. == is only true if the numbers are non NaN, and are bitwise exactly equal. It’s much better to allow a tolerance when comparing floating point numbers. If you look through the blog for items with category numerical accuracy, you should see several relevant posts.

–Loren

Ok read thru this several times as it seems appropriate to my problem, however ismember does not take into account the duplicates and I need these to be true also.

I have A= 80,000×2 i.e.

[275.0919,-13.6656;275.1045,-13.8508...... ]

and B= 10,000×2 i.e.

[275.1045,-13.8508;.... ]

and need to find every occurance where the value in B occurs in A prefereable with the index for both included. My task is then to compare other values relating to these indexes in other arrays. These are actually coordinates and I need the ones that match in both, but there could always be a slight difference maybe the last decimal that would still need to be accepted. This sounds simple to do in my head and I am not a MATLAB expert by any means so any help in simple English would help. Thanks.

Slart-

Exact equality for floating point numbers is very iffy. You will need to build in a tolerance test.

Could you use unique looking for rows somehow with 3 outputs A and B?

–Loren

Hmm note sure how unique could help as I need every occurance not just he first and last. I really need to be able to find where in A the corresponding two values from B occur. I.E. matching both the first column and the second for an exact coordinate match.

Could you elaborate on how you think unique could help?

Ahh well there could be multiple matches for column 1 but then the column 2 would probably be different there should only be one match where both 1 and 2 match the same for both which would indicate an exact coordinate match. If that makes sense.

These postings are the author's and don't necessarily represent the opinions of MathWorks.