Speeding Up MATLAB Applications25

Posted by Loren Shure,

Today I’d like to introduce a guest blogger, Sarah Wait Zaranek, who is an application engineer here at The MathWorks. Today she talks about speeding up code from a customer to get acceptable performance.

Introduction

A customer ran into slow performance issues with her code in MATLAB. She saw such slow performance in that she decided to recode her algorithm in another language. I wanted to show her some simple techniques in MATLAB that could bring her code down to a more reasonable running time.

I enlisted the help of my fellow application engineers to suggest possible changes. Many of them chimed in with some really great suggestions. I have integrated most of our thoughts into a few code versions to share with you. These versions illustrate our thought processes as we worked through the code collaboratively to improve its speed.

In this post, I focus on

• preallocation
• vectorization
• elimination of repeated calculations

What Does This Code Do?

For security reasons, the user couldn't give me her exact code. So the code we will use in this blog serves as an approximation of the code she is using.

The code generates locations on a 2D grid with dimensions nx1 by nx2. These grid locations serve as values for initial and final positions used by the main part of the calculation. The code iterates through all possible combinations of these initial and final positions.

Given an initial and final position, the code calculates the quantity exponent . If exponent is less than a particular threshold value, gausThresh, exponent is then used to calculate out. subs contains the indices corresponding to the initial and final positions used to calculate out.

For some of you, this probably is better understood just by looking at the code itself.

The Initial Code

% Initialize the grid and initial and final points
nx1 = 10; nx2 =  10;
x1l =  0; x1u = 100;
x2l =  0; x2u = 100;

x1 = linspace(x1l,x1u,nx1+1);
x2 = linspace(x2l,x2u,nx2+1);

limsf1 = 1:nx1+1; limsf2 = 1:nx2+1;

% Initalize other variables
t = 1;
sigmax1 = 0.5; sigmax2 = 1;
sigma = t * [sigmax1^2 0; 0 sigmax2^2];
invSig = inv(sigma);
detSig = det(sigma);

expF = [1 0; 0 1];
n = size (expF, 1);
gausThresh = 10;

small = 0; subs = []; vals = [];

% Iterate through all possible initial
% and final positions and calculate
% the values of exponent and out
% if exponent > gausThresh.

for i1 = 1:nx1+1
for i2 = 1:nx2+1
for f1 = limsf1
for f2 = limsf2

% Initial and final position
xi = [x1(i1) x2(i2)]';
xf = [x1(f1) x2(f2)]';

exponent = 0.5 * (xf - expF * xi)'...
* invSig * (xf - expF * xi);

if exponent > gausThresh
small = small + 1;
else
out = 1 / (sqrt((2 * pi)^n * detSig))...
* exp(-exponent);
subs = [subs; i1 i2 f1 f2];
vals = [vals; out];
end

end
end
end
end

Here is a graphical representation of the results for nx1=nx2=100. Because the data is quite dense, I have magnified a portion of the total graph. The red lines connect the initial and final points where exponent < gausThresh and the thickness of the line reflects the value of vals.

open('results.fig');

Here are the runtimes on my computer, a T60 Lenovo dual-core laptop with multithreading enabled.

displayRunTimes(1)
    nx1      nx2      time
50       50       296 seconds
100      100      9826 seconds


Listen to M-Lint

The first and easiest step is to listen to the suggestions from M-Lint, the static code analyzer that ships with core MATLAB. You can either access it via the editor or command line. In this case, I use the command line and define a simple function displayMlint so that the display is compact.

output = mlint('initial.m');
displayMlint(output);
'subs' might be growing inside a
loop. Consider preallocating
for speed.

'vals' might be growing inside a
loop. Consider preallocating
for speed.



Following these suggestions, I know I need to preallocate some of the arrays.

Why?

MATLAB needs contiguous blocks of memory for each array. Therefore, by repeatedly resizing the array, I keep asking MATLAB to look for a new contiguous blocks of memory and move the old array elements into that new block. If I preallocate the maximum amount of space that would be required by the array, I do not have to suffer the additional cost to do this searching and moving.

However, since I am unsure of the final size of subs and vars, I don't know what size I should preallocate. I could make them the maximum possible size. However for the case of nx1 = nx2 = 100, that requires a vector of 100^4 elements, which gets me into trouble!

try
zeros(1,100^4)
catch ME
end
display(ME.message)
Out of memory. Type HELP MEMORY for your options.


Therefore, I need to block allocate the variable.

I do this by first preallocating a reasonable size (that doesn't create an out of memory error) for subs and vars and then increase that allocation by a reasonable amount, chunk by chunk, if subs and vars go beyond their preallocated sizes.

This way, I only suffer the cost of resizing the array a few times instead of every time a new element is added.

I have also introduced a function, initialize, to do the variable initializations so that you don't have to repeatedly look at the part of the code that does not change.

The Code with Preallocation

% Initialization
nx1 = 10;
nx2 = 10;

[x1,x2,limsf1,limsf2,expF,gausThresh,small,invSig,detSig,n]= ...
initialize(nx1,nx2);

% Initial guess for preallocation
mm = min((nx1+1)^2*(nx2+1)^2, 10^6);
subs = zeros(mm,4);
vals = zeros(mm,1);

counter = 0;

% Iterate through all possible initial
% and final positions
for i1 = 1:nx1+1
for i2 = 1:nx2+1
for f1 = limsf1
for f2 = limsf2

xi = [x1(i1) x2(i2)]'; %% Initial position
xf = [x1(f1) x2(f2)]'; %% Final position

exponent = 0.5 * (xf - expF * xi)'...
* invSig * (xf - expF * xi);

% Increase preallocation if necessary
if counter == length(vals)
subs = [subs; zeros(mm, 4)];
vals = [vals; zeros(mm, 1)];
end

if exponent > gausThresh
small = small + 1;
else
% Counter introduced
counter=counter + 1;
out = 1 / (sqrt((2 * pi)^n * detSig))...
* exp(-exponent);
subs(counter,:) = [i1 i2 f1 f2];
vals(counter) = out;
end

end
end
end
end

% Remove zero components that came from preallocation
vals = vals(vals > 0);
subs = subs(vals > 0);

The preallocation makes for slightly faster code.

displayRunTimes(2)
    nx1      nx2      time
50       50       267 seconds
100      100      4228 seconds


However, this isn't exactly the speedup I was looking for. So, let's try to tackle the code using vectorization.

Vectorization

Because MATLAB is a array-based language, code written in terms of matrix and array operations is often faster than code which relies on for loops, especially in the case of nested for loops. In vectorizing, matrix and vector operations substitute for the equivalent for loop calculations.

There are several possibilities for vectorizing this code. I am going to explore two in this blog entry.

• Vectorize the Inner Two loops
• Vectorize the Inner Three loops

The Code with Preallocation and Vectorization of the Inner Two Loops

Try 1: Vectorize the Inner Two loops

% Initialization

nx1 = 10;
nx2 = 10;

[x1,x2,limsf1,limsf2,expF,gausThresh,small,invSig,detSig,n]= ...
initialize(nx1,nx2);

vals = cell(nx1+1,nx2+1); % Cell preallocation
subs = cell(nx1+1,nx2+1); % Cell preallocation

[xind,yind] = meshgrid(limsf1,limsf2);
xyindices = [xind(:)' ; yind(:)'];

[x,y] = meshgrid(x1(limsf1),x2(limsf2));
xyfinal = [x(:)' ; y(:)'];

exptotal = zeros(length(xyfinal),1);

% Loop over all possible combinations of positions
for i1 = 1:nx1+1
for i2 = 1:nx2+1

xyinitial = repmat([x1(i1);x2(i2)],1,length(xyfinal));

expa = 0.5 * (xyfinal - expF * xyinitial);
expb = invSig * (xyfinal - expF * xyinitial);
exptotal(:,1) = expa(1,:).*expb(1,:)+expa(2,:).*expb(2,:);

index = find(exptotal < gausThresh);
expreduced = exptotal(exptotal < gausThresh);

out = 1 / (sqrt((2 * pi)^n * detSig)) * exp(-(expreduced));
vals{i1,i2} = out;
subs{i1,i2} = [i1*ones(1,length(index)) ; ...
i2*ones(1,length(index)); xyindices(1,index); ...
xyindices(2,index)]' ;

end
end

% Reshape and convert output so it is in a
% simple matrix format
vals = cell2mat(vals(:));
subs = cell2mat(subs(:));

small = ((nx1+1)^2*(nx2+1)^2)-length(subs);

Wow! With just this little bit of vectorization, the run time decreased tremendously.

displayRunTimes(3)
    nx1      nx2      time
50       50       1.51 seconds
100      100      19.28 seconds


Let's go over the main techniques I've illustrated here.

• Use of meshgrid and repmat to create matrices
• Use of vector and matrix operations
• Use of element-wise operations when appropriate

meshgrid produces two vectors of the x and y positions for all possible combinations of the x and y input vectors. In other words, given the vectors x and y, meshgrid gives all possible positions in a mesh or grid defined on a side by those x and y input vectors. The n-dimensional version of meshgrid is ndgrid, which I use it in the next example.

The function repmat replicates and tiles arrays.

Generally I use matrices as part of the vectorization, so whenever possible I use vector and matrix operations.

However, in one segment of the code it is actually faster to explicitly write out a vector-vector multiply element by element. This is somewhat unintuitive, so let's discuss this a bit more in detail.

To calculate exponent in the original code, there is a series of vector and matrix multiplies with the original and final positions, each represented by two-element vectors. The values of the initial and final position are then changed every step through the loop.

When the code is vectorized, all the initial positions and final positions are now contained in two matrices xyinitial and xyfinal. Therefore, to calculate the value of exponent, only the diagonal terms of the multiplication of expa and expb are relevant. However, in this case, taking the diagonal of a matrix multiply turns out to be a rather expensive operation. Instead I choose to write it as an element-by-element operation.

I have only shown you a few functions that are useful for vectorization in this post. Here is a more extensive list of useful functions for vectorization.

The Code with Preallocation and Vectorization of the Inner Three Loops

Try 2: Vectorize the Inner Three Loops

% Initialization
nx1 = 10;
nx2 = 10;

[x1,x2,limsf1,limsf2,expF,gausThresh,small,invSig,detSig,n]= ...
initialize(nx1,nx2);

limsi1 = limsf1;
limsi2 = limsf2;

% ndgrid gives a matrix of all the possible combinations
[aind,bind,cind] = ndgrid(limsi2,limsf1,limsf2);
[a,b,c] = ndgrid(x2,x1,x2);

vals = cell(nx1+1,nx2+1);  % Cell preallocation
subs = cell(nx1+1,nx2+1);  % Cell preallocation

% Convert grids to single vector to use in a single loop
b = b(:); aind = aind(:); bind = bind(:); cind = cind(:);

expac = a(:)-c(:); % Calculate x2-x1

% Iterate through initial x1 positions (i1)
for i1 = limsi1

exbx1= b-x1(i1);
expaux = invSig(2)*exbx1.*expac;
exponent = 0.5*(invSig(1)*exbx1.*exbx1+expaux);

index = find(exponent < gausThresh);
expreduced = exponent(exponent < gausThresh);

vals{i1} = 1 / (sqrt((2 * pi)^n * detSig))...
.*exp(-expreduced);

subs{i1} = [i1*ones(1,length(index));
aind(index)' ; bind(index)';...
cind(index)']';

end

vals = cell2mat(vals(:));
subs = cell2mat(subs(:));

small = ((nx1+1)^2*(nx2+1)^2)-length(subs);

Now the code runs in even less time.

displayRunTimes(4)
    nx1      nx2      time
50       50       0.658 seconds
100      100      8.77 seconds


Why don't I try to vectorize all the loops?

I certainly could. In this case, I didn't see much of a benefit with total vectorization, and the code became a bit harder to read. So, I am not going to show the fully vectorized code. Considering performance and readability of the code, try 2 seems to be the best solution.

Once the code is vectorized, you can easily see that I repeatedly calculate values that only need to be calculated once. Also, assuming some of the vectors keep the same form (such as sigma), other calculations (e.g., InvSig) are unnecessary.

Some of these changes are quite easy to do, and some are a bit more advanced, but I wanted to finish this entry with the fastest version of the code from the application engineering participants.

The Code with Preallocation, Vectorization and Only Calculating Once

nx1 =  100; nx2 = 100;
x1l =   0; x1u = 100;
x2l =   0; x2u = 100;

x1 = linspace(x1l,x1u,nx1+1);
x2 = linspace(x2l,x2u,nx2+1);

limsi1 = 1:nx1+1;
limsi2 = 1:nx2+1;

limsf1 = 1:nx1+1;
limsf2 = 1:nx2+1;

t = 1;

sigmax1 = 0.5;
sigmax2 = 1;

sigma = t * [sigmax1^2 sigmax2^2];
detSig = sigma(1)*sigma(2);
invSig = [1/sigma(1) 1/sigma(2)];

gausThresh = 10;
n=3;
const=1 / (sqrt((2 * pi)^n * detSig));

% ndgrid gives a matrix of all the possible combinations
% of position, except limsi1 which we iterate over

[aind,bind,cind] = ndgrid(limsi2,limsf1,limsf2);
[a,b,c] = ndgrid(x2,x1,x2);

vals = cell(nx1+1,nx2+1);  % Cell preallocation
subs = cell(nx1+1,nx2+1);  % Cell preallocation

% Convert grids to single vector to
% use in a single for-loop
b = b(:);
aind = aind(:);
bind = bind(:);
cind = cind(:);

expac= a(:)-c(:);
expaux = invSig(2)*expac.*expac;

% Iterate through initial x1 positions

for i1 = limsi1

expbx1= b-x1(i1);
exponent = 0.5*(invSig(1)*expbx1.*expbx1+expaux);

% Find indices where exponent < gausThresh
index = find(exponent < gausThresh);

% Find and keep values where exp < gausThresh

expreduced = exponent(exponent < gausThresh);

vals{i1} = const.*exp(-expreduced);

subs{i1} = [i1*ones(1,length(index));
aind(index)' ; bind(index)';...
cind(index)']';
end

vals = cell2mat(vals(:));
subs = cell2mat(subs(:));

small = ((nx1+1)^2*(nx2+1)^2)-length(subs);

With this, here are the magical times.

displayRunTimes(5)
    nx1      nx2      time
50       50       0.568 seconds
100      100      8.36 seconds


Pretty good speedup. In fact, you see the run-time reduced to about 2% of the original times.

In this post, I focused on preallocation, vectorization, and only doing calculations a single time. However, there are also additional techniques to help you speed up your MATLAB code such as using the profiler to diagnose bottlenecks, using MEX-files, using bsxfun, and enabling multithreading. The documentation for MATLAB has a very helpful section entitled Performance. I encourage you to go there to read more about these and other relevant topics.

Tell me about some of the ways you speed up MATLAB applications or feel free to post a example of a technique I didn't illustrate here.

Get the MATLAB code

Published with MATLAB® 7.6

Note

Jared Jensen replied on : 1 of 25

In grad school I was working on a GSM receiver in Matlab. I found some source online that had been open-sourced that used a Viterbi Equalizer. It worked well, but was extremely slow. (Especially forming the channel estimate.) It looked like the original author was very prone to Computer-Science-Like FOR loops and lots of IF logic. So I took some time to rewrite the code using vectorization, meshgrid and linear indexing. You mentioned the first two, but I have found big speedups in using linear indexed arrays and matrices for my operations. (Mainly because I don’t have to pick elements out with loop variables and such.) The original was about 5*real time. (5 seconds to demod 1 second of signal.) The final rewrite with a .mex file for the actual Viterbi was right at 1/10 of real time. Here’s an example from the code using some meshgrid, vectorization, linear indexing and reshaping matrices for easy multiplies. This was fast enough that I could calculate a channel estimate for all 8 of the GSM training sequences at once. (I used that mainly for debugging purposes.) Meshgrid was the hottest line in the profiler, and was still down in the noise.

Original : (Short, sweet and slow…)

% ESTIMATE CHANNEL IMPULSE RESPONSE USING ONLY EVERY
% OSR’th SAMPLE IN THE RECEIVED SIGNAL
%

for n = 1:length(chan_est),
chan_est(n)=r_sub(n:OSR:n+(length(T16)-1)*OSR)*T16.’;
end

Rewrite : (Longer, a little harder to read, 1000xfaster)

% NO NO NO NO NO NO NO NO NO NO NO NO
% % % % % % for n = 1:length(chan_est),
% % % % % % chan_est(n)=r_sub(n:OSR:n+15*OSR)*TRAINING(:,3);
% % % % % % end

% Make my index matrices.
% X increments by OSR up to length(chan_est)*4 and then repeats that row to
% make the matrix square.
%
% Y is a column incrementing from 1 to the length of the X matrix row by 1.
% Then that column is repeated to make the matrix square.
[ X , Y ] = meshgrid([1:OSR:length(chan_est)*OSR],[1:length([1:OSR:length(chan_est)*OSR])]);

% Add them together, and subtract 1. This gives us rows of lags for the
% signal sample points. We only want 16 sample points, so we kill off
% everything after the 16th lag of each row and transpose. (Since the
% indices are all real, we can use the ‘ conjugate transpose without
% ill-effect, but to be disciplined we won’t.)
ind = X+Y-1;
ind = ind(:,1:16).’;

% Here we use linear indexing into the r_sub signal to create a long vector
% of lagged samples of r_sub. Every 16 samples of the new r_sub is going
% to be a row in our matrix. We reshape it into a matrix of 16 rows by
% length(chan_est) columns and then non-conjugate transpose to get the
% correct shape. (You have to do it this way due to the way reshape fills
% in values. Column-wise.)
r_sub = r_sub(ind);
r_sub = reshape(r_sub, 16, length(chan_est)).’;

% Now we simply do a matrix multiply and we create chan_est in one
% operation. In fact, we generate a chan_est for all 8 training sequences
% in one multiply.

chan_est = (r_sub*TRAINING).’;

Love the blog. Thanks for the great post.

Dan K replied on : 2 of 25

An excellent blog. There is an area of acceleration which you didn’t cover which I would very much like to get some clarification on. What is the fastest way to extract a subset of an array? For example, if I am calculating a power spectral density I only need the first half of the results, is there any way to know which of the following will be fastest?

nPts = 1e5;
half = nPts/2;
A = rand(1e5,1);
B = fft(A); % I know I can do lots of things to accelerate this

B=B(1:half); % method 1
B(half+1:end)=[]; %method 2
C = B(1:half); %method 3

Also, how does this change when I’m dealing with more than 1 dimension? Is there a relative speed difference between linear indexing and : notation? Does using “end” slow me down vs explicitly determining the number of rows?

I know this seems like a trivial area to worry about speed, but when you are dealing with lots of these operations, they can take up a lot of my execution time.

Thanks,
Dan

Jacob replied on : 3 of 25

Thanks for all the helpful tips. In particular, I have not seen bsxfun before, and I’ve wanted that functionality.

I have a question, though. I certainly try to preallocate and vectorize in my own code, but I was under the impression that MATLAB had radically sped up for-loops a few versions ago, as long as the data types and operations within the loop were part of an “accelerated” subset. This wouldn’t, of course, change the need to preallocate, but the operations in this example all seem to be things that should be accelerated. So why is your first try (with preallocation and no vectorization) still so slow? Would it have been that much slower in an old version of MATLAB, before acceleration?

Loren replied on : 4 of 25

Dan-

Good questions about indexing. I fixed your example to be more fair, I believe. Here’s the code and the experiment I ran.

First the code:

%% Extracting Parts of an Array
nPts = 1e7;
half = nPts/2;
A = rand(nPts,1);
B = fft(A);
Borig = B;
% make sure B and Borig occupy different memory locations
% so avoid copy on write by writing it out first!
Borig(1) = Borig(1);
%%
% be sure to reset B to the original size before each timing
t1 = tic;
B=B(1:half); % method 1
toc(t1)
B = Borig;
t2 = tic;
B(half+1:end)=[]; %method 2
toc(t2)
B = Borig;
t3 = tic;
C = B(1:half); %method 3
toc(t3)


And the results – which I got by running the first cell once and the second cell 3 times:

Elapsed time is 0.148372 seconds.
Elapsed time is 0.600757 seconds.
Elapsed time is 0.196193 seconds.
>>
>>
>>
Elapsed time is 0.155245 seconds.
Elapsed time is 0.550005 seconds.
Elapsed time is 0.198151 seconds.
>>
>>
>>
Elapsed time is 0.131419 seconds.
Elapsed time is 0.549656 seconds.
Elapsed time is 0.194549 seconds.


This is only 1-D and linear indexing both for this case. It seems like creating the part to keep and assigning it is faster than removing the ones you don’t want to keep. Not sure if that extends to ND. Expect me to think about this for a post of the future!

–Loren

Oliver A. Chapman, P.E. replied on : 5 of 25

This is a wonderfully deep example and I think it will take me a month to digest.

I guess my first comment is that you could have significantly improved the readability by simplifying the span of the nested for loops since for the purposes of this column, you left them fixed at all times.

i.e.,
for i1 = 1 : 11
rather than
for i1 = 1 : nx1+1

Next is a question. In your 1st attempt at preallocation, you have this in the code:

vals = vals(vals > 0);

Thanks.

Loren replied on : 6 of 25

Oliver-

vals = vals(vals > 0);


is just using logical indexing – a topic that is in the documentation here: http://www.mathworks.com/help/matlab/math/matrix-indexing.html#bq7egb6-1

and in several of my blogs on indexing: http://blogs.mathworks.com/loren/category/indexing/
particularly:

http://blogs.mathworks.com/loren/2006/08/09/essence-of-indexing/

http://blogs.mathworks.com/loren/2007/10/11/a-way-to-account-for-missing-data/

–Loren

Mark Hoemmen replied on : 7 of 25

Great article! Would love to see more performance tuning examples :-)

Dan K replied on : 8 of 25

Loren,
Good catch on the code error I made! I later caught the fact that I was reusing a variable incorrectly, but missed the copy on write part. I guess one more option to try would be logical indexing as a way of extracting a subset. Especially surpising to me was that B(half+1:end) =[]; is FASTER than B(half+1:nPts)=[]; That one I really can’t figure.

Dan

Loren replied on : 9 of 25

Dan-

I haven’t looked into the innards of MATLAB, but I think what’s happening is that the expression with nPts creates the vector with indices first and then uses it. The expression with ‘end’ has more oppportunties for optimizing since ‘end’ depends on context and doesn’t just get calculated without knowing the lay of the land. Once MATLAB has the expression in context, MATLAB might not need to create the full index vector first, but can construct the output without the intermediate vector. That guess is predicated on end and using :. If you did something else like [1 2 3 end], I don’t know if you’d see similar performance results. In any case, this is all speculation on my part.

–Loren

Yakov replied on : 10 of 25

In response to jhofman, for this application bsxfun is only slightly slower than that website describes, but uses much less memory. Additional advantage is the easier to read code.

Steve Eddins replied on : 11 of 25

Sarah Zaranek replied on : 12 of 25

Hi Jacob,

Sorry about the slow response. You are correct that the code would be slower without the JIT/accelerator.

The JIT/accelerator was first introduced in MATLAB 6.5 (Release 13). When you first run code, it is analyzed and, when possible, machine code is generated and used in place of the interpreted code. The JIT/accelerator initially sped up some calculations in loops that obeyed certain characteristics, e.g., scalar operations on double values. The JIT/accelerator has continued to evolve and handles much more of the MATLAB language, but still incurs costs the first time (or few times) through a loop so assumptions about variables being scalar, for example, can be verified. If the assumptions are ever broken, MATLAB does know to use the interpreted version of code. But if the assumptions hold, the loops can execute quickly.

Let’s look at the first case with nx1=nx2=50. When the JIT/accelerator is turned off, we get a speed around 371 seconds. Compared to the case I showed in the blog post of 296 seconds. Interestingly, you could put a tic-toc pair within the inner-most for-loop itself to see what effect the JIT has on each loop. The subsequent loops after the first loop are on average 27 times faster.

If you are working on elements of a matrix and iterating through them (such as one of our later iterations of the code), a single for-loop is for certain cases almost just as fast as the matrix operation, especially if you remember that MATLAB stores matrix columns in monotonically increasing memory locations. This means that processing data column-wise results in faster times then processing data row-wise.

We recommend not writing specifically for the JIT since the JIT is constantly evolving. You should though keep in mind the general good MATLAB-coding practices, such as those listed above.

Oktay replied on : 13 of 25

Hello,

Is there any significant difference between using:
– Vectorization inside a subfunction
– Benefiting from arrayfun, cellfun, or maybe some Mex functions outside the subfunction

For example:

I have two functions f1 and f2 which has subfunctions f1sub and f2sub respectively:

function y = f1
x = [1 2 3];
y = f1sub(x);
function z = f1sub(x)
– – do some validations on x here – –
z = x.^2;

function y = f2
x = [1 2 3];
y = arrayfun(@f2sub,x)
function z = f2sub(x)
– – do some validations on x, (x should have only a single element) – –
z = x^2;

Thank you.

Oktay

Loren replied on : 14 of 25

Oktay-

It very much depends on the details of the calculations you are doing. Vectorization can sometimes require large intermediate arrays that might not be required by using cellfun, for example. Grabbing the extra memory can cost time. On the other hand, with cellfun, etc., you have the overhead of error checking/validation for each element instead of once for the whole array. So it depends on the balance of what you are doing and what the typical sizes are. I suggest you use the profiler to try out your specific example.

–Loren

B Cook replied on : 15 of 25

I have some finite difference time domain code for solving maxwells equations – pretty basic stuff, and on a 100 by 100 by 100 grid it ran/runs very slowly the loop part of the code looks like

for t=2:1:T-2
tic
for i=2:1:gridxs-2
for j=2:1:gridys-2
for k=2:1:gridzs-2
%%%%%%%%%

dExdx=Ex(i+1,j,k)-Ex(i,j,k);
dExdy=Ex(i,j+1,k)-Ex(i,j,k);
dExdz=Ex(i,j,k+1)-Ex(i,j,k);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
dEydx=Ey(i+1,j,k)-Ey(i,j,k);
dEydy=Ey(i,j+1,k)-Ey(i,j,k);
dEydz=Ey(i,j,k+1)-Ey(i,j,k);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
dEzdx=Ez(i+1,j,k)-Ez(i,j,k);
dEzdy=Ez(i,j+1,k)-Ez(i,j,k);
dEzdz=Ez(i,j,k+1)-Ez(i,j,k);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Hx2(i,j,k)= Hx(i,j,k)-deltaH*(dEzdy-dEydz);
Hy2(i,j,k)= Hy(i,j,k)-deltaH*(dExdz-dEzdx);
Hz2(i,j,k)= Hz(i,j,k)-deltaH*(dEydx-dExdy);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
dHxdx=Hx2(i,j,k)-Hx2(i-1,j,k);
dHxdy=Hx2(i,j,k)-Hx2(i,j-1,k);
dHxdz=Hx2(i,j,k)-Hx2(i,j,k-1);                %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
dHydx=Hy2(i,j,k)-Hy2(i-1,j,k);
dHydy=Hy2(i,j,k)-Hy2(i,j-1,k);
dHydz=Hy2(i,j,k)-Hy2(i,j,k-1);                %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
dHzdx=Hz2(i,j,k)-Hz2(i-1,j,k);
dHzdy=Hz2(i,j,k)-Hz2(i,j-1,k);
dHzdz=Hz2(i,j,k)-Hz2(i,j,k-1);
end
end
end
toc
end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Ex2(i,j,k)= Ex(i,j,k)+deltaE*(dHzdy-dHydz);
Ey2(i,j,k)= Ey(i,j,k)+deltaE*(dHxdz-dHzdx);
Ez2(i,j,k)= Ez(i,j,k)+deltaE*(dHydx-dHxdy);


I am sure there are bugs in the code,but it runs so slowly that I can’t be bothered waiting for it to run to find out

tic toc gives 2.8 seconds….
I tried to vectorise my code (probably added some extra bugs along the way)

it now looks like


for t=2:1:T-2
tic

dExdx(2:iiend,:,:)=Ex1(3:iiend+1,:,:)-Ex1(2:iiend,:,:);
dExdy(:,2:jjend,:)=Ex1(:,3:jjend+1,:)-Ex1(:,2:jjend,:);
dExdz(:,:,2:kkend)=Ex1(:,:,3:kkend+1)-Ex1(:,:,2:kkend);

dEydx(2:iiend,:,:)=Ey1(3:iiend+1,:,:)-Ey1(2:iiend,:,:);
dEydy(:,2:jjend,:)=Ey1(:,3:jjend+1,:)-Ey1(:,2:jjend,:);
dEydz(:,:,2:kkend)=Ey1(:,:,3:kkend+1)-Ey1(:,:,2:kkend);

dEzdx(2:iiend,:,:)=Ez1(3:iiend+1,:,:)-Ez1(2:iiend,:,:);
dEzdy(:,2:jjend,:)=Ez1(:,3:jjend+1,:)-Ez1(:,2:jjend,:);
dEzdz(:,:,2:kkend)=Ez1(:,:,3:kkend+1)-Ez1(:,:,2:kkend);

Hx2(:,:,:)= Hx1(:,:,:)-deltaH*(dEzdy(:,:,:)-dEydz(:,:,:));
Hy2(:,:,:)= Hy1(:,:,:)-deltaH*(dExdz(:,:,:)-dEzdx(:,:,:));
Hz2(:,:,:)= Hz1(:,:,:)-deltaH*(dEydx(:,:,:)-dExdy(:,:,:));

dHxdx(2:iiend,:,:)=Hx1(3:iiend+1,:,:)-Hx1(2:iiend,:,:);
dHxdy(:,2:jjend,:)=Hx1(:,3:jjend+1,:)-Hx1(:,2:jjend,:);
dHxdz(:,:,2:kkend)=Hx1(:,:,3:kkend+1)-Hx1(:,:,2:kkend);

dHydx(2:iiend,:,:)=Hy1(3:iiend+1,:,:)-Hy1(2:iiend,:,:);
dHydy(:,2:jjend,:)=Hy1(:,3:jjend+1,:)-Hy1(:,2:jjend,:);
dHydz(:,:,2:kkend)=Hy1(:,:,3:kkend+1)-Hy1(:,:,2:kkend);

dHzdx(2:iiend,:,:)=Hz1(3:iiend+1,:,:)-Hz1(2:iiend,:,:);
dHzdy(:,2:jjend,:)=Hz1(:,3:jjend+1,:)-Hz1(:,2:jjend,:);
dHzdz(:,:,2:kkend)=Hz1(:,:,3:kkend+1)-Hz1(:,:,2:kkend);

Ex2(:,:,:)= (Ex1(:,:,:))+deltaE*(dHzdy(:,:,:)-dHydz(:,:,:));
Ey2(:,:,:)= (Ey1(:,:,:))+deltaE*(dHxdz(:,:,:)-dHzdx(:,:,:));
Ez2(:,:,:)= (Ez1(:,:,:))+deltaE*(dHydx(:,:,:)-dHxdy(:,:,:));

toc
end


result of tic toc– exactly the same 2.5 secs ???

I found that matlab has a numerical diff and numerical curl, in a speed test my code was at least twice as fast (though looking at the code clearly not as versatile)…

Eventually there was nothing left to do except make a mex file. Now it runs without hick-up in about 0.2 secs or so.

Why didn’t vecorisation work for me?

What can I have done wrong? I declared all my variables before hand with code such as

Ex(gridxs,gridzs,gridys)=0;

and analysed everything with profile

My only guess is memory problems, but as these are predeclared and allocated why should that matter?

-Ben

B Cook replied on : 16 of 25

Ooops in my previous post the below code should be within the inner loop like all the rest of it

           %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Ex2(i,j,k)= Ex(i,j,k)+deltaE*(dHzdy-dHydz);
Ey2(i,j,k)= Ey(i,j,k)+deltaE*(dHxdz-dHzdx);
Ez2(i,j,k)= Ez(i,j,k)+deltaE*(dHydx-dHxdy);


Sarah Z replied on : 17 of 25

Ben —

So, I played around with your code a bit. Running your initial code on my laptop, it takes ~1.06 seconds. If I alter your vectorized code by changing all cases where you indexing into the whole matrix into just using the matrix itself:

% old code
% Ex2(:,:,:)=(Ex1(:,:,:))+deltaE*(dHzdy(:,:,:)-dHydz(:,:,:));
%
% new code
Ex2 = Ex1+deltaE*(dHzdy -dHydz);


Your vectorized code runs in ~0.78 seconds.

This is my first shot at vectorization:



gridxs = 100;
gridys = 100;
gridzs = 100;
deltaH = 0.1;
deltaE = 0.1;

%%  Vectorized Version
RandStream.setDefaultStream(RandStream('mt19937ar','seed',1));

Hx = rand(gridxs,gridys,gridzs);
Hy = rand(gridxs,gridys,gridzs);
Hz = rand(gridxs,gridys,gridzs);

Ex = rand(gridxs,gridys,gridzs);
Ey = rand(gridxs,gridys,gridzs);
Ez = rand(gridxs,gridys,gridzs);

Hx22 = zeros(gridxs,gridys,gridzs);
Hy22 = zeros(gridxs,gridys,gridzs);
Hz22 = zeros(gridxs,gridys,gridzs);

Ex22 = zeros(gridxs,gridys,gridzs);
Ey22 = zeros(gridxs,gridys,gridzs);
Ez22 = zeros(gridxs,gridys,gridzs);

tic

% dExdx2 = diff(Ex(2:end-1,2:end-2,2:end-2),1,1);
dExdy2 = diff(Ex(2:end-2,2:end-1,2:end-2),1,2);
dExdz2 = diff(Ex(2:end-2,2:end-2,2:end-1),1,3);

dEydx2 = diff(Ey(2:end-1,2:end-2,2:end-2),1,1);
% dEydy2 = diff(Ey(2:end-2,2:end-1,2:end-2),1,2);
dEydz2 = diff(Ey(2:end-2,2:end-2,2:end-1),1,3);

dEzdx2 = diff(Ez(2:end-1,2:end-2,2:end-2),1,1);
dEzdy2 = diff(Ez(2:end-2,2:end-1,2:end-2),1,2);
% dEzdz2 = diff(Ez(2:end-2,2:end-2,2:end-1),1,3);

Hx22(2:end-2,2:end-2,2:end-2)= Hx(2:end-2,2:end-2,2:end-2)-deltaH*(dEzdy2-dEydz2);
Hy22(2:end-2,2:end-2,2:end-2)= Hy(2:end-2,2:end-2,2:end-2)-deltaH*(dExdz2-dEzdx2);
Hz22(2:end-2,2:end-2,2:end-2)= Hz(2:end-2,2:end-2,2:end-2)-deltaH*(dEydx2-dExdy2);

% dHxdx2 = diff(Hx22(1:end-2,2:end-2,2:end-2),1,1);
dHxdy2 = diff(Hx22(2:end-2,1:end-2,2:end-2),1,2);
dHxdz2 = diff(Hx22(2:end-2,2:end-2,1:end-2),1,3);

dHydx2 = diff(Hy22(1:end-2,2:end-2,2:end-2),1,1);
% dHydy2 = diff(Hy22(2:end-2,1:end-2,2:end-2),1,2);
dHydz2 = diff(Hy22(2:end-2,2:end-2,1:end-2),1,3);

dHzdx2 = diff(Hz22(1:end-2,2:end-2,2:end-2),1,1);
dHzdy2 = diff(Hz22(2:end-2,1:end-2,2:end-2),1,2);
% dHzdz2 = diff(Hz22(2:end-2,2:end-2,1:end-2),1,3);

Ex22(2:end-2,2:end-2,2:end-2)= Ex(2:end-2,2:end-2,2:end-2)+deltaE*(dHzdy2-dHydz2);
Ey22(2:end-2,2:end-2,2:end-2)= Ey(2:end-2,2:end-2,2:end-2)+deltaE*(dHxdz2-dHzdx2);
Ez22(2:end-2,2:end-2,2:end-2)= Ez(2:end-2,2:end-2,2:end-2)+deltaE*(dHydx2-dHxdy2);



My code takes ~0.44 seconds to run. With more tweaking I could perhaps get it slower, but vectorization does improve the performance considerably. I also compare the answer from your code to mine and they give the same outputs. I used diff because it made for cleaner code.

Cheers,
Sarah

B Cook replied on : 18 of 25

OK that’s interesting the changes don’t look like much at the moment, I will have to compare side by side to really see the difference I think. My first trial makes me think your laptop runs a bit faster than mine! I made the first change you suggested and it improved to around 1.8s, so not bad I think there shouldn’t be a big problem to improve on this now.

Thanks for the suggestions!

Robert Crampton Ph.D. replied on : 19 of 25

I have some code that I inherited at my job that works fine
but gave me the warning:

The variable ‘data’ is growing inside a loop preallocate for speed.

the code is simple and is for getting spectra from a spectrometer output format into Matlab for processing:

the code is included below. Since I am doing many directories with 8000 or more spectra this week I noticed that it slowed down as it progressed. I had never had a speed problem before so I added lines 31-35 to preallocate.
It got considerably slower so I added % before each line to go back to the original and it sped back up. Did I preallocate wrong?



%This function reads in data from files in the Cerex spectrometer format.
%It puts the data and wavelength scale into numeric matrices and saves the
%headers in a cell to re-append later.
%
%where dir can either be the Matlab dir function, or can be a variable
%created using that function (e.g., x=dir;).
%Outputs:
%data = spectral data arranged columnwise
%nm = wavelength vector

%**********************************************************AC 8/2005

data=[];

filenames={directory.name};
[index,j]=listdlg('Name','Select the files to be read in.','ListString',filenames);
drawnow;

%NEED TO PREALLOCATE BEFORE LOOP USE ZEROS COMMAND TEST??
%GET number of spec
specnum=index(end)-index(1);
%GET  number of pix always 2048 for this instrument
pix=2048;

%preallocate data
%data = zeros(pix,specnum);
%START LOOP
for i=index(1):index(end);
e=i
fid=fopen(filenames{i});
%changed delimiter for OOI
[stuff]=textscan(fid,'%s %s',20,'delimiter',':');
%changed delimiter for spectral data after line 20
[stuff2]=textscan(fid,'%s %s',2048);
fclose(fid);

%save headers % changed 11 to 19 for OOI

%create data matrix
% changed 12 to 20 for OOI
singledata=strvcat(stuff2{2}(1:end));%creates array of strings from cell
spectrum=str2num(singledata);%converts strings to numbers
data=[data spectrum];

end;

%save wavelength vector
nmcell=stuff2{1,1}(1:end);
nmcat=strvcat(nmcell);
nm=str2num(nmcat);


Loren replied on : 20 of 25

Robert-

You preallocated data, but you didn’t take advantage of it. So you created a large array, and then had it grow even larger when you did

data=[data spectrum];

the last line in the for-loop. Since data started large after preallocation, you were asking for successively larger amounts, vs. starting small.

You should replace that final line with something that indexes into data and replaces the zeros with the right number of values. To do this, you will need to know each time through the loop how many elements you are adding, and where you left off.

data(idx+(1:numThispiece)) = spectrum;

–Loren

Robert Crampton Ph.D. replied on : 21 of 25

Thank you!

Jiang replied on : 22 of 25

Loren,
I am not very good at MATLAB and don’t know how to get the “Results.fig”. Would you please give the commands of plotting the figure “Results.fig”?
Thank you.

Sarah Wait Zaranek replied on : 23 of 25

Hi Jiang.

I quickly wrote this visualization so we could have something fun to look at! Here is the code below. It probably could be more efficent.



function  plotVals( x1,x2,subs,vals )

width = 1+ 1* vals/max(vals);

for i=1:length(vals)

if x1(subs(i,1))<=10 && x1(subs(i,3))<=10 && ...
x2(subs(i,2))<=10 && x2(subs(i,4))<=10

plot([x1(subs(i,1)) x1(subs(i,3))],[x2(subs(i,2)) x2(subs(i,4))]...
,'-rs','LineWidth',width(i),...
'MarkerSize',2, 'MarkerEdgeColor','b',...
'MarkerFaceColor','b');
hold on
end
end

hXLabel=xlabel('x1 values');
hYLabel=ylabel('x2 values');
hTitle= title('Initial and Final Positions where exponent < gausThresh')

set( gca                             , 'FontName'   , 'Helvetica' );
set([hTitle, hXLabel, hYLabel]       , 'FontName'   , 'AvantGarde');
% set([hLegend, gca]                   , 'FontSize'   , 8           );
set([hXLabel, hYLabel]               , 'FontSize'   , 10          );
set( hTitle                          , 'FontSize'   , 12          , ...
'FontWeight' , 'bold'      );



Cheers,
Sarah

prasanna replied on : 24 of 25

Although I made use of vectorisation in the first code it runs slower than the second code. I would like to know if it is because of 3 dimensional matrices which i am creating in the vectorised code.

FIRST CODE:

clc;
clear all;
close all;
tic;
fs = 100;
fftl = 128;
noi = 50;
maxnof = 1000;
snr  = -5:-1:-25;
reqnof = ones(1,size(snr,2));
prevnof = 0;
h= waitbar(0);
noisepeak = zeros(1,noi);
signoipeak = zeros(1,noi);
wi = 0;
step = 1;

for k = 1:size(snr,2)
disp(snr(k));
for nos = 1:1000
nof = prevnof + step;
t = 0:1/fs:(nof*fftl*noi - 1)/fs;
signal = cos(2*pi*25*t);
noise = sqrt(0.5/(10^(snr(k)/10)))*randn(1,nof*fftl*noi);
fftnoise = abs(fft( reshape(noise,fftl,nof,noi) ));
tempfftnoise = sum(fftnoise,2)/nof;
noisepeak = fftl*max(tempfftnoise,[],1)./sum(tempfftnoise,1);
signoi = noise + signal;
fftsignoi = abs(fft( reshape(signoi,fftl,nof,noi) ));
tempfftsignoi = sum(fftsignoi,2)/nof;
signoipeak = fftl*max(tempfftsignoi,[],1)./sum(tempfftsignoi,1);

sort_signoi = sort(reshape(signoipeak,1,noi));
sort_noise = sort(reshape(noisepeak,1,noi));
if sort_signoi(noi*10/100 + 1) > sort_noise(noi*90/100 + 1)
reqnof(k) = nof;
prevnof = nof - step;
wi = 0;
break;
else
wi = wi + 1;
if wi == 1
step = step + 1;
wi = 0;
end
end
end
waitbar(k/size(snr,2),h);
end
semilogy(snr,reqnof,'o');
close(h);
save farvar;
toc,


SECOND CODE:

clc;
clear all;
close all;
tic;
fs = 100;
fftl = 128;
noi = 50;
maxnof = 1000;
snr  = -5:-1:-25;
reqnof = ones(1,size(snr,2));
prevnof = 0;
h= waitbar(0);
noisepeak = zeros(1,noi);
signoipeak = zeros(1,noi);
wi = 0;
step = 1;

for k = 1:size(snr,2)
disp(snr(k));
for nos = 1:1000
nof = prevnof + step;
t = 0:1/fs:(nof*fftl - 1)/fs;
signal = cos(2*pi*25*t);
for j = 1:noi
noise = sqrt(0.5/(10^(snr(k)/10)))*randn(1,nof*fftl);
tempfftnoise = sum((abs(fft( reshape(noise,fftl,nof) )))')/nof;
noisepeak(j) = fftl*max(tempfftnoise)/sum(tempfftnoise);
signoi = noise + signal;
tempfftsignoi = sum((abs(fft( reshape(signoi,fftl,nof) )))')/nof;
signoipeak(j) = fftl*max(tempfftsignoi)/sum(tempfftsignoi);
end

sort_signoi = sort(signoipeak);
sort_noise = sort(noisepeak);
if sort_signoi(noi*10/100 + 1) > sort_noise(noi*90/100 + 1)
reqnof(k) = nof;
prevnof = nof - step;
wi = 0;
break;
else
wi = wi + 1;
if wi == 1
step = step + 1;
wi = 0;
end
end
end
waitbar(k/size(snr,2),h);
end
semilogy(snr,reqnof,'o');
close(h);
save farvar;
toc,


Sarah replied on : 25 of 25

Hi all.

I found a small mistake in the final version, n should be set to 2 instead of 3. I apologize for the mistake.

Cheers,
Sarah