# Speeding Up MATLAB Applications

Today I’d like to introduce a guest blogger, Sarah Wait Zaranek, who is an application engineer here at The MathWorks. Today she talks about speeding up code from a customer to get acceptable performance.

### Introduction

A customer ran into slow performance issues with her code in MATLAB. She saw such slow performance in that she decided to recode her algorithm in another language. I wanted to show her some simple techniques in MATLAB that could bring her code down to a more reasonable running time.

I enlisted the help of my fellow application engineers to suggest possible changes. Many of them chimed in with some really great suggestions. I have integrated most of our thoughts into a few code versions to share with you. These versions illustrate our thought processes as we worked through the code collaboratively to improve its speed.

In this post, I focus on

• preallocation
• vectorization
• elimination of repeated calculations

### What Does This Code Do?

For security reasons, the user couldn't give me her exact code. So the code we will use in this blog serves as an approximation of the code she is using.

The code generates locations on a 2D grid with dimensions nx1 by nx2. These grid locations serve as values for initial and final positions used by the main part of the calculation. The code iterates through all possible combinations of these initial and final positions.

Given an initial and final position, the code calculates the quantity exponent . If exponent is less than a particular threshold value, gausThresh, exponent is then used to calculate out. subs contains the indices corresponding to the initial and final positions used to calculate out.

For some of you, this probably is better understood just by looking at the code itself.

### The Initial Code

% Initialize the grid and initial and final points
nx1 = 10; nx2 =  10;
x1l =  0; x1u = 100;
x2l =  0; x2u = 100;

x1 = linspace(x1l,x1u,nx1+1);
x2 = linspace(x2l,x2u,nx2+1);

limsf1 = 1:nx1+1; limsf2 = 1:nx2+1;

% Initalize other variables
t = 1;
sigmax1 = 0.5; sigmax2 = 1;
sigma = t * [sigmax1^2 0; 0 sigmax2^2];
invSig = inv(sigma);
detSig = det(sigma);

expF = [1 0; 0 1];
n = size (expF, 1);
gausThresh = 10;

small = 0; subs = []; vals = [];

% Iterate through all possible initial
% and final positions and calculate
% the values of exponent and out
% if exponent > gausThresh.

for i1 = 1:nx1+1
for i2 = 1:nx2+1
for f1 = limsf1
for f2 = limsf2

% Initial and final position
xi = [x1(i1) x2(i2)]';
xf = [x1(f1) x2(f2)]';

exponent = 0.5 * (xf - expF * xi)'...
* invSig * (xf - expF * xi);

if exponent > gausThresh
small = small + 1;
else
out = 1 / (sqrt((2 * pi)^n * detSig))...
* exp(-exponent);
subs = [subs; i1 i2 f1 f2];
vals = [vals; out];
end

end
end
end
end

Here is a graphical representation of the results for nx1=nx2=100. Because the data is quite dense, I have magnified a portion of the total graph. The red lines connect the initial and final points where exponent < gausThresh and the thickness of the line reflects the value of vals.

open('results.fig'); Here are the runtimes on my computer, a T60 Lenovo dual-core laptop with multithreading enabled.

displayRunTimes(1)
    nx1      nx2      time
50       50       296 seconds
100      100      9826 seconds


### Listen to M-Lint

The first and easiest step is to listen to the suggestions from M-Lint, the static code analyzer that ships with core MATLAB. You can either access it via the editor or command line. In this case, I use the command line and define a simple function displayMlint so that the display is compact.

output = mlint('initial.m');
displayMlint(output);
'subs' might be growing inside a
loop. Consider preallocating
for speed.

'vals' might be growing inside a
loop. Consider preallocating
for speed.



Following these suggestions, I know I need to preallocate some of the arrays.

Why?

MATLAB needs contiguous blocks of memory for each array. Therefore, by repeatedly resizing the array, I keep asking MATLAB to look for a new contiguous blocks of memory and move the old array elements into that new block. If I preallocate the maximum amount of space that would be required by the array, I do not have to suffer the additional cost to do this searching and moving.

However, since I am unsure of the final size of subs and vars, I don't know what size I should preallocate. I could make them the maximum possible size. However for the case of nx1 = nx2 = 100, that requires a vector of 100^4 elements, which gets me into trouble!

try
zeros(1,100^4)
catch ME
end
display(ME.message)
Out of memory. Type HELP MEMORY for your options.


Therefore, I need to block allocate the variable.

I do this by first preallocating a reasonable size (that doesn't create an out of memory error) for subs and vars and then increase that allocation by a reasonable amount, chunk by chunk, if subs and vars go beyond their preallocated sizes.

This way, I only suffer the cost of resizing the array a few times instead of every time a new element is added.

I have also introduced a function, initialize, to do the variable initializations so that you don't have to repeatedly look at the part of the code that does not change.

### The Code with Preallocation

% Initialization
nx1 = 10;
nx2 = 10;

[x1,x2,limsf1,limsf2,expF,gausThresh,small,invSig,detSig,n]= ...
initialize(nx1,nx2);

% Initial guess for preallocation
mm = min((nx1+1)^2*(nx2+1)^2, 10^6);
subs = zeros(mm,4);
vals = zeros(mm,1);

counter = 0;

% Iterate through all possible initial
% and final positions
for i1 = 1:nx1+1
for i2 = 1:nx2+1
for f1 = limsf1
for f2 = limsf2

xi = [x1(i1) x2(i2)]'; %% Initial position
xf = [x1(f1) x2(f2)]'; %% Final position

exponent = 0.5 * (xf - expF * xi)'...
* invSig * (xf - expF * xi);

% Increase preallocation if necessary
if counter == length(vals)
subs = [subs; zeros(mm, 4)];
vals = [vals; zeros(mm, 1)];
end

if exponent > gausThresh
small = small + 1;
else
% Counter introduced
counter=counter + 1;
out = 1 / (sqrt((2 * pi)^n * detSig))...
* exp(-exponent);
subs(counter,:) = [i1 i2 f1 f2];
vals(counter) = out;
end

end
end
end
end

% Remove zero components that came from preallocation
vals = vals(vals > 0);
subs = subs(vals > 0);

The preallocation makes for slightly faster code.

displayRunTimes(2)
    nx1      nx2      time
50       50       267 seconds
100      100      4228 seconds


However, this isn't exactly the speedup I was looking for. So, let's try to tackle the code using vectorization.

### Vectorization

Because MATLAB is a array-based language, code written in terms of matrix and array operations is often faster than code which relies on for loops, especially in the case of nested for loops. In vectorizing, matrix and vector operations substitute for the equivalent for loop calculations.

There are several possibilities for vectorizing this code. I am going to explore two in this blog entry.

• Vectorize the Inner Two loops
• Vectorize the Inner Three loops

### The Code with Preallocation and Vectorization of the Inner Two Loops

Try 1: Vectorize the Inner Two loops

% Initialization

nx1 = 10;
nx2 = 10;

[x1,x2,limsf1,limsf2,expF,gausThresh,small,invSig,detSig,n]= ...
initialize(nx1,nx2);

vals = cell(nx1+1,nx2+1); % Cell preallocation
subs = cell(nx1+1,nx2+1); % Cell preallocation

[xind,yind] = meshgrid(limsf1,limsf2);
xyindices = [xind(:)' ; yind(:)'];

[x,y] = meshgrid(x1(limsf1),x2(limsf2));
xyfinal = [x(:)' ; y(:)'];

exptotal = zeros(length(xyfinal),1);

% Loop over all possible combinations of positions
for i1 = 1:nx1+1
for i2 = 1:nx2+1

xyinitial = repmat([x1(i1);x2(i2)],1,length(xyfinal));

expa = 0.5 * (xyfinal - expF * xyinitial);
expb = invSig * (xyfinal - expF * xyinitial);
exptotal(:,1) = expa(1,:).*expb(1,:)+expa(2,:).*expb(2,:);

index = find(exptotal < gausThresh);
expreduced = exptotal(exptotal < gausThresh);

out = 1 / (sqrt((2 * pi)^n * detSig)) * exp(-(expreduced));
vals{i1,i2} = out;
subs{i1,i2} = [i1*ones(1,length(index)) ; ...
i2*ones(1,length(index)); xyindices(1,index); ...
xyindices(2,index)]' ;

end
end

% Reshape and convert output so it is in a
% simple matrix format
vals = cell2mat(vals(:));
subs = cell2mat(subs(:));

small = ((nx1+1)^2*(nx2+1)^2)-length(subs);

Wow! With just this little bit of vectorization, the run time decreased tremendously.

displayRunTimes(3)
    nx1      nx2      time
50       50       1.51 seconds
100      100      19.28 seconds


Let's go over the main techniques I've illustrated here.

• Use of meshgrid and repmat to create matrices
• Use of vector and matrix operations
• Use of element-wise operations when appropriate

meshgrid produces two vectors of the x and y positions for all possible combinations of the x and y input vectors. In other words, given the vectors x and y, meshgrid gives all possible positions in a mesh or grid defined on a side by those x and y input vectors. The n-dimensional version of meshgrid is ndgrid, which I use it in the next example.

The function repmat replicates and tiles arrays.

Generally I use matrices as part of the vectorization, so whenever possible I use vector and matrix operations.

However, in one segment of the code it is actually faster to explicitly write out a vector-vector multiply element by element. This is somewhat unintuitive, so let's discuss this a bit more in detail.

To calculate exponent in the original code, there is a series of vector and matrix multiplies with the original and final positions, each represented by two-element vectors. The values of the initial and final position are then changed every step through the loop.

When the code is vectorized, all the initial positions and final positions are now contained in two matrices xyinitial and xyfinal. Therefore, to calculate the value of exponent, only the diagonal terms of the multiplication of expa and expb are relevant. However, in this case, taking the diagonal of a matrix multiply turns out to be a rather expensive operation. Instead I choose to write it as an element-by-element operation.

I have only shown you a few functions that are useful for vectorization in this post. Here is a more extensive list of useful functions for vectorization.

### The Code with Preallocation and Vectorization of the Inner Three Loops

Try 2: Vectorize the Inner Three Loops

% Initialization
nx1 = 10;
nx2 = 10;

[x1,x2,limsf1,limsf2,expF,gausThresh,small,invSig,detSig,n]= ...
initialize(nx1,nx2);

limsi1 = limsf1;
limsi2 = limsf2;

% ndgrid gives a matrix of all the possible combinations
[aind,bind,cind] = ndgrid(limsi2,limsf1,limsf2);
[a,b,c] = ndgrid(x2,x1,x2);

vals = cell(nx1+1,nx2+1);  % Cell preallocation
subs = cell(nx1+1,nx2+1);  % Cell preallocation

% Convert grids to single vector to use in a single loop
b = b(:); aind = aind(:); bind = bind(:); cind = cind(:);

expac = a(:)-c(:); % Calculate x2-x1

% Iterate through initial x1 positions (i1)
for i1 = limsi1

exbx1= b-x1(i1);
expaux = invSig(2)*exbx1.*expac;
exponent = 0.5*(invSig(1)*exbx1.*exbx1+expaux);

index = find(exponent < gausThresh);
expreduced = exponent(exponent < gausThresh);

vals{i1} = 1 / (sqrt((2 * pi)^n * detSig))...
.*exp(-expreduced);

subs{i1} = [i1*ones(1,length(index));
aind(index)' ; bind(index)';...
cind(index)']';

end

vals = cell2mat(vals(:));
subs = cell2mat(subs(:));

small = ((nx1+1)^2*(nx2+1)^2)-length(subs);

Now the code runs in even less time.

displayRunTimes(4)
    nx1      nx2      time
50       50       0.658 seconds
100      100      8.77 seconds


Why don't I try to vectorize all the loops?

I certainly could. In this case, I didn't see much of a benefit with total vectorization, and the code became a bit harder to read. So, I am not going to show the fully vectorized code. Considering performance and readability of the code, try 2 seems to be the best solution.

Once the code is vectorized, you can easily see that I repeatedly calculate values that only need to be calculated once. Also, assuming some of the vectors keep the same form (such as sigma), other calculations (e.g., InvSig) are unnecessary.

Some of these changes are quite easy to do, and some are a bit more advanced, but I wanted to finish this entry with the fastest version of the code from the application engineering participants.

### The Code with Preallocation, Vectorization and Only Calculating Once

nx1 =  100; nx2 = 100;
x1l =   0; x1u = 100;
x2l =   0; x2u = 100;

x1 = linspace(x1l,x1u,nx1+1);
x2 = linspace(x2l,x2u,nx2+1);

limsi1 = 1:nx1+1;
limsi2 = 1:nx2+1;

limsf1 = 1:nx1+1;
limsf2 = 1:nx2+1;

t = 1;

sigmax1 = 0.5;
sigmax2 = 1;

sigma = t * [sigmax1^2 sigmax2^2];
detSig = sigma(1)*sigma(2);
invSig = [1/sigma(1) 1/sigma(2)];

gausThresh = 10;
n=3;
const=1 / (sqrt((2 * pi)^n * detSig));

% ndgrid gives a matrix of all the possible combinations
% of position, except limsi1 which we iterate over

[aind,bind,cind] = ndgrid(limsi2,limsf1,limsf2);
[a,b,c] = ndgrid(x2,x1,x2);

vals = cell(nx1+1,nx2+1);  % Cell preallocation
subs = cell(nx1+1,nx2+1);  % Cell preallocation

% Convert grids to single vector to
% use in a single for-loop
b = b(:);
aind = aind(:);
bind = bind(:);
cind = cind(:);

expac= a(:)-c(:);
expaux = invSig(2)*expac.*expac;

% Iterate through initial x1 positions

for i1 = limsi1

expbx1= b-x1(i1);
exponent = 0.5*(invSig(1)*expbx1.*expbx1+expaux);

% Find indices where exponent < gausThresh
index = find(exponent < gausThresh);

% Find and keep values where exp < gausThresh

expreduced = exponent(exponent < gausThresh);

vals{i1} = const.*exp(-expreduced);

subs{i1} = [i1*ones(1,length(index));
aind(index)' ; bind(index)';...
cind(index)']';
end

vals = cell2mat(vals(:));
subs = cell2mat(subs(:));

small = ((nx1+1)^2*(nx2+1)^2)-length(subs);

With this, here are the magical times.

displayRunTimes(5)
    nx1      nx2      time
50       50       0.568 seconds
100      100      8.36 seconds


Pretty good speedup. In fact, you see the run-time reduced to about 2% of the original times.

In this post, I focused on preallocation, vectorization, and only doing calculations a single time. However, there are also additional techniques to help you speed up your MATLAB code such as using the profiler to diagnose bottlenecks, using MEX-files, using bsxfun, and enabling multithreading. The documentation for MATLAB has a very helpful section entitled Performance. I encourage you to go there to read more about these and other relevant topics.