Speeding Up MATLAB Applications
Today I’d like to introduce a guest blogger, Sarah Wait Zaranek, who is an application engineer here at The MathWorks. Today she talks about speeding up code from a customer to get acceptable performance.
Contents
- Introduction
- What Does This Code Do?
- The Initial Code
- Listen to M-Lint
- The Code with Preallocation
- Vectorization
- The Code with Preallocation and Vectorization of the Inner Two Loops
- The Code with Preallocation and Vectorization of the Inner Three Loops
- The Code with Preallocation, Vectorization and Only Calculating Once
- Additional Resources
- Your Examples
Introduction
A customer ran into slow performance issues with her code in MATLAB. She saw such slow performance in that she decided to recode her algorithm in another language. I wanted to show her some simple techniques in MATLAB that could bring her code down to a more reasonable running time.
I enlisted the help of my fellow application engineers to suggest possible changes. Many of them chimed in with some really great suggestions. I have integrated most of our thoughts into a few code versions to share with you. These versions illustrate our thought processes as we worked through the code collaboratively to improve its speed.
In this post, I focus on
- preallocation
- vectorization
- elimination of repeated calculations
What Does This Code Do?
For security reasons, the user couldn't give me her exact code. So the code we will use in this blog serves as an approximation of the code she is using.
The code generates locations on a 2D grid with dimensions nx1 by nx2. These grid locations serve as values for initial and final positions used by the main part of the calculation. The code iterates through all possible combinations of these initial and final positions.
Given an initial and final position, the code calculates the quantity exponent . If exponent is less than a particular threshold value, gausThresh, exponent is then used to calculate out. subs contains the indices corresponding to the initial and final positions used to calculate out.
For some of you, this probably is better understood just by looking at the code itself.
The Initial Code
% Initialize the grid and initial and final points nx1 = 10; nx2 = 10; x1l = 0; x1u = 100; x2l = 0; x2u = 100; x1 = linspace(x1l,x1u,nx1+1); x2 = linspace(x2l,x2u,nx2+1); limsf1 = 1:nx1+1; limsf2 = 1:nx2+1; % Initalize other variables t = 1; sigmax1 = 0.5; sigmax2 = 1; sigma = t * [sigmax1^2 0; 0 sigmax2^2]; invSig = inv(sigma); detSig = det(sigma); expF = [1 0; 0 1]; n = size (expF, 1); gausThresh = 10; small = 0; subs = []; vals = []; % Iterate through all possible initial % and final positions and calculate % the values of exponent and out % if exponent > gausThresh. for i1 = 1:nx1+1 for i2 = 1:nx2+1 for f1 = limsf1 for f2 = limsf2 % Initial and final position xi = [x1(i1) x2(i2)]'; xf = [x1(f1) x2(f2)]'; exponent = 0.5 * (xf - expF * xi)'... * invSig * (xf - expF * xi); if exponent > gausThresh small = small + 1; else out = 1 / (sqrt((2 * pi)^n * detSig))... * exp(-exponent); subs = [subs; i1 i2 f1 f2]; vals = [vals; out]; end end end end end
Here is a graphical representation of the results for nx1=nx2=100. Because the data is quite dense, I have magnified a portion of the total graph. The red lines connect the initial and final points where exponent < gausThresh and the thickness of the line reflects the value of vals.
open('results.fig');
Here are the runtimes on my computer, a T60 Lenovo dual-core laptop with multithreading enabled.
displayRunTimes(1)
nx1 nx2 time 50 50 296 seconds 100 100 9826 seconds
Listen to M-Lint
The first and easiest step is to listen to the suggestions from M-Lint, the static code analyzer that ships with core MATLAB. You can either access it via the editor or command line. In this case, I use the command line and define a simple function displayMlint so that the display is compact.
output = mlint('initial.m');
displayMlint(output);
'subs' might be growing inside a loop. Consider preallocating for speed. 'vals' might be growing inside a loop. Consider preallocating for speed.
Following these suggestions, I know I need to preallocate some of the arrays.
Why?
MATLAB needs contiguous blocks of memory for each array. Therefore, by repeatedly resizing the array, I keep asking MATLAB to look for a new contiguous blocks of memory and move the old array elements into that new block. If I preallocate the maximum amount of space that would be required by the array, I do not have to suffer the additional cost to do this searching and moving.
However, since I am unsure of the final size of subs and vars, I don't know what size I should preallocate. I could make them the maximum possible size. However for the case of nx1 = nx2 = 100, that requires a vector of 100^4 elements, which gets me into trouble!
try zeros(1,100^4) catch ME end display(ME.message)
Out of memory. Type HELP MEMORY for your options.
Therefore, I need to block allocate the variable.
I do this by first preallocating a reasonable size (that doesn't create an out of memory error) for subs and vars and then increase that allocation by a reasonable amount, chunk by chunk, if subs and vars go beyond their preallocated sizes.
This way, I only suffer the cost of resizing the array a few times instead of every time a new element is added.
I have also introduced a function, initialize, to do the variable initializations so that you don't have to repeatedly look at the part of the code that does not change.
The Code with Preallocation
% Initialization nx1 = 10; nx2 = 10; [x1,x2,limsf1,limsf2,expF,gausThresh,small,invSig,detSig,n]= ... initialize(nx1,nx2); % Initial guess for preallocation mm = min((nx1+1)^2*(nx2+1)^2, 10^6); subs = zeros(mm,4); vals = zeros(mm,1); counter = 0; % Iterate through all possible initial % and final positions for i1 = 1:nx1+1 for i2 = 1:nx2+1 for f1 = limsf1 for f2 = limsf2 xi = [x1(i1) x2(i2)]'; %% Initial position xf = [x1(f1) x2(f2)]'; %% Final position exponent = 0.5 * (xf - expF * xi)'... * invSig * (xf - expF * xi); % Increase preallocation if necessary if counter == length(vals) subs = [subs; zeros(mm, 4)]; vals = [vals; zeros(mm, 1)]; end if exponent > gausThresh small = small + 1; else % Counter introduced counter=counter + 1; out = 1 / (sqrt((2 * pi)^n * detSig))... * exp(-exponent); subs(counter,:) = [i1 i2 f1 f2]; vals(counter) = out; end end end end end % Remove zero components that came from preallocation vals = vals(vals > 0); subs = subs(vals > 0);
The preallocation makes for slightly faster code.
displayRunTimes(2)
nx1 nx2 time 50 50 267 seconds 100 100 4228 seconds
However, this isn't exactly the speedup I was looking for. So, let's try to tackle the code using vectorization.
Vectorization
Because MATLAB is a array-based language, code written in terms of matrix and array operations is often faster than code which relies on for loops, especially in the case of nested for loops. In vectorizing, matrix and vector operations substitute for the equivalent for loop calculations.
There are several possibilities for vectorizing this code. I am going to explore two in this blog entry.
- Vectorize the Inner Two loops
- Vectorize the Inner Three loops
The Code with Preallocation and Vectorization of the Inner Two Loops
Try 1: Vectorize the Inner Two loops
% Initialization nx1 = 10; nx2 = 10; [x1,x2,limsf1,limsf2,expF,gausThresh,small,invSig,detSig,n]= ... initialize(nx1,nx2); vals = cell(nx1+1,nx2+1); % Cell preallocation subs = cell(nx1+1,nx2+1); % Cell preallocation [xind,yind] = meshgrid(limsf1,limsf2); xyindices = [xind(:)' ; yind(:)']; [x,y] = meshgrid(x1(limsf1),x2(limsf2)); xyfinal = [x(:)' ; y(:)']; exptotal = zeros(length(xyfinal),1); % Loop over all possible combinations of positions for i1 = 1:nx1+1 for i2 = 1:nx2+1 xyinitial = repmat([x1(i1);x2(i2)],1,length(xyfinal)); expa = 0.5 * (xyfinal - expF * xyinitial); expb = invSig * (xyfinal - expF * xyinitial); exptotal(:,1) = expa(1,:).*expb(1,:)+expa(2,:).*expb(2,:); index = find(exptotal < gausThresh); expreduced = exptotal(exptotal < gausThresh); out = 1 / (sqrt((2 * pi)^n * detSig)) * exp(-(expreduced)); vals{i1,i2} = out; subs{i1,i2} = [i1*ones(1,length(index)) ; ... i2*ones(1,length(index)); xyindices(1,index); ... xyindices(2,index)]' ; end end % Reshape and convert output so it is in a % simple matrix format vals = cell2mat(vals(:)); subs = cell2mat(subs(:)); small = ((nx1+1)^2*(nx2+1)^2)-length(subs);
Wow! With just this little bit of vectorization, the run time decreased tremendously.
displayRunTimes(3)
nx1 nx2 time 50 50 1.51 seconds 100 100 19.28 seconds
Let's go over the main techniques I've illustrated here.
meshgrid produces two vectors of the x and y positions for all possible combinations of the x and y input vectors. In other words, given the vectors x and y, meshgrid gives all possible positions in a mesh or grid defined on a side by those x and y input vectors. The n-dimensional version of meshgrid is ndgrid, which I use it in the next example.
The function repmat replicates and tiles arrays.
Generally I use matrices as part of the vectorization, so whenever possible I use vector and matrix operations.
However, in one segment of the code it is actually faster to explicitly write out a vector-vector multiply element by element. This is somewhat unintuitive, so let's discuss this a bit more in detail.
To calculate exponent in the original code, there is a series of vector and matrix multiplies with the original and final positions, each represented by two-element vectors. The values of the initial and final position are then changed every step through the loop.
When the code is vectorized, all the initial positions and final positions are now contained in two matrices xyinitial and xyfinal. Therefore, to calculate the value of exponent, only the diagonal terms of the multiplication of expa and expb are relevant. However, in this case, taking the diagonal of a matrix multiply turns out to be a rather expensive operation. Instead I choose to write it as an element-by-element operation.
I have only shown you a few functions that are useful for vectorization in this post. Here is a more extensive list of useful functions for vectorization.
The Code with Preallocation and Vectorization of the Inner Three Loops
Try 2: Vectorize the Inner Three Loops
% Initialization nx1 = 10; nx2 = 10; [x1,x2,limsf1,limsf2,expF,gausThresh,small,invSig,detSig,n]= ... initialize(nx1,nx2); limsi1 = limsf1; limsi2 = limsf2; % ndgrid gives a matrix of all the possible combinations [aind,bind,cind] = ndgrid(limsi2,limsf1,limsf2); [a,b,c] = ndgrid(x2,x1,x2); vals = cell(nx1+1,nx2+1); % Cell preallocation subs = cell(nx1+1,nx2+1); % Cell preallocation % Convert grids to single vector to use in a single loop b = b(:); aind = aind(:); bind = bind(:); cind = cind(:); expac = a(:)-c(:); % Calculate x2-x1 % Iterate through initial x1 positions (i1) for i1 = limsi1 exbx1= b-x1(i1); expaux = invSig(2)*exbx1.*expac; exponent = 0.5*(invSig(1)*exbx1.*exbx1+expaux); index = find(exponent < gausThresh); expreduced = exponent(exponent < gausThresh); vals{i1} = 1 / (sqrt((2 * pi)^n * detSig))... .*exp(-expreduced); subs{i1} = [i1*ones(1,length(index)); aind(index)' ; bind(index)';... cind(index)']'; end vals = cell2mat(vals(:)); subs = cell2mat(subs(:)); small = ((nx1+1)^2*(nx2+1)^2)-length(subs);
Now the code runs in even less time.
displayRunTimes(4)
nx1 nx2 time 50 50 0.658 seconds 100 100 8.77 seconds
Why don't I try to vectorize all the loops?
I certainly could. In this case, I didn't see much of a benefit with total vectorization, and the code became a bit harder to read. So, I am not going to show the fully vectorized code. Considering performance and readability of the code, try 2 seems to be the best solution.
Once the code is vectorized, you can easily see that I repeatedly calculate values that only need to be calculated once. Also, assuming some of the vectors keep the same form (such as sigma), other calculations (e.g., InvSig) are unnecessary.
Some of these changes are quite easy to do, and some are a bit more advanced, but I wanted to finish this entry with the fastest version of the code from the application engineering participants.
The Code with Preallocation, Vectorization and Only Calculating Once
nx1 = 100; nx2 = 100; x1l = 0; x1u = 100; x2l = 0; x2u = 100; x1 = linspace(x1l,x1u,nx1+1); x2 = linspace(x2l,x2u,nx2+1); limsi1 = 1:nx1+1; limsi2 = 1:nx2+1; limsf1 = 1:nx1+1; limsf2 = 1:nx2+1; t = 1; sigmax1 = 0.5; sigmax2 = 1; sigma = t * [sigmax1^2 sigmax2^2]; detSig = sigma(1)*sigma(2); invSig = [1/sigma(1) 1/sigma(2)]; gausThresh = 10; n=3; const=1 / (sqrt((2 * pi)^n * detSig)); % ndgrid gives a matrix of all the possible combinations % of position, except limsi1 which we iterate over [aind,bind,cind] = ndgrid(limsi2,limsf1,limsf2); [a,b,c] = ndgrid(x2,x1,x2); vals = cell(nx1+1,nx2+1); % Cell preallocation subs = cell(nx1+1,nx2+1); % Cell preallocation % Convert grids to single vector to % use in a single for-loop b = b(:); aind = aind(:); bind = bind(:); cind = cind(:); expac= a(:)-c(:); expaux = invSig(2)*expac.*expac; % Iterate through initial x1 positions for i1 = limsi1 expbx1= b-x1(i1); exponent = 0.5*(invSig(1)*expbx1.*expbx1+expaux); % Find indices where exponent < gausThresh index = find(exponent < gausThresh); % Find and keep values where exp < gausThresh expreduced = exponent(exponent < gausThresh); vals{i1} = const.*exp(-expreduced); subs{i1} = [i1*ones(1,length(index)); aind(index)' ; bind(index)';... cind(index)']'; end vals = cell2mat(vals(:)); subs = cell2mat(subs(:)); small = ((nx1+1)^2*(nx2+1)^2)-length(subs);
With this, here are the magical times.
displayRunTimes(5)
nx1 nx2 time 50 50 0.568 seconds 100 100 8.36 seconds
Pretty good speedup. In fact, you see the run-time reduced to about 2% of the original times.
Additional Resources
In this post, I focused on preallocation, vectorization, and only doing calculations a single time. However, there are also additional techniques to help you speed up your MATLAB code such as using the profiler to diagnose bottlenecks, using MEX-files, using bsxfun, and enabling multithreading. The documentation for MATLAB has a very helpful section entitled Performance. I encourage you to go there to read more about these and other relevant topics.
Your Examples
Tell me about some of the ways you speed up MATLAB applications or feel free to post a example of a technique I didn't illustrate here.