Today I’d like to introduce a guest blogger, Sarah Wait Zaranek, who is an application engineer here at The MathWorks. Today she talks about speeding up code from a customer to get acceptable performance.
Contents
- Introduction
- What Does This Code Do?
- The Initial Code
- Listen to M-Lint
- The Code with Preallocation
- Vectorization
- The Code with Preallocation and Vectorization of the Inner Two Loops
- The Code with Preallocation and Vectorization of the Inner Three Loops
- The Code with Preallocation, Vectorization and Only Calculating Once
- Additional Resources
- Your Examples
Introduction
A customer ran into slow performance issues with her code in MATLAB. She saw such slow performance in that she decided to recode her algorithm in another language. I wanted to show her some simple techniques in MATLAB that could bring her code down to a more reasonable running time.
I enlisted the help of my fellow application engineers to suggest possible changes. Many of them chimed in with some really great suggestions. I have integrated most of our thoughts into a few code versions to share with you. These versions illustrate our thought processes as we worked through the code collaboratively to improve its speed.
In this post, I focus on
- preallocation
- vectorization
- elimination of repeated calculations
What Does This Code Do?
For security reasons, the user couldn't give me her exact code. So the code we will use in this blog serves as an approximation of the code she is using.
The code generates locations on a 2D grid with dimensions nx1 by nx2. These grid locations serve as values for initial and final positions used by the main part of the calculation. The code iterates through all possible combinations of these initial and final positions.
Given an initial and final position, the code calculates the quantity exponent . If exponent is less than a particular threshold value, gausThresh, exponent is then used to calculate out. subs contains the indices corresponding to the initial and final positions used to calculate out.
For some of you, this probably is better understood just by looking at the code itself.
The Initial Code
% Initialize the grid and initial and final points nx1 = 10; nx2 = 10; x1l = 0; x1u = 100; x2l = 0; x2u = 100; x1 = linspace(x1l,x1u,nx1+1); x2 = linspace(x2l,x2u,nx2+1); limsf1 = 1:nx1+1; limsf2 = 1:nx2+1; % Initalize other variables t = 1; sigmax1 = 0.5; sigmax2 = 1; sigma = t * [sigmax1^2 0; 0 sigmax2^2]; invSig = inv(sigma); detSig = det(sigma); expF = [1 0; 0 1]; n = size (expF, 1); gausThresh = 10; small = 0; subs = []; vals = []; % Iterate through all possible initial % and final positions and calculate % the values of exponent and out % if exponent > gausThresh. for i1 = 1:nx1+1 for i2 = 1:nx2+1 for f1 = limsf1 for f2 = limsf2 % Initial and final position xi = [x1(i1) x2(i2)]'; xf = [x1(f1) x2(f2)]'; exponent = 0.5 * (xf - expF * xi)'... * invSig * (xf - expF * xi); if exponent > gausThresh small = small + 1; else out = 1 / (sqrt((2 * pi)^n * detSig))... * exp(-exponent); subs = [subs; i1 i2 f1 f2]; vals = [vals; out]; end end end end end
Here is a graphical representation of the results for nx1=nx2=100. Because the data is quite dense, I have magnified a portion of the total graph. The red lines connect the initial and final points where exponent < gausThresh and the thickness of the line reflects the value of vals.
open('results.fig');
Here are the runtimes on my computer, a T60 Lenovo dual-core laptop with multithreading enabled.
displayRunTimes(1)
nx1 nx2 time
50 50 296 seconds
100 100 9826 seconds
Listen to M-Lint
The first and easiest step is to listen to the suggestions from M-Lint, the static code analyzer that ships with core MATLAB. You can either access it via the editor or command line. In this case, I use the command line and define a simple function displayMlint so that the display is compact.
output = mlint('initial.m');
displayMlint(output);'subs' might be growing inside a
loop. Consider preallocating
for speed.
'vals' might be growing inside a
loop. Consider preallocating
for speed.
Following these suggestions, I know I need to preallocate some of the arrays.
Why?
MATLAB needs contiguous blocks of memory for each array. Therefore, by repeatedly resizing the array, I keep asking MATLAB to look for a new contiguous blocks of memory and move the old array elements into that new block. If I preallocate the maximum amount of space that would be required by the array, I do not have to suffer the additional cost to do this searching and moving.
However, since I am unsure of the final size of subs and vars, I don't know what size I should preallocate. I could make them the maximum possible size. However for the case of nx1 = nx2 = 100, that requires a vector of 100^4 elements, which gets me into trouble!
try zeros(1,100^4) catch ME end display(ME.message)
Out of memory. Type HELP MEMORY for your options.
Therefore, I need to block allocate the variable.
I do this by first preallocating a reasonable size (that doesn't create an out of memory error) for subs and vars and then increase that allocation by a reasonable amount, chunk by chunk, if subs and vars go beyond their preallocated sizes.
This way, I only suffer the cost of resizing the array a few times instead of every time a new element is added.
I have also introduced a function, initialize, to do the variable initializations so that you don't have to repeatedly look at the part of the code that does not change.
The Code with Preallocation
% Initialization nx1 = 10; nx2 = 10; [x1,x2,limsf1,limsf2,expF,gausThresh,small,invSig,detSig,n]= ... initialize(nx1,nx2); % Initial guess for preallocation mm = min((nx1+1)^2*(nx2+1)^2, 10^6); subs = zeros(mm,4); vals = zeros(mm,1); counter = 0; % Iterate through all possible initial % and final positions for i1 = 1:nx1+1 for i2 = 1:nx2+1 for f1 = limsf1 for f2 = limsf2 xi = [x1(i1) x2(i2)]'; %% Initial position xf = [x1(f1) x2(f2)]'; %% Final position exponent = 0.5 * (xf - expF * xi)'... * invSig * (xf - expF * xi); % Increase preallocation if necessary if counter == length(vals) subs = [subs; zeros(mm, 4)]; vals = [vals; zeros(mm, 1)]; end if exponent > gausThresh small = small + 1; else % Counter introduced counter=counter + 1; out = 1 / (sqrt((2 * pi)^n * detSig))... * exp(-exponent); subs(counter,:) = [i1 i2 f1 f2]; vals(counter) = out; end end end end end % Remove zero components that came from preallocation vals = vals(vals > 0); subs = subs(vals > 0);
The preallocation makes for slightly faster code.
displayRunTimes(2)
nx1 nx2 time
50 50 267 seconds
100 100 4228 seconds
However, this isn't exactly the speedup I was looking for. So, let's try to tackle the code using vectorization.
Vectorization
Because MATLAB is a array-based language, code written in terms of matrix and array operations is often faster than code which relies on for loops, especially in the case of nested for loops. In vectorizing, matrix and vector operations substitute for the equivalent for loop calculations.
There are several possibilities for vectorizing this code. I am going to explore two in this blog entry.
- Vectorize the Inner Two loops
- Vectorize the Inner Three loops
The Code with Preallocation and Vectorization of the Inner Two Loops
Try 1: Vectorize the Inner Two loops
% Initialization nx1 = 10; nx2 = 10; [x1,x2,limsf1,limsf2,expF,gausThresh,small,invSig,detSig,n]= ... initialize(nx1,nx2); vals = cell(nx1+1,nx2+1); % Cell preallocation subs = cell(nx1+1,nx2+1); % Cell preallocation [xind,yind] = meshgrid(limsf1,limsf2); xyindices = [xind(:)' ; yind(:)']; [x,y] = meshgrid(x1(limsf1),x2(limsf2)); xyfinal = [x(:)' ; y(:)']; exptotal = zeros(length(xyfinal),1); % Loop over all possible combinations of positions for i1 = 1:nx1+1 for i2 = 1:nx2+1 xyinitial = repmat([x1(i1);x2(i2)],1,length(xyfinal)); expa = 0.5 * (xyfinal - expF * xyinitial); expb = invSig * (xyfinal - expF * xyinitial); exptotal(:,1) = expa(1,:).*expb(1,:)+expa(2,:).*expb(2,:); index = find(exptotal < gausThresh); expreduced = exptotal(exptotal < gausThresh); out = 1 / (sqrt((2 * pi)^n * detSig)) * exp(-(expreduced)); vals{i1,i2} = out; subs{i1,i2} = [i1*ones(1,length(index)) ; ... i2*ones(1,length(index)); xyindices(1,index); ... xyindices(2,index)]' ; end end % Reshape and convert output so it is in a % simple matrix format vals = cell2mat(vals(:)); subs = cell2mat(subs(:)); small = ((nx1+1)^2*(nx2+1)^2)-length(subs);
Wow! With just this little bit of vectorization, the run time decreased tremendously.
displayRunTimes(3)
nx1 nx2 time
50 50 1.51 seconds
100 100 19.28 seconds
Let's go over the main techniques I've illustrated here.
meshgrid produces two vectors of the x and y positions for all possible combinations of the x and y input vectors. In other words, given the vectors x and y, meshgrid gives all possible positions in a mesh or grid defined on a side by those x and y input vectors. The n-dimensional version of meshgrid is ndgrid, which I use it in the next example.
The function repmat replicates and tiles arrays.
Generally I use matrices as part of the vectorization, so whenever possible I use vector and matrix operations.
However, in one segment of the code it is actually faster to explicitly write out a vector-vector multiply element by element. This is somewhat unintuitive, so let's discuss this a bit more in detail.
To calculate exponent in the original code, there is a series of vector and matrix multiplies with the original and final positions, each represented by two-element vectors. The values of the initial and final position are then changed every step through the loop.
When the code is vectorized, all the initial positions and final positions are now contained in two matrices xyinitial and xyfinal. Therefore, to calculate the value of exponent, only the diagonal terms of the multiplication of expa and expb are relevant. However, in this case, taking the diagonal of a matrix multiply turns out to be a rather expensive operation. Instead I choose to write it as an element-by-element operation.
I have only shown you a few functions that are useful for vectorization in this post. Here is a more extensive list of useful functions for vectorization.
The Code with Preallocation and Vectorization of the Inner Three Loops
Try 2: Vectorize the Inner Three Loops
% Initialization nx1 = 10; nx2 = 10; [x1,x2,limsf1,limsf2,expF,gausThresh,small,invSig,detSig,n]= ... initialize(nx1,nx2); limsi1 = limsf1; limsi2 = limsf2; % ndgrid gives a matrix of all the possible combinations [aind,bind,cind] = ndgrid(limsi2,limsf1,limsf2); [a,b,c] = ndgrid(x2,x1,x2); vals = cell(nx1+1,nx2+1); % Cell preallocation subs = cell(nx1+1,nx2+1); % Cell preallocation % Convert grids to single vector to use in a single loop b = b(:); aind = aind(:); bind = bind(:); cind = cind(:); expac = a(:)-c(:); % Calculate x2-x1 % Iterate through initial x1 positions (i1) for i1 = limsi1 exbx1= b-x1(i1); expaux = invSig(2)*exbx1.*expac; exponent = 0.5*(invSig(1)*exbx1.*exbx1+expaux); index = find(exponent < gausThresh); expreduced = exponent(exponent < gausThresh); vals{i1} = 1 / (sqrt((2 * pi)^n * detSig))... .*exp(-expreduced); subs{i1} = [i1*ones(1,length(index)); aind(index)' ; bind(index)';... cind(index)']'; end vals = cell2mat(vals(:)); subs = cell2mat(subs(:)); small = ((nx1+1)^2*(nx2+1)^2)-length(subs);
Now the code runs in even less time.
displayRunTimes(4)
nx1 nx2 time
50 50 0.658 seconds
100 100 8.77 seconds
Why don't I try to vectorize all the loops?
I certainly could. In this case, I didn't see much of a benefit with total vectorization, and the code became a bit harder to read. So, I am not going to show the fully vectorized code. Considering performance and readability of the code, try 2 seems to be the best solution.
Once the code is vectorized, you can easily see that I repeatedly calculate values that only need to be calculated once. Also, assuming some of the vectors keep the same form (such as sigma), other calculations (e.g., InvSig) are unnecessary.
Some of these changes are quite easy to do, and some are a bit more advanced, but I wanted to finish this entry with the fastest version of the code from the application engineering participants.
The Code with Preallocation, Vectorization and Only Calculating Once
nx1 = 100; nx2 = 100; x1l = 0; x1u = 100; x2l = 0; x2u = 100; x1 = linspace(x1l,x1u,nx1+1); x2 = linspace(x2l,x2u,nx2+1); limsi1 = 1:nx1+1; limsi2 = 1:nx2+1; limsf1 = 1:nx1+1; limsf2 = 1:nx2+1; t = 1; sigmax1 = 0.5; sigmax2 = 1; sigma = t * [sigmax1^2 sigmax2^2]; detSig = sigma(1)*sigma(2); invSig = [1/sigma(1) 1/sigma(2)]; gausThresh = 10; n=3; const=1 / (sqrt((2 * pi)^n * detSig)); % ndgrid gives a matrix of all the possible combinations % of position, except limsi1 which we iterate over [aind,bind,cind] = ndgrid(limsi2,limsf1,limsf2); [a,b,c] = ndgrid(x2,x1,x2); vals = cell(nx1+1,nx2+1); % Cell preallocation subs = cell(nx1+1,nx2+1); % Cell preallocation % Convert grids to single vector to % use in a single for-loop b = b(:); aind = aind(:); bind = bind(:); cind = cind(:); expac= a(:)-c(:); expaux = invSig(2)*expac.*expac; % Iterate through initial x1 positions for i1 = limsi1 expbx1= b-x1(i1); exponent = 0.5*(invSig(1)*expbx1.*expbx1+expaux); % Find indices where exponent < gausThresh index = find(exponent < gausThresh); % Find and keep values where exp < gausThresh expreduced = exponent(exponent < gausThresh); vals{i1} = const.*exp(-expreduced); subs{i1} = [i1*ones(1,length(index)); aind(index)' ; bind(index)';... cind(index)']'; end vals = cell2mat(vals(:)); subs = cell2mat(subs(:)); small = ((nx1+1)^2*(nx2+1)^2)-length(subs);
With this, here are the magical times.
displayRunTimes(5)
nx1 nx2 time
50 50 0.568 seconds
100 100 8.36 seconds
Pretty good speedup. In fact, you see the run-time reduced to about 2% of the original times.
Additional Resources
In this post, I focused on preallocation, vectorization, and only doing calculations a single time. However, there are also additional techniques to help you speed up your MATLAB code such as using the profiler to diagnose bottlenecks, using MEX-files, using bsxfun, and enabling multithreading. The documentation for MATLAB has a very helpful section entitled Performance. I encourage you to go there to read more about these and other relevant topics.
Your Examples
Tell me about some of the ways you speed up MATLAB applications or feel free to post a example of a technique I didn't illustrate here.
Get
the MATLAB code
Published with MATLAB® 7.6



In grad school I was working on a GSM receiver in Matlab. I found some source online that had been open-sourced that used a Viterbi Equalizer. It worked well, but was extremely slow. (Especially forming the channel estimate.) It looked like the original author was very prone to Computer-Science-Like FOR loops and lots of IF logic. So I took some time to rewrite the code using vectorization, meshgrid and linear indexing. You mentioned the first two, but I have found big speedups in using linear indexed arrays and matrices for my operations. (Mainly because I don’t have to pick elements out with loop variables and such.) The original was about 5*real time. (5 seconds to demod 1 second of signal.) The final rewrite with a .mex file for the actual Viterbi was right at 1/10 of real time. Here’s an example from the code using some meshgrid, vectorization, linear indexing and reshaping matrices for easy multiplies. This was fast enough that I could calculate a channel estimate for all 8 of the GSM training sequences at once. (I used that mainly for debugging purposes.) Meshgrid was the hottest line in the profiler, and was still down in the noise.
Original : (Short, sweet and slow…)
% ESTIMATE CHANNEL IMPULSE RESPONSE USING ONLY EVERY
% OSR’th SAMPLE IN THE RECEIVED SIGNAL
%
for n = 1:length(chan_est),
chan_est(n)=r_sub(n:OSR:n+(length(T16)-1)*OSR)*T16.’;
end
Rewrite : (Longer, a little harder to read, 1000xfaster)
% NO NO NO NO NO NO NO NO NO NO NO NO
% % % % % % for n = 1:length(chan_est),
% % % % % % chan_est(n)=r_sub(n:OSR:n+15*OSR)*TRAINING(:,3);
% % % % % % end
% Make my index matrices.
% X increments by OSR up to length(chan_est)*4 and then repeats that row to
% make the matrix square.
%
% Y is a column incrementing from 1 to the length of the X matrix row by 1.
% Then that column is repeated to make the matrix square.
[ X , Y ] = meshgrid([1:OSR:length(chan_est)*OSR],[1:length([1:OSR:length(chan_est)*OSR])]);
% Add them together, and subtract 1. This gives us rows of lags for the
% signal sample points. We only want 16 sample points, so we kill off
% everything after the 16th lag of each row and transpose. (Since the
% indices are all real, we can use the ‘ conjugate transpose without
% ill-effect, but to be disciplined we won’t.)
ind = X+Y-1;
ind = ind(:,1:16).’;
% Here we use linear indexing into the r_sub signal to create a long vector
% of lagged samples of r_sub. Every 16 samples of the new r_sub is going
% to be a row in our matrix. We reshape it into a matrix of 16 rows by
% length(chan_est) columns and then non-conjugate transpose to get the
% correct shape. (You have to do it this way due to the way reshape fills
% in values. Column-wise.)
r_sub = r_sub(ind);
r_sub = reshape(r_sub, 16, length(chan_est)).’;
% Now we simply do a matrix multiply and we create chan_est in one
% operation. In fact, we generate a chan_est for all 8 training sequences
% in one multiply.
chan_est = (r_sub*TRAINING).’;
Love the blog. Thanks for the great post.
see also this great link for an (often faster) indexing version of repmat: http://www.psi.toronto.edu/~vincent/matlabindexrepmat.html
An excellent blog. There is an area of acceleration which you didn’t cover which I would very much like to get some clarification on. What is the fastest way to extract a subset of an array? For example, if I am calculating a power spectral density I only need the first half of the results, is there any way to know which of the following will be fastest?
nPts = 1e5;
half = nPts/2;
A = rand(1e5,1);
B = fft(A); % I know I can do lots of things to accelerate this
B=B(1:half); % method 1
B(half+1:end)=[]; %method 2
C = B(1:half); %method 3
Also, how does this change when I’m dealing with more than 1 dimension? Is there a relative speed difference between linear indexing and : notation? Does using “end” slow me down vs explicitly determining the number of rows?
I know this seems like a trivial area to worry about speed, but when you are dealing with lots of these operations, they can take up a lot of my execution time.
Thanks,
Dan
Thanks for all the helpful tips. In particular, I have not seen bsxfun before, and I’ve wanted that functionality.
I have a question, though. I certainly try to preallocate and vectorize in my own code, but I was under the impression that MATLAB had radically sped up for-loops a few versions ago, as long as the data types and operations within the loop were part of an “accelerated” subset. This wouldn’t, of course, change the need to preallocate, but the operations in this example all seem to be things that should be accelerated. So why is your first try (with preallocation and no vectorization) still so slow? Would it have been that much slower in an old version of MATLAB, before acceleration?
Dan-
Good questions about indexing. I fixed your example to be more fair, I believe. Here’s the code and the experiment I ran.
First the code:
And the results - which I got by running the first cell once and the second cell 3 times:
This is only 1-D and linear indexing both for this case. It seems like creating the part to keep and assigning it is faster than removing the ones you don’t want to keep. Not sure if that extends to ND. Expect me to think about this for a post of the future!
–Loren
This is a wonderfully deep example and I think it will take me a month to digest.
I guess my first comment is that you could have significantly improved the readability by simplifying the span of the nested for loops since for the purposes of this column, you left them fixed at all times.
i.e.,
for i1 = 1 : 11
rather than
for i1 = 1 : nx1+1
Next is a question. In your 1st attempt at preallocation, you have this in the code:
vals = vals(vals > 0);
This is the first I’ve seen this indexing syntax of vals > 0. Where can I learn more about this?
Thanks.
Oliver-
The code you ask about:
is just using logical indexing - a topic that is in the documentation here: http://www.mathworks.com/access/helpdesk/help/techdoc/matlab_prog/f1-85462.html#bq7egb6-1
and in several of my blogs on indexing: http://blogs.mathworks.com/loren/category/indexing/
particularly:
http://blogs.mathworks.com/loren/2006/01/18/think-about-your-algorithm/
http://blogs.mathworks.com/loren/2006/08/09/essence-of-indexing/
http://blogs.mathworks.com/loren/2007/10/11/a-way-to-account-for-missing-data/
–Loren
Great article! Would love to see more performance tuning examples :-)
Loren,
Good catch on the code error I made! I later caught the fact that I was reusing a variable incorrectly, but missed the copy on write part. I guess one more option to try would be logical indexing as a way of extracting a subset. Especially surpising to me was that B(half+1:end) =[]; is FASTER than B(half+1:nPts)=[]; That one I really can’t figure.
Dan
Dan-
I haven’t looked into the innards of MATLAB, but I think what’s happening is that the expression with nPts creates the vector with indices first and then uses it. The expression with ‘end’ has more oppportunties for optimizing since ‘end’ depends on context and doesn’t just get calculated without knowing the lay of the land. Once MATLAB has the expression in context, MATLAB might not need to create the full index vector first, but can construct the output without the intermediate vector. That guess is predicated on end and using :. If you did something else like [1 2 3 end], I don’t know if you’d see similar performance results. In any case, this is all speculation on my part.
–Loren
In response to jhofman, for this application bsxfun is only slightly slower than that website describes, but uses much less memory. Additional advantage is the easier to read code.
Oliver—In addition to the links Loren provided about logical indexing, I have posted about logical indexing as well.
Hi Jacob,
Sorry about the slow response. You are correct that the code would be slower without the JIT/accelerator.
The JIT/accelerator was first introduced in MATLAB 6.5 (Release 13). When you first run code, it is analyzed and, when possible, machine code is generated and used in place of the interpreted code. The JIT/accelerator initially sped up some calculations in loops that obeyed certain characteristics, e.g., scalar operations on double values. The JIT/accelerator has continued to evolve and handles much more of the MATLAB language, but still incurs costs the first time (or few times) through a loop so assumptions about variables being scalar, for example, can be verified. If the assumptions are ever broken, MATLAB does know to use the interpreted version of code. But if the assumptions hold, the loops can execute quickly.
Let’s look at the first case with nx1=nx2=50. When the JIT/accelerator is turned off, we get a speed around 371 seconds. Compared to the case I showed in the blog post of 296 seconds. Interestingly, you could put a tic-toc pair within the inner-most for-loop itself to see what effect the JIT has on each loop. The subsequent loops after the first loop are on average 27 times faster.
If you are working on elements of a matrix and iterating through them (such as one of our later iterations of the code), a single for-loop is for certain cases almost just as fast as the matrix operation, especially if you remember that MATLAB stores matrix columns in monotonically increasing memory locations. This means that processing data column-wise results in faster times then processing data row-wise.
Stuart McGarrity has a great News and Notes article written about this here:
http://www.mathworks.com/company/newsletters/news_notes/june07/patterns.html
We recommend not writing specifically for the JIT since the JIT is constantly evolving. You should though keep in mind the general good MATLAB-coding practices, such as those listed above.
Hello,
Is there any significant difference between using:
- Vectorization inside a subfunction
- Benefiting from arrayfun, cellfun, or maybe some Mex functions outside the subfunction
For example:
I have two functions f1 and f2 which has subfunctions f1sub and f2sub respectively:
function y = f1
x = [1 2 3];
y = f1sub(x);
function z = f1sub(x)
- - do some validations on x here - -
z = x.^2;
function y = f2
x = [1 2 3];
y = arrayfun(@f2sub,x)
function z = f2sub(x)
- - do some validations on x, (x should have only a single element) - -
z = x^2;
Thank you.
Oktay
Oktay-
It very much depends on the details of the calculations you are doing. Vectorization can sometimes require large intermediate arrays that might not be required by using cellfun, for example. Grabbing the extra memory can cost time. On the other hand, with cellfun, etc., you have the overhead of error checking/validation for each element instead of once for the whole array. So it depends on the balance of what you are doing and what the typical sizes are. I suggest you use the profiler to try out your specific example.
–Loren
I have some finite difference time domain code for solving maxwells equations - pretty basic stuff, and on a 100 by 100 by 100 grid it ran/runs very slowly the loop part of the code looks like
for t=2:1:T-2 tic for i=2:1:gridxs-2 for j=2:1:gridys-2 for k=2:1:gridzs-2 %%%%%%%%% dExdx=Ex(i+1,j,k)-Ex(i,j,k); dExdy=Ex(i,j+1,k)-Ex(i,j,k); dExdz=Ex(i,j,k+1)-Ex(i,j,k); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% dEydx=Ey(i+1,j,k)-Ey(i,j,k); dEydy=Ey(i,j+1,k)-Ey(i,j,k); dEydz=Ey(i,j,k+1)-Ey(i,j,k); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% dEzdx=Ez(i+1,j,k)-Ez(i,j,k); dEzdy=Ez(i,j+1,k)-Ez(i,j,k); dEzdz=Ez(i,j,k+1)-Ez(i,j,k); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Hx2(i,j,k)= Hx(i,j,k)-deltaH*(dEzdy-dEydz); Hy2(i,j,k)= Hy(i,j,k)-deltaH*(dExdz-dEzdx); Hz2(i,j,k)= Hz(i,j,k)-deltaH*(dEydx-dExdy); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% dHxdx=Hx2(i,j,k)-Hx2(i-1,j,k); dHxdy=Hx2(i,j,k)-Hx2(i,j-1,k); dHxdz=Hx2(i,j,k)-Hx2(i,j,k-1); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% dHydx=Hy2(i,j,k)-Hy2(i-1,j,k); dHydy=Hy2(i,j,k)-Hy2(i,j-1,k); dHydz=Hy2(i,j,k)-Hy2(i,j,k-1); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% dHzdx=Hz2(i,j,k)-Hz2(i-1,j,k); dHzdy=Hz2(i,j,k)-Hz2(i,j-1,k); dHzdz=Hz2(i,j,k)-Hz2(i,j,k-1); end end end toc end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Ex2(i,j,k)= Ex(i,j,k)+deltaE*(dHzdy-dHydz); Ey2(i,j,k)= Ey(i,j,k)+deltaE*(dHxdz-dHzdx); Ez2(i,j,k)= Ez(i,j,k)+deltaE*(dHydx-dHxdy);I am sure there are bugs in the code,but it runs so slowly that I can’t be bothered waiting for it to run to find out
tic toc gives 2.8 seconds….
I tried to vectorise my code (probably added some extra bugs along the way)
it now looks like
for t=2:1:T-2 tic dExdx(2:iiend,:,:)=Ex1(3:iiend+1,:,:)-Ex1(2:iiend,:,:); dExdy(:,2:jjend,:)=Ex1(:,3:jjend+1,:)-Ex1(:,2:jjend,:); dExdz(:,:,2:kkend)=Ex1(:,:,3:kkend+1)-Ex1(:,:,2:kkend); dEydx(2:iiend,:,:)=Ey1(3:iiend+1,:,:)-Ey1(2:iiend,:,:); dEydy(:,2:jjend,:)=Ey1(:,3:jjend+1,:)-Ey1(:,2:jjend,:); dEydz(:,:,2:kkend)=Ey1(:,:,3:kkend+1)-Ey1(:,:,2:kkend); dEzdx(2:iiend,:,:)=Ez1(3:iiend+1,:,:)-Ez1(2:iiend,:,:); dEzdy(:,2:jjend,:)=Ez1(:,3:jjend+1,:)-Ez1(:,2:jjend,:); dEzdz(:,:,2:kkend)=Ez1(:,:,3:kkend+1)-Ez1(:,:,2:kkend); Hx2(:,:,:)= Hx1(:,:,:)-deltaH*(dEzdy(:,:,:)-dEydz(:,:,:)); Hy2(:,:,:)= Hy1(:,:,:)-deltaH*(dExdz(:,:,:)-dEzdx(:,:,:)); Hz2(:,:,:)= Hz1(:,:,:)-deltaH*(dEydx(:,:,:)-dExdy(:,:,:)); dHxdx(2:iiend,:,:)=Hx1(3:iiend+1,:,:)-Hx1(2:iiend,:,:); dHxdy(:,2:jjend,:)=Hx1(:,3:jjend+1,:)-Hx1(:,2:jjend,:); dHxdz(:,:,2:kkend)=Hx1(:,:,3:kkend+1)-Hx1(:,:,2:kkend); dHydx(2:iiend,:,:)=Hy1(3:iiend+1,:,:)-Hy1(2:iiend,:,:); dHydy(:,2:jjend,:)=Hy1(:,3:jjend+1,:)-Hy1(:,2:jjend,:); dHydz(:,:,2:kkend)=Hy1(:,:,3:kkend+1)-Hy1(:,:,2:kkend); dHzdx(2:iiend,:,:)=Hz1(3:iiend+1,:,:)-Hz1(2:iiend,:,:); dHzdy(:,2:jjend,:)=Hz1(:,3:jjend+1,:)-Hz1(:,2:jjend,:); dHzdz(:,:,2:kkend)=Hz1(:,:,3:kkend+1)-Hz1(:,:,2:kkend); Ex2(:,:,:)= (Ex1(:,:,:))+deltaE*(dHzdy(:,:,:)-dHydz(:,:,:)); Ey2(:,:,:)= (Ey1(:,:,:))+deltaE*(dHxdz(:,:,:)-dHzdx(:,:,:)); Ez2(:,:,:)= (Ez1(:,:,:))+deltaE*(dHydx(:,:,:)-dHxdy(:,:,:)); toc endresult of tic toc– exactly the same 2.5 secs ???
I found that matlab has a numerical diff and numerical curl, in a speed test my code was at least twice as fast (though looking at the code clearly not as versatile)…
Eventually there was nothing left to do except make a mex file. Now it runs without hick-up in about 0.2 secs or so.
Why didn’t vecorisation work for me?
What can I have done wrong? I declared all my variables before hand with code such as
Ex(gridxs,gridzs,gridys)=0;
and analysed everything with profile
My only guess is memory problems, but as these are predeclared and allocated why should that matter?
-Ben
Ooops in my previous post the below code should be within the inner loop like all the rest of it
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Ex2(i,j,k)= Ex(i,j,k)+deltaE*(dHzdy-dHydz); Ey2(i,j,k)= Ey(i,j,k)+deltaE*(dHxdz-dHzdx); Ez2(i,j,k)= Ez(i,j,k)+deltaE*(dHydx-dHxdy);Ben –
So, I played around with your code a bit. Running your initial code on my laptop, it takes ~1.06 seconds. If I alter your vectorized code by changing all cases where you indexing into the whole matrix into just using the matrix itself:
Your vectorized code runs in ~0.78 seconds.
This is my first shot at vectorization:
gridxs = 100; gridys = 100; gridzs = 100; deltaH = 0.1; deltaE = 0.1; %% Vectorized Version RandStream.setDefaultStream(RandStream('mt19937ar','seed',1)); Hx = rand(gridxs,gridys,gridzs); Hy = rand(gridxs,gridys,gridzs); Hz = rand(gridxs,gridys,gridzs); Ex = rand(gridxs,gridys,gridzs); Ey = rand(gridxs,gridys,gridzs); Ez = rand(gridxs,gridys,gridzs); Hx22 = zeros(gridxs,gridys,gridzs); Hy22 = zeros(gridxs,gridys,gridzs); Hz22 = zeros(gridxs,gridys,gridzs); Ex22 = zeros(gridxs,gridys,gridzs); Ey22 = zeros(gridxs,gridys,gridzs); Ez22 = zeros(gridxs,gridys,gridzs); tic % dExdx2 = diff(Ex(2:end-1,2:end-2,2:end-2),1,1); dExdy2 = diff(Ex(2:end-2,2:end-1,2:end-2),1,2); dExdz2 = diff(Ex(2:end-2,2:end-2,2:end-1),1,3); dEydx2 = diff(Ey(2:end-1,2:end-2,2:end-2),1,1); % dEydy2 = diff(Ey(2:end-2,2:end-1,2:end-2),1,2); dEydz2 = diff(Ey(2:end-2,2:end-2,2:end-1),1,3); dEzdx2 = diff(Ez(2:end-1,2:end-2,2:end-2),1,1); dEzdy2 = diff(Ez(2:end-2,2:end-1,2:end-2),1,2); % dEzdz2 = diff(Ez(2:end-2,2:end-2,2:end-1),1,3); Hx22(2:end-2,2:end-2,2:end-2)= Hx(2:end-2,2:end-2,2:end-2)-deltaH*(dEzdy2-dEydz2); Hy22(2:end-2,2:end-2,2:end-2)= Hy(2:end-2,2:end-2,2:end-2)-deltaH*(dExdz2-dEzdx2); Hz22(2:end-2,2:end-2,2:end-2)= Hz(2:end-2,2:end-2,2:end-2)-deltaH*(dEydx2-dExdy2); % dHxdx2 = diff(Hx22(1:end-2,2:end-2,2:end-2),1,1); dHxdy2 = diff(Hx22(2:end-2,1:end-2,2:end-2),1,2); dHxdz2 = diff(Hx22(2:end-2,2:end-2,1:end-2),1,3); dHydx2 = diff(Hy22(1:end-2,2:end-2,2:end-2),1,1); % dHydy2 = diff(Hy22(2:end-2,1:end-2,2:end-2),1,2); dHydz2 = diff(Hy22(2:end-2,2:end-2,1:end-2),1,3); dHzdx2 = diff(Hz22(1:end-2,2:end-2,2:end-2),1,1); dHzdy2 = diff(Hz22(2:end-2,1:end-2,2:end-2),1,2); % dHzdz2 = diff(Hz22(2:end-2,2:end-2,1:end-2),1,3); Ex22(2:end-2,2:end-2,2:end-2)= Ex(2:end-2,2:end-2,2:end-2)+deltaE*(dHzdy2-dHydz2); Ey22(2:end-2,2:end-2,2:end-2)= Ey(2:end-2,2:end-2,2:end-2)+deltaE*(dHxdz2-dHzdx2); Ez22(2:end-2,2:end-2,2:end-2)= Ez(2:end-2,2:end-2,2:end-2)+deltaE*(dHydx2-dHxdy2);My code takes ~0.44 seconds to run. With more tweaking I could perhaps get it slower, but vectorization does improve the performance considerably. I also compare the answer from your code to mine and they give the same outputs. I used diff because it made for cleaner code.
Cheers,
Sarah
OK that’s interesting the changes don’t look like much at the moment, I will have to compare side by side to really see the difference I think. My first trial makes me think your laptop runs a bit faster than mine! I made the first change you suggested and it improved to around 1.8s, so not bad I think there shouldn’t be a big problem to improve on this now.
Thanks for the suggestions!