I'm pleased to introduce Tim Davis as this week's guest blogger. Tim is a professor at the University of Florida, and is the author or co-author of many of our sparse matrix functions (lu, chol, much of sparse backslash, ordering methods such as amd and colamd, and other functions such as etree and symbfact). He is also the author of a recent book, Direct Methods for Sparse Linear Systems, published by SIAM, where more details of MATLAB sparse matrices are discussed ( http://www.cise.ufl.edu/~davis ).
Contents
- MATLAB is Slow? Only If You Use It Incorrectly
- How Not to Create a Finite-Element Matrix
- What's Wrong with It?
- How MATLAB Stores Sparse Matrices
- A Better Way to Create a Finite-Element Matrix
- Moral: Do Not Abuse A(i,j)=... for Sparse A; Use sparse Instead
- Try a Faster Sparse Function
- When Fast is Not Fast Enough...
MATLAB is Slow? Only If You Use It Incorrectly
From time to time, I hear comments such as "MATLAB is slow for large finite-element problems." When I look at the details, the problem is typically due to an overuse of A(i,j)= ... when creating the sparse matrix (assembling the finite elements). This can be seen in typical user's code, MATLAB code in books on the topic, and even in MATLAB itself. The problem is widespread. MATLAB can be very fast for finite-element problems, but not if it's used incorrectly.
How Not to Create a Finite-Element Matrix
A good example of what not to do can be found in the wathen.m function, in MATLAB. A=gallery('wathen',200,200) takes a huge amount of time; a very minor modification cuts the run time drastically. I'm not intending to single out this one function for critique; this is a very common issue that I see over and over again. This function is built into MATLAB, which makes it an accessible example. It was first written when MATLAB did not support sparse matrices, and was modified only slightly to exploit sparsity. It was never meant for generating large sparse finite-element matrices. You can see the entire function with the command:
type private/wathen.m
Below is an excerpt of the relevant parts of wathen.m. The function wathen1.m creates a finite-element matrix of an nx-by-ny mesh. Each i loop creates a single 8-by-8 finite-element matrix, and adds it into A.
type wathen1
tic
A = wathen1 (200,200) ;
toc
function A = wathen1 (nx,ny)
rand('state',0)
e1 = [6 -6 2 -8;-6 32 -6 20;2 -6 6 -6;-8 20 -6 32];
e2 = [3 -8 2 -6;-8 16 -8 20;2 -8 3 -8;-6 20 -8 16];
e = [e1 e2; e2' e1]/45;
n = 3*nx*ny+2*nx+2*ny+1;
A = sparse(n,n);
RHO = 100*rand(nx,ny);
nn = zeros(8,1);
for j=1:ny
for i=1:nx
nn(1) = 3*j*nx+2*i+2*j+1;
nn(2) = nn(1)-1;
nn(3) = nn(2)-1;
nn(4) = (3*j-1)*nx+2*j+i-1;
nn(5) = 3*(j-1)*nx+2*i+2*j-3;
nn(6) = nn(5)+1;
nn(7) = nn(6)+1;
nn(8) = nn(4)+1;
em = e*RHO(i,j);
for krow=1:8
for kcol=1:8
A(nn(krow),nn(kcol)) = A(nn(krow),nn(kcol))+em(krow,kcol);
end
end
end
end
Elapsed time is 305.832709 seconds.
What's Wrong with It?
The above code is fine for generating a modest sized matrix, but the A(i,j) = ... statement is quite slow when A is large and sparse, particularly when i and j are also scalars. The inner two for-loops can be vectorized so that i and j are vectors of length 8. Each A(i,j) = ... statement is assembling an entire finite-element matrix into A. However, this leads to very minimal improvement in run time.
type wathen1b
tic
A1b = wathen1b (200,200) ;
toc
function A = wathen1b (nx,ny)
rand('state',0)
e1 = [6 -6 2 -8;-6 32 -6 20;2 -6 6 -6;-8 20 -6 32];
e2 = [3 -8 2 -6;-8 16 -8 20;2 -8 3 -8;-6 20 -8 16];
e = [e1 e2; e2' e1]/45;
n = 3*nx*ny+2*nx+2*ny+1;
A = sparse(n,n);
RHO = 100*rand(nx,ny);
nn = zeros(8,1);
for j=1:ny
for i=1:nx
nn(1) = 3*j*nx+2*i+2*j+1;
nn(2) = nn(1)-1;
nn(3) = nn(2)-1;
nn(4) = (3*j-1)*nx+2*j+i-1;
nn(5) = 3*(j-1)*nx+2*i+2*j-3;
nn(6) = nn(5)+1;
nn(7) = nn(6)+1;
nn(8) = nn(4)+1;
em = e*RHO(i,j);
A (nn,nn) = A (nn,nn) + em ;
end
end
Elapsed time is 282.945593 seconds.
disp (norm (A-A1b,1))
0
How MATLAB Stores Sparse Matrices
To understand why the above examples are so slow, you need to understand how MATLAB stores its sparse matrices. An n-by-n MATLAB sparse matrix is stored as three arrays; I'll call them p, i, and x. These three arrays are not directly accessible from M, but they can be accessed by a mexFunction. The nonzero entries in column j are stored in i(p(j):p(j+1)-1) and x(p(j):p(j+1)-1), where x holds the numerical values and i holds the corresponding row indices. Below is a very small example. First, I create a full matrix and convert it into a sparse one. This is only so that you can easily see the matrix C and how it's stored in sparse form. You should never create a sparse matrix this way, except for tiny examples.
C = [ 4.5 0.0 3.2 0.0 3.1 2.9 0.0 0.9 0.0 1.7 3.0 0.0 3.5 0.4 0.0 1.0 ] ; C = sparse (C)
C = (1,1) 4.5000 (2,1) 3.1000 (4,1) 3.5000 (2,2) 2.9000 (3,2) 1.7000 (4,2) 0.4000 (1,3) 3.2000 (3,3) 3.0000 (2,4) 0.9000 (4,4) 1.0000
Notice that the nonzero entries in C are stored in column order, with sorted row indices. The internal p, i, and x arrays can be reconstructed as follows. The find(C) statement returns a list of "triplets," where the kth triplet is i(k), j(k), and x(k). This specifies that C(i(k),j(k)) is equal to x(k). Next, find(diff(...)) constructs the column pointer array p (this only works if there are no all-zero columns in the matrix).
[i j x] = find (C) ; n = size(C,2) ; p = find (diff ([0 ; j ; n+1])) ; for col = 1:n fprintf ('column %d:\n k row index value\n', col) ; disp ([(p(col):p(col+1)-1)' i(p(col):p(col+1)-1) x(p(col):p(col+1)-1)]) end
column 1:
k row index value
1.0000 1.0000 4.5000
2.0000 2.0000 3.1000
3.0000 4.0000 3.5000
column 2:
k row index value
4.0000 2.0000 2.9000
5.0000 3.0000 1.7000
6.0000 4.0000 0.4000
column 3:
k row index value
7.0000 1.0000 3.2000
8.0000 3.0000 3.0000
column 4:
k row index value
9.0000 2.0000 0.9000
10.0000 4.0000 1.0000
Now consider what happens when one entry is added to C:
C(3,1) = 42 ; [i j x] = find (C) ; n = size(C,2) ; p = find (diff ([0 ; j ; n+1])) ; for col = 1:n fprintf ('column %d:\n k row index value\n', col) ; disp ([(p(col):p(col+1)-1)' i(p(col):p(col+1)-1) x(p(col):p(col+1)-1)]) end
column 1:
k row index value
1.0000 1.0000 4.5000
2.0000 2.0000 3.1000
3.0000 3.0000 42.0000
4.0000 4.0000 3.5000
column 2:
k row index value
5.0000 2.0000 2.9000
6.0000 3.0000 1.7000
7.0000 4.0000 0.4000
column 3:
k row index value
8.0000 1.0000 3.2000
9.0000 3.0000 3.0000
column 4:
k row index value
10.0000 2.0000 0.9000
11.0000 4.0000 1.0000
and you can see that nearly every entry in C has been moved down by one in the i and x arrays. In general, the single statement C(3,1)=42 takes time proportional to the number of entries in matrix. Thus, looping nnz(A) times over the statement A(i,j)=A(i,j)+... takes time proportional to nnz(A)^2.
A Better Way to Create a Finite-Element Matrix
The version below is only slightly different. It could be improved, but I left it nearly the same to illustrate how simple it is to write fast MATLAB code to solve this problem, via a minor tweak. The idea is to create a list of triplets, and let MATLAB convert them into a sparse matrix all at once. If there are duplicates (which a finite-element matrix always has) the duplicates are summed, which is exactly what you want when assembling a finite-element matrix. In MATLAB 7.3 (R2006b), sparse uses quicksort, which takes nnz(A)*log(nnz(A)) time. This is slower than it could be (a linear-time bucket sort can be used, taking essentially nnz(A) time). However, it's still much faster than nnz(A)^2. For this matrix, nnz(A) is about 1.9 million.
type wathen2.m
tic
A2 = wathen2 (200,200) ;
toc
function A = wathen2 (nx,ny)
rand('state',0)
e1 = [6 -6 2 -8;-6 32 -6 20;2 -6 6 -6;-8 20 -6 32];
e2 = [3 -8 2 -6;-8 16 -8 20;2 -8 3 -8;-6 20 -8 16];
e = [e1 e2; e2' e1]/45;
n = 3*nx*ny+2*nx+2*ny+1;
ntriplets = nx*ny*64 ;
I = zeros (ntriplets, 1) ;
J = zeros (ntriplets, 1) ;
X = zeros (ntriplets, 1) ;
ntriplets = 0 ;
RHO = 100*rand(nx,ny);
nn = zeros(8,1);
for j=1:ny
for i=1:nx
nn(1) = 3*j*nx+2*i+2*j+1;
nn(2) = nn(1)-1;
nn(3) = nn(2)-1;
nn(4) = (3*j-1)*nx+2*j+i-1;
nn(5) = 3*(j-1)*nx+2*i+2*j-3;
nn(6) = nn(5)+1;
nn(7) = nn(6)+1;
nn(8) = nn(4)+1;
em = e*RHO(i,j);
for krow=1:8
for kcol=1:8
ntriplets = ntriplets + 1 ;
I (ntriplets) = nn (krow) ;
J (ntriplets) = nn (kcol) ;
X (ntriplets) = em (krow,kcol) ;
end
end
end
end
A = sparse (I,J,X,n,n) ;
Elapsed time is 1.594073 seconds.
disp (norm (A-A2,1))
1.4211e-014
If you do not know how many entries your matrix will have, you may not be able to preallocate the I, J, and X arrays, as done in wathen2.m. In that case, start them at a reasonable size (anything larger than zero will do) and add this code to the innermost loop, just after ntriplets is incremented:
len = length (X) ;
if (ntriplets > len)
I (2*len) = 0 ;
J (2*len) = 0 ;
X (2*len) = 0 ;
endand when done, use sparse(I(1:ntriplets),J(1:ntriplets),X(1:ntriplets),n,n).
Moral: Do Not Abuse A(i,j)=... for Sparse A; Use sparse Instead
Avoid statements such as
C(4,2) = C(4,2) + 42 ;
in a loop. Create a list of triplets (i,j,x) and use sparse instead. This advice holds for any sparse matrix, not just finite-element ones.
Try a Faster Sparse Function
CHOLMOD includes a sparse2 mexFunction which is a replacement for sparse. It uses a linear-time bucket sort. The MATLAB 7.3 (R2006b) sparse accounts for about 3/4ths the total run time of wathen2.m. For this matrix sparse2 in CHOLMOD is about 10 times faster than the MATLAB sparse. CHOLMOD can be found in the SuiteSparse package, in MATLAB Central.
If you would like to see a short and concise C mexFunction implementation of the method used by sparse2, take a look at CSparse, which was written for a concise textbook-style presentation. The cs_sparse mexFunction uses cs_compress.c, to convert the triplets to a compressed column form of A', cs_dupl.c to sum up duplicate entries, and then cs_transpose.c to transpose the result (which also sorts the columns).
When Fast is Not Fast Enough...
Even with this dramatic improvement in constructing the matrix A, MATLAB could still use additional features for faster construction of sparse finite-element matrices. Constructing the matrix should be much faster than x=A\b, since chol is doing about 700 times more work as sparse for this matrix (1.3 billion flops, vs 1.9 million nonzeros in A). The run times are not that different, however:
tic A = wathen2 (200,200) ; toc b = rand (size(A,1),1) ; tic x=A\b ; toc
Elapsed time is 1.720397 seconds. Elapsed time is 3.125791 seconds.
Get
the MATLAB code
Published with MATLAB® 7.4

You suggested that users do effecient dynamic allocation with the code snippet
len = length (X) ;
if (ntriplets > len)
I (2*len) = 0 ;
…
What are the pros and cons of having Matlab arrays do this automatically, like C++ std::vector?
Hi,
I added the following line in ‘wathen2′ after the nested loops
mymat = sortrows([ I J X ], [ 2 1 ]);
and then created the sparse matrix
A = sparse(mymat(:, 1), mymat(:, 2), mymat(:, 3), n, n);
as suggested.
I was curious if sorting the index vectors would affect performance (although one might expect ’sparse’ to do something akin to this by default).
I timed the original and modified codes for 10 consecutive runs and got a surprising (?) result:
>> test
Elapsed time is 45.761231 seconds.
>> test
Elapsed time is 21.771210 seconds.
The code using sorted index vectors were about twice as fast.
Any comments on this behavior?
Best regards,
David
David,
I tried your suggested change. In both MATLAB 7.3 (R2006b) and 7.4 (R2007a), the original non-sorted version was twice as fast as the version with the sort.
What version of MATLAB and platform are you running the tests with?
Regards,
Eric Sampson
The MathWorks,Inc.
Hi Eric,
I can reproduce these times (or similar) in MATLAB-versions R2006a and R2006b.
My platform is an Intel Pentium 4 660 Prescott (3.6 GHz), running Fedora Core 6, so I use the 64-bit version of MATLAB for Linux.
In my runs (R2006a) a sort using ’sortrows’ takes almost 1.5 s, followed by ’sparse’ which barley needs 0.3 s. Without presorting ’sparse’ takes about 3.9 s to complete instead.
I can provide my function (’wathen3.m’) and script (’test.m’), should it be of any interest to you.
I tried my sorting recipe in my FE-code, which provided a decent speedup for larger matrices.
Best regards,
David
I can confirm that the situation is reversed, i.e., like you described it with non-sorting being faster, on a 32-bit system using R2006b (AMD Mobile CPU, running Ubuntu Linux 6.10).
Why I got the opposite before—which indeed I did—bets me.
Best regards,
David
Reply to Tom Richardson:
If you allow MATLAB to grow the size of the I, J,
and X arrays, then they will grow by just one entry
each iteration, and the total work will be quadratic
(nnz(A)^2). The method I propose doubles the size
of I, J and X, and the total work will be O(nnz(A)),
using the ‘dynamic table’ method.
I’m not a C++ expert, but looking over the documentation
of the std::vector class, it seems that this class
makes a distinction between the size of the vector
(size(X,1) in MATLAB), and the “capacity” of the vector.
The capacity is how much extra space has been allocated,
in “elbow room” that allows the vector to grow in size
without reallocating memory. MATLAB does not have any
distinction between “size” and “capacity”, with the
hard-to-use exception of the difference between nnz(A)
and nzmax(A) for a sparse A.
What you are really asking for is for MATLAB to take
O(n) for the following loop:
n = 1e5 ;
tic
x = 0 ;
for k = 1:n
x (k) = k ;
end
toc
(elapsed time: 90 seconds)
To do so, MATLAB would have to implement some notion
of “capacity”, and allocate more space than needed
when an entry outside the current size if x is accessed.
I don’t think that’s necessary, when the following will
work just as well (with a fprintf added to see what’s
happening):
y = 0 ;
tic
for k = 1:n
if (k > length (y))
y (2 * length (y)) = 0 ;
fprintf (’k %d, y now has length %d\n’, k, length (y)) ;
end
y (k) = k ;
end
y = y (1:n) ;
toc
(elapsed time: 0.7 seconds, even with the printf’s).
x and y are now equal.
The output is:
k 2, y now has length 2
k 3, y now has length 4
k 5, y now has length 8
k 9, y now has length 16
k 17, y now has length 32
k 33, y now has length 64
k 65, y now has length 128
k 129, y now has length 256
k 257, y now has length 512
k 513, y now has length 1024
k 1025, y now has length 2048
k 2049, y now has length 4096
k 4097, y now has length 8192
k 8193, y now has length 16384
k 16385, y now has length 32768
k 32769, y now has length 65536
k 65537, y now has length 131072
If the time to reallocate y is equal to the size of y,
then the total work is a sum of powers of two, or 2*n total.
The same thing can be done with a C++ std::vector, if you
resize the vector by doubling its capacity anytime its
capacity is insufficient.
Reference: Cormen, Leiserson, Rivest, “Introduction to
Algorithms”, MIT Press, section 18.4 on Dynamic Tables.
Regarding the sortrows issue: in MATLAB 7.4 (released today) and in all previous versions of MATLAB with sparse matrices, “sparse” uses quicksort to sort the triplets, taking O(K*log(K)) time to do so, if it has K triplets. If the triplets are already sorted on input, you can tell from your runtimes that this initial sort is skipped. I’m speaking in generalities, about things that can be observed from “sparse” just by looking at its run time. I’m not looking at the code. You can measure it to be O(K*log(K)) for creating a matrix from K triplets (if the triplets are sorted), and O(K) otherwise. Quite a bit can be determined about how an algorithm works with just tic and toc, without looking at the code.
The CHOLMOD “sparse2″ is asymptotically faster than the “sparse” in MATLAB; it’s about 10 times faster for this wathen(200,200) matrix. sparse2 uses a linear-time bucket sort. Since it’s so fast, it doesn’t bother to check if the input is sorted or not.
When I, J, X are not sorted, “sparse” takes 3.0 seconds on my machine. sortrows takes 2.4 seconds, and then “sparse” on the result takes 0.5 seconds. The total time is about the same.
Using sparse2 on the unsorted I,J,X takes 0.4 seconds; it takes the same time on the sorted I,J,X.
These timings are on a 32-bit Pentium 4, with a 32-bit MATLAB. I don’t yet have a 64-bit port for any of my sparse mexFunctions, because The MathWorks has made it quite difficult to write a portable sparse mexFunction that seemlessly works on 32-bit/64-bit, and current/older versions of MATLAB. My codes do work on 64-bit MATLAB (x=A\b works), but not my own mexFunctions. So I can’t tell you how fast “sparse2″ is on a 64-bit MATLAB.
The fact that you’re seeing an improvement in “sparse” by sorting the triplets first is really an artifact of the “sparse” algorithm itself. A better algorithm (”sparse2″) is unaffected by the initial order of the triplets. If The MathWorks were to switch to the algorithm used by “sparse2″ then using “sortrows” first would actually slow your code down, since sortrows must take O(K*log(k)) time.
I suppose the reason that sortrows+sparse is faster than sparse on a 64-bit platform, is that internally, sparse is dealing with a 64-bit integer. sortrows is dealing with a 64-bit floating-point value. Perhaps 64-bit integer comparisons are slower than 64-bit fl.pt. comparisons. This is just a guess.
Oops. I got it backwards in post #7. “You can measure it to be O(K*log(K)) for creating a matrix from K triplets (if the triplets are sorted), and O(K) otherwise.” is backwards.
It should be “O(K log (K)) if the triplets are UNsorted, and O(K) otherwise.”
In my original post, I should have cited the paper by John Gilbert, Cleve Moler, and Rob Schreiber, “Sparse matrices in MATLAB”, SIAM J. Matrix Analysis & Applications, vol 13, 1992, pp. 333-356. They clearly point out the problem that A(i,j)=B and B=A(i,j) are both intrinsically slow. This is not a “bug” in MATLAB; it is intrinsically impossible to provide a data structure that does all things well. If MATLAB were to use a data structure that made A(i,j)=B fast, then (say), C=A*B where A and B are sparse, or x=A\b, would be horribly slow. Take your pick; I think John, Cleve, and Rob made the right choice.
You can read their paper at http://www.mathworks.com/access/helpdesk/help/pdf_doc/otherdocs/simax.pdf
Half tongue-in-cheek … but perhaps mlint could flag A(i,j)=B and B=A(i,j) usages and ask “Please read the Gilbert/Moler/Schreiber paper, page 10, Section 3.1.4; after reading it click OK to continue” …
First I would like to thank Tim for all help his blog post has provided me with… thank you!
Next, as you might have guessed (?), I’ll ask for yet some advice:
The problem is my index vectors, should we call them ‘row’, ‘col’ and ‘val’, grow too large. In fact that large I run out of RAM (3 x 16e6 x 8 / 1024^2 ~ 366 MB).
Alas, any uint-format seems incompatible with ’sparse’ or ’sparse2′, which otherwise would have brought down memory consumption. I suppose another remedy could be—besides the obvious one of changing to improved hardware—to simply update the sparse matrix once the index vectors start draining memory (that way they could be reset and then the FE assembly process allowed to carry on). But that would, in some sense, be out of line with the spirit of this discussion … :)
So, finally, my question becomes if it is possible to sum duplicate entries of the index vectors instead, i.e., if
row = [1 2 2 4 4 5]
col = [2 4 4 3 3 1]
val = [1 2 3 4 5 6]
could be transferred into
row = [1 2 4 5]
col = [2 4 3 1]
val = [1 5 9 6]
in an efficient manner? Could that prove better performing than updating the sparse matrix?
This is only a matter for large scale problems, but still something to consider, I presume.
The line below seems to do the work, but I have not put it to any serious test:
[row, col, val] = find(accumarray([row, col], val, [], @sum, [], true));
Maybe it is all inefficient.
Hi!
I’m trying to understand why the Matlab function LU.m takes almost 20 times more time to factorize a sparse matrix with nnz=784418 than another one with nnz=1238796, the two matrices being numerically the same as showed by the code below
>> [IindSCL,JindSCL,SnzSCL,nbrequaSCL]=Stokes3D_SCL(8,8,1,7,0,1);
>> [Iindcor,Jindcor,Snzcor,nbrequacor]=Stokes3Dcor(8,8,1,1,1,0,1);
>> a=(IindSCL==Iindcor);b=(JindSCL==Jindcor);c=(nbrequaSCL==nbrequacor);
>> find(a==0)
ans =
Empty matrix: 0-by-1
>> find(b==0), find(c==0)
ans =
Empty matrix: 0-by-1
ans =
[]
>> max(max(abs(SnzSCL-Snzcor)))
ans =
3.552713678800501e-15
>> tic;matSCL=sparse2(IindSCL,JindSCL,SnzSCL,nbrequaSCL+1,nbrequaSCL+1);toc
Elapsed time is 0.993688 seconds.
>> tic;matcor=sparse2(Iindcor,Jindcor,Snzcor,nbrequacor+1,nbrequacor+1);toc
Elapsed time is 0.961044 seconds.
>> tic;matSCL=sparse(IindSCL,JindSCL,SnzSCL,nbrequaSCL+1,nbrequaSCL+1);toc
Elapsed time is 1.882607 seconds.
>> tic;matcor=sparse(Iindcor,Jindcor,Snzcor,nbrequacor+1,nbrequacor+1);toc
Elapsed time is 1.893259 seconds.
>>
>>
>> tic;[L_SCL,U_SCL,P_SCL,Q_SCL,R_SCL]=lu(matSCL);toc
Elapsed time is 4.642097 seconds.
>> tic;[L_cor,U_cor,P_cor,Q_cor,R_cor]=lu(matcor);toc
Elapsed time is 84.450131 seconds.
>>
>>
>> nnz(matSCL),nnz(matcor)
ans =
1238796
ans =
784418
>> max(max(abs(matSCL-matcor)))
ans =
(1,1) 3.552713678800501e-15
>>
May be someone could help
Here’s Cleve’s reply to Etienne:
The crucial factor is the number and location of the nonzero elements in the LU factors, not in the original matrices. This doesn’t show us the original matrices. “Numerically the same” apparently means the difference is small. That’s not enough. The magnitude of the nonzeros is irrelevant; it’s their existence that matters.
— Cleve