I'm pleased to introduce Tim Davis as this week's guest blogger. Tim is a professor at the University of Florida, and is the
author or co-author of many of our sparse matrix functions (lu, chol, much of sparse backslash, ordering methods such as amd
and colamd, and other functions such as etree and symbfact). He is also the author of a recent book, Direct Methods for Sparse Linear Systems, published by SIAM, where more details of MATLAB sparse matrices are discussed ( http://www.cise.ufl.edu/~davis ).
- MATLAB is Slow? Only If You Use It Incorrectly
- How Not to Create a Finite-Element Matrix
- What's Wrong with It?
- How MATLAB Stores Sparse Matrices
- A Better Way to Create a Finite-Element Matrix
- Moral: Do Not Abuse A(i,j)=... for Sparse A; Use sparse Instead
- Try a Faster Sparse Function
- When Fast is Not Fast Enough...
From time to time, I hear comments such as "MATLAB is slow for large finite-element problems." When I look at the details,
the problem is typically due to an overuse of A(i,j)= ... when creating the sparse matrix (assembling the finite elements). This can be seen in typical user's code, MATLAB code in
books on the topic, and even in MATLAB itself. The problem is widespread. MATLAB can be very fast for finite-element problems,
but not if it's used incorrectly.
A good example of what not to do can be found in the wathen.m function, in MATLAB. A=gallery('wathen',200,200) takes a huge amount of time; a very minor modification cuts the run time drastically. I'm not intending to single out this
one function for critique; this is a very common issue that I see over and over again. This function is built into MATLAB,
which makes it an accessible example. It was first written when MATLAB did not support sparse matrices, and was modified
only slightly to exploit sparsity. It was never meant for generating large sparse finite-element matrices. You can see the
entire function with the command:
Below is an excerpt of the relevant parts of wathen.m. The function wathen1.m creates a finite-element matrix of an nx-by-ny mesh. Each i loop creates a single 8-by-8 finite-element matrix, and adds it into A.
type wathen1 tic A = wathen1 (200,200) ; toc
function A = wathen1 (nx,ny) rand('state',0) e1 = [6 -6 2 -8;-6 32 -6 20;2 -6 6 -6;-8 20 -6 32]; e2 = [3 -8 2 -6;-8 16 -8 20;2 -8 3 -8;-6 20 -8 16]; e = [e1 e2; e2' e1]/45; n = 3*nx*ny+2*nx+2*ny+1; A = sparse(n,n); RHO = 100*rand(nx,ny); nn = zeros(8,1); for j=1:ny for i=1:nx nn(1) = 3*j*nx+2*i+2*j+1; nn(2) = nn(1)-1; nn(3) = nn(2)-1; nn(4) = (3*j-1)*nx+2*j+i-1; nn(5) = 3*(j-1)*nx+2*i+2*j-3; nn(6) = nn(5)+1; nn(7) = nn(6)+1; nn(8) = nn(4)+1; em = e*RHO(i,j); for krow=1:8 for kcol=1:8 A(nn(krow),nn(kcol)) = A(nn(krow),nn(kcol))+em(krow,kcol); end end end end Elapsed time is 305.832709 seconds.
The above code is fine for generating a modest sized matrix, but the A(i,j) = ... statement is quite slow when A is large and sparse, particularly when i and j are also scalars. The inner two for-loops can be vectorized so that i and j are vectors of length 8. Each A(i,j) = ... statement is assembling an entire finite-element matrix into A. However, this leads to very minimal improvement in run time.
type wathen1b tic A1b = wathen1b (200,200) ; toc
function A = wathen1b (nx,ny) rand('state',0) e1 = [6 -6 2 -8;-6 32 -6 20;2 -6 6 -6;-8 20 -6 32]; e2 = [3 -8 2 -6;-8 16 -8 20;2 -8 3 -8;-6 20 -8 16]; e = [e1 e2; e2' e1]/45; n = 3*nx*ny+2*nx+2*ny+1; A = sparse(n,n); RHO = 100*rand(nx,ny); nn = zeros(8,1); for j=1:ny for i=1:nx nn(1) = 3*j*nx+2*i+2*j+1; nn(2) = nn(1)-1; nn(3) = nn(2)-1; nn(4) = (3*j-1)*nx+2*j+i-1; nn(5) = 3*(j-1)*nx+2*i+2*j-3; nn(6) = nn(5)+1; nn(7) = nn(6)+1; nn(8) = nn(4)+1; em = e*RHO(i,j); A (nn,nn) = A (nn,nn) + em ; end end Elapsed time is 282.945593 seconds.
disp (norm (A-A1b,1))
To understand why the above examples are so slow, you need to understand how MATLAB stores its sparse matrices. An n-by-n
MATLAB sparse matrix is stored as three arrays; I'll call them p, i, and x. These three arrays are not directly accessible from M, but they can be accessed by a mexFunction. The nonzero entries in
column j are stored in i(p(j):p(j+1)-1) and x(p(j):p(j+1)-1), where x holds the numerical values and i holds the corresponding row indices. Below is a very small example. First, I create a full matrix and convert it into a sparse one. This is only so that you can easily see the matrix C and how it's stored in sparse form. You should never create a sparse matrix this way, except for tiny examples.
C = [ 4.5 0.0 3.2 0.0 3.1 2.9 0.0 0.9 0.0 1.7 3.0 0.0 3.5 0.4 0.0 1.0 ] ; C = sparse (C)
C = (1,1) 4.5000 (2,1) 3.1000 (4,1) 3.5000 (2,2) 2.9000 (3,2) 1.7000 (4,2) 0.4000 (1,3) 3.2000 (3,3) 3.0000 (2,4) 0.9000 (4,4) 1.0000
Notice that the nonzero entries in C are stored in column order, with sorted row indices. The internal p, i, and x arrays can be reconstructed as follows. The find(C) statement returns a list of "triplets," where the kth triplet is i(k), j(k), and x(k). This specifies that C(i(k),j(k)) is equal to x(k). Next, find(diff(...)) constructs the column pointer array p (this only works if there are no all-zero columns in the matrix).
[i j x] = find (C) ; n = size(C,2) ; p = find (diff ([0 ; j ; n+1])) ; for col = 1:n fprintf ('column %d:\n k row index value\n', col) ; disp ([(p(col):p(col+1)-1)' i(p(col):p(col+1)-1) x(p(col):p(col+1)-1)]) end
column 1: k row index value 1.0000 1.0000 4.5000 2.0000 2.0000 3.1000 3.0000 4.0000 3.5000 column 2: k row index value 4.0000 2.0000 2.9000 5.0000 3.0000 1.7000 6.0000 4.0000 0.4000 column 3: k row index value 7.0000 1.0000 3.2000 8.0000 3.0000 3.0000 column 4: k row index value 9.0000 2.0000 0.9000 10.0000 4.0000 1.0000
Now consider what happens when one entry is added to C:
C(3,1) = 42 ; [i j x] = find (C) ; n = size(C,2) ; p = find (diff ([0 ; j ; n+1])) ; for col = 1:n fprintf ('column %d:\n k row index value\n', col) ; disp ([(p(col):p(col+1)-1)' i(p(col):p(col+1)-1) x(p(col):p(col+1)-1)]) end
column 1: k row index value 1.0000 1.0000 4.5000 2.0000 2.0000 3.1000 3.0000 3.0000 42.0000 4.0000 4.0000 3.5000 column 2: k row index value 5.0000 2.0000 2.9000 6.0000 3.0000 1.7000 7.0000 4.0000 0.4000 column 3: k row index value 8.0000 1.0000 3.2000 9.0000 3.0000 3.0000 column 4: k row index value 10.0000 2.0000 0.9000 11.0000 4.0000 1.0000
and you can see that nearly every entry in C has been moved down by one in the i and x arrays. In general, the single statement C(3,1)=42 takes time proportional to the number of entries in matrix. Thus, looping nnz(A) times over the statement A(i,j)=A(i,j)+... takes time proportional to nnz(A)^2.
The version below is only slightly different. It could be improved, but I left it nearly the same to illustrate how simple
it is to write fast MATLAB code to solve this problem, via a minor tweak. The idea is to create a list of triplets, and let
MATLAB convert them into a sparse matrix all at once. If there are duplicates (which a finite-element matrix always has)
the duplicates are summed, which is exactly what you want when assembling a finite-element matrix. In MATLAB 7.3 (R2006b),
sparse uses quicksort, which takes nnz(A)*log(nnz(A)) time. This is slower than it could be (a linear-time bucket sort can be used, taking essentially nnz(A) time). However, it's still much faster than nnz(A)^2. For this matrix, nnz(A) is about 1.9 million.
type wathen2.m tic A2 = wathen2 (200,200) ; toc
function A = wathen2 (nx,ny) rand('state',0) e1 = [6 -6 2 -8;-6 32 -6 20;2 -6 6 -6;-8 20 -6 32]; e2 = [3 -8 2 -6;-8 16 -8 20;2 -8 3 -8;-6 20 -8 16]; e = [e1 e2; e2' e1]/45; n = 3*nx*ny+2*nx+2*ny+1; ntriplets = nx*ny*64 ; I = zeros (ntriplets, 1) ; J = zeros (ntriplets, 1) ; X = zeros (ntriplets, 1) ; ntriplets = 0 ; RHO = 100*rand(nx,ny); nn = zeros(8,1); for j=1:ny for i=1:nx nn(1) = 3*j*nx+2*i+2*j+1; nn(2) = nn(1)-1; nn(3) = nn(2)-1; nn(4) = (3*j-1)*nx+2*j+i-1; nn(5) = 3*(j-1)*nx+2*i+2*j-3; nn(6) = nn(5)+1; nn(7) = nn(6)+1; nn(8) = nn(4)+1; em = e*RHO(i,j); for krow=1:8 for kcol=1:8 ntriplets = ntriplets + 1 ; I (ntriplets) = nn (krow) ; J (ntriplets) = nn (kcol) ; X (ntriplets) = em (krow,kcol) ; end end end end A = sparse (I,J,X,n,n) ; Elapsed time is 1.594073 seconds.
disp (norm (A-A2,1))
If you do not know how many entries your matrix will have, you may not be able to preallocate the I, J, and X arrays, as done in wathen2.m. In that case, start them at a reasonable size (anything larger than zero will do) and add this code to the innermost loop,
just after ntriplets is incremented:
len = length (X) ; if (ntriplets > len) I (2*len) = 0 ; J (2*len) = 0 ; X (2*len) = 0 ; end
and when done, use sparse(I(1:ntriplets),J(1:ntriplets),X(1:ntriplets),n,n).
Avoid statements such as
C(4,2) = C(4,2) + 42 ;
in a loop. Create a list of triplets (i,j,x) and use sparse instead. This advice holds for any sparse matrix, not just finite-element ones.
CHOLMOD includes a sparse2 mexFunction which is a replacement for sparse. It uses a linear-time bucket sort. The MATLAB 7.3 (R2006b) sparse accounts for about 3/4ths the total run time of wathen2.m. For this matrix sparse2 in CHOLMOD is about 10 times faster than the MATLAB sparse. CHOLMOD can be found in the SuiteSparse package, in MATLAB Central.
If you would like to see a short and concise C mexFunction implementation of the method used by sparse2, take a look at CSparse, which was written for a concise textbook-style presentation. The cs_sparse mexFunction uses cs_compress.c, to convert the triplets to a compressed column form of A', cs_dupl.c to sum up duplicate entries, and then cs_transpose.c to transpose the result (which also sorts the columns).
Even with this dramatic improvement in constructing the matrix A, MATLAB could still use additional features for faster construction of sparse finite-element matrices. Constructing the
matrix should be much faster than x=A\b, since chol is doing about 700 times more work as sparse for this matrix (1.3 billion flops, vs 1.9 million nonzeros in A). The run times are not that different, however:
tic A = wathen2 (200,200) ; toc b = rand (size(A,1),1) ; tic x=A\b ; toc
Elapsed time is 1.720397 seconds. Elapsed time is 3.125791 seconds.
Published with MATLAB® 7.4
Comments are closed.
49 CommentsOldest to Newest
val=eigs(Kr,Mr,10,'sm')My reduced matrices Kr and Mr are finite-element stiffness and mass. They are obtained from the sparse, semi-positive definite matrices K and M as follows:
Kr=H'*K*H Mr=H'*M*HThe matrix H is mostly zeros off diagonal. The diagonal terms are 1's. Some off-diagonal terms are complex, exp(1i*pi/4) for example. H is not square. For example K = 10x10, H = 10x6. Nominally, I expect Kr and Mr to be semi-positive definite and hermitian. This is what I obtain if M, K, and H are defined as full. If however I do this operation with sparse matrices (K,M, and H), I get very small real terms of the order of 1e-9 (max(K)~1e9) off diagonal. This triggers the following error with eigs: Generalized matrix B must be the same size as A and either a symmetric positive (semi-)definite matrix or its Cholesky factor. This is a very common problem that many users encounter I traced the error to line 805 of the function eigs. If I comment out this error, the function works properly and I get the correct eigenvalues for my problem. Could you comment on this? Is it possible the sparse-matrix multiplication introduces noise? Could the error about the Cholesky factor be too strict a condition? Thank you, Alex
K(element_dofs,element_dofs)=K(element_dofs,element_dofs)+ elem_K;as opposed to assembling it again and again in every iteration. It boils down to whether dereferencing a created sparse matrix is faster in comparison to creating one? -Meenakshi
NG = 20000; N = 5e5; Nt = 10; for loop = 1 : Nt col1 = floor(NG*rand(N, 1)) + 1; col2 = rand(N, 1); A = sparse(col1, 1, col2, NG, 1); endThe A vector is then used elsewhere (not shown). Thanks Terence
A = zeros(NG,1); for k = 1:N A(col1(k)) = A(col1(k)) + col2(k); endWhile this is not quite as elegant as
A = sparse(col1, 1, col2, NG, 1);it is linear in the number of elements in col1 and avoids a sort. On the other hand, if many entries of A wind up zero, then the code above will most likely use more memory. Take care. Pat.