Life in the fast lane: Making MATLAB even faster on Apple Silicon with Apple Accelerate

Posted by Mike Croucher, December 13, 2023

57 views (last 30 days) | 0 Likes | 15 comments

Up to 3.7x faster Matrix-Matrix multiplication, 2x faster LU factorisation, 1.7x faster Cholesky decomposition and the potential for many more speed-ups across different areas of linear algebra in MATLAB.  What if I told you that you could have all of this right now on your Apple Silicon machine?  Interested?  Read on. 

It’s all about the BLAS

BLAS (Basic Linear Algebra Subprograms) is probably the most important software library you’ve never heard of. It takes care of things like multiplying two matrices together or multiplying vectors by a scalar. Operations so simple you may not even expect that something like MATLAB would use a library to perform them. 

The reason we use the BLAS library is speed! It’s relatively easy to implement these operations if you don’t care about speed but it is certainly not easy to implement them in such a way that they make maximum use of modern hardware. The difference in speed between a naively written matrix-matrix multiplication routine and a highly optimised one can be thousands of times!  

Linear algebra is at the heart of much of modern technical computing so chip manufacturers, such as Intel, AMD and Apple want to ensure that when you do linear algebra on their hardware it’s as fast as it can possibly be. As such, they all have their own implementations of the BLAS library, exquisitely tuned to perfectly show-off the characteristics of their CPUs. 

None of this is news to regular readers of The MATLAB blog of course. I have a minor obsession with the BLAS library, and its big sibling, the LAPACK library; both of which have been covered multiple times in articles such as Trying out AMD’s AOCL and early posts exploring the beta versions of MATLAB on Apple Silicon (here and here). In these articles you can see how MATLAB’s support of BLAS has evolved recently, particularly for Apple Silicon. 

You can now use Apple Accelerate as the BLAS for MATLAB on Apple Silicon

MATLAB R2023b, the first general release of MATLAB for Apple Silicon, uses OpenBLAS and, as I reported back in June, performance is pretty good. From the beginning, however, users asked us ‘Why aren’t you using Apple Accelerate; which contains Apple’s implementation of BLAS?’.  We had our reasons (see the relevant section in this article) but the situation has since changed and MathWorks recently released an update allowing you to make use of it. It was done so quietly, however, that you almost certainly missed it! So let’s fix that. 

You need to be running at least R2023b Update 4. Anything earlier than R2023b isn’t even running natively on Apple Silicon so start there. Once you have R2023b installed, check which update you are running with the version command  

version 

ans = 

'23.2.0.2428915 (R2023b) Update 4' 

If you are running on an earlier update, fix that by clicking on Help->Check for Updates  

Once that’s done, you can switch from using OpenBLAS to Apple Accelerate by following the instructions at How can I use the BLAS implementations included in Apple Accelerate Framework with MATLAB R2023b Update 4

Performance discussion

At the risk of stating the obvious, if what you’re interested in isn’t related to linear algebra then you are not going to see any performance differences with this change. This is MATLAB though, there’s a lot of linear algebra.  

If you want to repeat this benchmarks for yourself, the script I used is on GitHub

Matrix-Matrix multiplication

Matrix-Matrix multiplication is where I saw the most speed-up. This is a BLAS operation that has clearly seen a lot of work by Apple.  

Matrix Size	OpenBLAS time (s)	Apple Accelerate time (s)	x Speed-up
1,000	0.0172	0.0046	3.74
5,000	1.1583	0.4171	2.78
10,000	6.8977	3.3186	2.08

LU factorization

LU factorization is a LAPACK operation and we haven’t changed LAPACK libraries here. However, LAPACK often makes use of BLAS so if you accelerate BLAS, you get can faster LAPACK for free.  

Matrix Size	OpenBLAS time (s)	Apple Accelerate time (s)	x Speed-up
1,000	0.0124	0.0115	1.08
5,000	0.4345	0.2556	1.7
10,000	3.5928	1.6821	2.14

 I also tested Cholesky decomposition and got speed-ups ranging from 1.28x to 1.77x 

Eigenvalues

To test eigenvalue computation I used the single output version of eig. e = eig(A).  As with other tests, I tried this on a range of random matrices.  Things started off well, with nice speed-ups for small matrices but larger matrices saw quite substantial slowdowns on my M2 machine when compared to the OpenBLAS original.

I discussed this with a colleague who has an M1 and he never saw any slowdowns. For his machine, it was always faster to use Apple Accelerate for eig.  As such, we suspect that this is an M2-speific issue and will investigate more closely to find out what’s going on here 

Matrix Size	OpenBLAS time (s)	Apple Accelerate time (s)	x Speed-up
1,000	0.4245	0.2407	1.76
5,000	22.8654	24.74	0.92
10,000	145.4076	201.57	0.72

Details of test machine

I used an M2-based Macbook Pro that was plugged-in. The output from the cpuinfo command (available on File Exchange) is as follows.   

cpuinfo 

ans =  

  struct with fields: 

        CPUName: 'Apple M2 Pro' 

          Clock: 'N/A' 

          Cache: 65536 

    TotalMemory: 1.7180e+10 

        NumCPUs: 1 

     TotalCores: 10 

         OSType: 'macOS' 

      OSVersion: '13.3.1' 

Other than the issue with eig, which I hope will get fixed at some point, switching to Apple Accelerate seems to be extremely beneficial.  If you have an Apple Silicon Mac, give it a try and tell me what you come up with.