Linear Algebra in MATLAB: Trying out AMD’s AOCL
In R2022a, MathWorks started shipping AMD’s AOCL alongside Intel’s MKL in MATLAB. This article explains what these are and why you might care about them.
BLAS and LAPACK
A lot of modern technical computing is made possible by the BLAS and LAPACK libraries. You may not have heard of them but I'm almost certain you've used them. Have you ever multiplied two dense matrices together using MATLAB? If so, you are a BLAS user. Computed the eigenvalues or SVD of a dense matrix? Yes? You're a LAPACK user. Not a user of MATLAB but you've done these operations in Python or R? You're probably users of these libraries too.
BLAS and LAPACK are at the heart of many matrix-based calculations and so, as you might imagine, they are taken very seriously by the company who develops the MATrix LABoratory! When MATLAB first started using LAPACK back in 2000, for example, Cleve Moler himself wrote about many of the details and the speed improvements that followed.
Something many of us at MathWorks find interesting about the results above is that the performance was then measured in Megaflops (10^6 floating point operations per second), thousands of times slower than the Gigaflops (10^9 flops) we expect from even the most modest laptop today. Earlier this year we entered a new era of computing when the Frontier supercomputer demonstrated that it could operate at 1 Exaflop (10^18 flops).
Two libraries, many implementations
One thing to know about BLAS and LAPACK libraries is that there are many implementations of them. The so-called reference BLAS and LAPACK define the user interfaces and give easy to read, unoptimized implementations of each of the operations.
Different groups produce optimized implementations of these libraries using various strategies. I mentioned one example, OpenBLAS, in my post about the R2022a Apple Silicon beta version of MATLAB. Another example is Intel’s Math Kernel Library (MKL) which, as the name suggests, is a library from Intel that provides highly optimized versions of BLAS and LAPACK for their hardware.
Intel MKL has been MATLAB's provider of BLAS and LAPACK for a long time now. In MATLAB R2022a, for example we have
>> version -lapack
ans =
'Intel(R) oneAPI Math Kernel Library Version 2021.3-Product Build 20210611 for Intel(R) 64 architecture applications (CNR branch AVX512_E1) supporting Linear Algebra PACKage (LAPACK 3.9.0)'
>> version -blas
ans =
'Intel(R) oneAPI Math Kernel Library Version 2021.3-Product Build 20210611 for Intel(R) 64 architecture applications (CNR branch AVX512_E1)'
MKL works just fine on AMD processors as well but some of our users have been asking for official support for AMD's own accelerated implementations of these libraries. Called the AMD Optimizing CPU Libraries, or AOCL for short, these are developed by AMD and targeted at their own hardware although, as with Intel MKL, they work on both AMD and Intel hardware.
AOCL is optionally available in MATLAB from R2022a
As of R2022a, we have started shipping AOCL with MATLAB but it is not activated by default. Changing the default version of anything is not something MathWorks does lightly. As such, R2022a continues to use MKL by default but users of both Intel and AMD hardware (On Windows and Linux) are able to switch to using the version of AOCL that has passed MathWorks qualification testing.
Instructions for making the switch are given on this MATLAB Answers post.
The reason you might do this, of course, is speed. It may be the case that AOCL is faster than MKL for some operations and we are interested in hearing from you if this is the case. Of course we are also very interested in learning of any problems you encounter in trying this out.
What performance differences can you expect to see?
You should only expect to see performance differences in functions that make use of linear algebra. Any potential differences will depend on factors such as type of operation, matrix size and structure and exactly which CPU you are using.
It is not necessarily the case that one library always outperforms the other on any given piece of hardware. For example, using the script laBench.m, and running on an Azure D16ads_v5 instance that exposes 8 cores of an AMD EPYC 7763 I found the following timings for 10,000 x 10,000 matrices:
Intel MKL results (best of 3)- Matrix Multiply time is 5.60s
- Cholesky time is 1.07s
- Matrix Multiply time is 5.31s
- Cholesky time is 1.22s
So Matrix-Matrix multiplication is faster on this hardware using AOCL but Cholesky decomposition is slower for this matrix size.
Over to youI was very excited by this recent update and hope that you are too. If you give it a try, let me know how you get on in the comments section or via twitter.
Comments
To leave a comment, please click here to sign in to your MathWorks Account or create a new one.