{"id":710,"date":"2013-06-24T12:00:35","date_gmt":"2013-06-24T17:00:35","guid":{"rendered":"https:\/\/blogs.mathworks.com\/cleve\/?p=710"},"modified":"2022-08-24T16:45:20","modified_gmt":"2022-08-24T20:45:20","slug":"the-linpack-benchmark","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/cleve\/2013\/06\/24\/the-linpack-benchmark\/","title":{"rendered":"The LINPACK Benchmark"},"content":{"rendered":"<div class=\"content\"><!--introduction--><p>By reaching 33.86 petaflops on the LINPACK Benchmark, China's Tianhe-2 supercomputer has just become the world's fastest computer. The technical computing community thinks of LINPACK not as a matrix software library, but as a benchmark.  In that role LINPACK has some attractive aspects, but also some undesirable features.<\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#8350f31b-5101-4539-a01a-e9a179d79366\">Tianhe-2<\/a><\/li><li><a href=\"#28e77db7-cd26-4e60-a17b-fcb44e75b378\">Top500<\/a><\/li><li><a href=\"#1c798af1-76ec-4d56-9c6b-952c87ffa426\">Benchmark Origins<\/a><\/li><li><a href=\"#88b05154-bd5a-4b1d-82ee-0b44ca566699\">Evolution<\/a><\/li><li><a href=\"#74e79ddb-862a-4a7a-9301-6176ff47dc8a\">36 years<\/a><\/li><li><a href=\"#3508a65b-f56f-4224-84cb-3a7cd1604b6c\">Green 500<\/a><\/li><li><a href=\"#3a9943f8-956d-40fb-b43a-cb2c61af9abc\">Geek Talk<\/a><\/li><li><a href=\"#4652ebc7-5056-4df6-ae9b-c1a2af5c0ba5\">Criticism<\/a><\/li><li><a href=\"#6de8f145-24a2-44a3-a9ec-d19d73fe79f1\">Legacy<\/a><\/li><li><a href=\"#4e72f394-cb62-432c-8e52-3e9c50cb6f50\">Home Runs<\/a><\/li><li><a href=\"#f63eb907-9aa0-4d5d-986e-e0ce143c5182\">Exaflops<\/a><\/li><li><a href=\"#385d1d83-197d-4bcf-b8e6-7cf201decc93\">A New Benchmark<\/a><\/li><\/ul><\/div><h4>Tianhe-2<a name=\"8350f31b-5101-4539-a01a-e9a179d79366\"><\/a><\/h4><p><a href=\"\">An announcement<\/a> made last week at the International Supercomputing Conference in Leipzig, Germany, declared the Tianhe-2 to be the world's fastest computer. The Tianhe-2, also known as the MilkyWay-2, is being built in Gunagzho, China by China's National University of Defense Technology. Tianhe-2 has 16,000 nodes, each with two Intel Xeon IvyBridge processors and three Xeon Phi processors, for a combined total of 3,120,000 computing cores.<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"http:\/\/blogs.mathworks.com\/cleve\/files\/tianhe-2_small.jpg\" alt=\"\"> <\/p><h4>Top500<a name=\"28e77db7-cd26-4e60-a17b-fcb44e75b378\"><\/a><\/h4><p>The ranking of world's fastest computer is based on the LINPACK Benchmark. Since 1993, LINPACK benchmark results have been collected by the <a href=\"http:\/\/www.top500.org\">Top500 project<\/a>.  They announce their results twice a year at international supercomputing conferences. <a href=\"https:\/\/www.isc-hpc.com\/all-years.html\">Last week's conference in Leipzig<\/a> produced this <a href=\"\">Top 500 List<\/a>. The next <a href=\"http:\/\/sc13.supercomputing.org\">Supercomputing Conference<\/a> will be in November in Denver.<\/p><p>Tianhe-2's top speed of 33.86 petaflops on the latest Top500 list is nearly twice as fast as number two Titan from Oak Ridge at 17.59 petaflops and number three Sequoia from Lawrence Livermore at 17.17 petaflops.<\/p><h4>Benchmark Origins<a name=\"1c798af1-76ec-4d56-9c6b-952c87ffa426\"><\/a><\/h4><p>The LINPACK benchmark is an accidental offspring of the development of the LINPACK software package in the 1970's.  During the development we asked two dozen universities and laboratories to test the software on a variety of main frame machines that were then available in central computer centers.  We also asked them to measure the time required for two subroutines in the package, DGEFA and DGESL, to solve a 100-by-100 system of simultaneous linear equations. With the LINPACK naming conventions, DGEFA stands for Double precision GEneral matrix FActor and DGESL stands for Double precision GEneral matrix SoLve.<\/p><p>Appendix B of the <a href=\"http:\/\/epubs.siam.org\/doi\/book\/10.1137\/1.978161197181B1\">LINPACK Users' Guide<\/a> has the timing results. The hand-written notes shown here are Jack Dongarra's calculation of the <i>megaflop<\/i> rate, millions of floating point operations per second. With a matrix of order $n$, the megaflop rate for a factorization by Gaussian elimination plus two triangular solves is<\/p><p>$megaflops = (\\frac{2}{3}n^3 + 2n^2)\/(time)\/10^6$<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"http:\/\/blogs.mathworks.com\/cleve\/files\/linpackbm.jpeg\" alt=\"\"> <\/p><p>In 1977, we chose $n$ = 100 because this was large enough to stress the campus main frames of the day.  Notice that almost half of the machines required a second or more to solve the equations.  In fact, the DEC-10 at Yale did not have enough memory to handle a 100-by-100 system. We had to use $n$ = 75 and extrapolate the execution time.<\/p><p>The fastest machine in the world in 1977, according to this LINPACK benchmark, was the <a href=\"https:\/\/www.flickr.com\/photos\/jpstanley\/13812345193\">newly installed Cray-1<\/a> at NCAR, the National Center for Atmospheric Research in Boulder. Its performance was 14 megaflops.<\/p><h4>Evolution<a name=\"88b05154-bd5a-4b1d-82ee-0b44ca566699\"><\/a><\/h4><p>Jack's notes were the beginning of LINPACK as a benchmark. He continued to collect performance data.  From time to time he would publish lists, initially of all the machines for which he had reported results, then later of just the fastest. Importantly, Jack has always required an <i>a postiori<\/i> test on the relative residual,<\/p><pre>    r = norm(A*x - b)\/(norm(A)*norm(x))<\/pre><p>More than once, a large residual has revealed some underlying hardware or system fault.<\/p><p>For a number of years, Jack insisted that the original Fortran DGEFA and DGESL be used with n = 100.  Only the underlying BLAS routines could be customized.  As processors became faster and memories larger, he relaxed these requirements.<\/p><p>The LINPACK software package has been replaced by <a href=\"http:\/\/www.netlib.org\/lapack\">LAPACK<\/a>, which features block algorithms to take advantage of cache memories. Matrix computation on distributed memory parallel computers with message passing interprocessing communication is now handled by <a href=\"http:\/\/www.netlib.org\/scalapack\">ScaLAPACK<\/a><\/p><p>The benchmark itself has become <a href=\"http:\/\/www.netlib.org\/benchmark\/hpl\/\">HPL<\/a>, which is described as \"A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers\".  The web page at the <a href=\"http:\/\/icl.cs.utk.edu\/\">Innovative Computing Laboratory<\/a> at the University of Tennessee, which Dongarra directs, summarizes the algorithm employed by HPL with these keywords:<\/p><p>\r\n<p style=\"margin-left:3ex;\">\r\nTwo-dimensional block-cyclic data distribution - Right-looking variant\r\nof the LU factorization with row partial pivoting featuring multiple\r\nlook-ahead depths - Recursive panel factorization with pivot search and\r\ncolumn broadcast combined - Various virtual panel broadcast topologies\r\n- bandwidth reducing swap-broadcast algorithm - backward substitution\r\nwith look-ahead of depth 1.\r\n<\/p>\r\n<\/p><h4>36 years<a name=\"74e79ddb-862a-4a7a-9301-6176ff47dc8a\"><\/a><\/h4><p>We went from 14 <b>megaflops<\/b> on the Cray-1 at NCAR in 1977 to almost 34 <b>petaflops<\/b> on the Tianhe-2 in China in 2013.  That's a speed-up by a factor of $2.4 \\cdot 10^9$ in 36 years.  This is a doubling of speed about every 14 months.  That is amazing!<\/p><p>Moore's Law that transistor counts double roughly every 24 months would account for a speed-up over 36 years of only $2^{(12\/24)36}=2.6 \\cdot 10^5$. Moore's Law that transistor counts double and clock speeds get faster roughly every eighteen months would account for a speed-up of something like $2^{(12\/18)36}=1.7 \\cdot 10^7$.  Neither of these comes close to explaining what we've seen with LINPACK.  In addition to faster hardware, the speed-up in LINPACK is due to all the algorithmic innovation represented by those terms in the description of HPL.<\/p><h4>Green 500<a name=\"3508a65b-f56f-4224-84cb-3a7cd1604b6c\"><\/a><\/h4><p>Many years ago we introduced the Intel Hypercube at a small conference on parallel computing in Knoxville.  The machine required 220 volts, but the hotel conference room we were using didn't have 220.  So we rented a gas generator, put it in the parking lot, and ran a long cable past the swimming pool into the conference room.  When I gave my talk about \"Parallel LINPACK on the Hypercube\", some wise guy asked \"How many megaflops per gallon are you getting?\"  It was intended as a joke at the time, but it is really an important question today.  I've made this cartoon.<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"http:\/\/blogs.mathworks.com\/cleve\/files\/generator_small.jpg\" alt=\"\"> <\/p><p>Energy consumption of big machines is a very important issue. When I took a tour of the Oak Ridge Leadership Computing Facility a few years ago, Buddy Bland, who is Project Director, told me he could tell when he comes to the Lab in the morning if they are running the LINPACK benchmark by looking at how much steam is coming out of the cooling towers.  LINPACK generates lots of heat.<\/p><p>The <a title=\"http:\/\/www.green500.org\">Green 500<\/a> is looking for energy efficient supercomputers.  The project also uses LINPACK, but with a different metric, gigaflops per watt. On the most recent list, issued last November, the world's most energy efficient super computer is a machine called <a title=\"http:\/\/www.nics.tennessee.edu\/beacon (link no longer works)\">Beacon<\/a>, also in Tennessee, at the University of Tennessee National Institute for Computational Sciences. A new Green 500 list is scheduled to be released June 28th.<\/p><h4>Geek Talk<a name=\"3a9943f8-956d-40fb-b43a-cb2c61af9abc\"><\/a><\/h4><p>I've only heard the LINPACK Benchmark mentioned once in the mainstream public media, and that was not in an entirely positive light.  I was listening to Monday Night Football on the radio.  (That tells you how long ago it was.)  It was a commercial for personal computers.  A guy goes into a computer store.  The salesman confuses him with incomprehensible geek talk about \"Fortran Linpack megaflops\".  So he leaves and goes to Sears where he gets straight talk from the salesman and buys a Gateway.<\/p><p>A few years ago a TV commercial for IBM during the Super Bowl subtly boasted about \"petaflops\".  That was also about LINPACK, but the commercial didn't say so.<\/p><h4>Criticism<a name=\"4652ebc7-5056-4df6-ae9b-c1a2af5c0ba5\"><\/a><\/h4><p>An <a href=\"http:\/\/www.hpcwire.com\/hpcwire\/2013-03-27\/flops_fall_flat_for_intelligence_agency.html\">excellent article<\/a> by Nicole Hemsoth in the March 27 issue of <i>HPCWire<\/i> set me thinking again about criticism of the LINPACK benchmark and led to this blog post.  The headline read:<\/p><p><b>FLOPS Fall Flat for Intelligence<\/b><\/p><p>The article quotes a Request For Information (RFI) from the Intelligence Advanced Research Projects Activity about possible future agency programs.<\/p><p>\r\n<p style=\"margin-left:3ex;\">\r\nIn this RFI we seek information about novel technologies that have the\r\npotential to enable new levels of computational performance with\r\ndramatically lower power, space and cooling requirements than the HPC\r\nsystems of today. Importantly, we also seek to broaden the definition of\r\nhigh performance computing beyond today's commonplace floating point\r\nbenchmarks, which reflect HPC's origins in the modeling and analysis of\r\nphysical systems. While these benchmarks have been invaluable in providing\r\nthe metrics that have driven HPC research and development, they have also\r\nconstrained the technology and architecture options for HPC system\r\ndesigners. The HPC benchmarking community has already started to move\r\nbeyond the traditional floating point benchmarks with new benchmarks\r\nfocused on data intensive analysis of large graphs and on power efficiency.\r\n<\/p>\r\n<\/p><p>I certainly agree with this point of view.  Thirty five years ago floating point multiplication was a relatively expensive operation and the LINPACK benchmark was well correlated with execution times for more complicated technical computing workloads. But today the time for various kinds of memory access usually dominates the time for arithmetic operations.  Even among matrix computations, the extremely large dense matrices involved in the LINPACK benchmark are not representative.  Most large matrices are sparse and involve different data structures and storage access patterns.<\/p><p>Even accepting this criticism as valid, I see nothing in the near future that can challenge the LINPACK Benchmark for its role in determining the World's Fastest Computer.<\/p><h4>Legacy<a name=\"6de8f145-24a2-44a3-a9ec-d19d73fe79f1\"><\/a><\/h4><p>We have now collected a huge amount of data about the performance of a wide range of computers over the last 35 years.  It is true that large dense systems of simultaneous linear equations are no longer representative of the range of problems encountered in high performance computing.  Nevertheless, tracking the performance on this one problem over the years gives a valuable view into the history of computing generally.<\/p><p>The <a href=\"http:\/\/www.top500.org\/statistics\">Top 500 Web site<\/a> and the Top 500 presentations at the supercomputer conferences provide fascinating views of this data.<\/p><h4>Home Runs<a name=\"4e72f394-cb62-432c-8e52-3e9c50cb6f50\"><\/a><\/h4><p>The LINPACK Benchmark in technical computing is a little like the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Home_Run_Derby\">Home Run Derby<\/a> in baseball. Home runs don't always decide the result of a baseball game, or determine which team is the best over an entire season.  But it sure is interesting to keep track of home run statistics over the years.<\/p><h4>Exaflops<a name=\"f63eb907-9aa0-4d5d-986e-e0ce143c5182\"><\/a><\/h4><p>If you go to supercomputer meetings like I do, you hear a lot of talk about getting to \"exaflops\".  That's $10^{18}$ floating point operations per second. A thousand petaflops.  \"Only\" 30 times faster than Tianhe-2.  Current government research programs and, I assume, industrial research programs are aimed at developing machines capable of exaflop computation in just a few years from now.<\/p><p>But exaflop computation on what?  Just the LINPACK Benchmark?  I certainly hope that the first machine to reach an exaflop on the Top500 is capable of some actually useful work as well.<\/p><h4>A New Benchmark<a name=\"385d1d83-197d-4bcf-b8e6-7cf201decc93\"><\/a><\/h4><p>In <a href=\"http:\/\/www.hpcwire.com\/hpcwire\/2013-06-18\/alternatives_emerge_as_linpack_loses_ground.html\">a recent interview<\/a> in <i>HPC Wire<\/i>, done at the ISC in Germany, Jack Dongarra echoes some of the same opinions about the LINPACK Benchmark that I have expressed here. He points to a <a href=\"http:\/\/www.netlib.org\/utk\/people\/JackDongarra\/PAPERS\/HPCG-Benchmark-utk.pdf\">recent technical report<\/a> that he has written with Mike Heroux titled \"Toward a New Metric for Ranking High Performance Computing Systems\". They describe work in progress involving a large scale application of the preconditioned conjugate gradient algorithm intended to complement LINPACK in benchmarking supercomputers. They're not trying to replace LINPACK, just trying to include something else. It will be interesting to see if they make any headway.  I wish them luck.<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_d9cd717cc99345ef93cf286c9346e0b1() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='d9cd717cc99345ef93cf286c9346e0b1 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' d9cd717cc99345ef93cf286c9346e0b1';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2018 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_d9cd717cc99345ef93cf286c9346e0b1()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2018a<br><\/p><\/div><!--\r\nd9cd717cc99345ef93cf286c9346e0b1 ##### SOURCE BEGIN #####\r\n%% The LINPACK Benchmark\r\n% By reaching 33.86 petaflops on the LINPACK Benchmark,\r\n% China's Tianhe-2 supercomputer has just become the world's fastest computer.\r\n% The technical computing community thinks of LINPACK not as a matrix\r\n% software library, but as a benchmark.  In that role LINPACK has some attractive\r\n% aspects, but also some undesirable features. \r\n\r\n%% Tianhe-2\r\n% < An announcement>\r\n% made last week at the International Supercomputing Conference in Leipzig,\r\n% Germany, declared the Tianhe-2 to be the world's fastest computer.\r\n% The Tianhe-2, also known as the MilkyWay-2, is being built in Gunagzho, China\r\n% by China's National University of Defense Technology.\r\n% Tianhe-2 has 16,000 nodes, each with two Intel Xeon IvyBridge processors and\r\n% three Xeon Phi processors, for a combined total of 3,120,000 computing cores.\r\n%\r\n% <<tianhe-2_small.jpg>>\r\n%\r\n\r\n%% Top500\r\n% The ranking of world's fastest computer is based on the LINPACK Benchmark.\r\n% Since 1993, LINPACK benchmark results have been collected by the\r\n% <http:\/\/www.top500.org Top500 project>.  They announce their results\r\n% twice a year at international supercomputing conferences.\r\n% <http:\/\/www.isc-events.com\/isc13 Last week's conference in Leipzig>\r\n% produced this < Top 500 List>.\r\n% The next <http:\/\/sc13.supercomputing.org Supercomputing\r\n% Conference> will be in November in Denver.\r\n%\r\n% Tianhe-2's top speed of 33.86 petaflops on the latest Top500 list is nearly\r\n% twice as fast as number two Titan from Oak Ridge at 17.59 petaflops and\r\n% number three Sequoia from Lawrence Livermore at 17.17 petaflops.\r\n\r\n%% Benchmark Origins\r\n% The LINPACK benchmark is an accidental offspring of the development of the\r\n% LINPACK software package in the 1970's.  During the development we asked\r\n% two dozen universities and laboratories to test the software on a variety\r\n% of main frame machines that were then available in central computer\r\n% centers.  We also asked them to measure the time required for two\r\n% subroutines in the package, DGEFA and DGESL, to solve a 100-by-100\r\n% system of simultaneous linear equations.\r\n% With the LINPACK naming conventions,\r\n% DGEFA stands for Double precision GEneral matrix FActor and\r\n% DGESL stands for Double precision GEneral matrix SoLve.\r\n%\r\n% Appendix B of the <http:\/\/epubs.siam.org\/doi\/book\/10.1137\/1.978161197181B1\r\n% LINPACK Users' Guide> has the timing results.\r\n% The hand-written notes shown here are Jack Dongarra's calculation of the \r\n% _megaflop_ rate, millions of floating point operations per second.\r\n% With a matrix of order $n$, the megaflop rate for a factorization by\r\n% Gaussian elimination plus two triangular solves is\r\n%\r\n% $megaflops = (\\frac{2}{3}n^3 + 2n^2)\/(time)\/10^6$\r\n%\r\n% <<linpackbm.jpeg>>\r\n%\r\n% In 1977, we chose $n$ = 100 because this was large enough to stress the\r\n% campus main frames of the day.  Notice that almost half of the machines\r\n% required a second or more to solve the equations.  In fact, the DEC-10\r\n% at Yale did not have enough memory to handle a 100-by-100 system.\r\n% We had to use $n$ = 75 and extrapolate the execution time.\r\n%\r\n% The fastest machine in the world in 1977, according to this LINPACK \r\n% benchmark, was the\r\n% <https:\/\/www.flickr.com\/photos\/jpstanley\/13812345193\r\n% newly installed Cray-1>\r\n% at NCAR, the National Center for Atmospheric Research in Boulder.\r\n% Its performance was 14 megaflops.\r\n\r\n%% Evolution\r\n% Jack's notes were the beginning of LINPACK as a benchmark.\r\n% He continued to collect performance data.  From time to time he would\r\n% publish lists, initially of all the machines for which he had reported\r\n% results, then later of just the fastest.\r\n% Importantly, Jack has always required an _a postiori_ test on the\r\n% relative residual,\r\n%\r\n%      r = norm(A*x - b)\/(norm(A)*norm(x))\r\n%\r\n% More than once, a large residual has revealed some underlying hardware\r\n% or system fault.\r\n%\r\n% For a number of years, Jack insisted that the original Fortran\r\n% DGEFA and DGESL be used with n = 100.  Only the underlying BLAS\r\n% routines could be customized.  As processors became faster and memories\r\n% larger, he relaxed these requirements.\r\n%\r\n% The LINPACK software package has been replaced by\r\n% <http:\/\/www.netlib.org\/lapack LAPACK>, which features block algorithms\r\n% to take advantage of cache memories.  \r\n% Matrix computation on distributed memory parallel computers \r\n% with message passing interprocessing communication is now handled by\r\n% <http:\/\/www.netlib.org\/scalapack ScaLAPACK>\r\n%\r\n% The benchmark itself has become\r\n% <http:\/\/www.netlib.org\/benchmark\/hpl\/ HPL>, which is described as\r\n% \"A Portable Implementation of the High-Performance Linpack Benchmark\r\n% for Distributed-Memory Computers\".  The web page at the\r\n% <http:\/\/icl.cs.utk.edu\/ Innovative Computing Laboratory>\r\n% at the University of Tennessee, which Dongarra directs, summarizes the\r\n% algorithm employed by HPL with these keywords:\r\n%\r\n% <html>\r\n% <p style=\"margin-left:3ex;\">\r\n% Two-dimensional block-cyclic data distribution - Right-looking variant\r\n% of the LU factorization with row partial pivoting featuring multiple\r\n% look-ahead depths - Recursive panel factorization with pivot search and\r\n% column broadcast combined - Various virtual panel broadcast topologies\r\n% - bandwidth reducing swap-broadcast algorithm - backward substitution\r\n% with look-ahead of depth 1.\r\n% <\/p>\r\n% <\/html>\r\n\r\n%% 36 years\r\n% We went from 14 *megaflops* on the Cray-1 at NCAR in 1977 to almost 34\r\n% *petaflops* on the Tianhe-2 in China in 2013.  That's a speed-up by a factor\r\n% of $2.4 \\cdot 10^9$ in 36 years.  This is a doubling of speed\r\n% about every 14 months.  That is amazing!\r\n% \r\n% Moore's Law that transistor counts double roughly every 24 months would\r\n% account for a speed-up over 36 years of only $2^{(12\/24)36}=2.6 \\cdot 10^5$.\r\n% Moore's Law that transistor counts double and clock speeds get faster\r\n% roughly every eighteen months would account for a speed-up of something\r\n% like $2^{(12\/18)36}=1.7 \\cdot 10^7$.  Neither of these comes close to explaining\r\n% what we've seen with LINPACK.  In addition to faster hardware, the\r\n% speed-up in LINPACK is due to all the algorithmic innovation represented\r\n% by those terms in the description of HPL.\r\n\r\n%% Green 500\r\n% Many years ago we introduced the Intel Hypercube at a small conference\r\n% on parallel computing in Knoxville.  The machine required 220 volts, but\r\n% the hotel conference room we were using didn't have 220.  So we rented\r\n% a gas generator, put it in the parking lot, and ran a long cable past\r\n% the swimming pool into the conference room.  When I gave my talk about\r\n% \"Parallel LINPACK on the Hypercube\", some wise guy asked \"How many\r\n% megaflops per gallon are you getting?\"  It was intended as a joke at\r\n% the time, but it is really an important question today.  I've made this\r\n% cartoon.\r\n%\r\n% <<generator_small.jpg>>\r\n%\r\n% Energy consumption of big machines is a very important issue.\r\n% When I took a tour of the Oak Ridge Leadership Computing Facility\r\n% a few years ago, Buddy Bland, who is Project Director, told me he could\r\n% tell when he comes to the Lab in the morning if they are running the\r\n% LINPACK benchmark by looking at how much steam is coming out of the\r\n% cooling towers.  LINPACK generates lots of heat.\r\n%\r\n% The <http:\/\/www.green500.org Green 500> is looking for energy efficient\r\n% supercomputers.  The project also uses LINPACK, but with a different metric,\r\n% gigaflops per watt.\r\n% On the most recent list, issued last November, the world's most energy\r\n% efficient super computer is a machine called\r\n% <http:\/\/www.nics.tennessee.edu\/beacon Beacon>,\r\n% also in Tennessee, at the University of Tennessee National Institute\r\n% for Computational Sciences.\r\n% A new Green 500 list is scheduled to be released June 28th.\r\n\r\n%% Geek Talk\r\n% I've only heard the LINPACK Benchmark mentioned once in the mainstream\r\n% public media, and that was not in an entirely positive light.  I was\r\n% listening to Monday Night Football on the radio.  (That tells you how long\r\n% ago it was.)  It was a commercial for personal computers.  A guy goes into\r\n% a computer store.  The salesman confuses him with incomprehensible geek\r\n% talk about \"Fortran Linpack megaflops\".  So he leaves and goes to Sears\r\n% where he gets straight talk from the salesman and buys a Gateway.\r\n%\r\n% A few years ago a TV commercial for IBM during the Super Bowl subtly boasted\r\n% about \"petaflops\".  That was also about LINPACK, but the commercial\r\n% didn't say so.\r\n\r\n%% Criticism\r\n% An <http:\/\/www.hpcwire.com\/hpcwire\/2013-03-27\/flops_fall_flat_for_intelligence_agency.html\r\n% excellent article> by Nicole Hemsoth in the March 27 issue of _HPCWire_\r\n% set me thinking again about criticism of the LINPACK benchmark and\r\n% led to this blog post.  The headline read:\r\n%\r\n% *FLOPS Fall Flat for Intelligence*\r\n%\r\n% The article quotes a Request For Information (RFI) from the Intelligence\r\n% Advanced Research Projects Activity about possible future agency programs.\r\n%\r\n% <html>\r\n% <p style=\"margin-left:3ex;\">\r\n% In this RFI we seek information about novel technologies that have the\r\n% potential to enable new levels of computational performance with\r\n% dramatically lower power, space and cooling requirements than the HPC\r\n% systems of today. Importantly, we also seek to broaden the definition of\r\n% high performance computing beyond today's commonplace floating point\r\n% benchmarks, which reflect HPC's origins in the modeling and analysis of\r\n% physical systems. While these benchmarks have been invaluable in providing\r\n% the metrics that have driven HPC research and development, they have also\r\n% constrained the technology and architecture options for HPC system\r\n% designers. The HPC benchmarking community has already started to move\r\n% beyond the traditional floating point benchmarks with new benchmarks\r\n% focused on data intensive analysis of large graphs and on power efficiency.\r\n% <\/p>\r\n% <\/html>\r\n%\r\n% I certainly agree with this point of view.  Thirty five years ago\r\n% floating point multiplication was a relatively expensive operation\r\n% and the LINPACK benchmark was well correlated with execution times\r\n% for more complicated technical computing workloads.\r\n% But today the time for various kinds of memory access usually\r\n% dominates the time for arithmetic operations.  Even among matrix\r\n% computations, the extremely large dense matrices involved in the\r\n% LINPACK benchmark are not representative.  Most large matrices are\r\n% sparse and involve different data structures and storage access patterns.\r\n%\r\n% Even accepting this criticism as valid, I see nothing in the near future\r\n% that can challenge the LINPACK Benchmark for its role in determining\r\n% the World's Fastest Computer.\r\n\r\n%% Legacy\r\n% We have now collected a huge amount of data about the performance of a\r\n% wide range of computers over the last 35 years.  It is true that\r\n% large dense systems of simultaneous linear equations are no longer\r\n% representative of the range of problems encountered in high performance\r\n% computing.  Nevertheless, tracking the performance on this one problem\r\n% over the years gives a valuable view into the history of computing\r\n% generally.\r\n%\r\n% The <http:\/\/www.top500.org\/statistics Top 500 Web site> and\r\n% the Top 500 presentations at the supercomputer conferences\r\n% provide fascinating views of this data.\r\n\r\n%% Home Runs\r\n% The LINPACK Benchmark in technical computing is a little like the\r\n% <http:\/\/en.wikipedia.org\/wiki\/Home_Run_Derby Home Run Derby> in baseball.\r\n% Home runs don't always decide the result of a baseball game,\r\n% or determine which team is the best over an entire season.  But it sure is\r\n% interesting to keep track of home run statistics over the years.\r\n\r\n%% Exaflops\r\n% If you go to supercomputer meetings like I do, you hear a lot of talk about\r\n% getting to \"exaflops\".  That's $10^{18}$ floating point operations per second.\r\n% A thousand petaflops.  \"Only\" 30 times faster than Tianhe-2.  Current government\r\n% research programs and, I assume, industrial research programs are aimed at\r\n% developing machines capable of exaflop computation in just a few years from now.\r\n%\r\n% But exaflop computation on what?  Just the LINPACK Benchmark?  I certainly hope\r\n% that the first machine to reach an exaflop on the Top500 is capable of some\r\n% actually useful work as well.\r\n\r\n%% A New Benchmark\r\n% In <http:\/\/www.hpcwire.com\/hpcwire\/2013-06-18\/alternatives_emerge_as_linpack_loses_ground.html a recent interview>\r\n% in _HPC Wire_, done at the ISC in Germany, Jack Dongarra echoes some of the same\r\n% opinions about the LINPACK Benchmark that I have expressed here.\r\n% He points to a \r\n% <http:\/\/www.netlib.org\/utk\/people\/JackDongarra\/PAPERS\/HPCG-Benchmark-utk.pdf\r\n% recent technical report> that he has written with Mike Heroux titled\r\n% \"Toward a New Metric for Ranking High Performance Computing Systems\".\r\n% They describe work in progress involving a large scale application of the\r\n% preconditioned conjugate gradient algorithm intended to complement LINPACK in\r\n% benchmarking supercomputers.\r\n% They're not trying to replace LINPACK, just trying to include something else.\r\n% It will be interesting to see if they make any headway.  I wish them luck.\r\n\r\n##### SOURCE END ##### d9cd717cc99345ef93cf286c9346e0b1\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"http:\/\/blogs.mathworks.com\/cleve\/files\/tianhe-2_small.jpg\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction--><p>By reaching 33.86 petaflops on the LINPACK Benchmark, China's Tianhe-2 supercomputer has just become the world's fastest computer. The technical computing community thinks of LINPACK not as a matrix software library, but as a benchmark.  In that role LINPACK has some attractive aspects, but also some undesirable features.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/cleve\/2013\/06\/24\/the-linpack-benchmark\/\">read more >><\/a><\/p>","protected":false},"author":78,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[4,14,19],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/710"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/users\/78"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/comments?post=710"}],"version-history":[{"count":12,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/710\/revisions"}],"predecessor-version":[{"id":9026,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/710\/revisions\/9026"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/media?parent=710"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/categories?post=710"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/tags?post=710"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}