{"id":396,"date":"2012-11-26T12:14:59","date_gmt":"2012-11-26T17:14:59","guid":{"rendered":"https:\/\/blogs.mathworks.com\/cleve\/?p=396"},"modified":"2021-01-13T09:38:03","modified_gmt":"2021-01-13T14:38:03","slug":"magic-squares-meet-supercomputing","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/cleve\/2012\/11\/26\/magic-squares-meet-supercomputing\/","title":{"rendered":"Magic Squares Meet Supercomputing"},"content":{"rendered":"<!DOCTYPE html\r\n  PUBLIC \"-\/\/W3C\/\/DTD HTML 4.01 Transitional\/\/EN\">\r\n<style type=\"text\/css\">\r\n\r\nh1 { font-size:18pt; }\r\nh2.titlebg { font-size:13pt; }\r\nh3 { color:#4A4F55; padding:0px; margin:5px 0px 5px; font-family:Arial, Helvetica, sans-serif; font-size:11pt; font-weight:bold; line-height:140%; border-bottom:1px solid #d6d4d4; display:block; }\r\nh4 { color:#4A4F55; padding:0px; margin:0px 0px 5px; font-family:Arial, Helvetica, sans-serif; font-size:10pt; font-weight:bold; line-height:140%; border-bottom:1px solid #d6d4d4; display:block; }\r\n   \r\np { padding:0px; margin:0px 0px 20px; }\r\nimg { padding:0px; margin:0px 0px 20px; border:none; }\r\np img, pre img, tt img, li img { margin-bottom:0px; } \r\n\r\nul { padding:0px; margin:0px 0px 20px 23px; list-style:square; }\r\nul li { padding:0px; margin:0px 0px 7px 0px; background:none; }\r\nul li ul { padding:5px 0px 0px; margin:0px 0px 7px 23px; }\r\nul li ol li { list-style:decimal; }\r\nol { padding:0px; margin:0px 0px 20px 0px; list-style:decimal; }\r\nol li { padding:0px; margin:0px 0px 7px 23px; list-style-type:decimal; }\r\nol li ol { padding:5px 0px 0px; margin:0px 0px 7px 0px; }\r\nol li ol li { list-style-type:lower-alpha; }\r\nol li ul { padding-top:7px; }\r\nol li ul li { list-style:square; }\r\n\r\npre, tt, code { font-size:12px; }\r\npre { margin:0px 0px 20px; }\r\npre.error { color:red; }\r\npre.codeinput { padding:10px; border:1px solid #d3d3d3; background:#f7f7f7; }\r\npre.codeoutput { padding:10px 11px; margin:0px 0px 20px; color:#4c4c4c; }\r\n\r\n@media print { pre.codeinput, pre.codeoutput { word-wrap:break-word; width:100%; } }\r\n\r\nspan.keyword { color:#0000FF }\r\nspan.comment { color:#228B22 }\r\nspan.string { color:#A020F0 }\r\nspan.untermstring { color:#B20000 }\r\nspan.syscmd { color:#B28C00 }\r\n\r\n.footer { width:auto; padding:10px 0px; margin:25px 0px 0px; border-top:1px dotted #878787; font-size:0.8em; line-height:140%; font-style:italic; color:#878787; text-align:left; float:none; }\r\n.footer p { margin:0px; }\r\n\r\n  <\/style><div class=\"content\"><!--introduction--><p>I have just returned from the huge <a href=\"http:\/\/sc12.supercomputing.org\">Supercomputing 2012<\/a> conference in Salt Lake City. I can report on interesting reactions to some questions I posed in last week's blog and on some impressive speedup results when we ran the code in last week's blog on a parallel cluster in the Cloud.<\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#80518664-c875-4c12-b0d4-f252bea8c7e1\">Linear Algebra Links<\/a><\/li><li><a href=\"#19d5f2c0-80fa-44aa-94da-77151501627c\">For Loops Nested Seven Deep<\/a><\/li><li><a href=\"#34e8be78-dbbe-4957-959c-83eb78784a50\">Parallel, Please<\/a><\/li><li><a href=\"#d0a075f5-6d99-4f2f-b496-5fce477e4533\">First Try<\/a><\/li><li><a href=\"#f6cd3c84-77a1-4ba2-bd07-e7e49988ba4b\">DCS on the EC2<\/a><\/li><li><a href=\"#1ecd581c-d033-42c8-b1ee-741f81209e3f\">My Modification<\/a><\/li><li><a href=\"#f9f8c168-304e-4e2d-bc8c-f31cb9a80cea\">Jos's Modification<\/a><\/li><li><a href=\"#ceab3618-b87d-46c2-84f1-309c6303bf7b\">Edric's Modification<\/a><\/li><li><a href=\"#015bafbb-cd39-4d83-9650-1be3e3ad2f46\">Speedup<\/a><\/li><li><a href=\"#ea211e2b-6c46-43f1-90dc-ce14b6ca431b\">More Speedup<\/a><\/li><li><a href=\"#a02b7679-62ee-46a9-b971-10085d46f9f2\">enumerate4_p<\/a><\/li><\/ul><\/div><h4>Linear Algebra Links<a name=\"80518664-c875-4c12-b0d4-f252bea8c7e1\"><\/a><\/h4><p>Last week's blog generated several interesting comments and emails. Included were web links to a wealth of information about the linear algebra of magic squares.<\/p><p>Francis Gaspalou sent me one email.  Francis is a French engineer who has worked on magic squares for many years as a hobby.  His web site is <a href=\"http:\/\/www.gaspalou.fr\/magic-squares\">\/http:\/\/www.gaspalou.fr\/magic-squares<\/a>.<\/p><p>The paper \"Magic Square Spectra\" is full of information about the matrices generated by the MATLAB <tt>magic<\/tt> function, as well as other magic squares. In particular, my question from last week about the rank of magic squares of order five is answered in the paper. The authors are Peter Loly, Ian Cameron, Water Trump, and Daniel Schindel. The paper was published in 2009 in <i>Linear Algebra and its Applications<\/i>, volume 430, issue 10. The official link to the paper on the Elsevier web site is <a href=\"http:\/\/dx.doi.org\/10.1016\/j.laa.2008.10.032\">DOI link<\/a>. It is also possible to do a Google search on the paper title to find a pdf copy of the paper on an author's web site.<\/p><h4>For Loops Nested Seven Deep<a name=\"19d5f2c0-80fa-44aa-94da-77151501627c\"><\/a><\/h4><p>Last week I described the investigation of the rank of magic squares of order four, including the matrix <tt>magic(4)<\/tt> generated by MATLAB, and its permutation, which appears in D&uuml;rer's <i>Melancholia<\/i>. I discussed a program with <tt>for<\/tt>-loops nested seven deep over the integers from 1 to 16 that generates 57,657,600 candidate 4-by-4 matrices and finds the 7040 that are actually magic squares. The matrices are generated using a parameterization from <a href=\"\">a paper by Bill Press<\/a>.<\/p><p>The indentation of the <tt>for<\/tt>-loops and the complexity of the parameterization cause the properly formatted program, <a href=\"https:\/\/blogs.mathworks.com\/images\/cleve\/enumerate4.m\">enumerate4.m<\/a>, to be too wide to fit on a blog page, so I have it in a separate file. But here is the essence of the program.<\/p><pre class=\"codeinput\"><span class=\"comment\">%   % ENUM.  Essence of enumerate4.m<\/span>\r\n<span class=\"comment\">%   % Enumerate magic squares of order 4.<\/span>\r\n<span class=\"comment\">%   % Count number of each rank.<\/span>\r\n<span class=\"comment\">%<\/span>\r\n<span class=\"comment\">%   R = zeros(1,4)<\/span>\r\n<span class=\"comment\">%   for a = 1:16<\/span>\r\n<span class=\"comment\">%     for b = 1:16<\/span>\r\n<span class=\"comment\">%       for c = 1:16<\/span>\r\n<span class=\"comment\">%         for e = 1:16<\/span>\r\n<span class=\"comment\">%           for f = 1:16<\/span>\r\n<span class=\"comment\">%             for g = 1:16<\/span>\r\n<span class=\"comment\">%               for i = 1:16<\/span>\r\n<span class=\"comment\">%                 A = [xxxx]<\/span>\r\n<span class=\"comment\">%                 if ismagic(A)<\/span>\r\n<span class=\"comment\">%                   k = rank(A)<\/span>\r\n<span class=\"comment\">%                   R(k) = R(k)+1<\/span>\r\n<span class=\"comment\">%                 end<\/span>\r\n<span class=\"comment\">%               end<\/span>\r\n<span class=\"comment\">%             end<\/span>\r\n<span class=\"comment\">%           end<\/span>\r\n<span class=\"comment\">%         end<\/span>\r\n<span class=\"comment\">%       end<\/span>\r\n<span class=\"comment\">%     end<\/span>\r\n<span class=\"comment\">%   end<\/span>\r\n<\/pre><h4>Parallel, Please<a name=\"34e8be78-dbbe-4957-959c-83eb78784a50\"><\/a><\/h4><p>This program cries out to be run in parallel.  With the loop structure of this simplified program, the inner block would be executed $16^7$, or roughly 268 million times.  Actually, <tt>if<\/tt>-statements that I have left out of the simplified program reduce that to 57 million. But they're all independent.  They could be run in any order, in principle on 57 million processors.<\/p><p>But, wait.  Look at that vector, <tt>R<\/tt>.  It's the output.  It's what we're trying to compute.  It's a row vector of length 4 that, when the program terminates, will contain counts of each of the observed ranks. All of the processors that we have working for us will have to cooperate during the computation, or queue up at the end, to accumulate the totals observed for each rank. Spawning the parallel tasks is easy. Coordinating their cooperation in accumulating the count is hard.<\/p><p>Think of little kids at a birthday party.  Getting each to sing a little song individually is easy.  But having them all sing, in unison, on key, is a challenge.<\/p><p>The 57 million matrices yield only 7040 magic squares, which in turn produce only four rank counts.  This is a dramatic example of an important general process known as <i>reduction<\/i>.  The accumulation of the counts in <tt>R<\/tt> is a <i>sum reduction<\/i>.<\/p><h4>First Try<a name=\"d0a075f5-6d99-4f2f-b496-5fce477e4533\"><\/a><\/h4><p>When I wrote last week's blog, I was already thinking about running <tt>enumerate4<\/tt> in parallel.  As I was writing the blog, I tried simply changing<\/p><pre class=\"language-matlab\"><span class=\"keyword\">for<\/span> a = 1:16\r\n<\/pre><p>to<\/p><pre class=\"language-matlab\"><span class=\"keyword\">parfor<\/span> a = 1:16\r\n<\/pre><p>When I tried to run this on my laptop with the Parallel Computing Toolbox, I got an error message (in red, which I can't reproduce in this blog)<\/p><pre>Error using enum (line 11)\r\nError: The variable R in a parfor cannot be classified.\r\nSee Parallel for Loops in MATLAB, \"Overview\".<\/pre><p>This didn't surprise me.  The message is pointing to the discussion in the documentation of various kinds of reductions.  Our parser doesn't have enough information about what the program might do to the array <tt>R<\/tt>.  I need to give the parser some help.<\/p><h4>DCS on the EC2<a name=\"f6cd3c84-77a1-4ba2-bd07-e7e49988ba4b\"><\/a><\/h4><p>In fact, when last week's blog was being published on Monday afternoon, I was already only my way to <a href=\"http:\/\/sc12.supercomputing.org\">Supercomputing 2012<\/a> in Salt Lake City. MathWorks has had a booth at the SC show for several years and we've always set up our own small parallel cluster.  But not this year. Instead we used the MATLAB Distributed Computing Server on the Amazon Elastic Compute Cloud.  That's DCS on the EC2.<\/p><p>When I got to Salt Lake, Jos Martin, Edric Ellis, and others from our Parallel Computing team in the UK were setting up the MathWorks booth. We logged on to Amazon over SCinet, the high performance network set up for the SC show, and easily established a pool of 32 workers.<\/p><h4>My Modification<a name=\"1ecd581c-d033-42c8-b1ee-741f81209e3f\"><\/a><\/h4><p>I need to clarify the statement<\/p><pre>    R(k) = R(k) + 1<\/pre><p>in the inner loop. Here is the most obvious solution.  Replace the initialization<\/p><pre>  R = zeros(1,4)<\/pre><p>with<\/p><pre>  R3 = 0;\r\n  R4 = 0;<\/pre><p>And then replace the statement in the inner loop with<\/p><pre>     if k == 3\r\n        R3 = R3 + 1;\r\n     else\r\n        R4 = R4 + 1;\r\n     end<\/pre><p>This is simply giving names to only two components of <tt>R<\/tt> that are actually involved in the computation. It is like dividing the kids at the birthday party into two groups, each of which can sing in unison.<\/p><h4>Jos's Modification<a name=\"f9f8c168-304e-4e2d-bc8c-f31cb9a80cea\"><\/a><\/h4><p>Jos Martin suggested expanding the single statement in the inner loop into three statements.<\/p><pre>    t = zeros(4,1);\r\n    t(k) = 1;\r\n    R = R + t;<\/pre><p>Of course, this is the same as the original program. But somehow the nature of the reduction is now clearer to the parser.<\/p><h4>Edric's Modification<a name=\"ceab3618-b87d-46c2-84f1-309c6303bf7b\"><\/a><\/h4><p>Edric Ellis suggested a modification that never would have occurred to me.  Initialize <tt>R2<\/tt> to be an empty array.<\/p><pre>  R2 = [];<\/pre><p>Then change the statement in the inner loop to<\/p><pre>     R2 = [R2 k]<\/pre><p>This certainly changes the length of <tt>R2<\/tt>.  The final result is a vector of length 7040.  The rank counts can be extracted with<\/p><pre class=\"language-matlab\">sum(R2==3)\r\nsum(R2==4)\r\n<\/pre><p>I have no idea why the parser is able to accept this code. Each worker must grow its own piece of <tt>R2<\/tt>, an array of unknown length.  Then these pieces must be merged in the correct order to produce the final result.  It seems to be bizarre approach. But Edric tells me that he uses something like this quite often.<\/p><h4>Speedup<a name=\"015bafbb-cd39-4d83-9650-1be3e3ad2f46\"><\/a><\/h4><p>We ran all three of these modified programs on the cluster with 32 workers that we had opened on the Amazon cloud. I didn't record the execution times, but I recall that the speedups were all around a factor between 15 and 16. Why not 32?  Because the <tt>parfor<\/tt> loop only goes up to 16.<\/p><h4>More Speedup<a name=\"ea211e2b-6c46-43f1-90dc-ce14b6ca431b\"><\/a><\/h4><p>The next modification is to merge the two outer loops. Replace the two loops<\/p><pre class=\"language-matlab\"><span class=\"keyword\">parfor<\/span> a = 1:16\r\n   <span class=\"keyword\">for<\/span> b = 1:16\r\n<\/pre><p>with a single, longer loop.<\/p><pre class=\"language-matlab\"><span class=\"keyword\">parfor<\/span> ab = 0:255\r\n   a = floor(ab\/16) + 1;\r\n   b = rem(ab, 16) + 1;\r\n<\/pre><p>Now we could get a speedup by a factor close to 32 out of the 32 workers with each of our three modifications.  We had reduced the time of almost 7 minutes that I had seen on my laptop at home to something under 15 seconds on the cluster available on EC2 at the convention.<\/p><h4>enumerate4_p<a name=\"a02b7679-62ee-46a9-b971-10085d46f9f2\"><\/a><\/h4><p>All of these modifications have been collected together in one working program, <a href=\"https:\/\/blogs.mathworks.com\/images\/cleve\/enumerate4_p.m\">enumerate4_p.m<\/a>.<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_22592634b4ee4ba3bbda59ed028fafab() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='22592634b4ee4ba3bbda59ed028fafab ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 22592634b4ee4ba3bbda59ed028fafab';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2012 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_22592634b4ee4ba3bbda59ed028fafab()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2012b<br><\/p><p class=\"footer\"><br>\r\n      Published with MATLAB&reg; R2012b<br><\/p><\/div><!--\r\n22592634b4ee4ba3bbda59ed028fafab ##### SOURCE BEGIN #####\r\n%% Magic Squares Meet Supercomputing\r\n% I have just returned from the huge\r\n% <http:\/\/sc12.supercomputing.org Supercomputing 2012>\r\n% conference in Salt Lake City.\r\n% I can report on interesting reactions to some questions I posed in\r\n% last week's blog and on some impressive speedup results when we ran the code\r\n% in last week's blog on a parallel cluster in the Cloud.\r\n\r\n%% Linear Algebra Links\r\n% Last week's blog generated several interesting comments and emails.\r\n% Included were web links to a wealth of information about the linear algebra\r\n% of magic squares.\r\n%\r\n% Francis Gaspalou sent me one email.  Francis is a French engineer who has\r\n% worked on magic squares for many years as a hobby.  His web site is\r\n% <http:\/\/www.gaspalou.fr\/magic-squares \/http:\/\/www.gaspalou.fr\/magic-squares>.\r\n%\r\n% The paper \"Magic Square Spectra\" is full of information about the matrices\r\n% generated by the MATLAB |magic| function, as well as other magic squares.\r\n% In particular, my question from last week about the rank of magic squares\r\n% of order five is answered in the paper.\r\n% The authors are Peter Loly, Ian Cameron, Water Trump, and Daniel Schindel.\r\n% The paper was published in 2009 in _Linear Algebra and its Applications_,\r\n% volume 430, issue 10.\r\n% The official link to the paper on the Elsevier web site is\r\n% <http:\/\/dx.doi.org\/10.1016\/j.laa.2008.10.032 DOI link>.\r\n% It is also possible to do a Google search on the paper title to find a\r\n% pdf copy of the paper on an author's web site.\r\n\r\n%% For Loops Nested Seven Deep\r\n% Last week I described the investigation of the rank of magic squares\r\n% of order four, including the matrix |magic(4)| generated by MATLAB,\r\n% and its permutation, which appears in D\u251c\u255drer's _Melancholia_.\r\n% I discussed a program with |for|-loops nested seven deep over the integers\r\n% from 1 to 16 that generates 57,657,600 candidate 4-by-4 matrices\r\n% and finds the 7040 that are actually magic squares.\r\n% The matrices are generated using a parameterization from\r\n% < a paper by Bill Press>.\r\n%\r\n% The indentation of the |for|-loops and the complexity of the parameterization\r\n% cause the properly formatted program,\r\n% <https:\/\/blogs.mathworks.com\/images\/cleve\/enumerate4.m enumerate4.m>,\r\n% to be too wide to fit on a blog page, so I have it in a separate file.\r\n% But here is the essence of the program.\r\n\r\n%   % ENUM.  Essence of enumerate4.m\r\n%   % Enumerate magic squares of order 4.\r\n%   % Count number of each rank.\r\n%   \r\n%   R = zeros(1,4)\r\n%   for a = 1:16\r\n%     for b = 1:16\r\n%       for c = 1:16\r\n%         for e = 1:16\r\n%           for f = 1:16\r\n%             for g = 1:16\r\n%               for i = 1:16\r\n%                 A = [xxxx]\r\n%                 if ismagic(A)\r\n%                   k = rank(A)\r\n%                   R(k) = R(k)+1\r\n%                 end\r\n%               end\r\n%             end\r\n%           end\r\n%         end\r\n%       end\r\n%     end\r\n%   end\r\n\r\n%% Parallel, Please\r\n% This program cries out to be run in parallel.  With the loop structure of\r\n% this simplified program, the inner block would be executed $16^7$,\r\n% or roughly 268 million times.  Actually, |if|-statements that\r\n% I have left out of the simplified program reduce that to 57 million.\r\n% But they're all independent.  They could be run in any order, in principle\r\n% on 57 million processors.\r\n%\r\n% But, wait.  Look at that vector, |R|.  It's the output.  It's what we're\r\n% trying to compute.  It's a row vector of length 4 that, when the program\r\n% terminates, will contain counts of each of the observed ranks.\r\n% All of the processors that we have working for us will have\r\n% to cooperate during the computation, or queue up at the end, to\r\n% accumulate the totals observed for each rank.\r\n% Spawning the parallel tasks is easy.\r\n% Coordinating their cooperation in accumulating the count is hard.\r\n%\r\n% Think of little kids at a birthday party.  Getting each to sing a little\r\n% song individually is easy.  But having them all sing, in unison, on key,\r\n% is a challenge.\r\n%\r\n% The 57 million matrices yield only 7040 magic squares, which in turn\r\n% produce only four rank counts.  This is a dramatic example of an important\r\n% general process known as _reduction_.  The accumulation of the counts in\r\n% |R| is a _sum reduction_.\r\n\r\n%% First Try\r\n% When I wrote last week's blog, I was already thinking about running\r\n% |enumerate4| in parallel.  As I was writing the blog, I tried\r\n% simply changing \r\n%\r\n%   for a = 1:16\r\n%\r\n% to\r\n% \r\n%   parfor a = 1:16\r\n%\r\n% When I tried to run this on my laptop with the Parallel Computing Toolbox,\r\n% I got an error message (in red, which I can't reproduce in this blog)\r\n%\r\n%  Error using enum (line 11)\r\n%  Error: The variable R in a parfor cannot be classified.\r\n%  See Parallel for Loops in MATLAB, \"Overview\".\r\n%\r\n% This didn't surprise me.  The message is pointing to the discussion\r\n% in the documentation of various kinds of reductions.  Our parser\r\n% doesn't have enough information about what the program might do\r\n% to the array |R|.  I need to give the parser some help.\r\n\r\n%% DCS on the EC2\r\n% In fact, when last week's blog was being published on Monday afternoon,\r\n% I was already only my way to \r\n% <http:\/\/sc12.supercomputing.org Supercomputing 2012> in Salt Lake City.\r\n% MathWorks has had a booth at the SC show for several years and\r\n% we've always set up our own small parallel cluster.  But not this year.\r\n% Instead we used the MATLAB Distributed Computing Server\r\n% on the Amazon Elastic Compute Cloud.  That's\r\n% <https:\/\/www.mathworks.com\/products\/distriben\/requirements.html\r\n% DCS on the EC2>.\r\n%\r\n% When I got to Salt Lake, Jos Martin, Edric Ellis, and others from our\r\n% Parallel Computing team in the UK were setting up the MathWorks booth.\r\n% We logged on to Amazon over SCinet, the high performance network set up\r\n% for the SC show, and easily established a pool of 32 workers.\r\n\r\n%% My Modification\r\n% I need to clarify the statement\r\n%\r\n%      R(k) = R(k) + 1\r\n%\r\n% in the inner loop.  \r\n% Here is the most obvious solution.  Replace the initialization\r\n%\r\n%    R = zeros(1,4)\r\n%\r\n% with\r\n%\r\n%    R3 = 0;\r\n%    R4 = 0;\r\n%\r\n% And then replace the statement in the inner loop with\r\n%\r\n%       if k == 3\r\n%          R3 = R3 + 1;\r\n%       else\r\n%          R4 = R4 + 1;\r\n%       end\r\n%\r\n% This is simply giving names to only two components of |R| that are actually\r\n% involved in the computation.  \r\n% It is like dividing the kids at the birthday party into two groups,\r\n% each of which can sing in unison.\r\n\r\n%% Jos's Modification\r\n% Jos Martin suggested expanding the single statement in the inner loop\r\n% into three statements.\r\n%\r\n%      t = zeros(4,1);\r\n%      t(k) = 1;\r\n%      R = R + t;\r\n%\r\n% Of course, this is the same as the original program.\r\n% But somehow the nature of the reduction is now clearer to the parser.\r\n\r\n%% Edric's Modification\r\n% Edric Ellis suggested a modification that never would have\r\n% occurred to me.  Initialize |R2| to be an empty array.\r\n%\r\n%    R2 = [];\r\n%\r\n% Then change the statement in the inner loop to\r\n%\r\n%       R2 = [R2 k]\r\n%\r\n% This certainly changes the length of |R2|.  The final result is\r\n% a vector of length 7040.  The rank counts can be extracted with\r\n%\r\n%   sum(R2==3)\r\n%   sum(R2==4)\r\n%\r\n% I have no idea why the parser is able to accept this code.\r\n% Each worker must grow its own piece of |R2|, an array of unknown\r\n% length.  Then these pieces must be merged in the correct order\r\n% to produce the final result.  It seems to be bizarre approach.\r\n% But Edric tells me that he uses something like this quite often.\r\n\r\n%% Speedup\r\n% We ran all three of these modified programs on the cluster with 32\r\n% workers that we had opened on the Amazon cloud.\r\n% I didn't record the execution times, but I recall that the speedups\r\n% were all around a factor between 15 and 16.\r\n% Why not 32?  Because the |parfor| loop only goes up to 16.\r\n\r\n%% More Speedup\r\n% The next modification is to merge the two outer loops.\r\n% Replace the two loops\r\n%\r\n%   parfor a = 1:16\r\n%      for b = 1:16\r\n%\r\n% with a single, longer loop.\r\n%\r\n%   parfor ab = 0:255\r\n%      a = floor(ab\/16) + 1;\r\n%      b = rem(ab, 16) + 1;\r\n%\r\n% Now we could get a speedup by a factor close to 32 out of the 32 workers\r\n% with each of our three modifications.  We had reduced the time of almost\r\n% 7 minutes that I had seen on my laptop at home to something \r\n% under 15 seconds on the cluster available on EC2 at the convention.\r\n\r\n%% enumerate4_p\r\n% All of these modifications have been collected together in \r\n% one working program, \r\n% <https:\/\/blogs.mathworks.com\/images\/cleve\/enumerate4_p.m enumerate4_p.m>.\r\n\r\n##### SOURCE END ##### 22592634b4ee4ba3bbda59ed028fafab\r\n-->","protected":false},"excerpt":{"rendered":"<!--introduction--><p>I have just returned from the huge <a href=\"http:\/\/sc12.supercomputing.org\">Supercomputing 2012<\/a> conference in Salt Lake City. I can report on interesting reactions to some questions I posed in last week's blog and on some impressive speedup results when we ran the code in last week's blog on a parallel cluster in the Cloud.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/cleve\/2012\/11\/26\/magic-squares-meet-supercomputing\/\">read more >><\/a><\/p>","protected":false},"author":78,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[9],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/396"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/users\/78"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/comments?post=396"}],"version-history":[{"count":15,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/396\/revisions"}],"predecessor-version":[{"id":6585,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/396\/revisions\/6585"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/media?parent=396"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/categories?post=396"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/tags?post=396"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}