The Intel Hypercube, part 2, reposted

著者 Cleve Moler, November 12, 2013

10 ビュー (過去 30 日間) | 0 いいね | 0 コメント

The first posting earlier today was truncated. Here is the complete post. Even though I am one of the founders of the MathWorks, I only acted as an advisor to the company for its first five years. During that time, from 1985 to 1989, I was trying my luck with two Silicon Valley computer startup companies. Both enterprises failed as businesses, but the experience taught me a great deal about the computer industry, and influenced how I viewed the eventual development of MATLAB. This post continues the discussion of the Intel Personal Supercomputer

Matrix Computation on the iPSC
MATLAB and the iPSC
Knoxville Conference
Another Look at Amdahl's Law
Embarrassingly Parallel
Mandelbrot Set
Legacy
References

Matrix Computation on the iPSC

As soon as I arrived at my new job in Oregon, I began to write programs for matrix computation on the iPSC, even though it would be a few months before we had a working machine. The programming language was Fortran, with calls to the operating system for message passing between the nodes of the Cube.

Computations were initiated and controlled from the front end Cube Manager. The matrices did not come from the Manager. They were test matrices, generated on the Cube, and distributed by columns. The actual numerical values rarely mattered. We were primarily interested in timing experiments for various operations.

We soon had communication routines like "gsend", global send, that sent a vector from one node to all the other nodes, and "gsum", global sum, that formed the sum of vectors, one from each node. These routines made use of the hypercube connectivity, but otherwise we could regard the nodes as fully interconnected.

As I described in my blog two weeks ago, Gaussian elimination would proceed in the following way. At the k -th step of the elimination, the node that held the k -th column would search it for the largest element. This is the k -th pivot. After dividing all the other elements in the column by the pivot to produce the multipliers, it would broadcast a message containing these multipliers to all the other nodes. Then, in the step that requires most of the arithmetic operations, all the nodes would apply the multipliers to their columns.

MATLAB and the iPSC

In 1985 and 1986 MATLAB was brand new. It had nothing in the way of parallel computing, and wouldn't for many years. So MATLAB could not run on the iPSC. The hardware on one node of the iPSC was the same as an IBM PC/AT, so in principle it could support MATLAB, but the bare bones operating system on the node could not, and there was no disc attached directly to a node.

The Cube Manager could not run MATLAB either. The Manager had an Intel 80286 CPU, but it ran Xenix, not Microsoft DOS, and MathWorks did not produce a release of MATLAB for this operating system.

Fortunately, our group had several Sun workstations. Acquiring them had created a bit of a stir within Intel because the Suns were based on Motorola chips and, at the time, Motorola was the competition in the microprocessor business. But we were independent of Mother Intel, so the Motorola ban did not apply to us. I could run MATLAB on the Sun on my desk, and using Sun's version of Unix, do a remote login to the iPSC Cube Manager. We did not yet have Mex files, so MATLAB could not communicate directly with the Cube. I ran timing experiments on the Cube, sent the results back to a file on the Manager, read the file into MATLAB, then modeled and plotted the data.

Knoxville Conference

In August, 1985, Mike Heath, who was then at Oak Ridge National Lab, organized a conference in Oak Ridge, Tennessee, on Hypercube Multiprocessors. Oak Ridge was going to be the next customer, after Yale, to receive an iPSC. They were also getting an nCUBE machine. The proceedings of the conference were published in 1986 by SIAM.

The Knoxville conference was the first public display of the iPSC. I gave a talk and published a paper in the proceedings entitled Matrix Computation on Distributed Memory Multiprocessors. I told the "megaflops per gallon" story about that talk in my blog about the LINPACK Benchmark last June.

The plots included in that paper are perhaps the first MATLAB plots published in any scientific paper. The entire paper was printed on a dot matrix printer and camera ready copy sent to SIAM. The plots are very crude by today's standards. Their resolution is less than a hundred dots per inch.

Another Look at Amdahl's Law

In the early days of parallel computing, Amdahl's law was widely cited as a fundamental limitation. It stated that if an experiment involving a fixed task was carried out on a parallel machine, then eventually the portion of the task that could not be parallelized would dominate the computation and increasing the number of processors would not significantly decrease the execution time.

But as I began to use the iPSC, it seemed sensible to vary the task by increasing the problem size as I increased the number of processors. In my Knoxville paper I wrote:

To fully utilize the system, we must consider problems involving many matrices, ..., or matrices of larger order. ... The performance is strongly dependent on the two parameters, n and p. For a given p, there is a maximum value of n determined by the amount of memory available.

$$ n_{max} \approx \sqrt{pM} $$

After the Knoxville paper, I wrote a technical report at Intel Scientific Computers that appeared twice, under two slightly different titles, Another Look at Amdahl's Law, and A Closer Look at Amdahl's Law. I lost my own copies of the report in a flood of my home office several years ago and none of the friends that I've asked can find their copies. If anybody reading this blog happens to have a copy, I would appreciate hearing from them.

I was able to find an overhead transparency made around that time that has the following MATLAB plot. The graph does not have good annotation. It should follow earlier graphs in a series. It is similar to graphs in the Knoxville/SIAM paper, although it involves a larger machine with more processors

Each dashed line shows how the megaflop rate, f, for the LU decomposition of a matrix of a fixed size, varies as the number of processors, p, is increased. The first line is for a matrix of order n = 100. This line is almost level at p = 16. This is the order for the original LINPACK benchmark and is a good example of Amdahl's Law. For a 100-by-100 matrix, increasing the number of processors beyond 16 does not provide any significant parallel speedup.

The line with the circles filled in to make black dots is n = 1000. We are still seeing significant speedup when we reach the maximum number of processors, p = 128. The top circle, with no dashed line connected to it, is a matrix of order n = 1959. With room for only 30K double precision words per node, this is the largest matrix that can be distributed across the full machine.

The upper envelope of these curves begins to approach the $45^\circ$ line that represents perfect speedup. It previews the point that I would later make in the now lost technical report. If you increase the problem size to make use of the increased memory available with an increased number of processors, you can increase parallel efficiency and at least partially mitigate the losses modeled by Amdahl's Law. This phenomenon has come to be known as weak speedup or weak scaling.

Embarrassingly Parallel

The term "embarrassingly parallel" is widely used today in parallel computing. I claim to be responsible for inventing that phrase. I intended to imply that it was embarrassing because it was so easy -- there was no parallel programming involved. Quoting a Web page by Ian Foster, Wikipedia defines an "embarrassingly parallel" problem as "one for which little or no effort is required to separate the problem into a number of parallel tasks. This is often the case where there exists no dependency (or communication) between those parallel tasks."

Wikipedia says that the etymology of the phrase "embarrassingly parallel" is not known, but they do reference my Knoxville Hypercube paper. Here is what I wrote.

One important way in which LINPACK and EISPACK will be used on such machines is in applications with many tasks involving matrices small enough to be stored on a single processor. The conventional subroutines can be used on the individual processors with no modification. We call such applications "embarrassingly parallel" to emphasize the fact that, while there is a high degree of parallelism and it is possible to make efficient use of many processors, the granularity is large enough that no cooperation between the processors is required within the matrix computations.

Mandelbrot Set

The cover of the August, 1985, issue of Scientific American featured an image that had a lasting impact on the world of computer graphics, parallel computing, and me -- the Mandelbrot set. An article by A. K. Dewdney entitled simply Computer Recreations was subtitled "A computer microscope zooms in for a look at the most complex object in mathematics". (By the way, Dewdney would later replace the legendary Martin Gardner and Douglas Hofstadter in authoring Scientific American's mathematics column.)

It was easy to write a parallel program for the iPSC to compute the Mandelbrot set. The code fit on one overhead transparency. And it was fast! The trouble was we couldn't display the results. It would take forever to send the image to the front end. Even then we didn't have any high resolution display. It would be a couple of years -- and access to a serious graphics supercomputer -- before I could do the Mandelbrot set justice.

Legacy

I left Intel Scientific Computers in Oregon to join another startup in Silicon Valley itself in 1987. I'll tell you about that adventure in my next post.

The iPSC/1 that I have been describing was not a commercial success. It wasn't really a supercomputer. Look at the graph above. Three megaflops on the LINPACK benchmark with a half million dollar machine in 1985 just didn't cut it. And a half megabyte of memory per node was not nearly enough. We knew that, and by the time I left, iSC had introduced models with more memory, and with vector processors, at the expense of fewer nodes.

There was very little software on the iPSC/1. The operating system on the Cube was minimal. The development environment on the Manager was minimal. And while I was there, I was in charge of the group that would have developed applications, but I'm not sure it occurred to us to actually do that. We just did demos and math libraries.

I think I can say that the iPSC/1 was an important scientific milestone. It played a significant role in research at a number of universities and laboratories. Intel Scientific Computers went on to build other machines, based on other Intel chips, that were commercially successful, but they didn't involve me, and didn't involve MATLAB.

I learned a lot. At the time, and to some extent even today, many people thought of MATLAB as just "Matrix Laboratory". So, to get a "Parallel MATLAB", one just had to run on top of a parallelized matrix library. I soon realized that was the wrong level of granularity. On the iPSC, that would mean running MATLAB on the Cube Manager, generating a matrix there, sending it over the ethernet link to the Cube, doing the parallel computation on the Cube, and eventually sending the result back to the front end.

My experience with the iPSC in 1985 taught me that sending messages from MATLAB on a front end machine to a parallel matrix library on a back end machine is a bad idea. I said so in my "Why There Isn't A Parallel MATLAB" Cleve's Corner in 1995. But other people were still trying it with their versions of Parallel MATLAB in 2005.

Today, we have parallelism at several levels with MATLAB. There is automatic fine-grained multithreading in the vector and matrix libraries that most users never think about. There may also be many threads in a GPU on machines that happen to have one. But the real legacy of the iPSC is the clusters and clouds and MATLAB Distributed Computing Server. There, we are finding that, by far, the most popular feature is the parfor construct to generate embarrassingly parallel jobs.

References

Cleve Moler, Matrix Computation on Distributed Memory Multiprocessors, in Hypercube Multiprocessors, Proceedings of the First Conference on Hypercube Multiprocessors, edited by Michael T. Heath, Oak Ridge National Laboratory, ISBN 0898712092 / 9780898712094 / 0-89871-209-2, SIAM, 181-195, 1986. <http://www.amazon.com/Hypercube-Multiprocessors-1986-Michael-Heath/dp/0898712092>

Cleve Moler, Another Look at Amdahl's Law, Technical Report TN-02-0587-0288, Intel Scientific Computers, 1987.

Cleve Moler, A Closer Look at Amdahl's Law, Technical Report TN-02-0687, Intel Scientific Computers, 1987.

A. K. Dewdney, Computer Recreations, Scientific American, 1985, <http://www.scientificamerican.com/media/inline/blog/File/Dewdney_Mandelbrot.pdf>

Published with MATLAB® R2013b