Once again we're going to hear from guest blogger Peter Perkins, who is a statistical software developer here at The MathWorks.
Last time, I showed a few ways to reproduce the values coming from MATLAB's random number generators. But sometimes what's needed is the "opposite of reproducibility" - you want to make sure that different runs of a simulation use different random inputs, because you want to be able to assume that the different runs are statistically independent. That's today's subject.
Remember seeds from my last post? I talked about reproducibility. If you create a random number stream with the same seed, you'll get the same sequence of values every time. But just as important, if you create different streams using different seeds, you'll get different sequences.
When MATLAB starts, it creates the default stream with commands equivalent to
stream0 = RandStream('mt19937ar','Seed',0)
stream0 = mt19937ar random stream Seed: 0 RandnAlg: Ziggurat
We saw that last week. To get a different sequence of random values, I can create a new stream with a different seed, such as
stream1 = RandStream('mt19937ar','Seed',1)
stream1 = mt19937ar random stream Seed: 1 RandnAlg: Ziggurat
Next I'll make it the default stream.
By setting stream1 as the default stream, I've made it so that rand, randn, and randi will use that stream from now on, and will draw values from a different sequence than they did when using stream0. Just to confirm that I've changed the default stream,
stream1 == RandStream.getDefaultStream()
ans = 1
In practice, I might have created the new stream on a second processor, to run in parallel with the first, or created it in a second MATLAB session on the same processor. It doesn't matter, either way works the same.
One common use of seeds is to get different random values each time you start up MATLAB without having to think about it every time. To do that, you can base the seed on the system clock.
mystream = RandStream('mt19937ar','Seed',sum(100*clock))
mystream = mt19937ar random stream Seed: 210226 RandnAlg: Ziggurat
Bear in mind that you won't be able to reproduce your results later on unless you hang onto the value of sum(100*clock). Doh!
You might wonder what value you should use for a seed, and if you will somehow add "extra randomness" into MATLAB by using the system clock. The answer is, you won't. In fact, taking this strategy to the extreme and reseeding a generator before each call will very likely do bad things to the statistical properties of the sequence of values you end up with. There's nothing wrong with choosing widely-spaced seeds, but the's no real advantage.
Remember, seeding a stream is not something you want to do a lot of. It's most useful if you use it as an initialization step, perhaps at MATLAB startup, perhaps before running a simulation.
A reasonable question to ask at this point is, "If you can't tell me how far apart or what in order the different seeds are in the sequence of random values, how do I know that two simulation runs won't end up reusing the same random values?" The answer is, "You don't." Practically speaking, the state space of the Mersenne Twister used in the example above is so ridiculously much larger (2^19937 elements) than the number of possible seeds (2^32) that the chances of overlap in different simulation runs are pretty remote unless you use a large number of different seeds. But there is a more subtle potential problem - not enough is known about the statistical (pseudo)independence of sequences of values corresponding to different "seeds" of the Mersenne Twister generator.
There is a better way available in MATLAB R2008b: two new generator algorithms support multiple independent streams. That is, you can create separate streams that are guaranteed to not overlap, and for which tests that demonstrate (pseudo)independence of the values not only within each stream, but between streams, have been carried out. Notably, the Mersenne Twister ('mt19937ar') does not support multiple streams.
The two generator types that support multiple independent streams are the Combined Multiple Recursive ('mrg32k3a') and the Multiplicative Lagged Fibonacci ('mlfg6331_64') generators (the doc has references.) I'll create one of the former, and make it the default stream.
cmrg1 = RandStream.create('mrg32k3a','NumStreams',2,'StreamIndices',1); RandStream.setDefaultStream(cmrg1)
The first parameter, NumStreams, says how many independent streams I plan on using, in total. StreamIndices says which one (or which ones) I actually want to create right now. Note that I've used the RandStream.create factory method, which allows for creating multiple streams as a group, and that the display for this stream
cmrg1 = mrg32k3a random stream (current default) StreamIndex: 1 NumStreams: 2 Seed: 0 RandnAlg: Ziggurat
indicates that it is part of a larger group (even though I have not yet created the second stream).
Let's say I've made one run of a simulation using the above as the default stream, represented by these 100,000 random values.
x1 = rand(100000,1);
Now I want to make a second, independent simulation run. If the runs are short enough, I might just run one after another in the same MATLAB session, without modifying the state of the default stream between runs. Since successive values coming from a stream are statistically independent, there's nothing at all wrong with that. On the other hand, if I want to shut down MATLAB between runs, or I want to run on two machines in parallel, that isn't going to work.
What I can do, though, is to create the second of my two independent streams, and make it the default stream.
cmrg2 = RandStream.create('mrg32k3a','NumStreams',2,'StreamIndices',2) RandStream.setDefaultStream(cmrg2) x2 = rand(100000,1);
cmrg2 = mrg32k3a random stream StreamIndex: 2 NumStreams: 2 Seed: 0 RandnAlg: Ziggurat
Let's look at the correlation between the two streams.
ans = 1 0.0015578 0.0015578 1
That isn't by any means proof of statistical independence, but it is reassuring. In fact, much more thorough tests have been used to demonstrate between-stream independence of these generators.
As long as the streams are created correctly, it doesn't matter where or when I make the second simulation run, the results will still be independent from the first run. 'mrg32k3a' and 'mlfg6331_64' both support a huge number of parallel independent streams, and each stream is itself very, very long, so you can use multiple independent streams for very large simulations.
Remember substreams from my last post? I promised to explain why they're different than seeds, so here it is. Unlike seeds, which are located who-knows-where, who-knows-how-far-apart along the sequence of random numbers, the spacing between substreams is known, so any chance of overlap can be eliminated. And like independent parallel streams, research has been done to demonstrate statistical independence across substreams. In short, they are a more controlled way to do many of the same things that seeds have traditionally been used for, and a more lightweight solution than parallel streams.
Substreams provide a quick and easy way to ensure that you get different results from the same code at different times. For example, if I run this loop (from the last post)
defaultStream = RandStream.getDefaultStream(); for i = 1:5 defaultStream.Substream = i; z = rand(1,i) end
z = 0.32457 z = 0.25369 0.3805 z = 0.22759 0.34154 0.61441 z = 0.0054898 0.12139 0.70173 0.15851 z = 0.51784 0.37283 0.54639 0.14839 0.89321
and then run off to the Tahiti, and then come back (maybe), I can run this loop
for i = 6:10 defaultStream.Substream = i; z = rand(1,11-i) end
z = 0.48889 0.46817 0.62735 0.14538 0.84446 z = 0.66568 0.11152 0.035575 0.013499 z = 0.60589 0.28911 0.74664 z = 0.24981 0.78979 z = 0.25271
and be certain that I've generated random values independently of the first set of 5 iterations, and in fact have gotten exactly the same results as if I had never gone to Tahiti (though not gotten as good of a tan):
for i = 1:10 defaultStream.Substream = i; z = rand(1,min(i,11-i)); end
As always, I might run the second loop at a different time than the first, but on the same machine, or run it at the same time on a different machine. As long as the stream is set up correctly, it makes no difference.
It's important to remember that replacing the default stream, or changing its state, either by resetting it, by setting its state directly, or by positioning it to a new substream, affect all subsequent calls to rand, randn, and randi, and are things you should be careful about doing. If code you write is used by someone else who has no idea that you've done one of those things, your code may unintentionally invalidate the results of their simulation by affecting the statistical properties of the random numbers they use, or make it difficult for them to reproduce their results when they need to.
In general, I recommend leaving control of the default stream up to the end user as much as possible. If you are your own end user, it might be a good idea to execute "stream control" statements only at MATLAB startup, or only right before running a simulation.
In code that will be called by others, it's safe to generate values using the default stream, with no effect on the statistical properties of other random numbers generated in other parts of the code. But if possible, don't otherwise change the default stream's state or configuration by doing things like resetting or replacing the default stream. It's safest if those kinds of operations are done only by, or at the specific request of, the end user, so that it's clear when they are happening.
When that's not possible, if you need to somehow modify the default stream's state, consider "cleaning up after yourself" by saving and restoring the state before and after your code does whatever it needs to do. For example,
defaultStream = RandStream.getDefaultStream(); savedState = defaultStream.State; % some use of the default stream and its state defaultStream.State = savedState;
makes it as if your code did not modify the default stream at all.
Alternatively, consider using a "private stream", one that you create and draw values from without affecting the default stream. For example.
mystream = RandStream('mt19937ar','Seed',sum(100*clock)); x = rand(mystream,100,1);
generates values completely separately from the default stream. That call to rand (the method not the function) uses your private stream, not the default stream. Another possibility is to make the private stream the default, but only temporarily. For example,
mystream = RandStream('mt19937ar','Seed',sum(100*clock)); savedStream = RandStream.getDefaultStream(); RandStream.setDefaultStream(mystream); % some use of the (new) default stream and its state RandStream.setDefaultStream(savedStream);
allows your code to use your stream as the default, but also puts things back just as they were originally.
Do the examples I've covered leave out any situations involving reproducibility of random numbers, or "the opposite or reproducibility", that you've run into? Let me know here.
Get the MATLAB code
Published with MATLAB® 7.7
Comments are closed.
22 CommentsOldest to Newest
Great blog entry, but I am still somewhat confused about the difference between multiple streams and substreams. Should I think of multiple streams as entirely independent and substreams as less so? Is there a speed advantage of using one vs the other?
Here’s an example of something I would like to do. Say I have a function whose output is the sum of 2 randn calls (ignore why for now) and designate them A and B. For one simulation run A may be active and B uncalled (output=A), for another B may be active (output=B), for another A&B may be active (output=A+B). An important point is that if A is not active then its randn is *never called*. Here’s some example code:
function out = gen_randn_sum(useA, useB)
stream0 = RandStream(‘mt19937ar’,’Seed’,0);
n_rows = 10;
outA = zeros(n_rows, 1);
outB = zeros(n_rows, 1);
outA = randn(stream0, n_rows, 1);
outB = randn(stream0, n_rows, 1);
out = outA + outB;
In this case gen_randn_sum(1, 0) + gen_randn_sum(0, 1) is not equal to gen_randn_sum(1, 1). But I do want them to be equal as I do not want the A-only case to generate different values when B is also used. So A & B need to be independent. But should I use multiple streams with different seeds or one stream/one seed with multiple substreams?
To bound the problem imagine that A & B can be A-J (ie 10 randn sources), n_rows is 1024, and the function can be called 1M times (with the seeding done outside the function only once at the beginning of the 1M calls). Speed is a major factor as the function is called so many times.
Kieran, as you’ve written your code, gen_randn_sum isn’t a random number generator at all — given the same input, it will always return the same output regardless of how many time it’s called. Is that really what you meant?
One interpretation of what you might have meant is that you want gen_randn_sum to “use up” the same random values regardless of what actual (useA, useB)’s are passed in. To do that, you don’t need multiple streams at all, just “burn off” the random values you don’t actually use. You could use parallel streams, but there’s no need to.
As far as parallel streams vs. substreams, they are very similar ideas, and for the mrg32k3a generator, they are exactly the same thing: this generator uses what’s called sequence splitting — imagine a very big circle broken up into hours (except 2^63 of them) and minutes (except 2^70-something per hour). Those are the streams and substreams. The mlfg6331 generator uses what’s called parameterization for parallel streams (imagine a bunch of “parallel” circles going the long way around a doughnut), and sequence splitting for substreams.
The operational difference is that you can create several parallel streams, and draw from each one independently at the same time with no bookkeeping. It’d be hard to do that with substreams. The latter are more aimed at using values from one substream, then the next, and so on.
Thanks Peter for your quick reply. I know that the code as written will produce the same output each time. Later I mentioned seeding outside the function.
The problem with “using up” random numbers is the following. Imagine instead of only the 1 function that I provided I instead have 2 (one produces A+B (or A, B) and the other A+B+C (or A, B or C)). Again for reproducibility I need to ensure that I can repeat either case with only A output. By using up the values I have ensured that the 2 cases will never be able to produce the same output. This may sound esoteric, but it is a real situation in my sims (imagine a functional block similar to a Simulink block with A/B/C being noise sources). I may have one version with 2 noise sources and another more advanced version with the same 2 noise sources + another one. If in the latter version I turn off the last noise source (the equivalent of C) the output should be *exactly* the same as the former version.
I’ll use multiple streams with one seed I think. Better than independent streams with different seeds.
You need R2008b for these features.
I am using MATLAB v7.4. None of this stuff seems to work for me. Am I missing a toolbox? Wrong version? How would one do similar types of things in a previous version?
Are there any possibilities to run multiple “streams” in R2008a (probably, some third-party scripts)?
I’m going to provide some statistical calculations on different cluster nodes and so need to obtain independent random numbers on each node.
As I understand, running
before simulation on the node is better than doing nothing, but won’t give me statisticial independene?
P.S. Thank you for such interesting and extremely useful blog.
P.P.S. BTW, why doesn’t ‘mt19937ar’ in MATLAB support multiple streams? In Intel MKL, for instance, Mersenne Twister does. And also Matsumoto has shown this possibility in the article “Dynamic Creation of Pseudorandom Number Generators”
Vladimir, if you really think you need parallel streams, and are not running R2008b, you could look at a submission I made to MATLAB Central File Exchange a few years ago:
It uses the Mersenne Twister with different seeds. And that, I believe, answers your P.S. too: the “standard” MT mt19937ar is not designed for parallel streams, but because the state space is so large and the seeding algorithm considered good enough, lots of people use it as a de-factor parallel generator by choosing different seeds. I don’t know how much formal testing this method has received. I believe the paper you’re referring to is a slightly different MT than the mt19937ar, and my recollection is that the implementation is still a work in progress. By the way, there’s nothing preventing you from using the FEX code as a starting point and implementing your own MT dynamic creator.
But before you spend a lot of time on any of this, I’d like to point out that you may not need multiple streams at all. If all your code does is to call rand, i.e., you use the MT as a serial generator, you are getting values that are as “independent” as anything you’ll get with parallel streams, regardless of the fact that you use the values in different places. Parallel streams are often a convenience for various reasons, and make repeatability easier in many cases, but think hard whether you really need them. Also, using sum(100*clock) to seed the MT is a good trick for not having to think up new seeds every time you need one, but I don’t know of any evidence that this (apparent) “added randomness” provides anything that using 1, 2, … wouldn’t already give you.
I am thinking of saving the state of the random generator at the end of a MATLAB session, and loading the state in the next MATLAB session. Can I do this by adding code to startup.m and finish.m? Further, what will be the behaviour, if I have concurrent MATLAB sessions?
Sure. You can either save/read the default stream’s state to/from a mat file, if you’ll never change the generator type, or save/read the default stream itself, if you want to be more general. That part is straight-forward.
But if you have concurrent sessions, you’re obviously going to have figure out how you want them to interact or not, and figure out some way for them to not step on each other.
Clearly this is a complex topic & these pair of blogs show that The MathWorks are masters of the topic. I’m happy to see that you have provided the capability that various users need.
However, I am stumbling over your choice to provide the capability via this “new kind of object.” From my view, you’ve made the end user’s job more complex by requiring that he master this new object, with its obtuse syntax, rather than just providing clear options & syntax in the appropriate functions. As I read the first page of documentation on this class, I was struck that the syntax in the 1st example required 27 characters! That doesn’t look very user friendly to me.
Next the documentation pretty quickly is talking about handles, which as I’ve said before, is confusing to most lay users out here.
I accept that you developers think these objects & classes are powerful & attractive ways to provide capability. But, most of us out here are confused by the topic & would rather have a function with clear syntax.
Thank you for the article!
If I’m switching back and forth between multiple streams, is the state of each stream saved when I switch to a new one? If I switch back to the first stream, will it keep operating as if I hadn’t switched at all?
I’m doing a project in which I have multiple runs of a simulation going simultaneously. I iterate each run for 500 time steps, check the state of the system, and then repeat.
Using a separate stream for each run has improved the performance considerably, but I’m still seeing some anomalies and wondering how to correct them.
Ben, if by “switch”, you mean designating different streams as the default stream at different times, then yes, their states are preserved. Obviously, when you generate values form a stream, its state will advance, but the mere act of marking a stream as the default has no effect on its state.
By the way, although there’s nothing wrong with what (it sounds like) you’re doing, it’s not strictly necessary to mark a stream as the default in order to generate values from it — you can use its RAND/RANDN/RANDI _methods_ instead. If you do that, and never mark a stream as the default, there’s no chance that you (or code that you call) will unknowingly draw values from it and advance its state.
Dear Loren and Peter,
Thank you very much for your insightful posts. They are very helpful.
I have one question, is there any maximum number of substreams we can use?
I am currently writing a program to compare the performance of several procedures. The purpose of each procedure is to select the best among 20 independent alternatives. In order to make a fair comparison between the procedures, I need to use the same random number seed for each design when testing their performance in each simulation trial.
I need to independently repeat the simulation 10,000 times. As far as I can think of, the best way to have the statistical independence is to have 20 x 10,000 = 200,000 substreams. However, I am not sure whether there could be 200,000 independent substreams (or more).
Thank you for your kind attention.
Nugroho, I’m not sure I see the need for you to use substreams at all. You say you, “need to use the same random number seed for each design when testing their performance in each simulation trial”. Forget about seeds for the moment, and think about things at a higher level. I think what you mean by the above is that you want to use the same random numbers for each of the 20 simulations, so that you know that each of the 20 procedures was given the exact same data to work on. If that’s the case, then I see no reason why you can’t simply reset the RNG before each of 20 runs.
If for some reason the different procedures use up different numbers of random numbers, then it might make sense to use substreams, one for each iteration, but even then it would seem like you’d still want to reset the RNG to the same point before each of 20 runs.
I can’t tell for sure what you need. But I can say that it’s possible to “overuse” some of these parallel RNG features.
In any case, the number of substreams for ‘mrg32k3a’ or ‘mlfg6331_64’, as described in the documentation in the User Guide, is much larger than 20,000. It’s something like 2^51.
Hope this helps.
How do we ensure stream independence when using parfor? For instance, if I do this:
% Assume a matlabpool open stream = RandStream.create('mrg32k3a','seed',1234); parfor ii = 1:10 par(ii) = rand(stream); end stream = RandStream.create('mrg32k3a','seed',1234); for ii = 1:10 nopar(ii) = rand(stream); end
Not only do I get different answers in “par” and “nopar”, but the values in “par” are repeated! How do I avoid this issue while still maintaining reproducibility of my results?
Mike, parfor can be tricky, and there are a couple things going on here:
1) parfor and for do not iterate the same “order”, and because your loop bodies have side effects (meaning that you’re calling rand which maintains an internal state), you won’t get the same results. The loop body in the for case gets run with i going sequentially from 1 to 10. The loop body in the parfor case gets executed in “some” order. There’s no wo way to guarantee, for example, that the 4th and 5th iterations are run on the same worker in that order. If you run the same parfor twice, or run it with a different number of workers, things change. That’s the nature of parfor. It’s also a good reason to use substreams if you want reproduceability.
2) You’re using _copies_ of the same stream on your workers, and they work separately from each other, so the first time rand is called on each worker, you’ll get the same value. In fact, that’s the whole reason for using multiple parallel streams when you want calculations on different workers to be statistically independent.
To answer your specific question: you can get the exact same calculations in a parfor that you’d get in a for by using substreams. Something like
stream = RandStream('mrg32k3a'); parfor ii = 1:10 set(stream,'Substream',ii); par(ii) = rand(stream); end
This uses (copies of) the same stream on all workers, but ties the values returned by rand to the iteration number, so regardless of whether you run this loop on one worker, or many, or change it to a for loop in serial mode, it will return the same vector. Ordinarily, you’d probably set the substream index by assigning to the property
stream.Substream = ii;
but parfor won’t let you do that.
If you want to run this same loop again but get different (but still repeatable) results, you might:
* use a different seed to create the stream, or
* add some offset to the substream index.
Using substreams gets you reproduceability. If you don’t care about that, you most likely want to create a set of independent parallel streams using RandStream.create, and passing in labindex as the StreamIndices parameter.
I am sorry for the late reply.
Thank you very much for making me think about the things at the higher level so that I do not overuse the substreams feature.
Thank you too for giving the idea on the number of available substreams.
Have a great month.
I havent read all the piecess of the article so please exuse me if i’m asking something obvious brought up already…
My Matlab 7.1.02xxx does generate uncorrelated randn results after using
however when i use this inside a function only subsequent randn() functions inside that function generate different random numbers, as soon as another main file iteration enters the same function again these sequences repeat as if the generator state was reset and not really “listening” to changing sum(100*clock) values… Any idea if thats a know bug or some fault on my side?
This is the second time i struggle against this loosing many, many hours…
Thanks for the article
Boh just forget i wrote above. That was my bad. Just lost some 5 hours on this. A couple beers too many and i havent noticed matlab had been writing my rand data to a mat file in a different directory. Meantime another part of the programme kept reading an old file of the same name placed where i was expecting new rand data to be written in at each iteration, but of cours never beeing updated with the new data at all. Anyways.
Thnanks for the info on rand!
I wish everyone all the best including seldom “matrix dimensions must agree” error.
Keep wearing them F5 buttons off!
Consider the following Matlab code.
stream0 = RandStream('mt19937ar','Seed',0); RandStream.setDefaultStream(stream0); a=rand(4e6,1); stream0 = RandStream('mt19937ar','Seed',0); RandStream.setDefaultStream(stream0); b=randn(1e6,1); c=rand(1); find(a==c) stream1 = RandStream('mrg32k3a','Seed',0); RandStream.setDefaultStream(stream1); a=rand(4e6,1); stream1 = RandStream('mrg32k3a','Seed',0); RandStream.setDefaultStream(stream1); b=randn(1e6,1); c=rand(1); find(a==c)
The output is:
Why does the mt19937ar algorithm uses only 1.0218 rand values on average?
The mrg32k3a uses 2.028 on average as expected.
I have a question about the difference in generating normal
(pseudo) random numbers with the mt19937ar algorithm and the mrg32k3a algorithm.
For some reason the mt19937ar algorithm uses 1.0218 uniform (pseudo) random numbers and the mrg32k3a algorithm uses 2.0282.
Both algorithms use the Ziggurat algorithm to tranform the uniform random numbers into normal random numbers.
Can you explain the difference?
stream0 = RandStream('mt19937ar','Seed',0); RandStream.setDefaultStream(stream0); a=rand(3e6,1); stream0 = RandStream('mt19937ar','Seed',0); RandStream.setDefaultStream(stream0); b=randn(1e6,1); c=rand; find(a==c) stream1 = RandStream('mrg32k3a','Seed',0); RandStream.setDefaultStream(stream1); a=rand(3e6,1); stream1 = RandStream('mrg32k3a','Seed',0); RandStream.setDefaultStream(stream1); b=randn(1e6,1); c=rand; find(a==c)
Jan, this requires some details, but you asked.
The Ziggurat algorithm, most of the time, requires a random integer to determine the level, and a random uniform to determine the position within that level. The original Ziggurat algorithm as published by Marsaglia used and reused 32bits of randomness for both purposes, and that’s what SHR3CONG still does (largely to be backwards compatible). Very fast, but if you look _very_ closely, you can see a “griddiness” that’s a consequence of reusing bits. Almost certainly not a practical problem, though. The MT algorithm as implemented in MATLAB’s RandStream uses 64bits of randomness for those two values, and doesn’t reuse any bits. But the original author of MRG32K3A didn’t do that, L’Ecuyer recommended using two full U(0,1)’s, so that’s what MATLAB does, and it uses up 128 bits of randomness.
Of course, something like 1.5% of the time, the RandStream Ziggurat algorithm requires more randomness, so that’s why you don’t see _exactly_ 1 (MT) or 2(MRG32K3A) d.p. uniform per normal.
If you switch to ‘Inversion’, you’ll find that _exactly_ one uniform gets used up for each normal in all cases, but it’s slower to do that computation.