Once again we're going to hear from guest blogger Peter Perkins, who is a statistical software developer here at The MathWorks.
Last time, I showed a few ways to reproduce the values coming from MATLAB's random number generators. But sometimes what's needed is the "opposite of reproducibility" - you want to make sure that different runs of a simulation use different random inputs, because you want to be able to assume that the different runs are statistically independent. That's today's subject.
Remember seeds from my last post? I talked about reproducibility. If you create a random number stream with the same seed, you'll get the same sequence of values every time. But just as important, if you create different streams using different seeds, you'll get different sequences.
When MATLAB starts, it creates the default stream with commands equivalent to
stream0 = RandStream('mt19937ar','Seed',0)
stream0 = mt19937ar random stream Seed: 0 RandnAlg: Ziggurat
We saw that last week. To get a different sequence of random values, I can create a new stream with a different seed, such as
stream1 = RandStream('mt19937ar','Seed',1)
stream1 = mt19937ar random stream Seed: 1 RandnAlg: Ziggurat
Next I'll make it the default stream.
By setting stream1 as the default stream, I've made it so that rand, randn, and randi will use that stream from now on, and will draw values from a different sequence than they did when using stream0. Just to confirm that I've changed the default stream,
stream1 == RandStream.getDefaultStream()
ans = 1
In practice, I might have created the new stream on a second processor, to run in parallel with the first, or created it in a second MATLAB session on the same processor. It doesn't matter, either way works the same.
One common use of seeds is to get different random values each time you start up MATLAB without having to think about it every time. To do that, you can base the seed on the system clock.
mystream = RandStream('mt19937ar','Seed',sum(100*clock))
mystream = mt19937ar random stream Seed: 210226 RandnAlg: Ziggurat
Bear in mind that you won't be able to reproduce your results later on unless you hang onto the value of sum(100*clock). Doh!
You might wonder what value you should use for a seed, and if you will somehow add "extra randomness" into MATLAB by using the system clock. The answer is, you won't. In fact, taking this strategy to the extreme and reseeding a generator before each call will very likely do bad things to the statistical properties of the sequence of values you end up with. There's nothing wrong with choosing widely-spaced seeds, but the's no real advantage.
Remember, seeding a stream is not something you want to do a lot of. It's most useful if you use it as an initialization step, perhaps at MATLAB startup, perhaps before running a simulation.
A reasonable question to ask at this point is, "If you can't tell me how far apart or what in order the different seeds are in the sequence of random values, how do I know that two simulation runs won't end up reusing the same random values?" The answer is, "You don't." Practically speaking, the state space of the Mersenne Twister used in the example above is so ridiculously much larger (2^19937 elements) than the number of possible seeds (2^32) that the chances of overlap in different simulation runs are pretty remote unless you use a large number of different seeds. But there is a more subtle potential problem - not enough is known about the statistical (pseudo)independence of sequences of values corresponding to different "seeds" of the Mersenne Twister generator.
There is a better way available in MATLAB R2008b: two new generator algorithms support multiple independent streams. That is, you can create separate streams that are guaranteed to not overlap, and for which tests that demonstrate (pseudo)independence of the values not only within each stream, but between streams, have been carried out. Notably, the Mersenne Twister ('mt19937ar') does not support multiple streams.
The two generator types that support multiple independent streams are the Combined Multiple Recursive ('mrg32k3a') and the Multiplicative Lagged Fibonacci ('mlfg6331_64') generators (the doc has references.) I'll create one of the former, and make it the default stream.
cmrg1 = RandStream.create('mrg32k3a','NumStreams',2,'StreamIndices',1); RandStream.setDefaultStream(cmrg1)
The first parameter, NumStreams, says how many independent streams I plan on using, in total. StreamIndices says which one (or which ones) I actually want to create right now. Note that I've used the RandStream.create factory method, which allows for creating multiple streams as a group, and that the display for this stream
cmrg1 = mrg32k3a random stream (current default) StreamIndex: 1 NumStreams: 2 Seed: 0 RandnAlg: Ziggurat
indicates that it is part of a larger group (even though I have not yet created the second stream).
Let's say I've made one run of a simulation using the above as the default stream, represented by these 100,000 random values.
x1 = rand(100000,1);
Now I want to make a second, independent simulation run. If the runs are short enough, I might just run one after another in the same MATLAB session, without modifying the state of the default stream between runs. Since successive values coming from a stream are statistically independent, there's nothing at all wrong with that. On the other hand, if I want to shut down MATLAB between runs, or I want to run on two machines in parallel, that isn't going to work.
What I can do, though, is to create the second of my two independent streams, and make it the default stream.
cmrg2 = RandStream.create('mrg32k3a','NumStreams',2,'StreamIndices',2) RandStream.setDefaultStream(cmrg2) x2 = rand(100000,1);
cmrg2 = mrg32k3a random stream StreamIndex: 2 NumStreams: 2 Seed: 0 RandnAlg: Ziggurat
Let's look at the correlation between the two streams.
ans = 1 0.0015578 0.0015578 1
That isn't by any means proof of statistical independence, but it is reassuring. In fact, much more thorough tests have been used to demonstrate between-stream independence of these generators.
As long as the streams are created correctly, it doesn't matter where or when I make the second simulation run, the results will still be independent from the first run. 'mrg32k3a' and 'mlfg6331_64' both support a huge number of parallel independent streams, and each stream is itself very, very long, so you can use multiple independent streams for very large simulations.
Remember substreams from my last post? I promised to explain why they're different than seeds, so here it is. Unlike seeds, which are located who-knows-where, who-knows-how-far-apart along the sequence of random numbers, the spacing between substreams is known, so any chance of overlap can be eliminated. And like independent parallel streams, research has been done to demonstrate statistical independence across substreams. In short, they are a more controlled way to do many of the same things that seeds have traditionally been used for, and a more lightweight solution than parallel streams.
Substreams provide a quick and easy way to ensure that you get different results from the same code at different times. For example, if I run this loop (from the last post)
defaultStream = RandStream.getDefaultStream(); for i = 1:5 defaultStream.Substream = i; z = rand(1,i) end
z = 0.32457 z = 0.25369 0.3805 z = 0.22759 0.34154 0.61441 z = 0.0054898 0.12139 0.70173 0.15851 z = 0.51784 0.37283 0.54639 0.14839 0.89321
and then run off to the Tahiti, and then come back (maybe), I can run this loop
for i = 6:10 defaultStream.Substream = i; z = rand(1,11-i) end
z = 0.48889 0.46817 0.62735 0.14538 0.84446 z = 0.66568 0.11152 0.035575 0.013499 z = 0.60589 0.28911 0.74664 z = 0.24981 0.78979 z = 0.25271
and be certain that I've generated random values independently of the first set of 5 iterations, and in fact have gotten exactly the same results as if I had never gone to Tahiti (though not gotten as good of a tan):
for i = 1:10 defaultStream.Substream = i; z = rand(1,min(i,11-i)); end
As always, I might run the second loop at a different time than the first, but on the same machine, or run it at the same time on a different machine. As long as the stream is set up correctly, it makes no difference.
It's important to remember that replacing the default stream, or changing its state, either by resetting it, by setting its state directly, or by positioning it to a new substream, affect all subsequent calls to rand, randn, and randi, and are things you should be careful about doing. If code you write is used by someone else who has no idea that you've done one of those things, your code may unintentionally invalidate the results of their simulation by affecting the statistical properties of the random numbers they use, or make it difficult for them to reproduce their results when they need to.
In general, I recommend leaving control of the default stream up to the end user as much as possible. If you are your own end user, it might be a good idea to execute "stream control" statements only at MATLAB startup, or only right before running a simulation.
In code that will be called by others, it's safe to generate values using the default stream, with no effect on the statistical properties of other random numbers generated in other parts of the code. But if possible, don't otherwise change the default stream's state or configuration by doing things like resetting or replacing the default stream. It's safest if those kinds of operations are done only by, or at the specific request of, the end user, so that it's clear when they are happening.
When that's not possible, if you need to somehow modify the default stream's state, consider "cleaning up after yourself" by saving and restoring the state before and after your code does whatever it needs to do. For example,
defaultStream = RandStream.getDefaultStream(); savedState = defaultStream.State; % some use of the default stream and its state defaultStream.State = savedState;
makes it as if your code did not modify the default stream at all.
Alternatively, consider using a "private stream", one that you create and draw values from without affecting the default stream. For example.
mystream = RandStream('mt19937ar','Seed',sum(100*clock)); x = rand(mystream,100,1);
generates values completely separately from the default stream. That call to rand (the method not the function) uses your private stream, not the default stream. Another possibility is to make the private stream the default, but only temporarily. For example,
mystream = RandStream('mt19937ar','Seed',sum(100*clock)); savedStream = RandStream.getDefaultStream(); RandStream.setDefaultStream(mystream); % some use of the (new) default stream and its state RandStream.setDefaultStream(savedStream);
allows your code to use your stream as the default, but also puts things back just as they were originally.
Do the examples I've covered leave out any situations involving reproducibility of random numbers, or "the opposite or reproducibility", that you've run into? Let me know here.
Get the MATLAB code
Published with MATLAB® 7.7
Comments are closed.
22 CommentsOldest to Newest
% Assume a matlabpool open stream = RandStream.create('mrg32k3a','seed',1234); parfor ii = 1:10 par(ii) = rand(stream); end stream = RandStream.create('mrg32k3a','seed',1234); for ii = 1:10 nopar(ii) = rand(stream); endNot only do I get different answers in "par" and "nopar", but the values in "par" are repeated! How do I avoid this issue while still maintaining reproducibility of my results?
stream = RandStream('mrg32k3a'); parfor ii = 1:10 set(stream,'Substream',ii); par(ii) = rand(stream); endThis uses (copies of) the same stream on all workers, but ties the values returned by rand to the iteration number, so regardless of whether you run this loop on one worker, or many, or change it to a for loop in serial mode, it will return the same vector. Ordinarily, you'd probably set the substream index by assigning to the property
stream.Substream = ii;but parfor won't let you do that. If you want to run this same loop again but get different (but still repeatable) results, you might: * use a different seed to create the stream, or * add some offset to the substream index. Using substreams gets you reproduceability. If you don't care about that, you most likely want to create a set of independent parallel streams using RandStream.create, and passing in labindex as the StreamIndices parameter.
stream0 = RandStream('mt19937ar','Seed',0); RandStream.setDefaultStream(stream0); a=rand(4e6,1); stream0 = RandStream('mt19937ar','Seed',0); RandStream.setDefaultStream(stream0); b=randn(1e6,1); c=rand(1); find(a==c) stream1 = RandStream('mrg32k3a','Seed',0); RandStream.setDefaultStream(stream1); a=rand(4e6,1); stream1 = RandStream('mrg32k3a','Seed',0); RandStream.setDefaultStream(stream1); b=randn(1e6,1); c=rand(1); find(a==c)The output is: ans = 1021808 ans = 2028211 Why does the mt19937ar algorithm uses only 1.0218 rand values on average? The mrg32k3a uses 2.028 on average as expected.
stream0 = RandStream('mt19937ar','Seed',0); RandStream.setDefaultStream(stream0); a=rand(3e6,1); stream0 = RandStream('mt19937ar','Seed',0); RandStream.setDefaultStream(stream0); b=randn(1e6,1); c=rand; find(a==c) stream1 = RandStream('mrg32k3a','Seed',0); RandStream.setDefaultStream(stream1); a=rand(3e6,1); stream1 = RandStream('mrg32k3a','Seed',0); RandStream.setDefaultStream(stream1); b=randn(1e6,1); c=rand; find(a==c)ans = 1021808 ans = 2028211