Practical Advice for People on the Leading Edge

[6 3 7 8 5 1 2 4 9 10] – or “A Story of Surprise About Randomness”

Back in late March, Tom Rhys Marshall discovered something about MATLAB’s handling of random numbers that he found both surprising and concerning!
The randperm function generates a random permutation of integers so randperm(10) will return all integers from 1 to 10 in some random order. In a fresh MATLAB session that 'random' order is always the same
ans = 1×10
6 3 7 8 5 1 2 4 9 10
As noted by Tom and others, the world is split into two types of people: those who think this is fine and commonly-known, and those who have no idea this was true and are very surprised and worried.
What if I told you....random numbers on a computer aren't random!
For legal reasons, I'm told I can't use well known memes on this blog but you probably know the one I have in mind right? Anyway, the thing about random numbers in software like MATLAB, Python or R is that they are not random. They are completely deterministic. By design, however, the numbers returned by these 'pseudo random number' algorithms have the same statistics as genuinely random numbers so, provided we are careful, we can go ahead with our Monte-Carlo simulations and so on as if everything really was random.
Let me say that again. All 'random' number generators used in most modern programming languages and simulation platforms are completely deterministic. If you are going to do anything serious using random numbers then this is a lesson you have to learn sooner or later.
For quite a while now, the default algorithm used by MATLAB's random number generators has been one called Mersenne Twister. MathWorks' implementation of this, in common with many other implementations, will give you a different set of random numbers according to the 'seed' you set. The seed is an integer and is set using the rng() function. When you start MATLAB, the seed is set to 0 by default.
ans = 1×10
6 3 7 8 5 1 2 4 9 10
A different seed gives a different result
ans = 1×10
3 6 5 7 4 8 9 1 10 2
Some people seem to believe that you have to do something fancy to the seed to get numbers that are somehow 'more random' but this isn't the case. You can use any integer and you'll be fine for most common, simple uses of random numbers (It gets more complicated when dealing with parallel simulations! See the relevant references at the end of the post).
A design choice: What should the seed be at start-up?
If you follow the replies to Tom'stwitter thread you'll see some people pointing out that some other systems do not behave as MATLAB does. The default generator in Python, for example, will give you (probably) a different set of random numbers every time you start it. This isn't because these systems are somehow more random than MATLAB, it’s because they've chosen a different way of setting the seed.
A common method of setting the seed at startup is to make use of the system time. The idea is that you use the current time to generate an integer and you use this as the seed. Voila! Every time you start up your system, you get a different seed and hence a different set of random numbers.
You can do this in MATLAB too if you want by using rng("shuffle")
rng("shuffle") % Use the system clock to set the seed
ans = 1×10
7 4 5 6 1 3 9 2 10 8
Now it could be argued (and many did on that twitter thread!) that this is somehow better than MATLAB's way of always using the same seed at start-up. Arguments for thinking this is a good idea include:
  • It’s more random (They are certainly different every time. This may or may not be useful.)
  • One can run two sessions and get “independent” results (Maybe! In practice its more subtle and there is a better way to do this. There is no mathematical guarantee that two sequences produced in this way will be statistically independent although you’ll probably get away with it.)
At some point in the past, however, the design decision was made at MathWorks to place reproducibility above all other concerns...and this is something that many of our users have come to rely on. Using the system clock to seed the random number generator (without recording what seed was actually used) is an anathema to reproducibility.
If you disagree with this design decision then you can go ahead and use rng("shuffle") we won’t mind!
Bug report: My 'random' numbers are different
Back in 2008, MathWorks changed the default algorithm used by the MATLAB programming language and all hell broke loose. Many MATLAB users filed bug reports because their tests failed. Some users, it turns out, validate algorithms on exactly reproducible random numbers and get very upset when these change. Of course it has always been possible to tune the algorithm and seed used but many people use the default settings. Arguably bad practice to be sure but extremely common!
MathWorks had to suffer that change of default though because there were fundamental issues with the old algorithm. Issues that Mersenne Twister was designed to rectify. Fast forward to 2022 and Mersenne Twister is still a fine choice for a lot of work and MathWorks have added several additional, optional algorithms for those cases when users want or need something different.
Changing the default behaviour is not something MathWorks takes lightly! We've got millions of users who depend on things the way they are.
My story: Why using the system clock as seed is not always a good idea
Way back in the day, before I learned about the ways of random numbers on computers, I was a member of a bunch of geeks at University of Manchester who turned all of the University's desktop machines into an ad-hoc supercomputer using a technology called Condor. By the standards of the day, it was quite a large resource with around 5000 CPU cores available at peak times. We were hungry for users.
A researcher reached out to us with the perfect application. It was a Monte-Carlo simulation programmed in something that wasn't MATLAB and he got decades worth of CPU time completed over a weekend. He was very happy until he started sifting through the results.
Every time he ran the simulation on his desktop he got a different result, as expected. On our system, however, he'd get groups of identical results. Sometimes just a few identical results, other times hundreds, with no seeming pattern. We were baffled because we didn't understand exactly how random numbers worked back then.
All of the computers we were using had their system clocks synced using the internet. Many (but not all) jobs started up at the exact same time so they had the exact same seed and hence the same stream of random numbers.
Earlier in this post, I said "All 'random' number generators used in most modern programming languages and simulation platforms are completely deterministic" and every user of random number generators needs to learn this sometime.
This was my time! when was yours?
Learn more
Random numbers are a popular topic and there have been several blog posts over the years, not to mention the official MATLAB documentation. Here are a few I recommend if you want to dig deeper into all this.
Thanks to Peter Perkins and Michelle Hirsch for proof-reading and suggestions. Any remaining mistakes are mine!
  • print


要发表评论,请点击 此处 登录到您的 MathWorks 帐户或创建一个新帐户。