{"id":1270,"date":"2015-11-23T08:54:31","date_gmt":"2015-11-23T13:54:31","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=1270"},"modified":"2015-11-06T08:56:12","modified_gmt":"2015-11-06T13:56:12","slug":"swing-low-sweet-probability-guessing-the-results-of-every-match-in-the-2015-rugby-world-cup","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2015\/11\/23\/swing-low-sweet-probability-guessing-the-results-of-every-match-in-the-2015-rugby-world-cup\/","title":{"rendered":"Swing Low, Sweet Probability: Guessing the results of every match in the 2015 Rugby World Cup"},"content":{"rendered":"\r\n<div class=\"content\"><!--introduction--><p>Today's guest blogger is Matt Tearle, who works on our MATLAB training materials here at MathWorks. Originally from New Zealand, Matt was delighted with the <a href=\"http:\/\/www.bbc.com\/sport\/0\/rugby-union\/34671255\">All Blacks' recent victory at the 2015 Rugby World Cup<\/a>. What better way to celebrate than to analyze the results with MATLAB?<\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#4d01b1be-5bc3-4151-8c3e-9aa437e5578c\">A short-lived competition<\/a><\/li><li><a href=\"#1053f586-5ebd-4626-a07f-eef838d66d79\">Win or lose: a simplistic analysis<\/a><\/li><li><a href=\"#468b5795-12ba-493c-947f-dd092a0938d0\">Win, lose, or draw: a more realistic approach<\/a><\/li><li><a href=\"#8f76b185-87a7-45e3-9525-269eacd2f1c2\">All things being unequal: building a strategy<\/a><\/li><li><a href=\"#bda0c4a1-8f04-4726-a981-5d9817ac2972\">Playing the percentages: a boring but effective strategy<\/a><\/li><li><a href=\"#3742a8a5-7dcb-439a-bbca-ff610aa9b247\">Full time: who won?<\/a><\/li><li><a href=\"#26b993c3-da2b-441f-bcf6-cd86fe2fbded\">Can you do better?<\/a><\/li><\/ul><\/div><h4>A short-lived competition<a name=\"4d01b1be-5bc3-4151-8c3e-9aa437e5578c\"><\/a><\/h4><p>The New Zealand <a href=\"https:\/\/en.wikipedia.org\/wiki\/New_Zealand_Racing_Board\">TAB<\/a> (betting agency) offered a $1 million prize to anyone who could correctly predict the result of all 48 matches in the 2015 Rugby World Cup. Nearly 48,000 people entered the free competition. However, only 79 of those entrants correctly picked Japan's <a href=\"http:\/\/www.skysports.com\/rugby-union\/south-africa-vs-japan\/69519\">shocking 34-32 upset<\/a> of South Africa. After only six games, <i>every<\/i> contestant was out of the running!<\/p><p>Would random guessing have been the best strategy? Even if things had gone according to form, how safe was the prize money? What are the chances of randomly picking 48 match results?<\/p><h4>Win or lose: a simplistic analysis<a name=\"1053f586-5ebd-4626-a07f-eef838d66d79\"><\/a><\/h4><p>If the competition was simply to pick a winner from two teams for each match, then guessing would be equivalent to calling 48 coin-tosses. The probability of <i>k<\/i> successes from <i>n<\/i> trials each with probability <i>p<\/i> is given by the binomial distribution:<\/p><p>$$B(k) = \\left( \\begin{array}{c}  n \\\\ k \\end{array} \\right) p^k (1-p)^{n-k}$$<\/p><p>This can be calculated manually:<\/p><pre class=\"codeinput\">b48 = nchoosek(48,48) * 0.5^48 * (1 - 0.5)^0\r\n<\/pre><pre class=\"codeoutput\">b48 =\r\n   3.5527e-15\r\n<\/pre><p>or with the <a href=\"https:\/\/www.mathworks.com\/help\/stats\/binopdf.html\"><tt>binopdf<\/tt><\/a> function in Statistics and Machine Learning Toolbox:<\/p><pre class=\"codeinput\">b48 = binopdf(48,48,0.5)\r\n<\/pre><pre class=\"codeoutput\">b48 =\r\n   3.5527e-15\r\n<\/pre><p>Pretty unlikely! If 48,000 participants all guessed randomly, the chance of having a winner is<\/p><pre class=\"codeinput\">anywin = 1 - binopdf(0,48000,b48)\r\n<\/pre><pre class=\"codeoutput\">anywin =\r\n   1.7053e-10\r\n<\/pre><p>That is, less than a 1-in-a-billion chance of the TAB having to pay out $1 million. The house always wins!<\/p><p>Let's consider the probabilities of a range of successes:<\/p><pre class=\"codeinput\">k = 0:48;\r\nb = binopdf(k,48,0.5);\r\nbar(k,b)\r\nxlim([-1 49])\r\nxlabel(<span class=\"string\">'Number of correctly guessed results'<\/span>)\r\nylabel(<span class=\"string\">'Probability'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/guessingRWC2015_01.png\" alt=\"\"> <p>Not surprisingly, the most likely outcome is guessing half of the results. It can also be informative to visualize the chance of achieving <i>at least<\/i> a given number of correct results. The cumulative binomial distribution gives the probability of <i>k<\/i> successes <i>or fewer<\/i>. To obtain the probability of at least <i>k<\/i> successes, we need to manually accumulate:<\/p><pre class=\"codeinput\">bar(k,cumsum(b,<span class=\"string\">'reverse'<\/span>))\r\nxlim([-1 49])\r\nylim([0 1])\r\ngrid <span class=\"string\">on<\/span>\r\nxlabel(<span class=\"string\">'Minimum number of correctly guessed results'<\/span>)\r\nylabel(<span class=\"string\">'Probability'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/guessingRWC2015_02.png\" alt=\"\"> <p>Getting anything above 35\/48 is highly unlikely.<\/p><h4>Win, lose, or draw: a more realistic approach<a name=\"468b5795-12ba-493c-947f-dd092a0938d0\"><\/a><\/h4><p>But to make things worse, the game is not that simple. Firstly, in the group stages, a draw (tie) is a possible result. Furthermore, not only were entrants required to guess the winning team, but also the margin of victory, from the two possibilities of \"1-12 points\" or \"13 points or more\". In total, that's five possible choices for each match (Team A by 13+, Team A by 1-12, draw, Team B by 1-12, Team B by 13+).<\/p><p>Although that appears to complicate things considerably, determining the probability of guessing a given number of results correctly is still a binomial problem: a successful trial is simply a correct prediction. If everything is equal, each successful trial now has a probability of 1\/5 instead of 1\/2. (The last 8 knockout-stage matches complicate the analysis a bit because they are not independent, and a draw is not allowed. Let's keep things simple and ignore those details.)<\/p><pre class=\"codeinput\">b = binopdf(k,48,0.2);\r\nsubplot(1,2,1)\r\nbar(k,b)\r\nxlim([-1 49])\r\nxlabel({<span class=\"string\">'Number of'<\/span>,<span class=\"string\">'correctly guessed results'<\/span>})\r\nylabel(<span class=\"string\">'Probability'<\/span>)\r\n<span class=\"comment\">% cumulative probability<\/span>\r\nsubplot(1,2,2)\r\nbar(k,cumsum(b,<span class=\"string\">'reverse'<\/span>))\r\nxlim([-1 49])\r\nylim([0 1])\r\ngrid <span class=\"string\">on<\/span>\r\nxlabel({<span class=\"string\">'Minimum number of'<\/span>,<span class=\"string\">'correctly guessed results'<\/span>})\r\nylabel(<span class=\"string\">'Probability'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/guessingRWC2015_03.png\" alt=\"\"> <p>The prize money is looking even safer now! Even 20 correct predictions is very unlikely. The probability of winning by guesswork is absurdly low:<\/p><pre class=\"codeinput\">b(end)\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n   2.8147e-34\r\n<\/pre><h4>All things being unequal: building a strategy<a name=\"8f76b185-87a7-45e3-9525-269eacd2f1c2\"><\/a><\/h4><p>But there is some hope: the five results are <i>not<\/i> equally likely. A draw is a rare event. World Cup pool matches often have mismatches (like Japan vs South Afr-- OK, bad example!), which result in huge margins of victory. So maybe a good strategy would be to guess results with the same distribution as typical results.<\/p><p>Sounds like a good plan, but first we would need some actual data. Conveniently, <a href=\"http:\/\/www.lassen.co.nz\/pickandgo.php\">the internet exists<\/a>. Using <i>Pick and Go<\/i>'s handy interface, we can look up <a href=\"http:\/\/lassen.co.nz\/pickandgo.php?fyear=&amp;tyear=&amp;teama=ALL&amp;tourn=WC#hrh\">the results of all world cup matches<\/a> from the first Cup in 1987 to the end of RWC 2015. The result is stored in the spreadsheet <tt>WCresults.xlsx<\/tt>.<\/p><pre class=\"codeinput\">wcdata = readtable(<span class=\"string\">'WCresults.xlsx'<\/span>);\r\nwcdata = wcdata(:,{<span class=\"string\">'Date'<\/span>,<span class=\"string\">'Score'<\/span>});\r\n<\/pre><pre class=\"codeoutput\">Warning: Variable names were modified to make them valid MATLAB identifiers. \r\n<\/pre><p>The score is recorded as a string (<i>\"x-y\"<\/i>, where <i>x<\/i> and <i>y<\/i> are the scores for each team). We need to turn that into a result. One approach would be to use regular expressions to extract <i>x<\/i> and <i>y<\/i>, use <tt>str2double<\/tt> to convert them to numbers, then calculate the result. But that result will be calculated as <i>x<\/i> - <i>y<\/i>. If only there was a way to interpret the string as a calculation directly... But, wait, there is! The <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/str2num.html\"><tt>str2num<\/tt><\/a> function actually uses <tt>eval<\/tt> to interpret the string as a numeric expression (not just a number). However, <tt>str2num<\/tt> works on individual strings, not cell arrays of strings, so we'll need to use <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/cellfun.html\"><tt>cellfun<\/tt><\/a> as well.<\/p><pre class=\"codeinput\">wcdata.Margin = cellfun(@str2num,wcdata.Score);\r\n<\/pre><p>This adds a new variable <tt>Margin<\/tt> to our table. Now we need to bin <tt>Margin<\/tt> into the five categories required by the competition. This is easily done with the <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/discretize.html\"><tt>discretize<\/tt><\/a> function introduced in R2015a. It can even return the result as a <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/categorical-arrays.html\"><tt>categorical<\/tt><\/a> variable.<\/p><pre class=\"codeinput\">wincats = {<span class=\"string\">'Away 13+'<\/span>,<span class=\"string\">'Away 1-12'<\/span>,<span class=\"string\">'Draw'<\/span>,<span class=\"string\">'Home 1-12'<\/span>,<span class=\"string\">'Home 13+'<\/span>};\r\nwcdata.Result = discretize(wcdata.Margin,[-Inf,-13,0,1,13,Inf],<span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'Categorical'<\/span>,wincats);\r\n<\/pre><p>We now need to split the data into historical results prior to 2015 and the 2015 results that we'll use to test our strategy. We can convert the dates to a <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/represent-date-and-times-in-MATLAB.html\"><tt>datetime<\/tt><\/a> variable, which then makes logic easy.<\/p><pre class=\"codeinput\">wcdata.Date = datetime(wcdata.Date,<span class=\"string\">'InputFormat'<\/span>,<span class=\"string\">'eee, dd MM yyyy'<\/span>);\r\npre2015 = wcdata.Date &lt; datetime(2015,1,1);\r\nprevious = wcdata(pre2015,:);\r\ncurrent = wcdata(~pre2015,:);\r\n<\/pre><p>Let's see how the results have been distributed in the past:<\/p><pre class=\"codeinput\">subplot(1,1,1)\r\nhistogram(previous.Result,<span class=\"string\">'Normalization'<\/span>,<span class=\"string\">'pdf'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/guessingRWC2015_04.png\" alt=\"\"> <p>Sure enough, there are very few draws and many blowouts. Interestingly, though, the distribution is not symmetric. Apart from the host teams, there is no \"home\" or \"away\" team in a World Cup, so how is this so lopsided? The teams for each match are listed in a particular order (the first being designated as the \"home\" team). Clearly this order is not entirely random, as there's a strong bias to large home wins. Whatever the rationale, as long as it is used the same for the 2015 World Cup as for the previous ones, it doesn't matter.<\/p><p>Assuming a strategy of guessing results according to this distribution, what is the probability of guessing correctly? If we can determine the probability for a single match, then the rest is just another binomial distribution. Although following the historical distribution makes intuitive sense, it might help to consider what will happen if we guess with any given distribution.<\/p><p>For simplicity, let's consider a coin-toss with a weighted coin that lands heads 75% of the time. Imagine we choose to guess randomly, guessing heads 2\/3 of the time. Then we will guess correctly 3\/4 of those 2\/3 times, and incorrectly 1\/4 of those 2\/3 times. We will also guess correctly 1\/4 of the 1\/3 times we guess tails, and incorrectly 3\/4 of the 1\/3 times. Overall,<\/p><pre class=\"codeinput\">dist_historic = [3\/4 1\/4];\r\ndist_guess = [2\/3 1\/3];\r\nformat <span class=\"string\">rat<\/span>\r\nallpossibilities = (dist_historic')*dist_guess\r\n<\/pre><pre class=\"codeoutput\">allpossibilities =\r\n       1\/2            1\/4     \r\n       1\/6            1\/12    \r\n<\/pre><p>In total, we're right 1\/2 (guessed heads, was heads) + 1\/12 (guessed tails, was tails) = 7\/12 of the time, and incorrect 1\/6 (guessed heads, was tails) + 1\/4 (guessed tails, was heads) = 5\/12 of the time. Note that the total correct proportion is<\/p><pre class=\"codeinput\">totalright = sum(diag(allpossibilities))\r\n<\/pre><pre class=\"codeoutput\">totalright =\r\n       7\/12    \r\n<\/pre><p>Or, equivalently,<\/p><pre class=\"codeinput\">totalright = dist_historic*(dist_guess')\r\n<\/pre><pre class=\"codeoutput\">totalright =\r\n       7\/12    \r\n<\/pre><p>Extending to multiple possibilities, the full set of outcomes is the outer product. The probability of success is the sum of the diagonal elements, which is equivalent to the inner product.<\/p><p>So now let's get the actual historical distribution values.<\/p><pre class=\"codeinput\">format <span class=\"string\">short<\/span>\r\ndist_historic = histcounts(previous.Result,<span class=\"string\">'Normalization'<\/span>,<span class=\"string\">'pdf'<\/span>)\r\n<\/pre><pre class=\"codeoutput\">dist_historic =\r\n    0.0605    0.1601    0.0107    0.1708    0.5979\r\n<\/pre><p>If we guess with this same distribution, then our probability of success in each match prediction is<\/p><pre class=\"codeinput\">dist_guess = dist_historic;\r\np = dist_historic*(dist_guess')\r\n<\/pre><pre class=\"codeoutput\">p =\r\n    0.4160\r\n<\/pre><p>Or, equivalently,<\/p><pre class=\"codeinput\">p = sum(dist_historic.^2)\r\n<\/pre><pre class=\"codeoutput\">p =\r\n    0.4160\r\n<\/pre><p>So it's just slightly worse than coin-tossing. But wait, the last element of <tt>dist_historic<\/tt> is 0.6, so we should be able to get a higher value of <i>p<\/i> just by putting a lot of weight on that:<\/p><pre class=\"codeinput\">dist_guess = [0 0 0 0.2 0.8];\r\np = dist_historic*(dist_guess')\r\ndist_guess = [0 0 0 0 1];\r\np = dist_historic*(dist_guess')\r\n<\/pre><pre class=\"codeoutput\">p =\r\n    0.5125\r\np =\r\n    0.5979\r\n<\/pre><p>Either of these is slightly <i>better<\/i> than a coin toss.<\/p><pre class=\"codeinput\">b = binopdf(k,48,p);\r\nsubplot(1,2,1)\r\nbar(k,b)\r\nxlim([-1 49])\r\nxlabel({<span class=\"string\">'Number of'<\/span>,<span class=\"string\">'correctly guessed results'<\/span>})\r\nylabel(<span class=\"string\">'Probability'<\/span>)\r\n<span class=\"comment\">% cumulative probability<\/span>\r\nsubplot(1,2,2)\r\nbar(k,cumsum(b,<span class=\"string\">'reverse'<\/span>))\r\nxlim([-1 49])\r\nylim([0 1])\r\ngrid <span class=\"string\">on<\/span>\r\nxlabel({<span class=\"string\">'Minimum number of'<\/span>,<span class=\"string\">'correctly guessed results'<\/span>})\r\nylabel(<span class=\"string\">'Probability'<\/span>)\r\n<span class=\"comment\">% probability of winning<\/span>\r\nb(end)\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n   1.8921e-11\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/guessingRWC2015_05.png\" alt=\"\"> <h4>Playing the percentages: a boring but effective strategy<a name=\"bda0c4a1-8f04-4726-a981-5d9817ac2972\"><\/a><\/h4><p>So what is the <i>optimal<\/i> guessing strategy? We need to determine <tt>dist_guess<\/tt> such that <i>p<\/i> is maximized. But, being a distribution, the elements of <tt>dist_guess<\/tt> need to add to 1 (and be between 0 and 1). This is a constrained optimization problem. The objective function is <i>p<\/i>, which is linear in <tt>dist_guess<\/tt>. Hence, using <a href=\"https:\/\/www.mathworks.com\/help\/optim\/ug\/linprog.html\"><tt>linprog<\/tt><\/a> from Optimization Toolbox,<\/p><pre class=\"codeinput\">dist_guess = linprog(-dist_historic',[],[],<span class=\"keyword\">...<\/span>\r\n    ones(1,5),1,zeros(5,1),ones(5,1))\r\n<\/pre><pre class=\"codeoutput\">Optimization terminated.\r\ndist_guess =\r\n    0.0000\r\n    0.0000\r\n    0.0000\r\n    0.0000\r\n    1.0000\r\n<\/pre><p>Being a linear problem, the solution lies at one of the vertices of the convex feasible region. Hence, the best strategy is simply to guess the most likely outcome all the time. How well does this do? The theoretical result is above: <i>p<\/i> = 0.6 (meaning 29\/48 correct on average).<\/p><p>How would it have done in 2015 in practice? Given that we're guessing a 13+ Home Team win for every match, we just need to know how many such results occurred in 2015.<\/p><pre class=\"codeinput\">numcorrect = sum(current.Result == wincats{5})\r\nfraccorrect = numcorrect\/48\r\n<\/pre><pre class=\"codeoutput\">numcorrect =\r\n    26\r\nfraccorrect =\r\n    0.5417\r\n<\/pre><p>For comparison, it should be noted that, using at least some knowledge of rugby, I personally predicted ... 27 results correctly! Yes, I could have done about as well by simply predicting a 13+ Home Team win for every match. (Unless the TAB makes their data public, I don't know how that compares to others.)<\/p><h4>Full time: who won?<a name=\"3742a8a5-7dcb-439a-bbca-ff610aa9b247\"><\/a><\/h4><p>Probability can be counterintuitive, at times. Surely just guessing the same thing every time can't be the best way to win? Sure, you'll get the right answer most of the time, but you're guaranteed to be wrong some of the time, too, right? Except that there are no guarantees with probability. If the results of the matches are themselves random variables from a given distribution (<tt>dist_historic<\/tt>), then the probability of getting 48 correct predictions by always guessing the same outcome is the same as the probability of 48 randomly selected games having that outcome.<\/p><pre class=\"codeinput\">rng(2015)\r\nnexp = 1e5;\r\nedges = cumsum([0 dist_historic]);\r\nsimresults = rand(48,nexp);\r\nsimresults = discretize(simresults,edges,<span class=\"string\">'Categorical'<\/span>,wincats);\r\nnum13plusHome = sum(simresults == wincats{5});\r\nsubplot(1,1,1)\r\nhistogram(num13plusHome,<span class=\"string\">'BinMethod'<\/span>,<span class=\"string\">'integers'<\/span>,<span class=\"string\">'Normalization'<\/span>,<span class=\"string\">'pdf'<\/span>)\r\nxlabel(<span class=\"string\">'Number of correctly guessed results'<\/span>)\r\nylabel(<span class=\"string\">'Probability'<\/span>)\r\nb = binopdf(k,48,p);\r\nhold <span class=\"string\">on<\/span>\r\nplot(k,b)\r\nhold <span class=\"string\">off<\/span>\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/guessingRWC2015_06.png\" alt=\"\"> <p>The probability of winning the TAB's money is, therefore,<\/p><pre class=\"codeinput\">bestchance = b(end)\r\n<\/pre><pre class=\"codeoutput\">bestchance =\r\n   1.8921e-11\r\n<\/pre><p>Better than $3 \\times 10^{-34}$, but still bad -- 1-in-50-billion bad. If we take that as a representative figure for the chances of a single contestant correctly predicting all 48 results (regardless of their strategy), the chance of anyone winning is<\/p><pre class=\"codeinput\">anywin = 1 - binopdf(0,48000,bestchance)\r\n<\/pre><pre class=\"codeoutput\">anywin =\r\n   9.0820e-07\r\n<\/pre><p>If the TAB ran this competition numerous times, with 48,000 entrants each time, their expected (average) payout each time would be<\/p><pre class=\"codeinput\">avgpayout = anywin*1e6\r\n<\/pre><pre class=\"codeoutput\">avgpayout =\r\n    0.9082\r\n<\/pre><p>$1 is not bad for advertising that reaches 48,000 people!<\/p><h4>Can you do better?<a name=\"26b993c3-da2b-441f-bcf6-cd86fe2fbded\"><\/a><\/h4><p>Surely I should be able to do better than 27\/48 correct predictions. But how? Consult <a href=\"http:\/\/www.nzherald.co.nz\/nz\/news\/article.cfm?c_id=1&amp;objectid=11538484\">Richie the Macaw<\/a> (rugby's answer to Paul the Octopus)? Or maybe use MATLAB to build a better strategy. Statistics? Machine Learning? Some kind of ranking system? If you can think of a strategy to out-guess me (and I'm not sure I'm setting a particularly high standard there), let me know <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=1270#respond\">here<\/a>.<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_581928a8a3a6493d8522df04bd60dc9f() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='581928a8a3a6493d8522df04bd60dc9f ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 581928a8a3a6493d8522df04bd60dc9f';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2015 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_581928a8a3a6493d8522df04bd60dc9f()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2015b<br><\/p><\/div><!--\r\n581928a8a3a6493d8522df04bd60dc9f ##### SOURCE BEGIN #####\r\n%% Swing Low, Sweet Probability: Guessing the results of every match in the 2015 Rugby World Cup\r\n% Today's guest blogger is Matt Tearle, who works on our MATLAB training\r\n% materials here at MathWorks. Originally from New Zealand, Matt was\r\n% delighted with the <http:\/\/www.bbc.com\/sport\/0\/rugby-union\/34671255 All\r\n% Blacks' recent victory at the 2015 Rugby World Cup>. What better way to\r\n% celebrate than to analyze the results with MATLAB?\r\n\r\n%% A short-lived competition\r\n% The New Zealand <https:\/\/en.wikipedia.org\/wiki\/New_Zealand_Racing_Board\r\n% TAB> (betting agency) offered a $1 million prize to anyone who could\r\n% correctly predict the result of all 48 matches in the 2015 Rugby World\r\n% Cup. Nearly 48,000 people entered the free competition. However, only 79\r\n% of those entrants correctly picked Japan's\r\n% <http:\/\/www.skysports.com\/rugby-union\/south-africa-vs-japan\/69519\r\n% shocking 34-32 upset> of South Africa. After only six games, _every_\r\n% contestant was out of the running!\r\n% \r\n% Would random guessing have been the best strategy? Even if things had\r\n% gone according to form, how safe was the prize money? What are the\r\n% chances of randomly picking 48 match results?\r\n% \r\n%% Win or lose: a simplistic analysis\r\n% If the competition was simply to pick a winner from two teams for each\r\n% match, then guessing would be equivalent to calling 48 coin-tosses. The\r\n% probability of _k_ successes from _n_ trials each with probability _p_ is\r\n% given by the binomial distribution:\r\n% \r\n% $$B(k) = \\left( \\begin{array}{c}  n \\\\ k \\end{array} \\right) p^k (1-p)^{n-k}$$\r\n% \r\n% This can be calculated manually:\r\n\r\nb48 = nchoosek(48,48) * 0.5^48 * (1 - 0.5)^0\r\n%% \r\n% or with the <https:\/\/www.mathworks.com\/help\/stats\/binopdf.html |binopdf|>\r\n% function in Statistics and Machine Learning Toolbox:\r\n\r\nb48 = binopdf(48,48,0.5)\r\n%% \r\n% Pretty unlikely! If 48,000 participants all guessed randomly, the chance \r\n% of having a winner is\r\n\r\nanywin = 1 - binopdf(0,48000,b48)\r\n%% \r\n% That is, less than a 1-in-a-billion chance of the TAB having to pay out \r\n% $1 million. The house always wins!\r\n% \r\n% Let's consider the probabilities of a range of successes:\r\n\r\nk = 0:48;\r\nb = binopdf(k,48,0.5);\r\nbar(k,b)\r\nxlim([-1 49])\r\nxlabel('Number of correctly guessed results')\r\nylabel('Probability')\r\n%% \r\n% Not surprisingly, the most likely outcome is guessing half of the\r\n% results. It can also be informative to visualize the chance of achieving\r\n% _at least_ a given number of correct results. The cumulative binomial\r\n% distribution gives the probability of _k_ successes _or fewer_. To obtain\r\n% the probability of at least _k_ successes, we need to manually\r\n% accumulate:\r\n\r\nbar(k,cumsum(b,'reverse'))\r\nxlim([-1 49])\r\nylim([0 1])\r\ngrid on\r\nxlabel('Minimum number of correctly guessed results')\r\nylabel('Probability')\r\n%% \r\n% Getting anything above 35\/48 is highly unlikely.\r\n% \r\n%% Win, lose, or draw: a more realistic approach\r\n% But to make things worse, the game is not that simple. Firstly, in the\r\n% group stages, a draw (tie) is a possible result. Furthermore, not only\r\n% were entrants required to guess the winning team, but also the margin of\r\n% victory, from the two possibilities of \"1-12 points\" or \"13 points or\r\n% more\". In total, that's five possible choices for each match (Team A by\r\n% 13+, Team A by 1-12, draw, Team B by 1-12, Team B by 13+).\r\n% \r\n% Although that appears to complicate things considerably, determining the\r\n% probability of guessing a given number of results correctly is still a\r\n% binomial problem: a successful trial is simply a correct prediction. If\r\n% everything is equal, each successful trial now has a probability of 1\/5\r\n% instead of 1\/2. (The last 8 knockout-stage matches complicate the\r\n% analysis a bit because they are not independent, and a draw is not\r\n% allowed. Let's keep things simple and ignore those details.)\r\n\r\nb = binopdf(k,48,0.2);\r\nsubplot(1,2,1)\r\nbar(k,b)\r\nxlim([-1 49])\r\nxlabel({'Number of','correctly guessed results'})\r\nylabel('Probability')\r\n% cumulative probability\r\nsubplot(1,2,2)\r\nbar(k,cumsum(b,'reverse'))\r\nxlim([-1 49])\r\nylim([0 1])\r\ngrid on\r\nxlabel({'Minimum number of','correctly guessed results'})\r\nylabel('Probability')\r\n%% \r\n% The prize money is looking even safer now! Even 20 correct predictions is\r\n% very unlikely. The probability of winning by guesswork is absurdly low:\r\n\r\nb(end)\r\n%% \r\n% \r\n%% All things being unequal: building a strategy\r\n% But there is some hope: the five results are _not_ equally likely. A draw\r\n% is a rare event. World Cup pool matches often have mismatches (like Japan\r\n% vs South AfrREPLACE_WITH_DASH_DASH OK, bad example!), which result in huge margins of\r\n% victory. So maybe a good strategy would be to guess results with the same\r\n% distribution as typical results.\r\n% \r\n% Sounds like a good plan, but first we would need some actual data.\r\n% Conveniently, <http:\/\/www.lassen.co.nz\/pickandgo.php the internet\r\n% exists>. Using _Pick and Go_'s handy interface, we can look up\r\n% <http:\/\/lassen.co.nz\/pickandgo.php?fyear=&tyear=&teama=ALL&tourn=WC#hrh\r\n% the results of all world cup matches> from the first Cup in 1987 to the\r\n% end of RWC 2015. The result is stored in the spreadsheet\r\n% |WCresults.xlsx|.\r\n\r\nwcdata = readtable('WCresults.xlsx');\r\nwcdata = wcdata(:,{'Date','Score'});\r\n%% \r\n% The score is recorded as a string (_\"x-y\"_, where _x_ and _y_ are the\r\n% scores for each team). We need to turn that into a result. One approach\r\n% would be to use regular expressions to extract _x_ and _y_, use\r\n% |str2double| to convert them to numbers, then calculate the result. But\r\n% that result will be calculated as _x_ - _y_. If only there was a way to\r\n% interpret the string as a calculation directly... But, wait, there is!\r\n% The <https:\/\/www.mathworks.com\/help\/matlab\/ref\/str2num.html |str2num|>\r\n% function actually uses |eval| to interpret the string as a numeric\r\n% expression (not just a number). However, |str2num| works on individual\r\n% strings, not cell arrays of strings, so we'll need to use\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/cellfun.html |cellfun|> as\r\n% well.\r\n\r\nwcdata.Margin = cellfun(@str2num,wcdata.Score);\r\n%% \r\n% This adds a new variable |Margin| to our table. Now we need to bin\r\n% |Margin| into the five categories required by the competition. This is\r\n% easily done with the\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/discretize.html |discretize|>\r\n% function introduced in R2015a. It can even return the result as a\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/categorical-arrays.html\r\n% |categorical|> variable.\r\n\r\nwincats = {'Away 13+','Away 1-12','Draw','Home 1-12','Home 13+'};\r\nwcdata.Result = discretize(wcdata.Margin,[-Inf,-13,0,1,13,Inf],...\r\n    'Categorical',wincats);\r\n%% \r\n% We now need to split the data into historical results prior to 2015 and\r\n% the 2015 results that we'll use to test our strategy. We can convert the\r\n% dates to a\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/represent-date-and-times-in-MATLAB.html\r\n% |datetime|> variable, which then makes logic easy.\r\n\r\nwcdata.Date = datetime(wcdata.Date,'InputFormat','eee, dd MM yyyy');\r\npre2015 = wcdata.Date < datetime(2015,1,1);\r\nprevious = wcdata(pre2015,:);\r\ncurrent = wcdata(~pre2015,:);\r\n%% \r\n% Let's see how the results have been distributed in the past:\r\nsubplot(1,1,1)\r\nhistogram(previous.Result,'Normalization','pdf')\r\n%% \r\n% Sure enough, there are very few draws and many blowouts. Interestingly,\r\n% though, the distribution is not symmetric. Apart from the host teams,\r\n% there is no \"home\" or \"away\" team in a World Cup, so how is this so\r\n% lopsided? The teams for each match are listed in a particular order (the\r\n% first being designated as the \"home\" team). Clearly this order is not\r\n% entirely random, as there's a strong bias to large home wins. Whatever\r\n% the rationale, as long as it is used the same for the 2015 World Cup as\r\n% for the previous ones, it doesn't matter.\r\n% \r\n% Assuming a strategy of guessing results according to this distribution,\r\n% what is the probability of guessing correctly? If we can determine the\r\n% probability for a single match, then the rest is just another binomial\r\n% distribution. Although following the historical distribution makes\r\n% intuitive sense, it might help to consider what will happen if we guess\r\n% with any given distribution.\r\n% \r\n% For simplicity, let's consider a coin-toss with a weighted coin that\r\n% lands heads 75% of the time. Imagine we choose to guess randomly,\r\n% guessing heads 2\/3 of the time. Then we will guess correctly 3\/4 of those\r\n% 2\/3 times, and incorrectly 1\/4 of those 2\/3 times. We will also guess\r\n% correctly 1\/4 of the 1\/3 times we guess tails, and incorrectly 3\/4 of the\r\n% 1\/3 times. Overall,\r\n\r\ndist_historic = [3\/4 1\/4];\r\ndist_guess = [2\/3 1\/3];\r\nformat rat\r\nallpossibilities = (dist_historic')*dist_guess\r\n%% \r\n% In total, we're right 1\/2 (guessed heads, was heads) + 1\/12 (guessed\r\n% tails, was tails) = 7\/12 of the time, and incorrect 1\/6 (guessed heads,\r\n% was tails) + 1\/4 (guessed tails, was heads) = 5\/12 of the time. Note that\r\n% the total correct proportion is\r\n\r\ntotalright = sum(diag(allpossibilities))\r\n%% \r\n% Or, equivalently,\r\n\r\ntotalright = dist_historic*(dist_guess')\r\n%% \r\n% Extending to multiple possibilities, the full set of outcomes is the\r\n% outer product. The probability of success is the sum of the diagonal\r\n% elements, which is equivalent to the inner product.\r\n% \r\n% So now let's get the actual historical distribution values.\r\n\r\nformat short\r\ndist_historic = histcounts(previous.Result,'Normalization','pdf')\r\n%% \r\n% If we guess with this same distribution, then our probability of success \r\n% in each match prediction is\r\n\r\ndist_guess = dist_historic;\r\np = dist_historic*(dist_guess')\r\n%% \r\n% Or, equivalently,\r\n\r\np = sum(dist_historic.^2)\r\n%% \r\n% So it's just slightly worse than coin-tossing. But wait, the last element\r\n% of |dist_historic| is 0.6, so we should be able to get a higher value of\r\n% _p_ just by putting a lot of weight on that:\r\n\r\ndist_guess = [0 0 0 0.2 0.8];\r\np = dist_historic*(dist_guess')\r\ndist_guess = [0 0 0 0 1];\r\np = dist_historic*(dist_guess')\r\n%% \r\n% Either of these is slightly _better_ than a coin toss.\r\n\r\nb = binopdf(k,48,p);\r\nsubplot(1,2,1)\r\nbar(k,b)\r\nxlim([-1 49])\r\nxlabel({'Number of','correctly guessed results'})\r\nylabel('Probability')\r\n% cumulative probability\r\nsubplot(1,2,2)\r\nbar(k,cumsum(b,'reverse'))\r\nxlim([-1 49])\r\nylim([0 1])\r\ngrid on\r\nxlabel({'Minimum number of','correctly guessed results'})\r\nylabel('Probability')\r\n% probability of winning\r\nb(end)\r\n%% Playing the percentages: a boring but effective strategy\r\n% So what is the _optimal_ guessing strategy? We need to determine\r\n% |dist_guess| such that _p_ is maximized. But, being a distribution, the\r\n% elements of |dist_guess| need to add to 1 (and be between 0 and 1). This\r\n% is a constrained optimization problem. The objective function is _p_,\r\n% which is linear in |dist_guess|. Hence, using\r\n% <https:\/\/www.mathworks.com\/help\/optim\/ug\/linprog.html |linprog|> from\r\n% Optimization Toolbox,\r\n\r\ndist_guess = linprog(-dist_historic',[],[],...\r\n    ones(1,5),1,zeros(5,1),ones(5,1))\r\n%% \r\n% Being a linear problem, the solution lies at one of the vertices of the\r\n% convex feasible region. Hence, the best strategy is simply to guess the\r\n% most likely outcome all the time. How well does this do? The theoretical\r\n% result is above: _p_ = 0.6 (meaning 29\/48 correct on average).\r\n% \r\n% How would it have done in 2015 in practice? Given that we're guessing a\r\n% 13+ Home Team win for every match, we just need to know how many such\r\n% results occurred in 2015.\r\n\r\nnumcorrect = sum(current.Result == wincats{5})\r\nfraccorrect = numcorrect\/48\r\n%% \r\n% For comparison, it should be noted that, using at least some knowledge of\r\n% rugby, I personally predicted ... 27 results correctly! Yes, I could have\r\n% done about as well by simply predicting a 13+ Home Team win for every\r\n% match. (Unless the TAB makes their data public, I don't know how that\r\n% compares to others.)\r\n% \r\n%% Full time: who won?\r\n% Probability can be counterintuitive, at times. Surely just guessing the\r\n% same thing every time can't be the best way to win? Sure, you'll get the\r\n% right answer most of the time, but you're guaranteed to be wrong some of\r\n% the time, too, right? Except that there are no guarantees with\r\n% probability. If the results of the matches are themselves random\r\n% variables from a given distribution (|dist_historic|), then the\r\n% probability of getting 48 correct predictions by always guessing the same\r\n% outcome is the same as the probability of 48 randomly selected games\r\n% having that outcome.\r\n\r\nrng(2015)\r\nnexp = 1e5;\r\nedges = cumsum([0 dist_historic]);\r\nsimresults = rand(48,nexp);\r\nsimresults = discretize(simresults,edges,'Categorical',wincats);\r\nnum13plusHome = sum(simresults == wincats{5});\r\nsubplot(1,1,1)\r\nhistogram(num13plusHome,'BinMethod','integers','Normalization','pdf')\r\nxlabel('Number of correctly guessed results')\r\nylabel('Probability')\r\nb = binopdf(k,48,p);\r\nhold on\r\nplot(k,b)\r\nhold off\r\n%% \r\n% The probability of winning the TAB's money is, therefore,\r\n\r\nbestchance = b(end)\r\n%% \r\n% Better than $3 \\times 10^{-34}$, but still bad REPLACE_WITH_DASH_DASH 1-in-50-billion bad. If\r\n% we take that as a representative figure for the chances of a single\r\n% contestant correctly predicting all 48 results (regardless of their\r\n% strategy), the chance of anyone winning is\r\n\r\nanywin = 1 - binopdf(0,48000,bestchance)\r\n%% \r\n% If the TAB ran this competition numerous times, with 48,000 entrants each\r\n% time, their expected (average) payout each time would be\r\n\r\navgpayout = anywin*1e6\r\n%% \r\n% $1 is not bad for advertising that reaches 48,000 people!\r\n\r\n%% Can you do better?\r\n% Surely I should be able to do better than 27\/48 correct predictions. But\r\n% how? Consult\r\n% <http:\/\/www.nzherald.co.nz\/nz\/news\/article.cfm?c_id=1&objectid=11538484\r\n% Richie the Macaw> (rugby's answer to Paul the Octopus)? Or maybe use\r\n% MATLAB to build a better strategy. Statistics? Machine Learning? Some\r\n% kind of ranking system? If you can think of a strategy to out-guess me\r\n% (and I'm not sure I'm setting a particularly high standard there), let me\r\n% know <https:\/\/blogs.mathworks.com\/loren\/?p=1270#respond here>.\r\n\r\n##### SOURCE END ##### 581928a8a3a6493d8522df04bd60dc9f\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/guessingRWC2015_06.png\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction--><p>Today's guest blogger is Matt Tearle, who works on our MATLAB training materials here at MathWorks. Originally from New Zealand, Matt was delighted with the <a href=\"http:\/\/www.bbc.com\/sport\/0\/rugby-union\/34671255\">All Blacks' recent victory at the 2015 Rugby World Cup<\/a>. What better way to celebrate than to analyze the results with MATLAB?... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2015\/11\/23\/swing-low-sweet-probability-guessing-the-results-of-every-match-in-the-2015-rugby-world-cup\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[66,33,48,1],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1270"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=1270"}],"version-history":[{"count":2,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1270\/revisions"}],"predecessor-version":[{"id":1272,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1270\/revisions\/1272"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=1270"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=1270"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=1270"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}