Bayesian Brain Teaser
Winter in Boston can get quite cold. When we get a lot of snow, we need to take a break after shoveling, and solving puzzles is nice way to spend time indoors. Today's guest blogger, Toshi Takeuchi, gives you an interesting brain teaser, written during one of the many 2015 snowstorms in Boston.
Contents
Nate Silver and Bayesian Reasoning
In The Signal and the Noise, the famed data scientist Nate Silver discusses the precarious business of making predictions. The key message of the book is that we should adopt the Bayesian way of thinking - deal with uncertainty probablistically and be flexible to change our prior beliefs as we gather more evidence.
How does it actually work? Let us use a fun brain teaser to work through an example.
The Monty Hall Problem - Simulation
The Monty Hall Problem is loosely based on a game show, "Let's Make a Deal," where the host Monty Hall gives you a choice of three doors, and he tells you that there is a car behind one of them, and you get to keep it if you pick the right door. If you pick a wrong door, you will see a goat, and you get nothing. After you made your choice, Monty opens one of the two doors you didn't pick that reveals a goat. Then he asks, if you would like to stick with your original choice or switch your choice to the other unopened door. At that point the options have been reduced to two doors, and only one of them has the car, so it appears that it is a 50/50 proposition.
Let's test this intuition by running simulations. We will play this game over many trials.
trials = 1000;
First, Monty randomly picks a door from 1, 2, or 3 to hide the car.
car_door = randi(3,trials,1);
You make a guess and a pick a door you think hides the car.
choice = randi(3,trials,1);
Let's preview the first five trials.
T = table(car_door,choice); disp(T(1:5,:))
car_door choice ________ ______ 3 2 3 2 1 3 3 1 2 2
At this point, Monty opens one of the unchosen doors and shows a goat. If the car door and chosen door are different, Monty has only one choice.
1 + 2 = 3 -> 3 1 + 3 = 4 -> 2 2 + 3 = 5 -> 1
Otherwise he can randomly choose either of the unopened doors to open.
1 + 1 = 2 --> 2 or 3 2 + 2 = 4 --> 1 or 3 3 + 3 = 6 --> 1 or 2
This solution is not so elegant, but here it goes.
goat_door = zeros(trials,1); goat_door(car_door + choice == 3) = 3; goat_door(car_door + choice == 4 & car_door ~= choice) = 2; goat_door(car_door + choice == 5) = 1; goat_door(goat_door == 0 & choice == 1) = randsample([2,3],1); goat_door(goat_door == 0 & choice == 2) = randsample([1,3],1); goat_door(goat_door == 0 & choice == 3) = randsample([1,2],1);
Let's update the preview with the goat door.
T = table(car_door,choice,goat_door); disp(T(1:5,:))
car_door choice goat_door ________ ______ _________ 3 2 1 3 2 1 1 3 2 3 1 2 2 2 3
At this point, you can stick with your original choice or switch. My intuition was that it would be a 50/50 proposition. Here is the simulated win rate if you choose to stay.
simulation = zeros(3,1);
simulation(1) = sum((car_door - choice) == 0)/trials*100;
fprintf('Win Rate if stick to your original choice: %.2f%%\n',simulation(1))
Win Rate if stick to your original choice: 33.70%
As you can see, staying with your original choice gives you a much lower win rate than I expected. Let's see what happens if you switch.
switch_choice = zeros(trials,1); switch_choice(goat_door + choice == 3) = 3; switch_choice(goat_door + choice == 4) = 2; switch_choice(goat_door + choice == 5) = 1;
Let's update the preview with the switch and compute the win rate.
T = table(car_door,choice,goat_door,switch_choice);
disp(T(1:5,:))
simulation(3) = sum((car_door - switch_choice) == 0)/trials*100;
fprintf('Win Rate if switch your choice: %.2f%%\n',simulation(3))
car_door choice goat_door switch_choice ________ ______ _________ _____________ 3 2 1 3 3 2 1 3 1 3 2 1 3 1 2 3 2 2 3 1 Win Rate if switch your choice: 66.30%
Why Switching Is Better - Bayesian Analysis
Counterintuitively, the simulation shows that it is actually better to switch your choice. Now let's turn to Bayesian inference to figure out why, using Bayes' Rule.
$$P(H|E) = \frac{P(E|H)\cdot P(H)}{P(E)}$$
In this analysis let's assume that:
- You initially picked Door 1. This is your hypothesis, H
- Monty opened Door 2 to reveal a goat. This is your evidence, E
$$P(H)$$
The car is hidden randomly behind any of the three doors and you picked Door 1. The probability of Door 1 being the car door is 1/3. This is the prior, or P(H).
prior = ones(3,1)*1/3
prior = 0.33333 0.33333 0.33333
$$P(E|H)$$
Now Monty has to choose Door 2 or 3 as the goat door, and he picked Door 2. It's time to think about the conditional probability of the evidence given hypothesis, the likelihood, or P(E|H).
You observed that Monty opened Door 2. His choice is not random - he actually knows the car door, and he needs to avoid it. What does his choice imply?
- If the car is behind Door 1, then Monty can choose either Door 2 or 3. So the probability that he chose Door 2 was 1/2.
- If the car were behond Door 2, then Monty couldn't have chosen it. So the probability was 0. Getting confused? I will revisit this case below.
- If the car is behind Door 3, then Monty had to choose Door 2. So the probability that he chose Door 2 was 1.
Let's go back to case #2. We already know Monty opened Door 2, so why are we talking about the probability of him opening it? Yeah, I know, the probability should be 1 because he actually opened it, right? Well, in this case, the probability is conditioned on the hypothesis under consideration to be true. So in each case you need to figure out what Monty could do based on this assumption. So we know that Door 2 is not the car door, but we still would like to know what Monty might have done if that was the car door.
likelihood = [1/2;0;1]
likelihood = 0.5 0 1
$$P(E|H)\cdot P(H)$$
Now we are ready to update the probabilities of hypotheses based on the new evidence using the equation. Let's calculate the joint probabilities of the prior and likelihood, the numerator of the equation.
joint_prob = likelihood .* prior
joint_prob = 0.16667 0 0.33333
$$P(E)$$
The joint probabilities are not true probabilities because they don't sum up to 1. To get the updated probabilities of hypotheses after the evidence, the posterior, we just need to normalize them by first adding up all the joint probabilities to get the total, and then use that to divide each joint probability. This total is called the normalizing constant, the probability of the evidence under all hypotheses, P(E).
$$P(H|E)$$
The posterior, P(H|E), should be pretty close to the simulation result.
posterior = joint_prob/sum(joint_prob); compare = table(posterior,simulation,'RowNames',{'Stay','N/A','Switch'}); disp(compare([1 3],:))
posterior simulation _________ __________ Stay 0.33333 33.7 Switch 0.66667 66.3
Nate Silver's Method
In his book, Nate Silver uses a different formulation to calculate the posterior, but it is actually the same thing. Let's use the example from the book...
"Suppose you are living with a partner and come home from a business trip to discover a strange pair of underwear in your dresser drawer. You will probably ask yourself: what is the probability that your partner is cheating on you?"
- x is the prior, the probability of your spouse cheating before you found the underwear, which he puts to 4%
- y is the conditional probability or likelihood that you see the underwear because your spouse is cheating, 50%
- z is the the conditional probability or likelihood that you see the underwear even though your spouse is not cheating, 5%.
- The resulting posterior of your spouse cheating is only 29%, so don't jump to the premature conclusion!
x = 0.04; y = 0.5; z = 0.05; posterior = x*y/(x*y + z*(1-x))
posterior = 0.29412
The method we use, however, computes the posteriors for both hypotheses - cheating and not cheating.
prior = [x; 1-x]; likelihood = [y; z]; joint_prob = likelihood .* prior; posterior = joint_prob ./ sum(joint_prob)
posterior = 0.29412 0.70588
Bayesian in Machine Learning
Bayes' Theorem is used in powerful machine learning algorithms. One fairely straight-forward and useful example is Naive Bayes Classification. It is often used for text classification tasks such as spam filter or sentiment analysis.
Let's do a quick walk-through using a toy example of sentiment analysis. We want to infer the sentiment, positive or negative, of a statement, based on the words contained. Naive Bayes classifier is available as fitcnb function from Statistics Toolbox in MATLAB R2014b.
Let's load the training examples.
training_exmples = {{'i','love','this','food','very','much'}; ... {'that','left','a','bad','taste'}; ... {'the','movie','was','boring'}; ... {'i','had','a','very','good','time'}; {'she','has','such','a','nice','sweet','dog'}; ... {'the','ending','was','very','sad'}};
Those examples are pre-labeled as positive or negative.
sentiment_class = {'positive';'negative';'negative'; ... 'positive';'positive';'negative'};
Here is the unlabeled example we will use to test the classifier.
test_example = {'she','was','a','very','sweet','girl'};
Turn the training data into a numeric matrix. This is a bag-of-words model that uses frequency of words as predictors.
tokens = unique([training_exmples{:}]); X = zeros(length(training_exmples),length(tokens)); for i = 1:length(training_exmples) X(i,:) = ismember(tokens,training_exmples{i}); end
Train a Naive Bayes model with a multinomial distribution, which is the best choice for the bag-of-words example we have. Even though we know that words have grammatical relationships and they are not completely independent, we make a simplifying assumption that our features are independent so we can apply Bayes' Theorem - thus naive. It works just fine none the less.
Mdl = fitcnb(X,sentiment_class,'Distribution','mn','PredictorNames',tokens);
Now check the trained model with the test example.
[label,post,~]= predict(Mdl,double(ismember(Mdl.PredictorNames,test_example))); fprintf('Test example: "%s"\nSentiment: %s\nPosterior: %.2f\n',... strjoin(test_example,' '),label{1},post(strcmp(Mdl.ClassNames,label)))
Test example: "she was a very sweet girl" Sentiment: positive Posterior: 0.62
In this toy example, the trained model was able to classify it correctly.
Let's examine the details in the trained model Mdl to see how Bayes' Theorem is used internally.
Here is the prior, and it is 1/2 for each class since we have 3 positive and 3 negative examples. Note that the numbers are listed in order of negative and positive.
disp(table(Mdl.Prior(1),Mdl.Prior(2),'VariableNames',Mdl.ClassNames))
negative positive ________ ________ 0.5 0.5
What's the actual frequency of positive and negative statements? I have no idea. Often, we don't know the prior, and in such cases we just apply equal probabilities for each hypothesis. This is called a uniform prior.
Here is the likelihood for words 'good' and 'bad'. You see that 'bad' is twice as negative as 'good' and 'good's twice as positive as 'bad'.
disp(table(cell2mat(Mdl.DistributionParameters(:,ismember(Mdl.PredictorNames,'good'))),... cell2mat(Mdl.DistributionParameters(:,ismember(Mdl.PredictorNames,'bad'))),... 'VariableNames',{'good','bad'},'RowNames',Mdl.ClassNames))
good bad ________ ________ negative 0.025 0.05 positive 0.044444 0.022222
Since we are simply interested in relative scores between the negative and positive classification, we can ignore the normalizing constant, which is the denominator of the equation and the same for either class. Therefore all you need is the prior and conditional probabilities of the tokens to calculate relative posterior scores between two hypotheses - postive or negative.
"So what happens if you have 'not' in the statement?" you may ask. A very good point. This simplistic approach doesn't work so well with negation, or irony, for that matter. However, if you train your model on a large number of training examples, it actually works surprisingly well on a majority of real world cases.
To deal with negation like 'not good' or 'not bad' in a more sophisticated way, you need to use word pairs as features instead of single words. For the work you need to put in, however, the performance gain would be incremental.
Likewise, it is easy to deal with irony - just come up with an appropriate way to capture the ironical features! (This is irony).
It's Your Turn
I hope you now see how powerful and useful Bayesian reasoning is and ready to apply to your problems. Do you have an interesting example of Bayesian reasoning? Please share here.