## Loren on the Art of MATLABTurn ideas into MATLAB

Note

Loren on the Art of MATLAB has been archived and will not be updated.

# Speed Dating Experiment

Valentine's day is fast approaching and those who are in a relationship might start thinking about plans. For those who are not as lucky, read on! Today's guest blogger, Today's guest blogger, Toshi Takeuchi, explores how you can be successful at speed dating events through data.

### Contents

#### Speed Dating Dataset

I recently came across an interesting Kaggle dataset Speed Dating Experiment - What attributes influence the selection of a romantic partner?. I never experienced speed dating, so I got curious.

The data comes from a series of heterosexual speed dating experiements at Columbia University from 2002-2004. In these experiments, you each met all of you opposite-sex participants for four minutes. The number of the first dates varied by the event - on average there were 15, but it could be as few as 5 or as many as 22. Then you were asked if you would like to meet any of them again. You also provided ratings on six attributes about your dates:

• Attractiveness
• Sincerity
• Intelligence
• Fun
• Ambition
• Shared Interests

The dataset also includes participants' preferences on those attributes at various point in the process, along with other demographic information.

opts = detectImportOptions('Speed Dating Data.csv');                % set import options
opts.VariableTypes([9,38,50:end]) = {'double'};                     % treat as double
opts.VariableTypes([35,49]) = {'categorical'};                      % treat as categorical
csv = readtable('Speed Dating Data.csv', opts);                     % import data
csv.tuition = str2double(csv.tuition);                              % convert to double
csv.zipcode = str2double(csv.zipcode);                              % convert to double
csv.income = str2double(csv.income);                                % convert to double


#### What Participants Looked For In the Opposite Sex

The participants filled out survey questions when they signed up, including what they were looking for in the opposite sex and what they thought others of their own gender looked for. If you take the mean ratings and then subtract the self-rating from the peer rating, you see that participants thought others were more into looks while they were also into sincerity and intelligence. Sounds a bit biased. We will see how they actually made decisions.

vars = csv.Properties.VariableNames;                                % get var names
[G, res] = findgroups(csv(:,{'iid','gender'}));                     % group by id and gender
[~,idx,~] = unique(G);                                              % get unique indices
pref = table2array(csv(idx,contains(vars,'1_1')));                  % subset pref as array
pref(isnan(pref)) = 0;                                              % replace NaN with 0
pref1_1 = pref ./ sum(pref,2) * 100;                                % convert to 100-pt alloc
pref = table2array(csv(idx,contains(vars,'4_1')));                  % subset pref as array
pref(isnan(pref)) = 0;                                              % replace NaN with 0
pref4_1 = pref ./ sum(pref,2) * 100;                                % convert to 100-pt alloc
labels = {'attr','sinc','intel','fun','amb','shar'};                % attributes
figure                                                              % new figure
b = bar([mean(pref4_1(res.gender == 1,:),'omitnan') - ...           % bar plot
mean(pref1_1(res.gender == 1,:),'omitnan'); ...
mean(pref4_1(res.gender == 0,:),'omitnan') - ...
mean(pref1_1(res.gender == 0,:),'omitnan')]');
b(1).FaceColor = [0 .45 .74]; b(2).FaceColor = [.85 .33 .1];        % change face color
b(1).FaceAlpha = 0.6; b(2).FaceAlpha = 0.6;                         % change face alpha
title('What Peers Look For More Than You Do in the Opposite Sex?')  % add title
xticklabels(labels)                                                 % label bars
ylabel('Differences in Mean Ratings - Peers vs Self')               % y axis label


#### Were They Able to Find Matches?

Let's first find out how successful those speed dating events were. If both you and your partner request another date after the first one, then you have a match. What percentage of initial dates resulted in matches?

• Most people found matches - the median match rate was around 13%
• A few people were extremely successful, getting more than an 80% match rate
• About 17-19% of men and women found no matches, unfortunately
• All in all, it looks like these speed dating events delivered the promised results
res.rounds = splitapply(@numel, G, G);                              % # initial dates
res.matched = splitapply(@sum, csv.match, G);                       % # matches
res.matched_r = res.matched ./ res.rounds;                          % match rate
edges = [0 0.4 1:9]./ 10;                                           % bin edges
figure                                                              % new figure
histogram(res.matched_r(res.gender == 1), edges,...                 % histogram of
'Normalization','probability')                                  % male match rate
hold on                                                             % overlay another plot
histogram(res.matched_r(res.gender == 0), edges,...                 % histogram of
'Normalization','probability')                                  % female match rate
hold off                                                            % stop overlay
title('What Percentage of the First Dates Resulted in Matches?')    % add title
xlabel(sprintf('%% Matches (Median - Men %.1f%%, Women %.1f%%)', ...% x-axis label
median(res.matched_r(res.gender == 1))*100, ...                 % median men
median(res.matched_r(res.gender == 0))*100))                    % median women
xticklabels(string(0:10:90))                                        % use percentage
xlim([-0.05 0.95])                                                  % x-axis range
ylabel('% Participants')                                            % y-axis label
yticklabels(string(0:5:30))                                         % use percentage
text(-0.04,0.21,'0 matches')                                        % annotate


#### Do You Get More Matches If You Make More Requests?

In order to make a match, you need to request another date and get that request accepted. This means some people who got very high match rate must have requested a second date with almost everyone they met and they got their favor returned. Does that mean people who made fewer matches were more picky and didn't request another date as often as those who were more successful? Let's plot the request rate vs. match rate - if they correlate, then we should see a diagonal line!

• You can see some correlation below a 50% request rate - particularly for women. The more requests they make, the more matches they seem to get, to a point
• There is a clear gender gap in request rate - women tend to make fewer requests - the median for men is 44% vs. women for 37%
• If you request everyone, your mileage varies - you may still get no matches. In the end, you only get matches if your requests are accepted
res.requests = splitapply(@sum, csv.dec, G);                        % # requests
res.request_r = res.requests ./ res.rounds;                         % request rate
figure                                                              % new figure
scatter(res.request_r(res.gender == 1), ...                         % scatter plot male
res.matched_r(res.gender == 1),'filled','MarkerFaceAlpha', 0.6)
hold on                                                             % overlay another plot
scatter(res.request_r(res.gender == 0), ...                         % scatter plot female
res.matched_r(res.gender == 0),'filled','MarkerFaceAlpha', 0.6)
r = refline(1,0); r.Color = 'r';r.LineStyle = ':';                  % reference line
hold off                                                            % stop overlay
title('Do You Get More Matches If You Ask More?')                   % add title
xlabel('% Second Date Requests')                                    % x-axis label
xticklabels(string(0:10:100))                                       % use percentage
ylabel('% Matches')                                                 % y-axis label
yticklabels(string(0:50:100))                                       % use percentage
histogram(res.request_r(res.gender == 1),...                        % histogram of
'Normalization','probability')                                  % male match rate
hold on                                                             % overlay another plot
histogram(res.request_r(res.gender == 0),...                        % histogram of
'Normalization','probability')                                  % female match rate
hold off                                                            % stop overlay
title('Do Women Make Fewer Requests Than Men?')                     % add title
xlabel(sprintf('%% Second Date Requests (Median - Men %.1f%%, Women %.1f%%)', ...
median(res.request_r(res.gender == 1))*100, ...                 % median men
median(res.request_r(res.gender == 0))*100))                    % median women
xticklabels(string(0:10:100))                                       % use percentage
ylabel('% Participants')                                            % y-axis label
yticklabels(string(0:5:20))                                         % use percentage


#### Decision to Say Yes for Second Date

Most people are more selective about who they go out with for second date. The two most obvious factors in such a decision are how much they liked who they met and how likely they think they will get a "yes" from them. There is no point in asking a person for a second date no matter how much you like him or her when you feel there's no chance of getting a yes. You can see a fairly good correlation between Yes and No using those two factors.

Please note that I normalized the ratings for "like" and "probability" (of getting yes) because the actual rating scale differs from person to person. You subtract the mean from the data to center the data around zero and then divide by the standard deviation to normalize the value range.

func = @(x) {(x - mean(x))./std(x)};                                % normalize func
f = csv.like;                                                       % get like
f(isnan(f)) = 0;                                                    % replace NaN with 0
f = splitapply(func, f, G);                                         % normalize by group
like = vertcat(f{:});                                               % add normalized like
f = csv.prob;                                                       % get prob
f(isnan(f)) = 0;                                                    % replace NaN with 0
f = splitapply(func, f, G);                                         % normalize by group
prob = vertcat(f{:});                                               % add normalized prob
figure                                                              % new figure
gscatter(like(csv.gender == 1), ...                                 % scatter plot male
prob(csv.gender == 1),csv.dec(csv.gender == 1),'rb','o')
title('Second Date Decision - Men')                                 % add title
xlabel('Like, normalized')                                          % x-axis label
ylabel('Prob of Get "Yes", normalized')                             % y-axis label
ylim([-5 5])                                                        % y-axis range
gscatter(like(csv.gender == 0), ...                                 % scatter plot female
prob(csv.gender == 0),csv.dec(csv.gender == 0),'rb','o')
title('Second Date Decision - Women')                               % add title
xlabel('Like, normalized')                                          % x-axis label
ylabel('Prob of Get "Yes", normalized')                             % y-axis label
ylim([-5 5])                                                        % y-axis range


#### Factors Influencing the Decision

If you can correctly guess the probability of getting a yes, that should help a lot. Can we make such a prediction using observable and discoverable factors only?

Features were generated using normalization and other techniques - see feature_eng.m for more details. To determine which features matter more, the resulting feature set was then split into training and holdout sets and the training set was used to generate a Random Forest model bt_all.mat using the Classification Learner app. What's nice about a Random Forest is that it can show you the predictor importance estimates based on how error increases if you randomly change the value of particular predictors. If they don't matter, it shouldn't increase the error rate much, right?

Based on those scores, the most important features are:

• attrgap - difference between partners' attractiveness rated by participants and their own self rating
• attr - rating of attractiveness participants gave to their partners
• shar - rating of shared interests participants gave to their partners
• fun - rating of fun participants gave to their partners
• field_cd - participants' field of study

The least important features are:

• agegap - difference between the ages of participants and their partners
• order - in which part of the event they first met - earlier or later
• samerace - whether participants and their partners were the same race
• race_o - the race of the partners
feature_eng                                                         % engineer features
imp = oobPermutedPredictorImportance(bt_all.ClassificationEnsemble);% get predictor importance
vars = bt_all.ClassificationEnsemble.PredictorNames;                % predictor names
figure                                                              % new figure
[~,rank] = sort(imp,'descend');                                     % get ranking
bar(imp(rank(1:10)));                                               % plot top 10
title('Out-of-Bag Permuted Predictor Importance Estimates');        % add title
ylabel('Estimates');                                                % y-axis label
xlabel('Top 10 Predictors');                                        % x-axis label
xticks(1:20);                                                       % set x-axis ticks
xticklabels(vars(rank(1:10)))                                       % label bars
xtickangle(45)                                                      % rotate labels
ax = gca;                                                           % get current axes
ax.TickLabelInterpreter = 'none';                                   % turn off latex
[~,rank] = sort(imp);                                               % get ranking
bar(imp(rank(1:10)));                                               % plot bottom 10
title('Out-of-Bag Permuted Predictor Importance Estimates');        % add title
ylabel('Estimates');                                                % y-axis label
xlabel('Bottom 10 Predictors');                                     % x-axis label
xticks(1:10);                                                       % set x-axis ticks
xticklabels(vars(rank(1:10)))                                       % label bars
xtickangle(45)                                                      % rotate labels
ax = gca;                                                           % get current axes
ax.TickLabelInterpreter = 'none';                                   % turn off latex


#### Validatng the Model with the Holdout Set

To be confident about the predictor values, let's check its predictive perfomance. The model was retained without the bottom 2 predictors - see bt_45.mat and the resulting model can predict the decision of participants to 79.6% accuracy. This looks a lot better than human participants did.

load bt_45                                                          % load trained model
Y = holdout.dec;                                                    % ground truth
X = holdout(:,1:end-1);                                             % predictors
Ypred = bt_45.predictFcn(X);                                        % prediction
c = confusionmat(Y,Ypred);                                          % get confusion matrix
disp(array2table(c, ...                                             % show the matrix as table
'VariableNames',{'Predicted_No','Predicted_Yes'}, ...
'RowNames',{'Actual_No','Actual_Yes'}));
accuracy = sum(c(logical(eye(2))))/sum(sum(c))                      % classification accuracy

                  Predicted_No    Predicted_Yes
____________    _____________
Actual_No     420              66
Actual_Yes    105             246
accuracy =
0.7957


#### Relative Attractiveness

We saw that relative attractiveness was a major factor in the decision to say yes. attrgap scores indicate how much more attractive the partners were relative to the participants. As you can see, people tend to say yes when their partners are more attractive than themselves regardless of gender.

• This is a dilemma, because if you say yes to people who are more attractive than you are, they are more likely to say no because you are less attractive.
• But if you have some redeeming quality, such as having more shared interests or being fun, then you may be able to get yes from more attractive partners
• This applies to both genders. Is it just me, or does it look like men may be more willing to say yes to less attractive partners while women tends to more more receptive to fun partners? Loren isn't sure - she thinks it's just me.
figure                                                              % new figure
gscatter(train.fun(train.gender == '1'), ...                        % scatter male
train.attrgap(train.gender == '1'),train.dec(train.gender == '1'),'rb','o')
title('Second Date Decision - Men')                                 % add title
xlabel('Partner''s Fun Rating')                                     % x-axis label
ylabel('Partner''s Relative Attractivenss')                         % y-axis label
ylim([-4 4])                                                        % y-axis range
gscatter(train.fun(train.gender == '0'), ...                        % scatter female
train.attrgap(train.gender == '0'),train.dec(train.gender == '0'),'rb','o')
title('Second Date Decision - Women')                               % add title
xlabel('Partner''s Fun Rating')                                     % x-axis label
ylabel('Partner''s Relative Attractivenss')                         % y-axis label
ylim([-4 4])                                                        % y-axis range


#### Are We Good at Assessing Our Own Attractiveness?

If relative attractiveness is one of the key factors in our decision to say "yes", how good are we at assessing our own attractiveness? Let's compare the self-rating for attractivess to the average ratings participants received. If you subtract the average ratings received from the self rating, you can see how much people overestimate their attractiveness.

• We are generally overestimating our own attractiveness - the median is almost as as much 1 point higher out of 1-10 scale
• Men tend to overestimate more than women
• If you overestimate, then you are more likely to be overconfident about the probability you will get "yes" answers
res.attr_mu = splitapply(@(x) mean(x,'omitnan'), csv.attr_o, G);    % mean attr ratings
[~,idx,~] = unique(G);                                              % get unique indices
res.attr3_1 = csv.attr3_1(idx);                                     % remove duplicates
res.atgap = res.attr3_1 - res.attr_mu;                              % add the rating gaps
figure                                                              % new figure
histogram(res.atgap(res.gender == 1),'Normalization','probability') % histpgram male
hold on                                                             % overlay another plot
histogram(res.atgap(res.gender == 0),'Normalization','probability') % histpgram female
hold off                                                            % stop overlay
title('How Much People Overestimate Their Attractiveness')         % add title
xlabel(['Rating Differences ' ...                                   % x-axis label
sprintf('(Median - Men %.2f, Women %.2f)', ...
median(res.atgap(res.gender == 1),'omitnan'), ...
median(res.atgap(res.gender == 0),'omitnan'))])
ylabel('% Participants')                                            % y-axis label


#### Attractiveness is in the Eye of the Beholder

One possible reason we are not so good at judging our own attractiveness is that for majority of people it it in the eye of the beholder. If you plot to standard deviations of ratings people received, the spread is pretty wide, especially for men.

figure
res.attr_sigma = splitapply(@(x) std(x,'omitnan'), csv.attr_o, G);  % sigma of attr ratings
histogram(res.attr_sigma(res.gender == 1), ...                      % histpgram male
'Normalization','probability')
hold on                                                             % overlay another plot
histogram(res.attr_sigma(res.gender == 0), ...                      % histpgram female
'Normalization','probability')
hold off                                                            % stop overlay
xlabel('Standard Deviations of Attractiveness Ratings Received')    % x-axis label
ylabel('% Participants')                                            % y-axis label
yticklabels(string(0:5:30))                                         % use percentage


#### Modesty is the Key to Success

Given that people are not always good at assessing their own attractiveness, how does it affect the ultimate goal - getting matches? Let's focus on average looking people (people who rate themselves 6 or 7) only to level the field and see how the self assessment affects the outcome.

• People who estimated their attractiveness realistically, with the gap from the mean received rating less than 0.9, did reasonably well
• People who overestimated performed the worst
• People who underestimated performed the best
is_realistic = abs(res.atgap) < 0.9;                                % lealitics estimate
is_over = res.atgap >= 0.9;                                         % overestimate
is_under = res.atgap <= -0.9;                                       % underestimate
is_avg = ismember(res.attr3_1,6:7);                                 % avg looking group
figure                                                              % new figure
scatter(res.attr_mu(is_avg & is_over), ...                          % plot overestimate
res.matched_r(is_avg & is_over))
hold on                                                             % overlay another plot
scatter(res.attr_mu(is_avg & is_under), ...                         % plot underestimate
res.matched_r(is_avg & is_under))
scatter(res.attr_mu(is_avg & is_realistic), ...                     % plot lealitics
res.matched_r(is_avg & is_realistic))
hold off                                                            % stop overlay
title('Match Rates among Self-Rated 6-7 in Attractiveness')         % add title
xlabel('Mean Attractiveness Ratings Recevided')                     % x-axis label
ylabel('% Matches')                                                 % y-axis label
'Realistic Estimators','Location','NorthWest')


#### Summary

It looks like you can get matches in speed dating as long as you set your expectations appropriately. Here are some of my suggestions.

• Relative attractiveness is more important than people admit, because you are not going to learn a lot about your partners in four minutes.
• But who people find attractive varies a lot.
• You can still do well if you have more shared interests or more fun - so use your four minutes wisely.
• Be modest about your own looks and look for those are also modest about their looks - you will more likely get a match.

We should also remember that the data comes from those who went to Columbia at that time - as indicated by such variables as the field of study - and therefore the findings may not generalize to other situations.

Of course I also totally lack practical experience in speed dating - if you do, please let us know what you think of this analysis compared to your own experience below.

Published with MATLAB® R2016b