Predicting When People Quit Their Jobs

著者 Loren Shure, January 5, 2017

39 ビュー (過去 30 日間) | 0 いいね | 8 コメント

2017 is upon us and that means some of you may be going into your annual review or thinking about your career after graduation. Today's guest blogger, Toshi Takeuchi used machine learning on a job-related dataset for predictive analytics. Let's see what he learned.

Dataset
Big Picture - How Bad Is Turnover at This Company?
Defining Who Are the "Best"
Defining Who Are the "Most Experienced"
Job Satisfaction Among High Risk Group
Was It for Money?
Making Insights Actionable with Predictive Analytics
Evaluating the Predictive Performance
Using the Model to Take Action
Explaining the Model
Operationalizing Action
Summary

Companies spend money and time recruiting talent and they lose all that investment when people leave. Therefore companies can save money if they can intervene before their employees leave. Perhaps this is a sign of a robust economy, that one of the datasets popular on Kaggle deals with this issue: Human Resources Analytics - Why are our best and most experienced employees leaving prematurely? Note that this dataset is no longer available on Kaggle.

This is an example of predictive analytics where you try to predict future events based on the historical data using machine learning algorithms. When people talk about predictive analytics, you hear often that the key is to turn insights into action. What does that really mean? Let's also examine this question through exploration of this dataset.

Dataset

The Kaggle page says "our example concerns a big company that wants to understand why some of their best and most experienced employees are leaving prematurely. The company also wishes to predict which valuable employees will leave next."

The fields in the dataset include:

Employee satisfaction level, scaling 0 to 1
Last evaluation, scaling 0 to 1
Number of projects
Average monthly hours
Time spent at the company in years
Whether they have had a work accident
Whether they have had a promotion in the last 5 years
Sales (which actually means job function)
Salary - low, medium or high
Whether the employee has left

Let's load it into MATLAB. The new detectImportOptions makes it easy to set up import options based on the file content.

opts = detectImportOptions('HR_comma_sep.csv');     % set import options
opts.VariableTypes(9:10) = {'categorical'};         % turn text to categorical
csv = readtable('HR_comma_sep.csv', opts);          % import data
fprintf('Number of rows: %d\n',height(csv))         % show number of rows

Number of rows: 14999

We will then hold out 10% of the dataset for model evaluation (holdout), and use the remaining 90% (train) to explore the data and train predictive models.

rng(1)                                              % for reproducibility
c = cvpartition(csv.left,'HoldOut',0.1);            % partition data
train = csv(training(c),:);                         % for training
holdout = csv(test(c),:);                           % for model evaluation

Big Picture - How Bad Is Turnover at This Company?

The first thing to understand is how bad a problem this company has. Assuming each row represents an employee, this company employs close to 14999 people over some period and about 24% of them left this company in that same period (not stated). Is this bad? Turnover is usually calculated on an annual basis and we don't know what period this dataset covers. Also the turnover rate differs from industry to industry. That said, this seems pretty high for a company of this size with an internal R&D team.

When you break it down by job function, the turnover seems to be correlated to job satisfaction. For example, HR and accounting have low median satisfaction levels and high turnover ratios where as R&D and Management have higher satisfaction levels and lower turnover ratios.

Please note the use of new functions in R2016b: xticklabels to set x-axis tick labels and xtickangle to rotate x-axis tick labels.

[g, job] = findgroups(train.sales);                 % group by sales
job = cellstr(job);                                 % convert to cell
job = replace(job,'and','&');                       % clean up
job = replace(job,'_',' ');                         % more clean up
job = regexprep(job,'(^| )\s*.','${upper($0)}');    % capitalize first letter
job = replace(job,'Hr','HR');                       % more capitalize
func = @(x) sum(x)/numel(x);                        % anonymous function
turnover = splitapply(func, train.left, g);         % get group stats
figure                                              % new figure
yyaxis left                                         % left y-axis
bar(turnover*100)                                   % turnover percent
xticklabels(job)                                    % label bars
xtickangle(45)                                      % rotate labels
ylabel('Employee Turnover %')                       % left y-axis label
title({'Turnover & Satisfaction by Job Function'; ...
    sprintf('Overall Turnover %.1f%%', sum(train.left)/height(train)*100)})
hold on                                             % overlay another plot
satisfaction = splitapply(@median, ...              % get group median
    train.satisfaction_level, g);
yyaxis right                                        % right y-axis
plot(1:length(job),satisfaction)                    % plot median line
ylabel('Median Employee Satisfaction')              % right y-axis label
ylim([.5 .67])                                      % scale left y-axis
hold off                                            % stop overlay

Defining Who Are the "Best"

We are asked to analyze why the best and most experienced employees are leaving. How do we identify who are the "best"? For the purpose of this analysis, I will use the performance evaluation score to determine who are high performers. As the following histogram shows, employees with lower scores as well as higher scores tend to leave, and people with average scores are less likely to leave. The median score is 0.72. Let's say anyone with 0.8 or higher scores are high performers.

figure                                              % new figure
histogram(train.last_evaluation(train.left == 0))   % histogram of those stayed
hold on                                             % overlay another plot
histogram(train.last_evaluation(train.left == 1))   % histogram of those left
hold off                                            % stop overlay
xlabel(...                                          % x-axis label
    sprintf('Last Evaluation - Median = %.2f',median(train.last_evaluation)))
ylabel('# Employees')                               % y-axis label
legend('Stayed','Left')                             % add legend
title('Distribution of Last Evaluation')            % add title

Defining Who Are the "Most Experienced"

Among high performers, the company is particularly interested in "most experienced" people - let's use Time Spent at Company to measure the experience level. The plot shows that high performers with 4 to 6 years of experience are at higher risk of turnover.

hp = train(train.last_evaluation >= 0.8,:);         % subset high performers

figure                                              % new figure
histogram(hp.time_spend_company(hp.left == 0))      % histogram of those stayed
hold on                                             % overlay another plot
histogram(hp.time_spend_company(hp.left == 1))      % histogram of those left
hold off                                            % stop overlay
xlabel(...                                          % x-axis label
    sprintf('Time Spent @ Company - Median = %.2f',median(hp.time_spend_company)))
ylabel('# Employees')                               % y-axis label
legend('Stayed','Left')                             % add legend
title('Time Spent @ Company Among High Performers') % add title

Job Satisfaction Among High Risk Group

Let's isolate the at-risk group and see how their job satisfaction stacks up. It is interesting to see that not only people with very low satisfaction levels (no surprise) but also the highly satisfied people left the company. It seems like people with a satisfaction level of 0.7 or higher are at an elevated risk.

at_risk = hp(hp.time_spend_company >= 4 &...        % subset high performers
    hp.time_spend_company <= 6,:);                  % with 4-6 years experience

figure                                              % new figure
histogram(at_risk.satisfaction_level(...            % histogram of those stayed
    at_risk.left == 0))
hold on                                             % overlay another plot
histogram(at_risk.satisfaction_level(...            % histogram of those left
    at_risk.left == 1))
hold off                                            % stop overlay
xlabel(...                                          % x-axis label
    sprintf('Satisfaction Level - Median = %.2f',median(at_risk.satisfaction_level)))
ylabel('# Employees')                               % y-axis label
legend('Stayed','Left')                             % add legend
title('Satisfaction Level of the Best and Most Experienced')

Was It for Money?

Let's isolate high performing seasoned employees and check their salaries. It is clear that people who get a higher salary are staying while people with medium or low salary leave. No big surprise here, either.

at_risk_sat = at_risk(...                           % subset at_risk
    at_risk.satisfaction_level >= .7,:);
figure                                              % new figure
histogram(at_risk_sat.salary(at_risk_sat.left == 0))% histogram of those stayed
hold on                                             % overlay another plot
histogram(at_risk_sat.salary(at_risk_sat.left == 1))% histogram of those left
hold off                                            % stop overlay
xlabel('Salary')                                    % x-axis label
ylabel('# Employees')                               % y-axis label
legend('Stayed','Left')                             % add legend
title('Salary of the Best and Most Experienced with High Satisfaction')

Making Insights Actionable with Predictive Analytics

At this point, you may say, "I knew all this stuff already. I didn't get any new insight from this analysis." It's true, but why are you complaining about that? Predictability is a good thing for prediction!

What we have seen so far all happened in the past, which you cannot undo. However, if you can predict the future, you can do something about it. Making data actionable - that's the true value of predictive analytics.

In order to make this insight actionable, we need to quantify the turnover risk as scores so that we can identify at-risk employees for intervention. Since we are trying to classify people into those who are likely to stay vs. leave, what we need to do is build a classification model to predict such binary outcome.

I have used the Classification Learner app to run multiple classifiers on our training data train to see which one provides the highest prediction accuracy.

The winner is the Bagged Trees classifier (also known as "Random Forest") with 99.0% accuracy. To evaluate a classifier, we typically use the ROC curve plot, which lets you see the trade-off between the true positive rate vs. false positive rate. If you have a high AUC score like 0.99, the classifer is very good at identifying the true class without causing too many false positives.

You can also export the trained predictive model from the Classification Learner. I saved the exported model as btree.mat that comes with some instructions.

load btree                                          % load trained model
how_to = btree.HowToPredict;                        % get how to use
disp([how_to(1:80) ' ...'])                         % snow snippet

To make predictions on a new table, T, use: 
  yfit = c.predictFcn(T) 
replacing ...

Evaluating the Predictive Performance

Let's pick 10 samples from the holdout data partition and see how the predictive model scores them. Here are the samples - 5 people who left and 5 who stayed. They are all high performers with 4 or more years of experience.

samples = holdout([111; 484; 652; 715; 737; 1135; 1293; 1443; 1480; 1485],:);
samples(:,[1,2,5,8,9,10,7])

ans = 
    satisfaction_level    last_evaluation    time_spend_company    promotion_last_5years      sales       salary    left
    __________________    _______________    __________________    _____________________    __________    ______    ____
     0.9                  0.92               4                     0                        sales         low       1   
    0.31                  0.92               6                     0                        support       medium    0   
    0.86                  0.87               4                     0                        sales         low       0   
    0.62                  0.95               4                     0                        RandD         low       0   
    0.23                  0.96               6                     0                        marketing     medium    0   
    0.39                  0.89               5                     0                        support       low       0   
    0.09                  0.92               4                     0                        sales         medium    1   
    0.73                  0.87               5                     0                        IT            low       1   
    0.75                  0.97               6                     0                        technical     medium    1   
    0.84                  0.83               5                     0                        accounting    low       1

Now let's try the model to predict whether they left the company.

actual = samples.left;                              % actual outocme
predictors = samples(:,[1:6,8:10]);                 % predictors only
[predicted,score]= btree.predictFcn(predictors);    % get prediction
c = confusionmat(actual,predicted);                 % get confusion matrix
disp(array2table(c, ...                             % show the matrix as table
    'VariableNames',{'Predicted_Stay','Predicted_Leave'}, ...
    'RowNames',{'Actual_Stayed','Actual_Left'}));

                     Predicted_Stay    Predicted_Leave
                     ______________    _______________
    Actual_Stayed    5                 0              
    Actual_Left      0                 5

The model was able to predict the outcome of those 10 samples accurately. In addition, it also returns the probability of each class as score. You can also use the entire holdout data to check the model performance, but I will skip that step here.

Using the Model to Take Action

We can use the probability of leaving as the risk score. We can select high performers based on the last evaluation and time spent at the company, and then score their risk level and intervene in high risk cases. In this example, the sales employee with the 0.92 evaluation score is 100% at risk of leaving. Since this employee has not been promoted in the last 5 years, perhaps it is time to do so.

samples.risk = score(:,2);                        % probability of leaving
[~,ranking] = sort(samples.risk,'descend');       % sort by risk score
samples = samples(ranking,:);                     % sort table by ranking
samples(samples.risk > .7,[1,2,5,8,9,10,11])      % intervention targets

ans = 
    satisfaction_level    last_evaluation    time_spend_company    promotion_last_5years      sales       salary     risk  
    __________________    _______________    __________________    _____________________    __________    ______    _______
    0.09                  0.92               4                     0                        sales         medium          1
    0.73                  0.87               5                     0                        IT            low             1
    0.84                  0.83               5                     0                        accounting    low        0.9984
    0.75                  0.97               6                     0                        technical     medium    0.96154
     0.9                  0.92               4                     0                        sales         low       0.70144

Explaining the Model

Machine learning algorithms used in predictive analytics are often a black-box solution, so HR managers may need to provide an easy-to-understand explanation about how it works in order to get the buy-in from their superiors. We can use the predictor importance score to show which attributes are used to compute the prediction.

In this example, the model is using the mixture of attributes at different weights to compute it, with emphasis on the satisfaction level, followed by the number of projects and time spent at the company. We also see that people who got a promotion in the last 5 years are less likely to leave. Those attributes are very obvious and therefore we feel more confident about the prediction with this model.

imp = oobPermutedPredictorImportance(...            % get predictor importance
    btree.ClassificationEnsemble);
vals = btree.ClassificationEnsemble.PredictorNames; % predictor names
figure                                              % new figure
bar(imp);                                           % plot importnace
title('Out-of-Bag Permuted Predictor Importance Estimates');
ylabel('Estimates');                                % y-axis label
xlabel('Predictors');                               % x-axis label
xticklabels(vals)                                   % label bars
xtickangle(45)                                      % rotate labels
ax = gca;                                           % get current axes
ax.TickLabelInterpreter = 'none';                   % turn off latex

Operationalizing Action

The predictor importance score also give us a hint about when we should intervene. Given that satisfaction level, last evaluation, and time spent at the company are all important predictors, it is probably a good idea to update the predictive scores at the time of each annual evaluation and then decide who may need intervention.

It is feasible to implement a system to schedule such analysis when a performance review is conducted and automatically generate a report of high risk employees using the model derived from MATLAB. Here is a simple live demo of such a system running the following code using MATLAB Production Server. It uses the trained predictive model we just generated.

The example code just loads the data from MAT-file, but you would instead access data from an SQL database in a real world use scenario. Please also note the use of a new function jsonencode introduced in R2016b that converts structure arrays into JSON formatted text. Naturally, we also have jsondecode that converts JSON formatted text into appropriate MATLAB data types.

dbtype scoreRisk.m

1     function risk = scoreRisk(employee_ids)
2     %SCORERISK scores the turnover risk of selected employees
3     
4     load data                                  % load holdout data
5     load btree                                 % load trained model
6     X_sub = X(employee_ids,:);                 % select employees
7     [predicted,score]= btree.predictFcn(X_sub);   % get prediction
8     risk = struct('Ypred',predicted,'score',score(:,2));% create struct
9     risk = jsonencode(risk);                   % JSON encode it
10    
11    end

Depending on your needs, you can build it in a couple of ways:

Excel add-in with MATLAB Compiler
Server-based solution with MATLAB Production Server

Often, you would need to retrain the predictive model as human behavior changes over time. If you use code generated from MATLAB, it is very easy to retrain using the new dataset and redeploy the new model for production use, as compared to cases where you reimplement the model in some other languages.

Summary

In the beginning of this analysis, we explored the dataset which represents the past. You cannot undo the past, but you can do something about the future if you can find quantifiable patterns in the data for prediction.

People often think the novelty of insights is important. What really matters is what action you can take on them, whether novel or well-known. If you do nothing, no fancy analytics will deliver any value. Harvard Business Review recently publushed an article Why You’re Not Getting Value from Your Data Science but failed to mention this issue.

In my opionion, analysis paralysis is one of the biggest reason companies are not getting the value because they are failing to take action and are stuck at the data analysis phase of the data science process.

Has this example helped you understand how you can use predictive analytics to solve practical problems? Can you think of way to apply this in your area of work? Let us know what you think here.

Published with MATLAB® R2016b