Student Lounge

Sharing technical and real-life examples of how students can use MATLAB and Simulink in their everyday projects #studentsuccess

Predicting Timely Diagnosis of Metastatic Breast Cancer for the WiDS Datathon 2024

In today’s blog, Grace Woolson will show how you can use MATLAB and machine learning to make meaningful deductions from healthcare data for patients who have been diagnosed with metastatic breast cancer. Over to you Grace!

Introduction

In this blog, I will show how you can use MATLAB for the WiDS Datathon 2024 using the dataset for the WiDS Datathon #1, which runs from January 9th 2024 – March 1st 2024. This challenge tasks participants with creating a model that can predict whether or not a patient with metastatic breast cancer will receive a diagnosis within 90 days based on patient and environmental data. This can help identify relationships between demographics or environmental hazards with the likelihood of getting timely treatment. Please note that this blog is based on a subset of the data and there may be slight differences between this dataset and the one provided by WiDS.
MathWorks is happy to support participants of the Women in Data Science Datathon 2024 by providing complimentary MATLAB licenses, tutorials, workshops, and additional resources. To request complimentary licenses for you and your teammates, go to this MathWorks site, click the “Request Software” button, and fill out the software request form.
This tutorial will walk through the following steps of the model-making process:
  1. Importing a Tabular Dataset
  2. Preprocessing the Data
  3. Exploring and Analyzing Tabular Data
  4. Choosing and Creating Features
  5. Training a Machine Learning Model
  6. Evaluating a Machine Learning Model
  7. Making New Predictions and Exporting Submissions

Import Data

First, make sure the ‘Current Folder’ is the folder where you saved the data. If you have not already done so, you can download the data from Kaggle after you register for the datathon. The data is provided as a .CSV file, so we can use the readtable function to import the whole file as a table.
dataFolder = fullfile(pwd);
trainDataFilename = ‘Training.csv’;
allTrainData = readtable(fullfile(dataFolder, trainDataFilename))
allTrainData = 12906×83 table
patient_id patient_race payer_type patient_state patient_zip3 patient_age patient_gender bmi breast_cancer_diagnosis_code breast_cancer_diagnosis_desc metastatic_cancer_diagnosis_code metastatic_first_novel_treatment metastatic_first_novel_treatment_type Region Division population density age_median age_under_10 age_10_to_19 age_20s age_30s age_40s age_50s age_60s age_70s age_over_80 male female married
1 475714 ‘MEDICAID’ ‘CA’ 924 84 ‘F’ NaN ‘C50919’ ‘Malignant neoplasm of unsp site of unspecified female breast’ ‘C7989’ ‘West’ ‘Pacific’ 3.1438e+04 1.1896e+03 30.6429 16.0143 15.5429 17.6143 14.0143 11.6143 11.5571 7.5714 4 2.1000 49.8571 50.1429 36.5714
2 349367 ‘White’ ‘COMMERCIAL’ ‘CA’ 928 62 ‘F’ 28.4900 ‘C50411’ ‘Malig neoplm of upper-outer quadrant of right female breast’ ‘C773’ ‘West’ ‘Pacific’ 3.9122e+04 2.2959e+03 38.2000 11.8788 13.3545 14.2303 13.4182 13.3333 14.0606 10.2485 5.9515 3.5030 49.8939 50.1061 50.2455
3 138632 ‘White’ ‘COMMERCIAL’ ‘TX’ 760 43 ‘F’ 38.09 ‘C50112’ ‘Malignant neoplasm of central portion of left female breast’ ‘C773’ ‘South’ ‘West South Central’ 2.1997e+04 626.2367 37.9067 13.0283 14.4633 12.5317 13.5450 12.8600 12.7700 11.4267 6.5650 2.8117 50.1233 49.8767 55.7533
4 617843 ‘White’ ‘COMMERCIAL’ ‘CA’ 926 45 ‘F’ NaN ‘C50212’ ‘Malig neoplasm of upper-inner quadrant of left female breast’ ‘C773’ ‘West’ ‘Pacific’ 3.2795e+04 1.8962e+03 42.8714 10.0714 12.1357 12.5381 12.4643 12.6500 14.8476 12.2810 8.2167 4.7595 49.0667 50.9333 52.6048
5 817482 ‘COMMERCIAL’ ‘ID’ 836 55 ‘F’ NaN ‘1749’ ‘Malignant neoplasm of breast (female), unspecified’ ‘C773’ ‘West’ ‘Mountain’ 1.0886e+04 116.8860 43.4735 10.8240 13.9760 9.4920 10.3640 12.6000 14.9920 14.8360 9.4620 3.4660 52.3120 47.6880 57.8820
6 111545 ‘White’ ‘MEDICARE ADVANTAGE’ ‘NY’ 141 66 ‘F’ NaN ‘1749’ ‘Malignant neoplasm of breast (female), unspecified’ ‘C7981’ ‘Northeast’ ‘Middle Atlantic’ 5.6438e+03 219.3629 45.1800 8.5114 14.8571 11.0886 9.7543 13.6143 13.3743 15.6857 9.4457 3.6457 50.9114 49.0914 51.3229
7 914071 ‘COMMERCIAL’ ‘CA’ 900 51 ‘F’ 29.0500 ‘C50912’ ‘Malignant neoplasm of unspecified site of left female breast’ ‘C779’ ‘West’ ‘Pacific’ 3.6054e+04 5.2943e+03 36.6538 9.7615 11.2677 17.2338 17.4415 13.0908 12.3046 9.4077 5.6738 3.8246 50.5108 49.4892 33.4785
8 479368 ‘White’ ‘COMMERCIAL’ ‘IL’ 619 60 ‘F’ NaN ‘C50512’ ‘Malig neoplasm of lower-outer quadrant of left female breast’ ‘C773’ ‘Midwest’ ‘East North Central’ 3.4041e+03 25.7333 42.7900 11.9833 13.2567 9.5733 12.4000 11.8133 13.5767 14.0433 8.5267 4.8533 49.2833 50.7167 55.8867
9 994014 ‘White’ ‘MEDICARE ADVANTAGE’ 973 82 ‘F’ NaN ‘1744’ ‘Malignant neoplasm of upper-outer quadrant of female breast’ ‘C7800’ 1.0111e+04 240.5785 44.9159 9.0646 12.1000 11.1385 11.5123 10.5354 12.7292 18.5462 10.7431 3.6400 52.5846 47.4154 52.9938
10 155485 ‘COMMERCIAL’ ‘IL’ 617 64 ‘F’ NaN ‘C50912’ ‘Malignant neoplasm of unspecified site of left female breast’ ‘C773’ ‘Midwest’ ‘East North Central’ 4.4353e+03 68.0019 41.3000 12.8358 13.6811 10.5245 11.9377 11.6585 13.5774 13.7434 7.6868 4.3415 49.3962 50.6038 57.8962
11 875977 ‘MEDICARE ADVANTAGE’ ‘MI’ 488 67 ‘F’ NaN ‘C50412’ ‘Malig neoplasm of upper-outer quadrant of left female breast’ ‘C799’ ‘Midwest’ ‘East North Central’ 8101 246.2810 40.2782 11.0456 14.7684 13.3848 11.4671 11.2203 14.8975 12.5899 7.1494 3.4709 51.3228 48.6772 49.0658
12 343914 ‘MEDICARE ADVANTAGE’ ‘CA’ 900 66 ‘F’ NaN ‘1749’ ‘Malignant neoplasm of breast (female), unspecified’ ‘C7800’ ‘West’ ‘Pacific’ 3.6054e+04 5.2943e+03 36.6538 9.7615 11.2677 17.2338 17.4415 13.0908 12.3046 9.4077 5.6738 3.8246 50.5108 49.4892 33.4785
13 266700 ‘White’ ‘COMMERCIAL’ ‘MI’ 480 58 ‘F’ NaN ‘C50812’ ‘Malignant neoplasm of ovrlp sites of left female breast’ ‘C781’ ‘Midwest’ ‘East North Central’ 1.6938e+04 894.1681 42.9348 10.5116 11.8130 11.9217 12.4043 12.4304 15.2710 13.8826 7.8522 3.8971 50.0217 49.9783 50.9565
14 437659 ‘IL’ 606 82 ‘F’ undefined ‘1749’ ‘Malignant neoplasm of breast (female), unspecified’ ‘C779’ ‘Midwest’ ‘East North Central’ 4.8671e+04 6.4314e+03 35.7554 10.4286 10.6518 18.3107 18.9036 11.9696 11.7268 9.6839 5.4071 2.8911 48.6964 51.3036 35.9304
I want to see some high-level statistics about the data, so I’ll use the summary function to get an idea of what kind of information we have.
summary(allTrainData)

Variables:

patient_id: 12906×1 double

Values:

Min 1.0006e+05
Median 5.4352e+05
Max 9.999e+05

patient_race: 12906×1 cell array of character vectors

payer_type: 12906×1 cell array of character vectors

patient_state: 12906×1 cell array of character vectors

patient_zip3: 12906×1 double

Values:

Min 101
Median 554
Max 999

patient_age: 12906×1 double

Values:

Min 18
Median 59
Max 91

patient_gender: 12906×1 cell array of character vectors

bmi: 12906×1 double

Values:

Min 14
Median 28.19
Max 85
NumMissing 8965

breast_cancer_diagnosis_code: 12906×1 cell array of character vectors

breast_cancer_diagnosis_desc: 12906×1 cell array of character vectors

metastatic_cancer_diagnosis_code: 12906×1 cell array of character vectors

metastatic_first_novel_treatment: 12906×1 cell array of character vectors

metastatic_first_novel_treatment_type: 12906×1 cell array of character vectors

Region: 12906×1 cell array of character vectors

Division: 12906×1 cell array of character vectors

population: 12906×1 double

Values:

Min 635.55
Median 19154
Max 71374
NumMissing 1

density: 12906×1 double

Values:

Min 0.91667
Median 700.34
Max 21172
NumMissing 1

age_median: 12906×1 double

Values:

Min 20.6
Median 40.639
Max 54.57
NumMissing 1

age_under_10: 12906×1 double

Values:

Min 0
Median 11.039
Max 17.675
NumMissing 1

age_10_to_19: 12906×1 double

Values:

Min 6.3143
Median 12.924
Max 35.3
NumMissing 1

age_20s: 12906×1 double

Values:

Min 5.925
Median 12.538
Max 62.1
NumMissing 1

age_30s: 12906×1 double

Values:

Min 1.5
Median 12.443
Max 25.471
NumMissing 1

age_40s: 12906×1 double

Values:

Min 0.8
Median 12.124
Max 17.82
NumMissing 1

age_50s: 12906×1 double

Values:

Min 0
Median 13.568
Max 21.661
NumMissing 1

age_60s: 12906×1 double

Values:

Min 0.2
Median 12.533
Max 29.855
NumMissing 1

age_70s: 12906×1 double

Values:

Min 0
Median 7.3169
Max 19
NumMissing 1

age_over_80: 12906×1 double

Values:

Min 0
Median 3.8
Max 18.825
NumMissing 1

male: 12906×1 double

Values:

Min 39.725
Median 49.976
Max 61.6
NumMissing 1

female: 12906×1 double

Values:

Min 38.4
Median 50.024
Max 60.275
NumMissing 1

married: 12906×1 double

Values:

Min 0.9
Median 49.434
Max 66.903
NumMissing 1

divorced: 12906×1 double

Values:

Min 0.2
Median 12.653
Max 19.831
NumMissing 1

never_married: 12906×1 double

Values:

Min 13.44
Median 32.004
Max 98.9
NumMissing 1

widowed: 12906×1 double

Values:

Min 0
Median 5.5208
Max 23.055
NumMissing 1

family_size: 12906×1 double

Values:

Min 2.5504
Median 3.1665
Max 4.1723
NumMissing 4

family_dual_income: 12906×1 double

Values:

Min 19.312
Median 52.592
Max 70.925
NumMissing 4

income_household_median: 12906×1 double

Values:

Min 29222
Median 69803
Max 1.6412e+05
NumMissing 4

income_household_under_5: 12906×1 double

Values:

Min 0.75
Median 2.8382
Max 19.62
NumMissing 4

income_household_5_to_10: 12906×1 double

Values:

Min 0.36154
Median 2.1604
Max 11.872
NumMissing 4

income_household_10_to_15: 12906×1 double

Values:

Min 1.0154
Median 3.7171
Max 14.278
NumMissing 4

income_household_15_to_20: 12906×1 double

Values:

Min 1.0278
Median 3.7712
Max 12.918
NumMissing 4

income_household_20_to_25: 12906×1 double

Values:

Min 1.1
Median 4.0421
Max 14.35
NumMissing 4

income_household_25_to_35: 12906×1 double

Values:

Min 2.65
Median 8.4353
Max 18.34
NumMissing 4

income_household_35_to_50: 12906×1 double

Values:

Min 1.7
Median 11.793
Max 24.075
NumMissing 4

income_household_50_to_75: 12906×1 double

Values:

Min 4.95
Median 17.076
Max 27.13
NumMissing 4

income_household_75_to_100: 12906×1 double

Values:

Min 4.7333
Median 12.677
Max 24.8
NumMissing 4

income_household_100_to_150: 12906×1 double

Values:

Min 4.2889
Median 16.016
Max 31.325
NumMissing 4

income_household_150_over: 12906×1 double

Values:

Min 0.84
Median 14.703
Max 52.824
NumMissing 4

income_household_six_figure: 12906×1 double

Values:

Min 5.6926
Median 30.575
Max 69.032
NumMissing 4

income_individual_median: 12906×1 double

Values:

Min 4316
Median 35253
Max 88910
NumMissing 1

home_ownership: 12906×1 double

Values:

Min 15.85
Median 69.669
Max 90.367
NumMissing 4

housing_units: 12906×1 double

Values:

Min 0
Median 6994.4
Max 25923
NumMissing 1

home_value: 12906×1 double

Values:

Min 60629
Median 2.4784e+05
Max 1.8531e+06
NumMissing 4

rent_median: 12906×1 double

Values:

Min 448.4
Median 1168
Max 2965.2
NumMissing 4

rent_burden: 12906×1 double

Values:

Min 17.416
Median 30.986
Max 78.94
NumMissing 4

education_less_highschool: 12906×1 double

Values:

Min 0
Median 10.843
Max 34.325
NumMissing 1

education_highschool: 12906×1 double

Values:

Min 0
Median 27.406
Max 53.96
NumMissing 1

education_some_college: 12906×1 double

Values:

Min 7.2
Median 29.286
Max 50.133
NumMissing 1

education_bachelors: 12906×1 double

Values:

Min 2.4657
Median 19.047
Max 41.7
NumMissing 1

education_graduate: 12906×1 double

Values:

Min 2.0941
Median 10.796
Max 51.84
NumMissing 1

education_college_or_above: 12906×1 double

Values:

Min 7.0488
Median 30.141
Max 77.817
NumMissing 1

education_stem_degree: 12906×1 double

Values:

Min 23.915
Median 43.066
Max 73
NumMissing 1

labor_force_participation: 12906×1 double

Values:

Min 30.7
Median 62.778
Max 78.67
NumMissing 1

unemployment_rate: 12906×1 double

Values:

Min 0.82308
Median 5.4741
Max 18.8
NumMissing 1

self_employed: 12906×1 double

Values:

Min 2.263
Median 12.748
Max 25.538
NumMissing 4

farmer: 12906×1 double

Values:

Min 0
Median 0.45493
Max 26.729
NumMissing 4

race_white: 12906×1 double

Values:

Min 14.496
Median 70.878
Max 98.444
NumMissing 1

race_black: 12906×1 double

Values:

Min 0.060976
Median 6.4103
Max 69.66
NumMissing 1

race_asian: 12906×1 double

Values:

Min 0
Median 2.9667
Max 49.85
NumMissing 1

race_native: 12906×1 double

Values:

Min 0
Median 0.43095
Max 76.935
NumMissing 1

race_pacific: 12906×1 double

Values:

Min 0
Median 0.054054
Max 14.758
NumMissing 1

race_other: 12906×1 double

Values:

Min 0.0025641
Median 3.5136
Max 33.189
NumMissing 1

race_multiple: 12906×1 double

Values:

Min 0.43333
Median 5.802
Max 26.43
NumMissing 1

hispanic: 12906×1 double

Values:

Min 0.19444
Median 11.983
Max 91.005
NumMissing 1

disabled: 12906×1 double

Values:

Min 4.6
Median 12.884
Max 35.156
NumMissing 1

poverty: 12906×1 double

Values:

Min 3.4333
Median 12.178
Max 38.348
NumMissing 4

limited_english: 12906×1 double

Values:

Min 0
Median 2.7472
Max 26.755
NumMissing 4

commute_time: 12906×1 double

Values:

Min 12.461
Median 27.788
Max 48.02
NumMissing 1

health_uninsured: 12906×1 double

Values:

Min 2.44
Median 7.4657
Max 27.566
NumMissing 1

veteran: 12906×1 double

Values:

Min 1.2
Median 6.8471
Max 25.2
NumMissing 1

Ozone: 12906×1 double

Values:

Min 30.939
Median 39.108
Max 52.237
NumMissing 29

PM25: 12906×1 double

Values:

Min 2.636
Median 7.6866
Max 11.169
NumMissing 29

N02: 12906×1 double

Values:

Min 2.7604
Median 15.589
Max 31.505
NumMissing 29

DiagPeriodL90D: 12906×1 double

Values:

Min 0
Median 1
Max 1

Take some time to scroll through this summary and see what information or patterns you can learn! Here are some things I notice:
  1. There are a lot of rows or variables that just say “cell array of character vectors”, which doesn’t tell us much about the data.
  2. There are a few variables that have a high ‘NumMissing’ value.
  3. The numeric variables can have dramatically different minimums and maximums.
We can use these observations to make decisions about how we want to explore and preprocess the dataset.

Process and Clean the Data

1. Convert text data to categorical

Text data can be hard for machine learning algorithms to understand, so let’s go through and change each “cell array of character vectors” to a categorical. This will help the algorithm sort the text into different categories instead of understanding it as a series of individual letters.
varTypes = varfun(@class, allTrainData, OutputFormat=“cell”);
catIdx = strcmp(varTypes, “cell”);
varNames = allTrainData.Properties.VariableNames;
catVarNames = varNames(catIdx);
for catNameIdx = 1:length(catVarNames)
allTrainData.(catVarNames{catNameIdx}) = categorical(allTrainData.(catVarNames{catNameIdx}));
end

2. Handle Missing Data

Now I want to handle all that missing data I noticed earlier. I’ll go through each variable and specifically look at variables that are missing data for over half of the rows or observations.
dataSum = summary(allTrainData);
for nameIdx = 1:length(varNames)
varName = varNames{nameIdx};
varNumMissing = dataSum.(varName).NumMissing;
if varNumMissing > (height(allTrainData) / 2)
disp(varName);
disp(varNumMissing);
end
end
bmi
8965
metastatic_first_novel_treatment
12882
metastatic_first_novel_treatment_type
12882
Let’s remove those variables entirely, since they might not be too helpful for our algorithm.
allTrainData = removevars(allTrainData, [“bmi”, “metastatic_first_novel_treatment”, “metastatic_first_novel_treatment_type”])
allTrainData = 12906×80 table
patient_id patient_race payer_type patient_state patient_zip3 patient_age patient_gender breast_cancer_diagnosis_code breast_cancer_diagnosis_desc metastatic_cancer_diagnosis_code Region Division population density age_median age_under_10 age_10_to_19 age_20s age_30s age_40s age_50s age_60s age_70s age_over_80 male female married divorced never_married widowed
1 475714 <undefined> MEDICAID CA 924 84 F C50919 Malignant neoplasm of unsp site of unspecified female breast C7989 West Pacific 3.1438e+04 1.1896e+03 30.6429 16.0143 15.5429 17.6143 14.0143 11.6143 11.5571 7.5714 4 2.1000 49.8571 50.1429 36.5714 11.8857 47.1143 4.4429
2 349367 White COMMERCIAL CA 928 62 F C50411 Malig neoplm of upper-outer quadrant of right female breast C773 West Pacific 3.9122e+04 2.2959e+03 38.2000 11.8788 13.3545 14.2303 13.4182 13.3333 14.0606 10.2485 5.9515 3.5030 49.8939 50.1061 50.2455 9.8273 35.2909 4.6515
3 138632 White COMMERCIAL TX 760 43 F C50112 Malignant neoplasm of central portion of left female breast C773 South West South Central 2.1997e+04 626.2367 37.9067 13.0283 14.4633 12.5317 13.5450 12.8600 12.7700 11.4267 6.5650 2.8117 50.1233 49.8767 55.7533 12.3300 27.1950 4.7100
4 617843 White COMMERCIAL CA 926 45 F C50212 Malig neoplasm of upper-inner quadrant of left female breast C773 West Pacific 3.2795e+04 1.8962e+03 42.8714 10.0714 12.1357 12.5381 12.4643 12.6500 14.8476 12.2810 8.2167 4.7595 49.0667 50.9333 52.6048 11.6238 31.1429 4.6238
5 817482 <undefined> COMMERCIAL ID 836 55 F 1749 Malignant neoplasm of breast (female), unspecified C773 West Mountain 1.0886e+04 116.8860 43.4735 10.8240 13.9760 9.4920 10.3640 12.6000 14.9920 14.8360 9.4620 3.4660 52.3120 47.6880 57.8820 14.9640 21.7600 5.4060
6 111545 White MEDICARE ADVANTAGE NY 141 66 F 1749 Malignant neoplasm of breast (female), unspecified C7981 Northeast Middle Atlantic 5.6438e+03 219.3629 45.1800 8.5114 14.8571 11.0886 9.7543 13.6143 13.3743 15.6857 9.4457 3.6457 50.9114 49.0914 51.3229 11.7600 30.8314 6.0914
7 914071 <undefined> COMMERCIAL CA 900 51 F C50912 Malignant neoplasm of unspecified site of left female breast C779 West Pacific 3.6054e+04 5.2943e+03 36.6538 9.7615 11.2677 17.2338 17.4415 13.0908 12.3046 9.4077 5.6738 3.8246 50.5108 49.4892 33.4785 11.3015 50.4569 4.7662
8 479368 White COMMERCIAL IL 619 60 F C50512 Malig neoplasm of lower-outer quadrant of left female breast C773 Midwest East North Central 3.4041e+03 25.7333 42.7900 11.9833 13.2567 9.5733 12.4000 11.8133 13.5767 14.0433 8.5267 4.8533 49.2833 50.7167 55.8867 12.6400 24.5267 6.9433
9 994014 White MEDICARE ADVANTAGE <undefined> 973 82 F 1744 Malignant neoplasm of upper-outer quadrant of female breast C7800 <undefined> <undefined> 1.0111e+04 240.5785 44.9159 9.0646 12.1000 11.1385 11.5123 10.5354 12.7292 18.5462 10.7431 3.6400 52.5846 47.4154 52.9938 13.9323 27.5262 5.5615
10 155485 <undefined> COMMERCIAL IL 617 64 F C50912 Malignant neoplasm of unspecified site of left female breast C773 Midwest East North Central 4.4353e+03 68.0019 41.3000 12.8358 13.6811 10.5245 11.9377 11.6585 13.5774 13.7434 7.6868 4.3415 49.3962 50.6038 57.8962 10.8981 24.9547 6.2472
11 875977 <undefined> MEDICARE ADVANTAGE MI 488 67 F C50412 Malig neoplasm of upper-outer quadrant of left female breast C799 Midwest East North Central 8101 246.2810 40.2782 11.0456 14.7684 13.3848 11.4671 11.2203 14.8975 12.5899 7.1494 3.4709 51.3228 48.6772 49.0658 13.6051 31.8848 5.4392
12 343914 <undefined> MEDICARE ADVANTAGE CA 900 66 F 1749 Malignant neoplasm of breast (female), unspecified C7800 West Pacific 3.6054e+04 5.2943e+04 36.6538 9.7615 11.2677 17.2338 17.4415 13.0908 12.3046 9.4077 5.6738 3.8246 50.5108 49.4892 33.4785 11.3015 50.4569 4.7662
13 266700 White COMMERCIAL MI 480 58 F C50812 Malignant neoplasm of ovrlp sites of left female breast C781 Midwest East North Central 1.6938e+04 894.1681 42.9348 10.5116 11.8130 11.9217 12.4043 12.4304 15.2710 13.8826 7.8522 3.8971 50.0217 49.9783 50.9565 12.3145 30.8333 5.9014
14 437659 <undefined> <undefined> IL 606 82 F 1749 Malignant neoplasm of breast (female), unspecified C779 Midwest East North Central 4.8671e+04 6.4314e+03 35.7554 10.4286 10.6518 18.3107 18.9036 11.9696 11.7268 9.6839 5.4071 2.8911 48.6964 51.3036 35.9304 10.2982 49.0054 4.7643
Now I want to look at each row and remove any that are missing too many values. It’s okay to have a couple of missing data points in your dataset, but if you have too many it could cause your machine learning algorithm to be less accurate. I’ll use the Clean Missing Data live task to remove any rows that are missing 2 or more data points.
% Remove missing data
[fullData,missingIndices] = rmmissing(allTrainData,“MinNumMissing”,2);
% Display results
figure
% Get locations of missing data
indicesForPlot = ismissing(allTrainData.patient_age);
mask = missingIndices & ~indicesForPlot;
% Plot cleaned data
plot(find(~missingIndices),fullData.patient_age,“SeriesIndex”,1,“LineWidth”,1.5,
“DisplayName”,“Cleaned data”)
hold on
% Plot data in rows where other variables contain missing entries
plot(find(mask),allTrainData.patient_age(mask),“x”,“SeriesIndex”,“none”,
“DisplayName”,“Removed by other variables”)
% Plot removed missing entries
x = repelem(find(indicesForPlot),3);
y = repmat([ylim(gca) missing]’,nnz(indicesForPlot),1);
plot(x,y,“Color”,[145 145 145]/255,“DisplayName”,“Removed missing entries”)
title(“Number of removed missing entries: ” + nnz(indicesForPlot))
hold off
legend
ylabel(“patient_age”,“Interpreter”,“none”)
clear indicesForPlot mask x y

Explore the Data

Now that the data is cleaned up, you should spend some time exploring your data to understand how different variables may interact with each other or see if you can draw any meaningful conclusions from the data or figure out which variables may be more or less important when it comes to predicting time to diagnosis.

Univariate Analysis

First, I want to separate the data into two datasets: one full of patients who were diagnosed in 90 days or less (the 1 or “True” values), and one full of patients who were not (the 0 or “False” values). This will allow me to explore the data patterns in each of these datasets and look for any meaningful differences.
allTrueIdx = fullData.DiagPeriodL90D == 1;
allTrueData = fullData(allTrueIdx, :);
allTrueData = 7559×80 table
patient_id patient_race payer_type patient_state patient_zip3 patient_age patient_gender breast_cancer_diagnosis_code breast_cancer_diagnosis_desc metastatic_cancer_diagnosis_code Region Division population density age_median age_under_10 age_10_to_19 age_20s age_30s age_40s age_50s age_60s age_70s age_over_80 male female married divorced never_married widowed
1 475714 <undefined> MEDICAID CA 924 84 F C50919 Malignant neoplasm of unsp site of unspecified female breast C7989 West Pacific 3.1438e+04 1.1896e+03 30.6429 16.0143 15.5429 17.6143 14.0143 11.6143 11.5571 7.5714 4 2.1000 49.8571 50.1429 36.5714 11.8857 47.1143 4.4429
2 349367 White COMMERCIAL CA 928 62 F C50411 Malig neoplm of upper-outer quadrant of right female breast C773 West Pacific 3.9122e+04 2.2959e+03 38.2000 11.8788 13.3545 14.2303 13.4182 13.3333 14.0606 10.2485 5.9515 3.5030 49.8939 50.1061 50.2455 9.8273 35.2909 4.6515
3 138632 White COMMERCIAL TX 760 43 F C50112 Malignant neoplasm of central portion of left female breast C773 South West South Central 2.1997e+04 626.2367 37.9067 13.0283 14.4633 12.5317 13.5450 12.8600 12.7700 11.4267 6.5650 2.8117 50.1233 49.8767 55.7533 12.3300 27.1950 4.7100
4 914071 <undefined> COMMERCIAL CA 900 51 F C50912 Malignant neoplasm of unspecified site of left female breast C779 West Pacific 3.6054e+04 5.2943e+03 36.6538 9.7615 11.2677 17.2338 17.4415 13.0908 12.3046 9.4077 5.6738 3.8246 50.5108 49.4892 33.4785 11.3015 50.4569 4.7662
5 479368 White COMMERCIAL IL 619 60 F C50512 Malig neoplasm of lower-outer quadrant of left female breast C773 Midwest East North Central 3.4041e+03 25.7333 42.7900 11.9833 13.2567 9.5733 12.4000 11.8133 13.5767 14.0433 8.5267 4.8533 49.2833 50.7167 55.8867 12.6400 24.5267 6.9433
6 155485 <undefined> COMMERCIAL IL 617 64 F C50912 Malignant neoplasm of unspecified site of left female breast C773 Midwest East North Central 4.4353e+03 68.0019 41.3000 12.8358 13.6811 10.5245 11.9377 11.6585 13.5774 13.7434 7.6868 4.3415 49.3962 50.6038 57.8962 10.8981 24.9547 6.2472
7 266700 White COMMERCIAL MI 480 58 F C50812 Malignant neoplasm of ovrlp sites of left female breast C781 Midwest East North Central 1.6938e+04 894.1681 42.9348 10.5116 11.8130 11.9217 12.4043 12.4304 15.2710 13.8826 7.8522 3.8971 50.0217 49.9783 50.9565 12.3145 30.8333 5.9014
8 880521 Other COMMERCIAL CA 945 58 F C50911 Malignant neoplasm of unsp site of right female breast C773 West Pacific 3.0154e+04 976.2892 42.1358 10.7531 12.7148 11.7259 13.1012 12.8173 13.3012 12.7716 8.4136 4.4086 49.7272 50.2728 53.0765 10.9123 30.5346 5.4667
9 971531 Hispanic MEDICARE ADVANTAGE IL 606 83 F C50911 Malignant neoplasm of unsp site of right female breast C773 Midwest East North Central 4.8671e+04 6.4314e+03 35.7554 10.4286 10.6518 18.3107 18.9036 11.9696 11.7268 9.6839 5.4071 2.8911 48.6964 51.3036 35.9304 10.2982 49.0054 4.7643
10 529840 White COMMERCIAL MT 590 60 F C50411 Malig neoplm of upper-outer quadrant of right female breast C773 West Mountain 1.2208e+03 2.1597 46.7408 11.1521 11.1000 7.9183 10.3338 10.7577 15.4211 18.9042 9.4479 5
11 198037 White MEDICAID KY 402 45 F C50312 Malig neoplasm of lower-inner quadrant of left female breast C773 South East South Central 2.2669e+04 1.1427e+03 37.4937 10.9688 13.6031 15.2281 14.9219 11.7219 12.1375 11.5188 6.3156 3.5625 48.8344 51.1656 39.7906 15.0312 39.2875 5.8906
12 791301 <undefined> MEDICARE ADVANTAGE CA 958 58 F C50112 Malignant neoplasm of central portion of left female breast C773 West Pacific 3.0687e+04 1.9179e+03 36.5517 11.6207 11.4655 16.1345 15.9655 12.5276 12.4793 11.0655 5.6034 3.1586 49.5138 50.4862 41.5345 13.7034 40.1793 4.5793
13 618259 White COMMERCIAL OH 430 55 F C50311 Malig neoplm of lower-inner quadrant of right female breast C773 Midwest East North Central 1.4386e+04 263.5774 40.6393 11.8852 14.2492 10.8426 11.5590 12.6984 13.8869 12.8131 8.2557 3.8148 49.7082 50.2918 53.7148 13.5279 25.8443 6.9180
14 393934 White <undefined> CO 801 70 F C50911 Malignant neoplasm of unsp site of right female breast C7951 West Mountain 2.1243e+04 564.7743 42.4114 10.4086 14.4486 10.6314 11.8086 13.7086 15.9543 13.2914 6.9514 2.8057 50.7971 49.2029 60.1086 10.8800 25.5914 3.4286
allFalseIdx = fullData.DiagPeriodL90D == 0;
allFalseData = fullData(allFalseIdx, :);
allFalseData = 4598×80 table
patient_id patient_race payer_type patient_state patient_zip3 patient_age patient_gender breast_cancer_diagnosis_code breast_cancer_diagnosis_desc metastatic_cancer_diagnosis_code Region Division population density age_median age_under_10 age_10_to_19 age_20s age_30s age_40s age_50s age_60s age_70s age_over_80 male female married divorced never_married widowed
1 617843 White COMMERCIAL CA 926 45 F C50212 Malig neoplasm of upper-inner quadrant of left female breast C773 West Pacific 3.2795e+04 1.8962e+03 42.8714 10.0714 12.1357 12.5381 12.4643 12.6500 14.8476 12.2810 8.2167 4.7595 49.0667 50.9333 52.6048 11.6238 31.1429 4.6238
2 817482 <undefined> COMMERCIAL ID 836 55 F 1749 Malignant neoplasm of breast (female), unspecified C773 West Mountain 1.0886e+04 116.8860 43.4735 10.8240 13.9760 9.4920 10.3640 12.6000 14.9920 14.8360 9.4620 3.4660 52.3120 47.6880 57.8820 14.9640 21.7600 5.4060
3 111545 White MEDICARE ADVANTAGE NY 141 66 F 1749 Malignant neoplasm of breast (female), unspecified C7981 Northeast Middle Atlantic 5.6438e+03 219.3629 45.1800 8.5114 14.8571 11.0886 9.7543 13.6143 13.3743 15.6857 9.4457 3.6457 50.9114 49.0914 51.3229 11.7600 30.8314 6.0914
4 875977 <undefined> MEDICARE ADVANTAGE MI 488 67 F C50412 Malig neoplasm of upper-outer quadrant of left female breast C799 Midwest East North Central 8101 246.2810 40.2782 11.0456 14.7684 13.3848 11.4671 11.2203 14.8975 12.5899 7.1494 3.4709 51.3228 48.6772 49.0658 13.6051 31.8848 5.4392
5 343914 <undefined> MEDICARE ADVANTAGE CA 900 66 F 1749 Malignant neoplasm of breast (female), unspecified C7800 West Pacific 3.6054e+04 5.2943e+03 36.6538 9.7615 11.2677 17.2338 17.4415 13.0908 12.3046 9.4077 5.6738 3.8246 50.5108 49.4892 33.4785 11.3015 50.4569 4.7662
6 615208 Other COMMERCIAL OR 975 62 F C50411 Malig neoplm of upper-outer quadrant of right female breast C786 West Pacific 1.2836e+04 87.3667 48.9208 9.3458 9.4500 8.7833 11.9542 10.3458 12.6000 17.8833 13.8708 5.7542 50.4708 49.5292 53.7167 15.8583 23.1333 7.2708
7 279917 White MEDICARE ADVANTAGE NY 142 75 F C50912 Malignant neoplasm of unspecified site of left female breast C7801 Northeast Middle Atlantic 2.0195e+04 2.1920e+03 36.4690 10.2207 15.4345 17.8241 13.2483 10.2897 11.7345 11.5276 5.8966 3.8345 48.4414 51.5586 31.7517 12.9966 49.4724 5.7759
8 366792 Asian COMMERCIAL MI 482 46 F C50412 Malig neoplasm of upper-outer quadrant of left female breast C773 Midwest East North Central 2.2081e+04 1.6665e+03 36.5861 12.7778 12.8556 15.8083 13.2028 11.8889 12.4556 11.6000 5.9194 3.4833 48.5472 51.4528 26.2417 14.7028 52.5722 6.4611
9 643360 <undefined> COMMERCIAL NY 120 52 F 1744 Malignant neoplasm of upper-outer quadrant of female breast C7800 Northeast Middle Atlantic 5.1122e+03 103.9061 46.2954 9.0636 10.6182 11.3000 10.9576 11.0045 16.7348 15.3530 10.3818 4.5909 52.1348 47.8652 49.7773 13.7470 29.3818 7.0924
10 487817 <undefined> COMMERCIAL TX 773 57 F 1749 Malignant neoplasm of breast (female), unspecified C773 South West South Central 2.4751e+04 352.2268 41.3712 11.9302 12.9868 10.9962 11.1623 13.1075 13.0226 13.0660 9.5774 4.1585 49.3547 50.6453 52.9943 13.3415 25.0943 8.5792
11 345047 <undefined> COMMERCIAL TX 751 78 F C50912 Malignant neoplasm of unspecified site of left female breast C773 South West South Central 1.6981e+04 271.9135 38.5392 13.2529 15.1843 11.8118 11.7980 12.7176 14.0510 11.8373 6.3667 2.9627 49.8667 50.1333 52.8333 13.5725 28.0196 5.5725
12 907418 White MEDICARE ADVANTAGE IN 460 50 F 1749 Malignant neoplasm of breast (female), unspecified C7951 Midwest East North Central 1.3549e+04 256.8795 40.2864 13.3023 13.2045 10.9000 14.2909 12.6364 13.2909 11.8773 6.3886 4.0886 50.8545 49.1455 54.1114 12.4795 27.4068 5.9955
13 908851 White MEDICARE ADVANTAGE FL 339 82 F 1749 Malignant neoplasm of breast (female), unspecified C7800 South South Atlantic 1.8007e+04 479.7347 50.9592 7.9449 9.9816 10.4449 9.7082 9.3714 13.1020 16.6653 16.0143 6.7714 49.0306 50.9694 52.1673 13.8000 25.2510 8.7939
14 785337 Black COMMERCIAL VA 234 54 F 1741 Malignant neoplasm of central portion of female breast C7951 South South Atlantic 1.3242e+04 299.2533 44.9310 9.1227 10.6818 13.8364 11.3295 9.5705 14.7477 14.3523 12.4568 3.8909 50.5205 49.4795 50.1318 13.8023 29.3364 6.7386
Now we can use the Create Plot live task to plot histograms of the different variables in each dataset. In the plot below, blue bars represent data from the folks who were diagnosed in a timely manner, and the red bars represent data from the folks who were not.
figure
% Create histogram of selected data
histogram(allTrueData.health_uninsured,“NumBins”,40,“DisplayName”,“health_uninsured”);
hold on
% Create histogram of selected data
histogram(allFalseData.health_uninsured,“NumBins”,40,“DisplayName”,“health_uninsured”);
hold off
legend
Take some time to explore these visualizations on your own, as I can only show one at a time in this blog. It is worth noting that we have less False data than True data, so the red bars will almost always be lower than the blue bars. If there are red bars that are higher or if the shapes are different, that may indicate a relationship between a variable and time to diagnosis.
I didn’t see many significant differences in shape, though I did notice that for the ‘health_uninsured’ histograms the red vars are fairly high in the higher numbers, indicating that there may be a correlation between populations with high rates of being unisured and time to diagnosis.

Bivariate and Multivariate Analysis

You can break the data down further and plot two (or more!) variables against each other to see if you can find any patterns. In the plot below, for example, we can see the percentage of the population that is unisured and the state the patient is in, broken down by whether or not the patient was diagnosed within 90 days. Again, blue values indicate that the patient was, and red values indicate that the patient was not.
figure
% Create scatter of selected data
scatter(allTrueData,“patient_state”,“health_uninsured”,“DisplayName”,“health_uninsured”);
hold on
% Create scatter of selected data
scatter(allFalseData,“patient_state”,“health_uninsured”,“DisplayName”,“health_uninsured”);
hold off
legend
We can see that in some states, such as GA, OK, or TX, the the red values come from populations that are typically higher in terms of being uninsured. This could indcate that in some states, coming from a zip code with a high population of uninsured folks (or being uninsured yourself) means you are more likely to receive delays in your diagnosis.

Statistical Analysis

You can also create meaningful deductions by calculating various statistics from your data. For example, I want to calculate the skewness, or level of asymmetry, of each of my variables. A negative value indicates the data is left skewed when plotted, and a positive value indicates the data is right skewed when plotted, with a 0 meaning the data is evenly distributed.
statsTrue = varfun(@skewness, allTrueData, “InputVariables”, @isnumeric);
statsFalse = varfun(@skewness, allFalseData, “InputVariables”, @isnumeric);
Now I want to see if any of the variables have a significant difference in their skewness, as differences in the data distributions between patients who were diagnosed in a timely manner vs patients who were not could indicate an underlying relationship between those variables and time to diagnosis.
statsDiffs = abs(statsTrue{:, :} – statsFalse{:, :});
statsTrue.Properties.VariableNames(statsDiffs > 0.2)
ans = 1×4 cell
‘skewness_density”skewness_age_over_80”skewness_rent_burden”skewness_race_native’
If we investigate the four variables that are returned, we can see that population density, the percentage of folks above 80 in your zip code, the median rent burden of your zip code, and the percentage of residents who reported their race as American Indian or Alaska Native in your zip code may have a relationship with time to diagnosis.

Feature Engineering

When it comes to machine learning, you don’t have to use all of the data as it is presented to you. Feature Engineering is the process of deciding what data you want to use, creating new data based on the provided data, and transforming the data to be in whatever format or range is suitable for your workflow. You can do this manually, and some of the exploration we just did should influence decisions you make if you want to play around with including or excluding different variables.
For this blog, I’ll use the gencfeatures function to automate this process. I want to use 90 features, which is 10 more than we currently have in our dataset, and it will go through and create a set of 90 meaningful features based on our processed dataset. It may keep some data as-is, but will often standardize numeric variables and create new variables by manipulating the provided data.
[T, augTrainData] = gencfeatures(fullData, “DiagPeriodL90D”, 90)
Warning: Table variable names were truncated to the length namelengthmax.
T =

FeatureTransformer with properties:

Type: ‘classification’
TargetLearner: ‘linear’
NumEngineeredFeatures: 89
NumOriginalFeatures: 1
TotalNumFeatures: 90

augTrainData = 12157×91 table
metastatic_cancer_diagnosis_code zsc(woe2(breast_cancer_diagnosis_code)) zsc(woe2(breast_cancer_diagnosis_desc)) zsc(woe2(metastatic_cancer_diagnosis_code)) zsc(woe2(patient_state)) zsc(patient_age./Ozone) zsc(patient_age./commute_time) zsc(kmc51) eb28(education_less_highschool) zsc(income_household_35_to_50./income_household_75_to_100) zsc(kmc12) eb11(patient_age) q28(income_household_under_5) zsc(rent_burden-education_less_highschool) q11(patient_age) zsc(sig(family_dual_income)) zsc(sig(patient_age)) zsc(sin(PM25)) zsc(cos(rent_median)) zsc(sin(patient_zip3)) zsc(health_uninsured./PM25) zsc(cos(population)) zsc(cos(education_bachelors)) zsc(sin(hispanic)) q28(density) eb28(education_highschool) zsc(income_household_75_to_100.*rent_burden) q28(unemployment_rate) q28(patient_zip3) zsc(patient_id.*hispanic)
1 C7989 0.5390 0.5390 -0.4180 0.3566 0.2943 1.2850 -0.9221 28 0.4148 0.8408 11 17 -2.6485 11 0.0164 0.0329 0.4182 -1.3204 0.4400 0.1652 -1.3368 -0.8396 -0.9848 19 16 1.3688 25 25 1.8601
2 C773 0.5403 0.5403 0.6847 0.3566 -0.0953 -0.2533 1.1450 14 -1.0911 -0.6892 7 9 0.2093 7 0.0164 0.0329 0.5817 0.4439 -1.4209 -0.5852 -1.2593 0.1155 0.3359 24 8 0.6588 11 26 0.2666
3 C773 0.4472 0.4472 0.6847 -0.4817 -1.1723 -1.2683 -0.5463 9 -0.1595 0.1389 3 10 -0.1364 2 0.0164 0.0329 0.8441 -0.5097 -0.4515 1.2506 1.0719 0.8394 0.7034 14 14 0.0868 7 19 -0.6458
4 C773 0.4804 0.4804 0.6847 0.3566 -1.1790 -0.8614 2.4460 2 -1.4278 -2.6472 3 19 1.2015 2 0.0164 0.0329 0.5895 0.0984 0.9142 -0.9343 -1.3164 -0.5676 -1.2673 23 1 -0.8003 13 26 0.0137
5 C773 -1.8389 -1.8389 0.6847 -0.1188 -0.4668 -0.1359 -0.1677 10 1.1998 0.5623 5 11 -0.6774 5 0.0164 0.0329 -1.9107 -1.1374 0.3927 2.8054 -1.0779 0.0894 0.9926 5 16 -0.8990 5 21 0.0631
6 C7981 -1.8389 -1.8389 -3.0232 -0.7796 0.3970 0.6970 -0.5463 5 -0.4166 0.1389 8 1 -0.1713 9 0.0164 0.0329 -0.8052 -1.4920 0.4399 -0.6668 0.1846 -1.5704 0.8916 8 23 -0.1033 3 3 -0.8538
7 C779 0.5150 0.5150 -1.0339 0.3566 -0.7176 -0.8180 1.2575 26 0.1398 -0.9325 4 28 -0.8526 4 0.0164 0.0329 -2.1568 0.2754 1.3443 -0.4261 0.5523 0.3214 1.4403 27 6 -0.0186 26 23 2.7136
8 C773 0.5605 05605 0.6847 -0.8787 0.2552 0.4365 -0.9221 8 1.2589 0.8408 6 12 -1.0383 7 0.0164 0.0329 0.6462 0.3778 -0.2288 0.0129 0.3019 1.0857 0.5697 1 26 -1.3912 17 17 -0.7649
9 C773 0.5150 0.5150 0.6847 -0.8787 0.5303 0.7654 -0.5463 1 -0.3771 0.1389 7 15 0.0934 8 0.0164 0.0329 0.6494 1.3120 1.2738 -0.8822 1.1952 1.2474 0.1205 3 23 -0.1783 2 17 -0.8367
10 C799 0.5892 0.5892 -0.7083 0.4285 0.6062 0.4731 -0.1677 5 0.3661 0.5623 8 4 -0.1511 9 0.0164 0.0329 0.8468 0.7382 -1.3160 -0.3069 -0.5051 -1.0839 -1.0734 9 22 -0.5364 11 13 -0.5800
11 C7800 -1.8389 -1.8389 -0.4676 0.3566 0.2790 -0.0623 2.5856 26 0.1398 -3.1658 8 28 -0.8526 9 0.0164 0.0329 -2.1568 0.2754 1.3443 -0.4261 0.5523 0.3214 1.4403 27 6 -0.0186 26 23 0.4736
12 C781 0.5758 0.5758 -1.7502 0.4285 -0.0598 -0.2406 -0.5463 4 -0.4701 0.1389 6 8 0.1148 6 0.0164 0.0329 0.8290 -1.2051 0.8002 -0.9475 0.0945 1.1447 0.5680 16 14 -0.0036 16 12 -0.8148
13 C773 0.5668 0.5668 0.6847 0.3566 0.2459 -0.8014 1.2575 6 -1.3361 -0.9325 6 5 0.6748 6 0.0164 0.0329 -0.6779 0.9431 0.7497 -0.9270 0.9452 -0.4288 0.1632 17 4 -0.4941 13 27 0.7836
14 C786 0.5403 0.5403 -0.0208 0.4096 0.4423 0.4481 -1.0650 7 1.5972 0.0484 7 22 1.3062 7 0.0164 0.0329 -1.4776 0.6570 1.1964 0.8045 1.4144 0.4459 1.3659 4 15 0.5212 22 28 -0.4435
To better understand the generated features, you can use the describe function of the returned FeatureTransformer object, ‘T’.
describe(T)
Type IsOriginal InputVariables Transformations
___________ __________ _____________________________________________________ ________________________________________________________________________metastatic_cancer_diagnosis_code Categorical true metastatic_cancer_diagnosis_code
zsc(woe2(breast_cancer_diagnosis_code)) Numeric false breast_cancer_diagnosis_code Weight of Evidence (positive class = 1)
Standardization with z-score (mean = -0.046637, std = 1.5098)
zsc(woe2(breast_cancer_diagnosis_desc)) Numeric false breast_cancer_diagnosis_desc Weight of Evidence (positive class = 1)
Standardization with z-score (mean = -0.046637, std = 1.5098)
zsc(woe2(metastatic_cancer_diagnosis_code)) Numeric false metastatic_cancer_diagnosis_code Weight of Evidence (positive class = 1)
Standardization with z-score (mean = 0.0067098, std = 0.28786)
zsc(woe2(patient_state)) Numeric false patient_state Weight of Evidence (positive class = 1)
Standardization with z-score (mean = 0.0060064, std = 0.23323)
zsc(patient_age./Ozone) Numeric false patient_age, Ozone patient_age ./ Ozone
Standardization with z-score (mean = 1.5005, std = 0.36544)
zsc(patient_age./commute_time) Numeric false patient_age, commute_time patient_age ./ commute_time
Standardization with z-score (mean = 2.1895, std = 0.64638)
zsc(kmc51) Numeric false all valid numeric variables Centroid encoding (component #51) (kmeans clustering with k = 10)
Standardization with z-score (mean = 5.9447, std = 0.1673)
eb28(education_less_highschool) Categorical false education_less_highschool Equal-width binning (number of bins = 28)
zsc(income_household_35_to_50./income_household_75_to_100) Numeric false income_household_35_to_50, income_household_75_to_100 income_household_35_to_50 ./ income_household_75_to_100
Standardization with z-score (mean = 0.93234, std = 0.2685)
zsc(kmc12) Numeric false all valid numeric variables Centroid encoding (component #12) (kmeans clustering with k = 10)
Standardization with z-score (mean = 13.4409, std = 0.15797)
eb11(patient_age) Categorical false patient_age Equal-width binning (number of bins = 11)
q28(income_household_under_5) Categorical false income_household_under_5 Equiprobable binning (number of bins = 28)
zsc(rent_burden-education_less_highschool) Numeric false rent_burden, education_less_highschool rent_burden – education_less_highschool
Standardization with z-score (mean = 19.3265, std = 5.7168)
q11(patient_age) Categorical false patient_age Equiprobable binning (number of bins = 11)
zsc(sig(family_dual_income)) Numeric false family_dual_income sigmoid( )
Standardization with z-score (mean = 1, std = 4.2283e-11)
zsc(sig(patient_age)) Numeric false patient_age sigmoid( )
Standardization with z-score (mean = 1, std = 4.0863e-10)
zsc(sin(PM25)) Numeric false PM25 sin( )
Standardization with z-score (mean = 0.42558, std = 0.65419)
zsc(cos(rent_median)) Numeric false rent_median cos( )
Standardization with z-score (mean = 0.046444, std = 0.68827)
zsc(sin(patient_zip3)) Numeric false patient_zip3 sin( )
Standardization with z-score (mean = 0.054487, std = 0.70171)
zsc(health_uninsured./PM25) Numeric false health_uninsured, PM25 health_uninsured ./ PM25
Standardization with z-score (mean = 1.1917, std = 0.6234)
zsc(cos(population)) Numeric false population cos( )
Standardization with z-score (mean = -0.03209, std = 0.71354)
zsc(cos(education_bachelors)) Numeric false education_bachelors cos( )
Standardization with z-score (mean = 0.096871, std = 0.68966)
zsc(sin(hispanic)) Numeric false hispanic sin( )
Standardization with z-score (mean = 0.017785, std = 0.6817)
q28(density) Categorical false density Equiprobable binning (number of bins = 28)
eb28(education_highschool) Categorical false education_highschool Equal-width binning (number of bins = 28)
zsc(income_household_75_to_100.*rent_burden) Numeric false income_household_75_to_100, rent_burden income_household_75_to_100 .* rent_burden
Standardization with z-score (mean = 392.7502, std = 61.6458)
q28(unemployment_rate) Categorical false unemployment_rate Equiprobable binning (number of bins = 28)
q28(patient_zip3) Categorical false patient_zip3 Equiprobable binning (number of bins = 28)
zsc(patient_id.*hispanic) Numeric false patient_id, hispanic patient_id .* hispanic
Standardization with z-score (mean = 10169065.2502, std = 11587944.1233)
zsc(home_value.*race_other) Numeric false home_value, race_other home_value .* race_other
Standardization with z-score (mean = 2725364.3718, std = 4298818.8992)
zsc(patient_age.*income_household_20_to_25) Numeric false patient_age, income_household_20_to_25 patient_age .* income_household_20_to_25
Standardization with z-score (mean = 241.7171, std = 97.8001)
q25(farmer) Categorical false farmer Equiprobable binning (number of bins = 25)
q27(race_native) Categorical false race_native Equiprobable binning (number of bins = 27)
eb28(age_median) Categorical false age_median Equal-width binning (number of bins = 28)
q28(never_married) Categorical false never_married Equiprobable binning (number of bins = 28)
zsc(cos(patient_age)) Numeric false patient_age cos( )
Standardization with z-score (mean = 0.021113, std = 0.71469)
zsc(sin(race_black)) Numeric false race_black sin( )
Standardization with z-score (mean = 0.16517, std = 0.70668)
zsc(tanh(age_50s)) Numeric false age_50s tanh( )
Standardization with z-score (mean = 1, std = 8.9224e-09)
zsc(male+female) Numeric false male, female male + female
Standardization with z-score (mean = 100.0001, std = 0.000436)
q28(female) Categorical false female Equiprobable binning (number of bins = 28)
eb28(male) Categorical false male Equal-width binning (number of bins = 28)
zsc(sin(age_median)) Numeric false age_median sin( )
Standardization with z-score (mean = -0.1365, std = 0.71613)
q28(home_ownership) Categorical false home_ownership Equiprobable binning (number of bins = 28)
zsc(age_over_80./income_household_20_to_25) Numeric false age_over_80, income_household_20_to_25 age_over_80 ./ income_household_20_to_25
Standardization with z-score (mean = 1.0866, std = 0.51568)
zsc(cos(education_highschool)) Numeric false education_highschool cos( )
Standardization with z-score (mean = -0.019221, std = 0.71994)
zsc(cos(race_black)) Numeric false race_black cos( )
Standardization with z-score (mean = -0.020693, std = 0.68773)
q28(self_employed) Categorical false self_employed Equiprobable binning (number of bins = 28)
zsc(cos(age_median)) Numeric false age_median cos( )
Standardization with z-score (mean = -0.029038, std = 0.68394)
q50(patient_id) Categorical false patient_id Equiprobable binning (number of bins = 50)
zsc(sin(race_asian)) Numeric false race_asian sin( )
Standardization with z-score (mean = 0.28421, std = 0.64235)
q28(education_stem_degree) Categorical false education_stem_degree Equiprobable binning (number of bins = 28)
zsc(cos(age_20s)) Numeric false age_20s cos( )
Standardization with z-score (mean = 0.10518, std = 0.69162)
eb23(N02) Categorical false N02 Equal-width binning (number of bins = 23)
q28(rent_burden) Categorical false rent_burden Equiprobable binning (number of bins = 28)
zsc(race_asian.*veteran) Numeric false race_asian, veteran race_asian .* veteran
Standardization with z-score (mean = 28.4889, std = 30.7)
zsc(sin(income_household_35_to_50)) Numeric false income_household_35_to_50 sin( )
Standardization with z-score (mean = 0.03083, std = 0.68752)
zsc(cos(patient_zip3)) Numeric false patient_zip3 cos( )
Standardization with z-score (mean = -0.06867, std = 0.7071)
eb28(rent_burden) Categorical false rent_burden Equal-width binning (number of bins = 28)
zsc(sig(rent_burden)) Numeric false rent_burden sigmoid( )
Standardization with z-score (mean = 1, std = 3.571e-10)
q28(age_over_80) Categorical false age_over_80 Equiprobable binning (number of bins = 28)
q28(family_dual_income) Categorical false family_dual_income Equiprobable binning (number of bins = 28)
q28(family_size) Categorical false family_size Equiprobable binning (number of bins = 28)
zsc(age_over_80./income_household_5_to_10) Numeric false age_over_80, income_household_5_to_10 age_over_80 ./ income_household_5_to_10
Standardization with z-score (mean = 2.0422, std = 1.3415)
eb28(age_10_to_19) Categorical false age_10_to_19 Equal-width binning (number of bins = 28)
q28(income_individual_median) Categorical false income_individual_median Equiprobable binning (number of bins = 28)
zsc(age_over_80./unemployment_rate) Numeric false age_over_80, unemployment_rate age_over_80 ./ unemployment_rate
Standardization with z-score (mean = 0.74942, std = 0.37691)
zsc(cos(income_household_50_to_75)) Numeric false income_household_50_to_75 cos( )
Standardization with z-score (mean = -0.012865, std = 0.69717)
eb25(race_pacific) Categorical false race_pacific Equal-width binning (number of bins = 25)
zsc(sin(patient_id)) Numeric false patient_id sin( )
Standardization with z-score (mean = -0.0018454, std = 0.70739)
zsc(race_native./race_multiple) Numeric false race_native, race_multiple race_native ./ race_multiple
Standardization with z-score (mean = 0.14079, std = 0.41944)
eb28(income_household_25_to_35) Categorical false income_household_25_to_35 Equal-width binning (number of bins = 28)
zsc(age_50s-income_household_75_to_100) Numeric false age_50s, income_household_75_to_100 age_50s – income_household_75_to_100
Standardization with z-score (mean = 0.77657, std = 2.1264)
zsc(cos(age_60s)) Numeric false age_60s cos( )
Standardization with z-score (mean = 0.05337, std = 0.75178)
q28(income_household_35_to_50) Categorical false income_household_35_to_50 Equiprobable binning (number of bins = 28)
eb21(race_black) Categorical false race_black Equal-width binning (number of bins = 21)
zsc(sin(income_individual_median)) Numeric false income_individual_median sin( )
Standardization with z-score (mean = 0.045145, std = 0.69873)
q28(age_50s) Categorical false age_50s Equiprobable binning (number of bins = 28)
q28(race_white) Categorical false race_white Equiprobable binning (number of bins = 28)
q28(age_under_10) Categorical false age_under_10 Equiprobable binning (number of bins = 28)
q28(disabled) Categorical false disabled Equiprobable binning (number of bins = 28)
zsc(patient_age./income_household_100_to_150) Numeric false patient_age, income_household_100_to_150 patient_age ./ income_household_100_to_150
Standardization with z-score (mean = 3.9266, std = 1.314)
q28(income_household_75_to_100) Categorical false income_household_75_to_100 Equiprobable binning (number of bins = 28)
zsc(sin(N02)) Numeric false N02 sin( )
Standardization with z-score (mean = 0.039533, std = 0.70149)
eb28(family_size) Categorical false family_size Equal-width binning (number of bins = 28)
q28(limited_english) Categorical false limited_english Equiprobable binning (number of bins = 28)
q28(income_household_100_to_150) Categorical false income_household_100_to_150 Equiprobable binning (number of bins = 28)
zsc(farmer.*race_black) Numeric false farmer, race_black farmer .* race_black
Standardization with z-score (mean = 10.7649, std = 26.8957)
zsc(home_value.*race_pacific) Numeric false home_value, race_pacific home_value .* race_pacific
Standardization with z-score (mean = 59826.8413, std = 128896.4218)
zsc(education_graduate.*health_uninsured) Numeric false education_graduate, health_uninsured education_graduate .* health_uninsured
Standardization with z-score (mean = 97.7642, std = 54.0304)

Split the Data

The last step before you can train a machine learning model is to split your data into a training and testing set. We’ll use the training data to fit the model, and the testing set to evaluate how well the model performs on new data before we use it to make a submission. Here I split the data into 80% training and 20% testing.
numRows = height(augTrainData);
[trainInd, ~, testInd] = dividerand(numRows, .8, 0, .2);
trainingData = augTrainData(trainInd, :);
testingData = augTrainData(testInd, :);

Train a Machine Learning Model

In this example, I’ll create a binary decision tree using the fitctree function and set ‘Optimize Hyperparameters’ to ‘auto’, which will attempt to minimize the error of our algorithm by choosing the best value for the ‘MinLeafSize’ parameter. It visualizes the results of adjusting this value, as can be seen below.
classificationTree = fitctree(trainingData, “DiagPeriodL90D”,
OptimizeHyperparameters=‘auto’);
|======================================================================================|
| Iter | Eval | Objective | Objective | BestSoFar | BestSoFar | MinLeafSize |
| | result | | runtime | (observed) | (estim.) | |
|======================================================================================|
| 1 | Best | 0.18764 | 1.4699 | 0.18764 | 0.18764 | 1676 |
| 2 | Accept | 0.18764 | 0.87349 | 0.18764 | 0.18764 | 162 |
| 3 | Accept | 0.20923 | 1.005 | 0.18764 | 0.19426 | 36 |
| 4 | Accept | 0.29395 | 1.6132 | 0.18764 | 0.18764 | 3 |
| 5 | Accept | 0.18764 | 0.6073 | 0.18764 | 0.1876 | 491 |
| 6 | Accept | 0.38012 | 0.21492 | 0.18764 | 0.24104 | 4858 |
| 7 | Accept | 0.18764 | 0.60759 | 0.18764 | 0.18764 | 330 |
| 8 | Accept | 0.18764 | 0.36986 | 0.18764 | 0.18763 | 1033 |
| 9 | Accept | 0.19227 | 1.0609 | 0.18764 | 0.18762 | 80 |
| 10 | Accept | 0.24409 | 1.4868 | 0.18764 | 0.18761 | 13 |
| 11 | Accept | 0.18764 | 0.3479 | 0.18764 | 0.18568 | 1363 |
| 12 | Accept | 0.18764 | 0.70426 | 0.18764 | 0.1861 | 231 |
| 13 | Accept | 0.18764 | 0.48941 | 0.18764 | 0.18678 | 698 |
| 14 | Accept | 0.29519 | 2.1238 | 0.18764 | 0.18671 | 1 |
| 15 | Accept | 0.18764 | 0.35153 | 0.18764 | 0.18736 | 1438 |
| 16 | Accept | 0.18764 | 0.86203 | 0.18764 | 0.18735 | 119 |
| 17 | Accept | 0.18764 | 0.41595 | 0.18764 | 0.18734 | 849 |
| 18 | Accept | 0.18764 | 0.31486 | 0.18764 | 0.18737 | 1527 |
| 19 | Accept | 0.18764 | 0.60161 | 0.18764 | 0.18738 | 404 |
| 20 | Accept | 0.18764 | 0.45615 | 0.18764 | 0.18738 | 589 |
|======================================================================================|
| Iter | Eval | Objective | Objective | BestSoFar | BestSoFar | MinLeafSize |
| | result | | runtime | (observed) | (estim.) | |
|======================================================================================|
| 21 | Accept | 0.18764 | 0.30864 | 0.18764 | 0.18745 | 1515 |
| 22 | Accept | 0.18764 | 0.71981 | 0.18764 | 0.18745 | 138 |
| 23 | Accept | 0.18764 | 0.62974 | 0.18764 | 0.18745 | 278 |
| 24 | Accept | 0.18764 | 0.27013 | 0.18764 | 0.18749 | 1511 |
| 25 | Accept | 0.18764 | 0.62894 | 0.18764 | 0.18749 | 196 |
| 26 | Accept | 0.18764 | 0.40254 | 0.18764 | 0.18749 | 811 |
| 27 | Accept | 0.30239 | 0.19617 | 0.18764 | 0.18741 | 2944 |
| 28 | Accept | 0.18764 | 0.27176 | 0.18764 | 0.18741 | 1170 |
| 29 | Accept | 0.18764 | 0.37273 | 0.18764 | 0.18747 | 1576 |
| 30 | Accept | 0.18764 | 0.45381 | 0.18764 | 0.18747 | 945 |__________________________________________________________
Optimization completed.
MaxObjectiveEvaluations of 30 reached.
Total function evaluations: 30
Total elapsed time: 50.9097 seconds
Total objective function evaluation time: 20.2308Best observed feasible point:
MinLeafSize
___________1676Observed objective function value = 0.18764
Estimated objective function value = 0.18815
Function evaluation time = 1.4699Best estimated feasible point (according to models):
MinLeafSize
___________1527Estimated objective function value = 0.18747
Estimated function evaluation time = 0.36237
I used a binary tree as my starting point, but it’s important to test out different types of algorithms to see what works best with your data! Check out the Classification Learner app documentation and this short video to learn how to train several machine learning models quickly and iteratively!

Test Your Model

There are many ways to evaluate the performance of a machine learning model, so in this blog I’ll show how to do so by computing validation accuracy and using testing data.

Validation Accuracy

Cross-validation is one method of evaluating a model, and at a high level is done by:
  1. Setting aside a subset of the training data, known as validation data
  2. Using the rest of the training data to fit the model
  3. Testing how well the model performs on the validation data
You can use the crossval function to do this:
% Perform cross-validation
partitionedModel = crossval(classificationTree, ‘KFold’, 5);
Then, extract the misclassification rate, and subtract it from 1 to get the model’s accuracy. The closer to 1 this value is, the more accurate our model is.
% Compute validation accuracy
validationAccuracy = 1 – kfoldLoss(partitionedModel, LossFun=‘ClassifError’)
validationAccuracy = 0.8124

Testing Data

In this section, we’ll use the ‘testingData’ dataset we created earlier. Similar to what we did with the validation data, we can use the loss function to compute the misclassification rate when you use the classification tree on the testing data, and subtract it from 1 to get a measure of accuracy.
testAccuracy = 1 – loss(classificationTree, testingData, “DiagPeriodL90D”,
LossFun=‘classiferror’)
testAccuracy = 0.8048
I also want to compare the predictions that the model makes to the actual outputs, so let’s remove the ‘DiagPeriodL90D’ variable from our testing data
testActual = testingData.DiagPeriodL90D;
testingData = removevars(testingData, “DiagPeriodL90D”);
Now, use the model to make predictions on the testing set
[testPreds, scores, ~, ~] = predict(classificationTree, testingData);
And use the confusionchart function to compare the predicted outputs to the actual outputs, to see how often they match or don’t.
confusionchart(testActual, testPreds)
This shows that it almost always predicts 1s correctly, or when the patient is diagnosed within 90 days, but it’s almost a 50/50 chance that this model will predict the 0s correctly.
We can also use the test data and predictions to visualize receiver operating characteristic (ROC) metrics. The ROC curve shows the true positive rate (TPR) versus the false positive rate (FPR) for different thresholds of classification scores. The “Model Operating Point” shows the false positive rate and true positive rate of the model.
rocObj = rocmetrics(testActual, scores, classificationTree.ClassNames);
plot(rocObj)
Here we can see that the classifier correctly assigns about 90-95% of the 1 class observations to 1 (TPR), but incorrectly assigns about 40% of the 0 class observations as 1 (FPR). This is similar to what we observed with the confusion chart.
You can also extract the area under the curve (AUC) value, which is a measure of the overall quality of the classifier. The AUC values are in the range 0 to 1, and larger AUC values indicate better classifier performance.
rocObj.AUC
The AUC is pretty high, but shows that there is definitely room for improvement. To learn more about ROC metrics, check out this documentation page that explains it in more detail.

Create Submission

Once you have a model that performs well on the validation and testing data, it’s time to create a submission for the datathon! As a reminder, you will upload this file to Kaggle to be scored on the leaderboard.
First, import the ‘Test’ dataset:
testDataFilename = ‘Test.csv’;
allTestData = readtable(fullfile(dataFolder, testDataFilename))
allTestData = 3999×83 table
patient_id patient_race payer_type patient_state patient_zip3 patient_age patient_gender bmi breast_cancer_diagnosis_code breast_cancer_diagnosis_desc metastatic_cancer_diagnosis_code metastatic_first_novel_treatment metastatic_first_novel_treatment_type Region Division population density age_median age_under_10 age_10_to_19 age_20s age_30s age_40s age_50s age_60s age_70s age_over_80 male female married
1 573710 ‘White’ ‘MEDICAID’ ‘IN’ 467 54 ‘F’ NaN ‘C50412’ ‘Malig neoplasm of upper-outer quadrant of left female breast’ ‘C773’ NaN NaN ‘Midwest’ ‘East North Central’ 5.4414e+03 85.6210 40.8803 12.7323 14.0887 10.6597 11.6258 11.2081 15.6194 12.3226 8.4097 3.3435 49.1548 50.8452 55.1758
2 593679 ‘COMMERCIAL’ ‘FL’ 337 52 ‘F’ NaN ‘C50912’ ‘Malignant neoplasm of unspecified site of left female breast’ ‘C787’ NaN NaN ‘South’ ‘South Atlantic’ 1.9614e+04 1.5551e+03 49.1077 8.0692 8.5872 10.6846 11.3026 10.9718 15.8231 15.9026 11.8282 6.8154 49.6590 50.3410 44.8000
3 184532 ‘Hispanic’ ‘MEDICAID’ ‘CA’ 917 61 ‘F’ NaN ‘C50911’ ‘Malignant neoplasm of unsp site of right female breast’ ‘C773’ NaN NaN ‘West’ ‘Pacific’ 4.3030e+04 2.0486e+03 38.8522 11.3065 12.8978 14.1217 13.5326 13.1609 13.3783 11.4739 6.3804 3.7370 49.0522 50.9478 48.5043
4 184532 ‘Hispanic’ ‘MEDICARE ADVANTAGE’ ‘CA’ 917 61 ‘F’ NaN ‘C50912’ ‘Malignant neoplasm of unspecified site of left female breast’ ‘C779’ NaN NaN ‘West’ ‘Pacific’ 4.3030e+04 2.0486e+03 38.8522 11.3065 12.8978 14.1217 13.5326 13.1609 13.3783 11.4739 6.3804 3.7370 49.0522 50.9478 48.5043
5 447383 ‘Black’ ‘CA’ 917 64 ‘F’ 23 ‘C50412’ ‘Malig neoplasm of upper-outer quadrant of left female breast’ ‘C779’ NaN NaN ‘West’ ‘Pacific’ 3.6054e+04 5.2943e+03 36.6538 9.7615 11.2677 17.2338 17.4415 13.0908 12.3046 9.4077 5.6738 3.8246 50.5108 49.4892 33.4785
6 281312 ‘COMMERCIAL’ ‘MI’ 483 64 ‘F’ 24 ‘1748’ ‘Malignant neoplasm of other specified sites of female breast’ ‘C7800’ NaN NaN ‘Midwest’ ‘East North Central’ 2.0151e+04 724.9353 42.0784 11.0392 13.0098 11.6431 11.8882 13.0647 15.1098 12.8686 7.4000 3.9588 49.2922 50.7078 54.0137
7 492714 ‘COMMERCIAL’ ‘TX’ 761 91 ‘F’ NaN ‘C50912’ ‘Malignant neoplasm of unspecified site of left female breast’ ‘C773’ NaN NaN ‘South’ ‘West South Central’ 2.9482e+04 1.3355e+03 33.6278 13.1611 15.3444 16.7250 15.2167 12.5361 11.4139 8.8583 4.4167 2.3250 47.6694 52.3306 43.4639
8 378266 ‘White’ ‘MEDICARE ADVANTAGE’ ‘IN’ 473 79 ‘F’ NaN ‘C50212’ ‘Malig neoplasm of upper-inner quadrant of left female breast’ ‘C773’ NaN NaN ‘Midwest’ ‘East North Central’ 5.2774e+03 296.8542 42.0763 11.0220 13.9932 12.1288 10.4949 12.3237 13.4797 13.8864 7.6407 5.0271 48.5627 51.4373 50.6559
9 291550 ‘COMMERCIAL’ ‘AZ’ 852 50 ‘F’ NaN ‘C50919’ ‘Malignant neoplasm of unsp site of unspecified female breast’ ‘C773’ NaN NaN ‘West’ ‘Mountain’ 3.5899e+04 1.1664e+03 41.8273 10.8364 12.3045 12.7114 12.7545 11.8909 13.0341 12.7659 9.1523 4.5614 49.7568 50.2432 51.5750
10 612272 ‘COMMERCIAL’ ‘CA’ 902 47 ‘F’ 24 ‘C50412’ ‘Malig neoplasm of upper-outer quadrant of left female breast’ ‘C7801’ NaN NaN ‘West’ ‘Pacific’ 3.5350e+04 3.5588e+03 38.7486 11.0686 13.8657 13.6371 13.7886 13.7000 13.0686