Predicting Timely Diagnosis of Metastatic Breast Cancer for the WiDS Datathon 2024

Posted by Tanya Kuruvilla, January 1, 2024

41 views (last 30 days) | 0 Likes | 0 comment

In today’s blog, Grace Woolson will show how you can use MATLAB and machine learning to make meaningful deductions from healthcare data for patients who have been diagnosed with metastatic breast cancer. Over to you Grace!

Introduction

In this blog, I will show how you can use MATLAB for the WiDS Datathon 2024 using the dataset for the WiDS Datathon #1, which runs from January 9th 2024 – March 1st 2024. This challenge tasks participants with creating a model that can predict whether or not a patient with metastatic breast cancer will receive a diagnosis within 90 days based on patient and environmental data. This can help identify relationships between demographics or environmental hazards with the likelihood of getting timely treatment. Please note that this blog is based on a subset of the data and there may be slight differences between this dataset and the one provided by WiDS.

MathWorks is happy to support participants of the Women in Data Science Datathon 2024 by providing complimentary MATLAB licenses, tutorials, workshops, and additional resources. To request complimentary licenses for you and your teammates, go to this MathWorks site, click the “Request Software” button, and fill out the software request form.

This tutorial will walk through the following steps of the model-making process:

Importing a Tabular Dataset
Preprocessing the Data
Exploring and Analyzing Tabular Data
Choosing and Creating Features
Training a Machine Learning Model
Evaluating a Machine Learning Model
Making New Predictions and Exporting Submissions

Import Data

First, make sure the ‘Current Folder’ is the folder where you saved the data. If you have not already done so, you can download the data from Kaggle after you register for the datathon. The data is provided as a .CSV file, so we can use the readtable function to import the whole file as a table.

dataFolder = fullfile(pwd);

trainDataFilename = ‘Training.csv’;

allTrainData = readtable(fullfile(dataFolder, trainDataFilename))

allTrainData = 12906×83 table 




patient_id
patient_race
payer_type
patient_state
patient_zip3
patient_age
patient_gender
bmi
breast_cancer_diagnosis_code
breast_cancer_diagnosis_desc
metastatic_cancer_diagnosis_code
metastatic_first_novel_treatment
metastatic_first_novel_treatment_type
Region
Division
population
density
age_median
age_under_10
age_10_to_19
age_20s
age_30s
age_40s
age_50s
age_60s
age_70s
age_over_80
male
female
married


1
475714
”
‘MEDICAID’
‘CA’
924
84
‘F’
NaN
‘C50919’
‘Malignant neoplasm of unsp site of unspecified female breast’
‘C7989’
”
”
‘West’
‘Pacific’
3.1438e+04
1.1896e+03
30.6429
16.0143
15.5429
17.6143
14.0143
11.6143
11.5571
7.5714
4
2.1000
49.8571
50.1429
36.5714

2
349367
‘White’
‘COMMERCIAL’
‘CA’
928
62
‘F’
28.4900
‘C50411’
‘Malig neoplm of upper-outer quadrant of right female breast’
‘C773’
”
”
‘West’
‘Pacific’
3.9122e+04
2.2959e+03
38.2000
11.8788
13.3545
14.2303
13.4182
13.3333
14.0606
10.2485
5.9515
3.5030
49.8939
50.1061
50.2455

3
138632
‘White’
‘COMMERCIAL’
‘TX’
760
43
‘F’
38.09
‘C50112’
‘Malignant neoplasm of central portion of left female breast’
‘C773’
”
”
‘South’
‘West South Central’
2.1997e+04
626.2367
37.9067
13.0283
14.4633
12.5317
13.5450
12.8600
12.7700
11.4267
6.5650
2.8117
50.1233
49.8767
55.7533

4
617843
‘White’
‘COMMERCIAL’
‘CA’
926
45
‘F’
NaN
‘C50212’
‘Malig neoplasm of upper-inner quadrant of left female breast’
‘C773’
”
”
‘West’
‘Pacific’
3.2795e+04
1.8962e+03
42.8714
10.0714
12.1357
12.5381
12.4643
12.6500
14.8476
12.2810
8.2167
4.7595
49.0667
50.9333
52.6048

5
817482
”
‘COMMERCIAL’
‘ID’
836
55
‘F’
NaN
‘1749’
‘Malignant neoplasm of breast (female), unspecified’
‘C773’
”
”
‘West’
‘Mountain’
1.0886e+04
116.8860
43.4735
10.8240
13.9760
9.4920
10.3640
12.6000
14.9920
14.8360
9.4620
3.4660
52.3120
47.6880
57.8820

6
111545
‘White’
‘MEDICARE ADVANTAGE’
‘NY’
141
66
‘F’
NaN
‘1749’
‘Malignant neoplasm of breast (female), unspecified’
‘C7981’
”
”
‘Northeast’
‘Middle Atlantic’
5.6438e+03
219.3629
45.1800
8.5114
14.8571
11.0886
9.7543
13.6143
13.3743
15.6857
9.4457
3.6457
50.9114
49.0914
51.3229

7
914071
”
‘COMMERCIAL’
‘CA’
900
51
‘F’
29.0500
‘C50912’
‘Malignant neoplasm of unspecified site of left female breast’
‘C779’
”
”
‘West’
‘Pacific’
3.6054e+04
5.2943e+03
36.6538
9.7615
11.2677
17.2338
17.4415
13.0908
12.3046
9.4077
5.6738
3.8246
50.5108
49.4892
33.4785

8
479368
‘White’
‘COMMERCIAL’
‘IL’
619
60
‘F’
NaN
‘C50512’
‘Malig neoplasm of lower-outer quadrant of left female breast’
‘C773’
”
”
‘Midwest’
‘East North Central’
3.4041e+03
25.7333
42.7900
11.9833
13.2567
9.5733
12.4000
11.8133
13.5767
14.0433
8.5267
4.8533
49.2833
50.7167
55.8867

9
994014
‘White’
‘MEDICARE ADVANTAGE’
”
973
82
‘F’
NaN
‘1744’
‘Malignant neoplasm of upper-outer quadrant of female breast’
‘C7800’
”
”
”
”
1.0111e+04
240.5785
44.9159
9.0646
12.1000
11.1385
11.5123
10.5354
12.7292
18.5462
10.7431
3.6400
52.5846
47.4154
52.9938

10
155485
”
‘COMMERCIAL’
‘IL’
617
64
‘F’
NaN
‘C50912’
‘Malignant neoplasm of unspecified site of left female breast’
‘C773’
”
”
‘Midwest’
‘East North Central’
4.4353e+03
68.0019
41.3000
12.8358
13.6811
10.5245
11.9377
11.6585
13.5774
13.7434
7.6868
4.3415
49.3962
50.6038
57.8962

11
875977
”
‘MEDICARE ADVANTAGE’
‘MI’
488
67
‘F’
NaN
‘C50412’
‘Malig neoplasm of upper-outer quadrant of left female breast’
‘C799’
”
”
‘Midwest’
‘East North Central’
8101
246.2810
40.2782
11.0456
14.7684
13.3848
11.4671
11.2203
14.8975
12.5899
7.1494
3.4709
51.3228
48.6772
49.0658

12
343914
”
‘MEDICARE ADVANTAGE’
‘CA’
900
66
‘F’
NaN
‘1749’
‘Malignant neoplasm of breast (female), unspecified’
‘C7800’
”
”
‘West’
‘Pacific’
3.6054e+04
5.2943e+03
36.6538
9.7615
11.2677
17.2338
17.4415
13.0908
12.3046
9.4077
5.6738
3.8246
50.5108
49.4892
33.4785

13
266700
‘White’
‘COMMERCIAL’
‘MI’
480
58
‘F’
NaN
‘C50812’
‘Malignant neoplasm of ovrlp sites of left female breast’
‘C781’
”
”
‘Midwest’
‘East North Central’
1.6938e+04
894.1681
42.9348
10.5116
11.8130
11.9217
12.4043
12.4304
15.2710
13.8826
7.8522
3.8971
50.0217
49.9783
50.9565

14
437659
”
”
‘IL’
606
82
‘F’
undefined
‘1749’
‘Malignant neoplasm of breast (female), unspecified’
‘C779’
”
”
‘Midwest’
‘East North Central’
4.8671e+04
6.4314e+03
35.7554
10.4286
10.6518
18.3107
18.9036
11.9696
11.7268
9.6839
5.4071
2.8911
48.6964
51.3036
35.9304

⋮

I want to see some high-level statistics about the data, so I’ll use the summary function to get an idea of what kind of information we have.

summary(allTrainData)

Variables:
patient_id: 12906×1 double
Values:
Min 1.0006e+05

Median 5.4352e+05

Max 9.999e+05
patient_race: 12906×1 cell array of character vectors
payer_type: 12906×1 cell array of character vectors
patient_state: 12906×1 cell array of character vectors
patient_zip3: 12906×1 double
Values:
Min 101

Median 554

Max 999
patient_age: 12906×1 double
Values:
Min 18

Median 59

Max 91
patient_gender: 12906×1 cell array of character vectors
bmi: 12906×1 double
Values:
Min 14

Median 28.19

Max 85

NumMissing 8965
breast_cancer_diagnosis_code: 12906×1 cell array of character vectors
breast_cancer_diagnosis_desc: 12906×1 cell array of character vectors
metastatic_cancer_diagnosis_code: 12906×1 cell array of character vectors
metastatic_first_novel_treatment: 12906×1 cell array of character vectors
metastatic_first_novel_treatment_type: 12906×1 cell array of character vectors
Region: 12906×1 cell array of character vectors
Division: 12906×1 cell array of character vectors
population: 12906×1 double
Values:
Min 635.55

Median 19154

Max 71374

NumMissing 1
density: 12906×1 double
Values:
Min 0.91667

Median 700.34

Max 21172

NumMissing 1
age_median: 12906×1 double
Values:
Min 20.6

Median 40.639

Max 54.57

NumMissing 1
age_under_10: 12906×1 double
Values:
Min 0

Median 11.039

Max 17.675

NumMissing 1
age_10_to_19: 12906×1 double
Values:
Min 6.3143

Median 12.924

Max 35.3

NumMissing 1
age_20s: 12906×1 double
Values:
Min 5.925

Median 12.538

Max 62.1

NumMissing 1
age_30s: 12906×1 double
Values:
Min 1.5

Median 12.443

Max 25.471

NumMissing 1
age_40s: 12906×1 double
Values:
Min 0.8

Median 12.124

Max 17.82

NumMissing 1
age_50s: 12906×1 double
Values:
Min 0

Median 13.568

Max 21.661

NumMissing 1
age_60s: 12906×1 double
Values:
Min 0.2

Median 12.533

Max 29.855

NumMissing 1
age_70s: 12906×1 double
Values:
Min 0

Median 7.3169

Max 19

NumMissing 1
age_over_80: 12906×1 double
Values:
Min 0

Median 3.8

Max 18.825

NumMissing 1
male: 12906×1 double
Values:
Min 39.725

Median 49.976

Max 61.6

NumMissing 1
female: 12906×1 double
Values:
Min 38.4

Median 50.024

Max 60.275

NumMissing 1
married: 12906×1 double
Values:
Min 0.9

Median 49.434

Max 66.903

NumMissing 1
divorced: 12906×1 double
Values:
Min 0.2

Median 12.653

Max 19.831

NumMissing 1
never_married: 12906×1 double
Values:
Min 13.44

Median 32.004

Max 98.9

NumMissing 1
widowed: 12906×1 double
Values:
Min 0

Median 5.5208

Max 23.055

NumMissing 1
family_size: 12906×1 double
Values:
Min 2.5504

Median 3.1665

Max 4.1723

NumMissing 4
family_dual_income: 12906×1 double
Values:
Min 19.312

Median 52.592

Max 70.925

NumMissing 4
income_household_median: 12906×1 double
Values:
Min 29222

Median 69803

Max 1.6412e+05

NumMissing 4
income_household_under_5: 12906×1 double
Values:
Min 0.75

Median 2.8382

Max 19.62

NumMissing 4
income_household_5_to_10: 12906×1 double
Values:
Min 0.36154

Median 2.1604

Max 11.872

NumMissing 4
income_household_10_to_15: 12906×1 double
Values:
Min 1.0154

Median 3.7171

Max 14.278

NumMissing 4
income_household_15_to_20: 12906×1 double
Values:
Min 1.0278

Median 3.7712

Max 12.918

NumMissing 4
income_household_20_to_25: 12906×1 double
Values:
Min 1.1

Median 4.0421

Max 14.35

NumMissing 4
income_household_25_to_35: 12906×1 double
Values:
Min 2.65

Median 8.4353

Max 18.34

NumMissing 4
income_household_35_to_50: 12906×1 double
Values:
Min 1.7

Median 11.793

Max 24.075

NumMissing 4
income_household_50_to_75: 12906×1 double
Values:
Min 4.95

Median 17.076

Max 27.13

NumMissing 4
income_household_75_to_100: 12906×1 double
Values:
Min 4.7333

Median 12.677

Max 24.8

NumMissing 4
income_household_100_to_150: 12906×1 double
Values:
Min 4.2889

Median 16.016

Max 31.325

NumMissing 4
income_household_150_over: 12906×1 double
Values:
Min 0.84

Median 14.703

Max 52.824

NumMissing 4
income_household_six_figure: 12906×1 double
Values:
Min 5.6926

Median 30.575

Max 69.032

NumMissing 4
income_individual_median: 12906×1 double
Values:
Min 4316

Median 35253

Max 88910

NumMissing 1
home_ownership: 12906×1 double
Values:
Min 15.85

Median 69.669

Max 90.367

NumMissing 4
housing_units: 12906×1 double
Values:
Min 0

Median 6994.4

Max 25923

NumMissing 1
home_value: 12906×1 double
Values:
Min 60629

Median 2.4784e+05

Max 1.8531e+06

NumMissing 4
rent_median: 12906×1 double
Values:
Min 448.4

Median 1168

Max 2965.2

NumMissing 4
rent_burden: 12906×1 double
Values:
Min 17.416

Median 30.986

Max 78.94

NumMissing 4
education_less_highschool: 12906×1 double
Values:
Min 0

Median 10.843

Max 34.325

NumMissing 1
education_highschool: 12906×1 double
Values:
Min 0

Median 27.406

Max 53.96

NumMissing 1
education_some_college: 12906×1 double
Values:
Min 7.2

Median 29.286

Max 50.133

NumMissing 1
education_bachelors: 12906×1 double
Values:
Min 2.4657

Median 19.047

Max 41.7

NumMissing 1
education_graduate: 12906×1 double
Values:
Min 2.0941

Median 10.796

Max 51.84

NumMissing 1
education_college_or_above: 12906×1 double
Values:
Min 7.0488

Median 30.141

Max 77.817

NumMissing 1
education_stem_degree: 12906×1 double
Values:
Min 23.915

Median 43.066

Max 73

NumMissing 1
labor_force_participation: 12906×1 double
Values:
Min 30.7

Median 62.778

Max 78.67

NumMissing 1
unemployment_rate: 12906×1 double
Values:
Min 0.82308

Median 5.4741

Max 18.8

NumMissing 1
self_employed: 12906×1 double
Values:
Min 2.263

Median 12.748

Max 25.538

NumMissing 4
farmer: 12906×1 double
Values:
Min 0

Median 0.45493

Max 26.729

NumMissing 4
race_white: 12906×1 double
Values:
Min 14.496

Median 70.878

Max 98.444

NumMissing 1
race_black: 12906×1 double
Values:
Min 0.060976

Median 6.4103

Max 69.66

NumMissing 1
race_asian: 12906×1 double
Values:
Min 0

Median 2.9667

Max 49.85

NumMissing 1
race_native: 12906×1 double
Values:
Min 0

Median 0.43095

Max 76.935

NumMissing 1
race_pacific: 12906×1 double
Values:
Min 0

Median 0.054054

Max 14.758

NumMissing 1
race_other: 12906×1 double
Values:
Min 0.0025641

Median 3.5136

Max 33.189

NumMissing 1
race_multiple: 12906×1 double
Values:
Min 0.43333

Median 5.802

Max 26.43

NumMissing 1
hispanic: 12906×1 double
Values:
Min 0.19444

Median 11.983

Max 91.005

NumMissing 1
disabled: 12906×1 double
Values:
Min 4.6

Median 12.884

Max 35.156

NumMissing 1
poverty: 12906×1 double
Values:
Min 3.4333

Median 12.178

Max 38.348

NumMissing 4
limited_english: 12906×1 double
Values:
Min 0

Median 2.7472

Max 26.755

NumMissing 4
commute_time: 12906×1 double
Values:
Min 12.461

Median 27.788

Max 48.02

NumMissing 1
health_uninsured: 12906×1 double
Values:
Min 2.44

Median 7.4657

Max 27.566

NumMissing 1
veteran: 12906×1 double
Values:
Min 1.2

Median 6.8471

Max 25.2

NumMissing 1
Ozone: 12906×1 double
Values:
Min 30.939

Median 39.108

Max 52.237

NumMissing 29
PM25: 12906×1 double
Values:
Min 2.636

Median 7.6866

Max 11.169

NumMissing 29
N02: 12906×1 double
Values:
Min 2.7604

Median 15.589

Max 31.505

NumMissing 29
DiagPeriodL90D: 12906×1 double
Values:
Min 0

Median 1

Max 1

Take some time to scroll through this summary and see what information or patterns you can learn! Here are some things I notice:

There are a lot of rows or variables that just say “cell array of character vectors”, which doesn’t tell us much about the data.
There are a few variables that have a high ‘NumMissing’ value.
The numeric variables can have dramatically different minimums and maximums.

We can use these observations to make decisions about how we want to explore and preprocess the dataset.

Process and Clean the Data

1. Convert text data to categorical

Text data can be hard for machine learning algorithms to understand, so let’s go through and change each “cell array of character vectors” to a categorical. This will help the algorithm sort the text into different categories instead of understanding it as a series of individual letters.

varTypes = varfun(@class, allTrainData, OutputFormat=“cell”);

catIdx = strcmp(varTypes, “cell”);

varNames = allTrainData.Properties.VariableNames;

catVarNames = varNames(catIdx);

for catNameIdx = 1:length(catVarNames)

 allTrainData.(catVarNames{catNameIdx}) = categorical(allTrainData.(catVarNames{catNameIdx}));

end

2. Handle Missing Data

Now I want to handle all that missing data I noticed earlier. I’ll go through each variable and specifically look at variables that are missing data for over half of the rows or observations.

dataSum = summary(allTrainData);

for nameIdx = 1:length(varNames)

 varName = varNames{nameIdx};

 varNumMissing = dataSum.(varName).NumMissing;

 if varNumMissing > (height(allTrainData) / 2)

 disp(varName);

 disp(varNumMissing);

end

end

bmi
8965
metastatic_first_novel_treatment
12882
metastatic_first_novel_treatment_type
12882

Let’s remove those variables entirely, since they might not be too helpful for our algorithm.

allTrainData = removevars(allTrainData, [“bmi”, “metastatic_first_novel_treatment”, “metastatic_first_novel_treatment_type”])

allTrainData = 12906×80 table 




patient_id
patient_race
payer_type
patient_state
patient_zip3
patient_age
patient_gender
breast_cancer_diagnosis_code
breast_cancer_diagnosis_desc
metastatic_cancer_diagnosis_code
Region
Division
population
density
age_median
age_under_10
age_10_to_19
age_20s
age_30s
age_40s
age_50s
age_60s
age_70s
age_over_80
male
female
married
divorced
never_married
widowed


1
475714
<undefined>
MEDICAID
CA
924
84
F
C50919
Malignant neoplasm of unsp site of unspecified female breast
C7989
West
Pacific
3.1438e+04
1.1896e+03
30.6429
16.0143
15.5429
17.6143
14.0143
11.6143
11.5571
7.5714
4
2.1000
49.8571
50.1429
36.5714
11.8857
47.1143
4.4429

2
349367
White
COMMERCIAL
CA
928
62
F
C50411
Malig neoplm of upper-outer quadrant of right female breast
C773
West
Pacific
3.9122e+04
2.2959e+03
38.2000
11.8788
13.3545
14.2303
13.4182
13.3333
14.0606
10.2485
5.9515
3.5030
49.8939
50.1061
50.2455
9.8273
35.2909
4.6515

3
138632
White
COMMERCIAL
TX
760
43
F
C50112
Malignant neoplasm of central portion of left female breast
C773
South
West South Central
2.1997e+04
626.2367
37.9067
13.0283
14.4633
12.5317
13.5450
12.8600
12.7700
11.4267
6.5650
2.8117
50.1233
49.8767
55.7533
12.3300
27.1950
4.7100

4
617843
White
COMMERCIAL
CA
926
45
F
C50212
Malig neoplasm of upper-inner quadrant of left female breast
C773
West
Pacific
3.2795e+04
1.8962e+03
42.8714
10.0714
12.1357
12.5381
12.4643
12.6500
14.8476
12.2810
8.2167
4.7595
49.0667
50.9333
52.6048
11.6238
31.1429
4.6238

5
817482
<undefined>
COMMERCIAL
ID
836
55
F
1749
Malignant neoplasm of breast (female), unspecified
C773
West
Mountain
1.0886e+04
116.8860
43.4735
10.8240
13.9760
9.4920
10.3640
12.6000
14.9920
14.8360
9.4620
3.4660
52.3120
47.6880
57.8820
14.9640
21.7600
5.4060

6
111545
White
MEDICARE ADVANTAGE
NY
141
66
F
1749
Malignant neoplasm of breast (female), unspecified
C7981
Northeast
Middle Atlantic
5.6438e+03
219.3629
45.1800
8.5114
14.8571
11.0886
9.7543
13.6143
13.3743
15.6857
9.4457
3.6457
50.9114
49.0914
51.3229
11.7600
30.8314
6.0914

7
914071
<undefined>
COMMERCIAL
CA
900
51
F
C50912
Malignant neoplasm of unspecified site of left female breast
C779
West
Pacific
3.6054e+04
5.2943e+03
36.6538
9.7615
11.2677
17.2338
17.4415
13.0908
12.3046
9.4077
5.6738
3.8246
50.5108
49.4892
33.4785
11.3015
50.4569
4.7662

8
479368
White
COMMERCIAL
IL
619
60
F
C50512
Malig neoplasm of lower-outer quadrant of left female breast
C773
Midwest
East North Central
3.4041e+03
25.7333
42.7900
11.9833
13.2567
9.5733
12.4000
11.8133
13.5767
14.0433
8.5267
4.8533
49.2833
50.7167
55.8867
12.6400
24.5267
6.9433

9
994014
White
MEDICARE ADVANTAGE
<undefined>
973
82
F
1744
Malignant neoplasm of upper-outer quadrant of female breast
C7800
<undefined>
<undefined>
1.0111e+04
240.5785
44.9159
9.0646
12.1000
11.1385
11.5123
10.5354
12.7292
18.5462
10.7431
3.6400
52.5846
47.4154
52.9938
13.9323
27.5262
5.5615

10
155485
<undefined>
COMMERCIAL
IL
617
64
F
C50912
Malignant neoplasm of unspecified site of left female breast
C773
Midwest
East North Central
4.4353e+03
68.0019
41.3000
12.8358
13.6811
10.5245
11.9377
11.6585
13.5774
13.7434
7.6868
4.3415
49.3962
50.6038
57.8962
10.8981
24.9547
6.2472

11
875977
<undefined>
MEDICARE ADVANTAGE
MI
488
67
F
C50412
Malig neoplasm of upper-outer quadrant of left female breast
C799
Midwest
East North Central
8101
246.2810
40.2782
11.0456
14.7684
13.3848
11.4671
11.2203
14.8975
12.5899
7.1494
3.4709
51.3228
48.6772
49.0658
13.6051
31.8848
5.4392

12
343914
<undefined>
MEDICARE ADVANTAGE
CA
900
66
F
1749
Malignant neoplasm of breast (female), unspecified
C7800
West
Pacific
3.6054e+04
5.2943e+04
36.6538
9.7615
11.2677
17.2338
17.4415
13.0908
12.3046
9.4077
5.6738
3.8246
50.5108
49.4892
33.4785
11.3015
50.4569
4.7662

13
266700
White
COMMERCIAL
MI
480
58
F
C50812
Malignant neoplasm of ovrlp sites of left female breast
C781
Midwest
East North Central
1.6938e+04
894.1681
42.9348
10.5116
11.8130
11.9217
12.4043
12.4304
15.2710
13.8826
7.8522
3.8971
50.0217
49.9783
50.9565
12.3145
30.8333
5.9014

14
437659
<undefined>
<undefined>
IL
606
82
F
1749
Malignant neoplasm of breast (female), unspecified
C779
Midwest
East North Central
4.8671e+04
6.4314e+03
35.7554
10.4286
10.6518
18.3107
18.9036
11.9696
11.7268
9.6839
5.4071
2.8911
48.6964
51.3036
35.9304
10.2982
49.0054
4.7643

⋮

Now I want to look at each row and remove any that are missing too many values. It’s okay to have a couple of missing data points in your dataset, but if you have too many it could cause your machine learning algorithm to be less accurate. I’ll use the Clean Missing Data live task to remove any rows that are missing 2 or more data points.

% Remove missing data

[fullData,missingIndices] = rmmissing(allTrainData,“MinNumMissing”,2);

% Display results

figure

% Get locations of missing data

indicesForPlot = ismissing(allTrainData.patient_age);

mask = missingIndices & ~indicesForPlot;

% Plot cleaned data

plot(find(~missingIndices),fullData.patient_age,“SeriesIndex”,1,“LineWidth”,1.5, …

 “DisplayName”,“Cleaned data”)

hold on

% Plot data in rows where other variables contain missing entries

plot(find(mask),allTrainData.patient_age(mask),“x”,“SeriesIndex”,“none”, …

 “DisplayName”,“Removed by other variables”)

% Plot removed missing entries

x = repelem(find(indicesForPlot),3);

y = repmat([ylim(gca) missing]’,nnz(indicesForPlot),1);

plot(x,y,“Color”,[145 145 145]/255,“DisplayName”,“Removed missing entries”)

title(“Number of removed missing entries: ” + nnz(indicesForPlot))

hold off

legend

ylabel(“patient_age”,“Interpreter”,“none”)

clear indicesForPlot mask x y

Explore the Data

Now that the data is cleaned up, you should spend some time exploring your data to understand how different variables may interact with each other or see if you can draw any meaningful conclusions from the data or figure out which variables may be more or less important when it comes to predicting time to diagnosis.

Univariate Analysis

First, I want to separate the data into two datasets: one full of patients who were diagnosed in 90 days or less (the 1 or “True” values), and one full of patients who were not (the 0 or “False” values). This will allow me to explore the data patterns in each of these datasets and look for any meaningful differences.

allTrueIdx = fullData.DiagPeriodL90D == 1;

allTrueData = fullData(allTrueIdx, :);

allTrueData = 7559×80 table 




patient_id
patient_race
payer_type
patient_state
patient_zip3
patient_age
patient_gender
breast_cancer_diagnosis_code
breast_cancer_diagnosis_desc
metastatic_cancer_diagnosis_code
Region
Division
population
density
age_median
age_under_10
age_10_to_19
age_20s
age_30s
age_40s
age_50s
age_60s
age_70s
age_over_80
male
female
married
divorced
never_married
widowed


1
475714
<undefined>
MEDICAID
CA
924
84
F
C50919
Malignant neoplasm of unsp site of unspecified female breast
C7989
West
Pacific
3.1438e+04
1.1896e+03
30.6429
16.0143
15.5429
17.6143
14.0143
11.6143
11.5571
7.5714
4
2.1000
49.8571
50.1429
36.5714
11.8857
47.1143
4.4429

2
349367
White
COMMERCIAL
CA
928
62
F
C50411
Malig neoplm of upper-outer quadrant of right female breast
C773
West
Pacific
3.9122e+04
2.2959e+03
38.2000
11.8788
13.3545
14.2303
13.4182
13.3333
14.0606
10.2485
5.9515
3.5030
49.8939
50.1061
50.2455
9.8273
35.2909
4.6515

3
138632
White
COMMERCIAL
TX
760
43
F
C50112
Malignant neoplasm of central portion of left female breast
C773
South
West South Central
2.1997e+04
626.2367
37.9067
13.0283
14.4633
12.5317
13.5450
12.8600
12.7700
11.4267
6.5650
2.8117
50.1233
49.8767
55.7533
12.3300
27.1950
4.7100

4
914071
<undefined>
COMMERCIAL
CA
900
51
F
C50912
Malignant neoplasm of unspecified site of left female breast
C779
West
Pacific
3.6054e+04
5.2943e+03
36.6538
9.7615
11.2677
17.2338
17.4415
13.0908
12.3046
9.4077
5.6738
3.8246
50.5108
49.4892
33.4785
11.3015
50.4569
4.7662

5
479368
White
COMMERCIAL
IL
619
60
F
C50512
Malig neoplasm of lower-outer quadrant of left female breast
C773
Midwest
East North Central
3.4041e+03
25.7333
42.7900
11.9833
13.2567
9.5733
12.4000
11.8133
13.5767
14.0433
8.5267
4.8533
49.2833
50.7167
55.8867
12.6400
24.5267
6.9433

6
155485
<undefined>
COMMERCIAL
IL
617
64
F
C50912
Malignant neoplasm of unspecified site of left female breast
C773
Midwest
East North Central
4.4353e+03
68.0019
41.3000
12.8358
13.6811
10.5245
11.9377
11.6585
13.5774
13.7434
7.6868
4.3415
49.3962
50.6038
57.8962
10.8981
24.9547
6.2472

7
266700
White
COMMERCIAL
MI
480
58
F
C50812
Malignant neoplasm of ovrlp sites of left female breast
C781
Midwest
East North Central
1.6938e+04
894.1681
42.9348
10.5116
11.8130
11.9217
12.4043
12.4304
15.2710
13.8826
7.8522
3.8971
50.0217
49.9783
50.9565
12.3145
30.8333
5.9014

8
880521
Other
COMMERCIAL
CA
945
58
F
C50911
Malignant neoplasm of unsp site of right female breast
C773
West
Pacific
3.0154e+04
976.2892
42.1358
10.7531
12.7148
11.7259
13.1012
12.8173
13.3012
12.7716
8.4136
4.4086
49.7272
50.2728
53.0765
10.9123
30.5346
5.4667

9
971531
Hispanic
MEDICARE ADVANTAGE
IL
606
83
F
C50911
Malignant neoplasm of unsp site of right female breast
C773
Midwest
East North Central
4.8671e+04
6.4314e+03
35.7554
10.4286
10.6518
18.3107
18.9036
11.9696
11.7268
9.6839
5.4071
2.8911
48.6964
51.3036
35.9304
10.2982
49.0054
4.7643

10
529840
White
COMMERCIAL
MT
590
60
F
C50411
Malig neoplm of upper-outer quadrant of right female breast
C773
West
Mountain
1.2208e+03
2.1597
46.7408
11.1521
11.1000
7.9183
10.3338
10.7577
15.4211
18.9042
9.4479
5







11
198037
White
MEDICAID
KY
402
45
F
C50312
Malig neoplasm of lower-inner quadrant of left female breast
C773
South
East South Central
2.2669e+04
1.1427e+03
37.4937
10.9688
13.6031
15.2281
14.9219
11.7219
12.1375
11.5188
6.3156
3.5625
48.8344
51.1656
39.7906
15.0312
39.2875
5.8906

12
791301
<undefined>
MEDICARE ADVANTAGE
CA
958
58
F
C50112
Malignant neoplasm of central portion of left female breast
C773
West
Pacific
3.0687e+04
1.9179e+03
36.5517
11.6207
11.4655
16.1345
15.9655
12.5276
12.4793
11.0655
5.6034
3.1586
49.5138
50.4862
41.5345
13.7034
40.1793
4.5793

13
618259
White
COMMERCIAL
OH
430
55
F
C50311
Malig neoplm of lower-inner quadrant of right female breast
C773
Midwest
East North Central
1.4386e+04
263.5774
40.6393
11.8852
14.2492
10.8426
11.5590
12.6984
13.8869
12.8131
8.2557
3.8148
49.7082
50.2918
53.7148
13.5279
25.8443
6.9180

14
393934
White
<undefined>
CO
801
70
F
C50911
Malignant neoplasm of unsp site of right female breast
C7951
West
Mountain
2.1243e+04
564.7743
42.4114
10.4086
14.4486
10.6314
11.8086
13.7086
15.9543
13.2914
6.9514
2.8057
50.7971
49.2029
60.1086
10.8800
25.5914
3.4286

⋮

allFalseIdx = fullData.DiagPeriodL90D == 0;

allFalseData = fullData(allFalseIdx, :);

allFalseData = 4598×80 table 




patient_id
patient_race
payer_type
patient_state
patient_zip3
patient_age
patient_gender
breast_cancer_diagnosis_code
breast_cancer_diagnosis_desc
metastatic_cancer_diagnosis_code
Region
Division
population
density
age_median
age_under_10
age_10_to_19
age_20s
age_30s
age_40s
age_50s
age_60s
age_70s
age_over_80
male
female
married
divorced
never_married
widowed


1
617843
White
COMMERCIAL
CA
926
45
F
C50212
Malig neoplasm of upper-inner quadrant of left female breast
C773
West
Pacific
3.2795e+04
1.8962e+03
42.8714
10.0714
12.1357
12.5381
12.4643
12.6500
14.8476
12.2810
8.2167
4.7595
49.0667
50.9333
52.6048
11.6238
31.1429
4.6238

2
817482
<undefined>
COMMERCIAL
ID
836
55
F
1749
Malignant neoplasm of breast (female), unspecified
C773
West
Mountain
1.0886e+04
116.8860
43.4735
10.8240
13.9760
9.4920
10.3640
12.6000
14.9920
14.8360
9.4620
3.4660
52.3120
47.6880
57.8820
14.9640
21.7600
5.4060

3
111545
White
MEDICARE ADVANTAGE
NY
141
66
F
1749
Malignant neoplasm of breast (female), unspecified
C7981
Northeast
Middle Atlantic
5.6438e+03
219.3629
45.1800
8.5114
14.8571
11.0886
9.7543
13.6143
13.3743
15.6857
9.4457
3.6457
50.9114
49.0914
51.3229
11.7600
30.8314
6.0914

4
875977
<undefined>
MEDICARE ADVANTAGE
MI
488
67
F
C50412
Malig neoplasm of upper-outer quadrant of left female breast
C799
Midwest
East North Central
8101
246.2810
40.2782
11.0456
14.7684
13.3848
11.4671
11.2203
14.8975
12.5899
7.1494
3.4709
51.3228
48.6772
49.0658
13.6051
31.8848
5.4392

5
343914
<undefined>
MEDICARE ADVANTAGE
CA
900
66
F
1749
Malignant neoplasm of breast (female), unspecified
C7800
West
Pacific
3.6054e+04
5.2943e+03
36.6538
9.7615
11.2677
17.2338
17.4415
13.0908
12.3046
9.4077
5.6738
3.8246
50.5108
49.4892
33.4785
11.3015
50.4569
4.7662

6
615208
Other
COMMERCIAL
OR
975
62
F
C50411
Malig neoplm of upper-outer quadrant of right female breast
C786
West
Pacific
1.2836e+04
87.3667
48.9208
9.3458
9.4500
8.7833
11.9542
10.3458
12.6000
17.8833
13.8708
5.7542
50.4708
49.5292
53.7167
15.8583
23.1333
7.2708

7
279917
White
MEDICARE ADVANTAGE
NY
142
75
F
C50912
Malignant neoplasm of unspecified site of left female breast
C7801
Northeast
Middle Atlantic
2.0195e+04
2.1920e+03
36.4690
10.2207
15.4345
17.8241
13.2483
10.2897
11.7345
11.5276
5.8966
3.8345
48.4414
51.5586
31.7517
12.9966
49.4724
5.7759

8
366792
Asian
COMMERCIAL
MI
482
46
F
C50412
Malig neoplasm of upper-outer quadrant of left female breast
C773
Midwest
East North Central
2.2081e+04
1.6665e+03
36.5861
12.7778
12.8556
15.8083
13.2028
11.8889
12.4556
11.6000
5.9194
3.4833
48.5472
51.4528
26.2417
14.7028
52.5722
6.4611

9
643360
<undefined>
COMMERCIAL
NY
120
52
F
1744
Malignant neoplasm of upper-outer quadrant of female breast
C7800
Northeast
Middle Atlantic
5.1122e+03
103.9061
46.2954
9.0636
10.6182
11.3000
10.9576
11.0045
16.7348
15.3530
10.3818
4.5909
52.1348
47.8652
49.7773
13.7470
29.3818
7.0924

10
487817
<undefined>
COMMERCIAL
TX
773
57
F
1749
Malignant neoplasm of breast (female), unspecified
C773
South
West South Central
2.4751e+04
352.2268
41.3712
11.9302
12.9868
10.9962
11.1623
13.1075
13.0226
13.0660
9.5774
4.1585
49.3547
50.6453
52.9943
13.3415
25.0943
8.5792

11
345047
<undefined>
COMMERCIAL
TX
751
78
F
C50912
Malignant neoplasm of unspecified site of left female breast
C773
South
West South Central
1.6981e+04
271.9135
38.5392
13.2529
15.1843
11.8118
11.7980
12.7176
14.0510
11.8373
6.3667
2.9627
49.8667
50.1333
52.8333
13.5725
28.0196
5.5725

12
907418
White
MEDICARE ADVANTAGE
IN
460
50
F
1749
Malignant neoplasm of breast (female), unspecified
C7951
Midwest
East North Central
1.3549e+04
256.8795
40.2864
13.3023
13.2045
10.9000
14.2909
12.6364
13.2909
11.8773
6.3886
4.0886
50.8545
49.1455
54.1114
12.4795
27.4068
5.9955

13
908851
White
MEDICARE ADVANTAGE
FL
339
82
F
1749
Malignant neoplasm of breast (female), unspecified
C7800
South
South Atlantic
1.8007e+04
479.7347
50.9592
7.9449
9.9816
10.4449
9.7082
9.3714
13.1020
16.6653
16.0143
6.7714
49.0306
50.9694
52.1673
13.8000
25.2510
8.7939

14
785337
Black
COMMERCIAL
VA
234
54
F
1741
Malignant neoplasm of central portion of female breast
C7951
South
South Atlantic
1.3242e+04
299.2533
44.9310
9.1227
10.6818
13.8364
11.3295
9.5705
14.7477
14.3523
12.4568
3.8909
50.5205
49.4795
50.1318
13.8023
29.3364
6.7386

⋮

Now we can use the Create Plot live task to plot histograms of the different variables in each dataset. In the plot below, blue bars represent data from the folks who were diagnosed in a timely manner, and the red bars represent data from the folks who were not.

figure

% Create histogram of selected data

histogram(allTrueData.health_uninsured,“NumBins”,40,“DisplayName”,“health_uninsured”);

hold on

% Create histogram of selected data

histogram(allFalseData.health_uninsured,“NumBins”,40,“DisplayName”,“health_uninsured”);

hold off

legend

Take some time to explore these visualizations on your own, as I can only show one at a time in this blog. It is worth noting that we have less False data than True data, so the red bars will almost always be lower than the blue bars. If there are red bars that are higher or if the shapes are different, that may indicate a relationship between a variable and time to diagnosis.

I didn’t see many significant differences in shape, though I did notice that for the ‘health_uninsured’ histograms the red vars are fairly high in the higher numbers, indicating that there may be a correlation between populations with high rates of being unisured and time to diagnosis.

Bivariate and Multivariate Analysis

You can break the data down further and plot two (or more!) variables against each other to see if you can find any patterns. In the plot below, for example, we can see the percentage of the population that is unisured and the state the patient is in, broken down by whether or not the patient was diagnosed within 90 days. Again, blue values indicate that the patient was, and red values indicate that the patient was not.

figure

% Create scatter of selected data

scatter(allTrueData,“patient_state”,“health_uninsured”,“DisplayName”,“health_uninsured”);

hold on

% Create scatter of selected data

scatter(allFalseData,“patient_state”,“health_uninsured”,“DisplayName”,“health_uninsured”);

hold off

legend

We can see that in some states, such as GA, OK, or TX, the the red values come from populations that are typically higher in terms of being uninsured. This could indcate that in some states, coming from a zip code with a high population of uninsured folks (or being uninsured yourself) means you are more likely to receive delays in your diagnosis.

Statistical Analysis

You can also create meaningful deductions by calculating various statistics from your data. For example, I want to calculate the skewness, or level of asymmetry, of each of my variables. A negative value indicates the data is left skewed when plotted, and a positive value indicates the data is right skewed when plotted, with a 0 meaning the data is evenly distributed.

statsTrue = varfun(@skewness, allTrueData, “InputVariables”, @isnumeric);

statsFalse = varfun(@skewness, allFalseData, “InputVariables”, @isnumeric);

Now I want to see if any of the variables have a significant difference in their skewness, as differences in the data distributions between patients who were diagnosed in a timely manner vs patients who were not could indicate an underlying relationship between those variables and time to diagnosis.

statsDiffs = abs(statsTrue{:, :} – statsFalse{:, :});

statsTrue.Properties.VariableNames(statsDiffs > 0.2)

ans = 1×4 cell
‘skewness_density”skewness_age_over_80”skewness_rent_burden”skewness_race_native’

If we investigate the four variables that are returned, we can see that population density, the percentage of folks above 80 in your zip code, the median rent burden of your zip code, and the percentage of residents who reported their race as American Indian or Alaska Native in your zip code may have a relationship with time to diagnosis.

Feature Engineering

When it comes to machine learning, you don’t have to use all of the data as it is presented to you. Feature Engineering is the process of deciding what data you want to use, creating new data based on the provided data, and transforming the data to be in whatever format or range is suitable for your workflow. You can do this manually, and some of the exploration we just did should influence decisions you make if you want to play around with including or excluding different variables.

For this blog, I’ll use the gencfeatures function to automate this process. I want to use 90 features, which is 10 more than we currently have in our dataset, and it will go through and create a set of 90 meaningful features based on our processed dataset. It may keep some data as-is, but will often standardize numeric variables and create new variables by manipulating the provided data.

[T, augTrainData] = gencfeatures(fullData, “DiagPeriodL90D”, 90)

Warning: Table variable names were truncated to the length namelengthmax.
T = 
FeatureTransformer with properties:
Type: ‘classification’

TargetLearner: ‘linear’

NumEngineeredFeatures: 89

NumOriginalFeatures: 1

TotalNumFeatures: 90
augTrainData = 12157×91 table 




metastatic_cancer_diagnosis_code
zsc(woe2(breast_cancer_diagnosis_code))
zsc(woe2(breast_cancer_diagnosis_desc))
zsc(woe2(metastatic_cancer_diagnosis_code))
zsc(woe2(patient_state))
zsc(patient_age./Ozone)
zsc(patient_age./commute_time)
zsc(kmc51)
eb28(education_less_highschool)
zsc(income_household_35_to_50./income_household_75_to_100)
zsc(kmc12)
eb11(patient_age)
q28(income_household_under_5)
zsc(rent_burden-education_less_highschool)
q11(patient_age)
zsc(sig(family_dual_income))
zsc(sig(patient_age))
zsc(sin(PM25))
zsc(cos(rent_median))
zsc(sin(patient_zip3))
zsc(health_uninsured./PM25)
zsc(cos(population))
zsc(cos(education_bachelors))
zsc(sin(hispanic))
q28(density)
eb28(education_highschool)
zsc(income_household_75_to_100.*rent_burden)
q28(unemployment_rate)
q28(patient_zip3)
zsc(patient_id.*hispanic)


1
C7989
0.5390
0.5390
-0.4180
0.3566
0.2943
1.2850
-0.9221
28
0.4148
0.8408
11
17
-2.6485
11
0.0164
0.0329
0.4182
-1.3204
0.4400
0.1652
-1.3368
-0.8396
-0.9848
19
16
1.3688
25
25
1.8601

2
C773
0.5403
0.5403
0.6847
0.3566
-0.0953
-0.2533
1.1450
14
-1.0911
-0.6892
7
9
0.2093
7
0.0164
0.0329
0.5817
0.4439
-1.4209
-0.5852
-1.2593
0.1155
0.3359
24
8
0.6588
11
26
0.2666

3
C773
0.4472
0.4472
0.6847
-0.4817
-1.1723
-1.2683
-0.5463
9
-0.1595
0.1389
3
10
-0.1364
2
0.0164
0.0329
0.8441
-0.5097
-0.4515
1.2506
1.0719
0.8394
0.7034
14
14
0.0868
7
19
-0.6458

4
C773
0.4804
0.4804
0.6847
0.3566
-1.1790
-0.8614
2.4460
2
-1.4278
-2.6472
3
19
1.2015
2
0.0164
0.0329
0.5895
0.0984
0.9142
-0.9343
-1.3164
-0.5676
-1.2673
23
1
-0.8003
13
26
0.0137

5
C773
-1.8389
-1.8389
0.6847
-0.1188
-0.4668
-0.1359
-0.1677
10
1.1998
0.5623
5
11
-0.6774
5
0.0164
0.0329
-1.9107
-1.1374
0.3927
2.8054
-1.0779
0.0894
0.9926
5
16
-0.8990
5
21
0.0631

6
C7981
-1.8389
-1.8389
-3.0232
-0.7796
0.3970
0.6970
-0.5463
5
-0.4166
0.1389
8
1
-0.1713
9
0.0164
0.0329
-0.8052
-1.4920
0.4399
-0.6668
0.1846
-1.5704
0.8916
8
23
-0.1033
3
3
-0.8538

7
C779
0.5150
0.5150
-1.0339
0.3566
-0.7176
-0.8180
1.2575
26
0.1398
-0.9325
4
28
-0.8526
4
0.0164
0.0329
-2.1568
0.2754
1.3443
-0.4261
0.5523
0.3214
1.4403
27
6
-0.0186
26
23
2.7136

8
C773
0.5605
05605
0.6847
-0.8787
0.2552
0.4365
-0.9221
8
1.2589
0.8408
6
12
-1.0383
7
0.0164
0.0329
0.6462
0.3778
-0.2288
0.0129
0.3019
1.0857
0.5697
1
26
-1.3912
17
17
-0.7649

9
C773
0.5150
0.5150
0.6847
-0.8787
0.5303
0.7654
-0.5463
1
-0.3771
0.1389
7
15
0.0934
8
0.0164
0.0329
0.6494
1.3120
1.2738
-0.8822
1.1952
1.2474
0.1205
3
23
-0.1783
2
17
-0.8367

10
C799
0.5892
0.5892
-0.7083
0.4285
0.6062
0.4731
-0.1677
5
0.3661
0.5623
8
4
-0.1511
9
0.0164
0.0329
0.8468
0.7382
-1.3160
-0.3069
-0.5051
-1.0839
-1.0734
9
22
-0.5364
11
13
-0.5800

11
C7800
-1.8389
-1.8389
-0.4676
0.3566
0.2790
-0.0623
2.5856
26
0.1398
-3.1658
8
28
-0.8526
9
0.0164
0.0329
-2.1568
0.2754
1.3443
-0.4261
0.5523
0.3214
1.4403
27
6
-0.0186
26
23
0.4736

12
C781
0.5758
0.5758
-1.7502
0.4285
-0.0598
-0.2406
-0.5463
4
-0.4701
0.1389
6
8
0.1148
6
0.0164
0.0329
0.8290
-1.2051
0.8002
-0.9475
0.0945
1.1447
0.5680
16
14
-0.0036
16
12
-0.8148

13
C773
0.5668
0.5668
0.6847
0.3566
0.2459
-0.8014
1.2575
6
-1.3361
-0.9325
6
5
0.6748
6
0.0164
0.0329
-0.6779
0.9431
0.7497
-0.9270
0.9452
-0.4288
0.1632
17
4
-0.4941
13
27
0.7836

14
C786
0.5403
0.5403
-0.0208
0.4096
0.4423
0.4481
-1.0650
7
1.5972
0.0484
7
22
1.3062
7
0.0164
0.0329
-1.4776
0.6570
1.1964
0.8045
1.4144
0.4459
1.3659
4
15
0.5212
22
28
-0.4435

⋮

To better understand the generated features, you can use the describe function of the returned FeatureTransformer object, ‘T’.

describe(T)

Type IsOriginal InputVariables Transformations

___________ __________ _____________________________________________________ ________________________________________________________________________metastatic_cancer_diagnosis_code  Categorical true metastatic_cancer_diagnosis_code

zsc(woe2(breast_cancer_diagnosis_code))  Numeric false breast_cancer_diagnosis_code Weight of Evidence (positive class = 1)

Standardization with z-score (mean = -0.046637, std = 1.5098)

zsc(woe2(breast_cancer_diagnosis_desc))  Numeric false breast_cancer_diagnosis_desc Weight of Evidence (positive class = 1)

Standardization with z-score (mean = -0.046637, std = 1.5098)

zsc(woe2(metastatic_cancer_diagnosis_code))  Numeric false metastatic_cancer_diagnosis_code Weight of Evidence (positive class = 1)

Standardization with z-score (mean = 0.0067098, std = 0.28786)

zsc(woe2(patient_state))  Numeric false patient_state Weight of Evidence (positive class = 1)

Standardization with z-score (mean = 0.0060064, std = 0.23323)

zsc(patient_age./Ozone)  Numeric false patient_age, Ozone patient_age ./ Ozone

Standardization with z-score (mean = 1.5005, std = 0.36544)

zsc(patient_age./commute_time)  Numeric false patient_age, commute_time patient_age ./ commute_time

Standardization with z-score (mean = 2.1895, std = 0.64638)

zsc(kmc51)  Numeric false all valid numeric variables Centroid encoding (component #51) (kmeans clustering with k = 10)

Standardization with z-score (mean = 5.9447, std = 0.1673)

eb28(education_less_highschool)  Categorical false education_less_highschool Equal-width binning (number of bins = 28)

zsc(income_household_35_to_50./income_household_75_to_100) Numeric false income_household_35_to_50, income_household_75_to_100 income_household_35_to_50 ./ income_household_75_to_100

Standardization with z-score (mean = 0.93234, std = 0.2685)

zsc(kmc12)  Numeric false all valid numeric variables Centroid encoding (component #12) (kmeans clustering with k = 10)

Standardization with z-score (mean = 13.4409, std = 0.15797)

eb11(patient_age)  Categorical false patient_age Equal-width binning (number of bins = 11)

q28(income_household_under_5)  Categorical false income_household_under_5 Equiprobable binning (number of bins = 28)

zsc(rent_burden-education_less_highschool)  Numeric false rent_burden, education_less_highschool rent_burden – education_less_highschool

Standardization with z-score (mean = 19.3265, std = 5.7168)

q11(patient_age)  Categorical false patient_age Equiprobable binning (number of bins = 11)

zsc(sig(family_dual_income))  Numeric false family_dual_income sigmoid( )

Standardization with z-score (mean = 1, std = 4.2283e-11)

zsc(sig(patient_age))  Numeric false patient_age sigmoid( )

Standardization with z-score (mean = 1, std = 4.0863e-10)

zsc(sin(PM25))  Numeric false PM25 sin( )

Standardization with z-score (mean = 0.42558, std = 0.65419)

zsc(cos(rent_median))  Numeric false rent_median cos( )

Standardization with z-score (mean = 0.046444, std = 0.68827)

zsc(sin(patient_zip3))  Numeric false patient_zip3 sin( )

Standardization with z-score (mean = 0.054487, std = 0.70171)

zsc(health_uninsured./PM25)  Numeric false health_uninsured, PM25 health_uninsured ./ PM25

Standardization with z-score (mean = 1.1917, std = 0.6234)

zsc(cos(population))  Numeric false population cos( )

Standardization with z-score (mean = -0.03209, std = 0.71354)

zsc(cos(education_bachelors))  Numeric false education_bachelors cos( )

Standardization with z-score (mean = 0.096871, std = 0.68966)

zsc(sin(hispanic))  Numeric false hispanic sin( )

Standardization with z-score (mean = 0.017785, std = 0.6817)

q28(density)  Categorical false density Equiprobable binning (number of bins = 28)

eb28(education_highschool)  Categorical false education_highschool Equal-width binning (number of bins = 28)

zsc(income_household_75_to_100.*rent_burden)  Numeric false income_household_75_to_100, rent_burden income_household_75_to_100 .* rent_burden

Standardization with z-score (mean = 392.7502, std = 61.6458)

q28(unemployment_rate)  Categorical false unemployment_rate Equiprobable binning (number of bins = 28)

q28(patient_zip3)  Categorical false patient_zip3 Equiprobable binning (number of bins = 28)

zsc(patient_id.*hispanic)  Numeric false patient_id, hispanic patient_id .* hispanic

Standardization with z-score (mean = 10169065.2502, std = 11587944.1233)

zsc(home_value.*race_other)  Numeric false home_value, race_other home_value .* race_other

Standardization with z-score (mean = 2725364.3718, std = 4298818.8992)

zsc(patient_age.*income_household_20_to_25)  Numeric false patient_age, income_household_20_to_25 patient_age .* income_household_20_to_25

Standardization with z-score (mean = 241.7171, std = 97.8001)

q25(farmer)  Categorical false farmer Equiprobable binning (number of bins = 25)

q27(race_native)  Categorical false race_native Equiprobable binning (number of bins = 27)

eb28(age_median)  Categorical false age_median Equal-width binning (number of bins = 28)

q28(never_married)  Categorical false never_married Equiprobable binning (number of bins = 28)

zsc(cos(patient_age))  Numeric false patient_age cos( )

Standardization with z-score (mean = 0.021113, std = 0.71469)

zsc(sin(race_black))  Numeric false race_black sin( )

Standardization with z-score (mean = 0.16517, std = 0.70668)

zsc(tanh(age_50s))  Numeric false age_50s tanh( )

Standardization with z-score (mean = 1, std = 8.9224e-09)

zsc(male+female)  Numeric false male, female male + female

Standardization with z-score (mean = 100.0001, std = 0.000436)

q28(female)  Categorical false female Equiprobable binning (number of bins = 28)

eb28(male)  Categorical false male Equal-width binning (number of bins = 28)

zsc(sin(age_median))  Numeric false age_median sin( )

Standardization with z-score (mean = -0.1365, std = 0.71613)

q28(home_ownership)  Categorical false home_ownership Equiprobable binning (number of bins = 28)

zsc(age_over_80./income_household_20_to_25)  Numeric false age_over_80, income_household_20_to_25 age_over_80 ./ income_household_20_to_25

Standardization with z-score (mean = 1.0866, std = 0.51568)

zsc(cos(education_highschool))  Numeric false education_highschool cos( )

Standardization with z-score (mean = -0.019221, std = 0.71994)

zsc(cos(race_black))  Numeric false race_black cos( )

Standardization with z-score (mean = -0.020693, std = 0.68773)

q28(self_employed)  Categorical false self_employed Equiprobable binning (number of bins = 28)

zsc(cos(age_median))  Numeric false age_median cos( )

Standardization with z-score (mean = -0.029038, std = 0.68394)

q50(patient_id)  Categorical false patient_id Equiprobable binning (number of bins = 50)

zsc(sin(race_asian))  Numeric false race_asian sin( )

Standardization with z-score (mean = 0.28421, std = 0.64235)

q28(education_stem_degree)  Categorical false education_stem_degree Equiprobable binning (number of bins = 28)

zsc(cos(age_20s))  Numeric false age_20s cos( )

Standardization with z-score (mean = 0.10518, std = 0.69162)

eb23(N02)  Categorical false N02 Equal-width binning (number of bins = 23)

q28(rent_burden)  Categorical false rent_burden Equiprobable binning (number of bins = 28)

zsc(race_asian.*veteran)  Numeric false race_asian, veteran race_asian .* veteran

Standardization with z-score (mean = 28.4889, std = 30.7)

zsc(sin(income_household_35_to_50))  Numeric false income_household_35_to_50 sin( )

Standardization with z-score (mean = 0.03083, std = 0.68752)

zsc(cos(patient_zip3))  Numeric false patient_zip3 cos( )

Standardization with z-score (mean = -0.06867, std = 0.7071)

eb28(rent_burden)  Categorical false rent_burden Equal-width binning (number of bins = 28)

zsc(sig(rent_burden))  Numeric false rent_burden sigmoid( )

Standardization with z-score (mean = 1, std = 3.571e-10)

q28(age_over_80)  Categorical false age_over_80 Equiprobable binning (number of bins = 28)

q28(family_dual_income)  Categorical false family_dual_income Equiprobable binning (number of bins = 28)

q28(family_size)  Categorical false family_size Equiprobable binning (number of bins = 28)

zsc(age_over_80./income_household_5_to_10)  Numeric false age_over_80, income_household_5_to_10 age_over_80 ./ income_household_5_to_10

Standardization with z-score (mean = 2.0422, std = 1.3415)

eb28(age_10_to_19)  Categorical false age_10_to_19 Equal-width binning (number of bins = 28)

q28(income_individual_median)  Categorical false income_individual_median Equiprobable binning (number of bins = 28)

zsc(age_over_80./unemployment_rate)  Numeric false age_over_80, unemployment_rate age_over_80 ./ unemployment_rate

Standardization with z-score (mean = 0.74942, std = 0.37691)

zsc(cos(income_household_50_to_75))  Numeric false income_household_50_to_75 cos( )

Standardization with z-score (mean = -0.012865, std = 0.69717)

eb25(race_pacific)  Categorical false race_pacific Equal-width binning (number of bins = 25)

zsc(sin(patient_id))  Numeric false patient_id sin( )

Standardization with z-score (mean = -0.0018454, std = 0.70739)

zsc(race_native./race_multiple)  Numeric false race_native, race_multiple race_native ./ race_multiple

Standardization with z-score (mean = 0.14079, std = 0.41944)

eb28(income_household_25_to_35)  Categorical false income_household_25_to_35 Equal-width binning (number of bins = 28)

zsc(age_50s-income_household_75_to_100)  Numeric false age_50s, income_household_75_to_100 age_50s – income_household_75_to_100

Standardization with z-score (mean = 0.77657, std = 2.1264)

zsc(cos(age_60s))  Numeric false age_60s cos( )

Standardization with z-score (mean = 0.05337, std = 0.75178)

q28(income_household_35_to_50)  Categorical false income_household_35_to_50 Equiprobable binning (number of bins = 28)

eb21(race_black)  Categorical false race_black Equal-width binning (number of bins = 21)

zsc(sin(income_individual_median))  Numeric false income_individual_median sin( )

Standardization with z-score (mean = 0.045145, std = 0.69873)

q28(age_50s)  Categorical false age_50s Equiprobable binning (number of bins = 28)

q28(race_white)  Categorical false race_white Equiprobable binning (number of bins = 28)

q28(age_under_10)  Categorical false age_under_10 Equiprobable binning (number of bins = 28)

q28(disabled)  Categorical false disabled Equiprobable binning (number of bins = 28)

zsc(patient_age./income_household_100_to_150)  Numeric false patient_age, income_household_100_to_150 patient_age ./ income_household_100_to_150

Standardization with z-score (mean = 3.9266, std = 1.314)

q28(income_household_75_to_100)  Categorical false income_household_75_to_100 Equiprobable binning (number of bins = 28)

zsc(sin(N02))  Numeric false N02 sin( )

Standardization with z-score (mean = 0.039533, std = 0.70149)

eb28(family_size)  Categorical false family_size Equal-width binning (number of bins = 28)

q28(limited_english)  Categorical false limited_english Equiprobable binning (number of bins = 28)

q28(income_household_100_to_150)  Categorical false income_household_100_to_150 Equiprobable binning (number of bins = 28)

zsc(farmer.*race_black)  Numeric false farmer, race_black farmer .* race_black

Standardization with z-score (mean = 10.7649, std = 26.8957)

zsc(home_value.*race_pacific)  Numeric false home_value, race_pacific home_value .* race_pacific

Standardization with z-score (mean = 59826.8413, std = 128896.4218)

zsc(education_graduate.*health_uninsured)  Numeric false education_graduate, health_uninsured education_graduate .* health_uninsured

Standardization with z-score (mean = 97.7642, std = 54.0304)

Split the Data

The last step before you can train a machine learning model is to split your data into a training and testing set. We’ll use the training data to fit the model, and the testing set to evaluate how well the model performs on new data before we use it to make a submission. Here I split the data into 80% training and 20% testing.

numRows = height(augTrainData);

[trainInd, ~, testInd] = dividerand(numRows, .8, 0, .2); 

trainingData = augTrainData(trainInd, :);

testingData = augTrainData(testInd, :);

Train a Machine Learning Model

In this example, I’ll create a binary decision tree using the fitctree function and set ‘Optimize Hyperparameters’ to ‘auto’, which will attempt to minimize the error of our algorithm by choosing the best value for the ‘MinLeafSize’ parameter. It visualizes the results of adjusting this value, as can be seen below.

classificationTree = fitctree(trainingData, “DiagPeriodL90D”, …

 OptimizeHyperparameters=‘auto’);

|======================================================================================|

| Iter | Eval | Objective | Objective | BestSoFar | BestSoFar | MinLeafSize |

| | result | | runtime | (observed) | (estim.) | |

|======================================================================================|

| 1 | Best | 0.18764 | 1.4699 | 0.18764 | 0.18764 | 1676 |

| 2 | Accept | 0.18764 | 0.87349 | 0.18764 | 0.18764 | 162 |

| 3 | Accept | 0.20923 | 1.005 | 0.18764 | 0.19426 | 36 |

| 4 | Accept | 0.29395 | 1.6132 | 0.18764 | 0.18764 | 3 |

| 5 | Accept | 0.18764 | 0.6073 | 0.18764 | 0.1876 | 491 |

| 6 | Accept | 0.38012 | 0.21492 | 0.18764 | 0.24104 | 4858 |

| 7 | Accept | 0.18764 | 0.60759 | 0.18764 | 0.18764 | 330 |

| 8 | Accept | 0.18764 | 0.36986 | 0.18764 | 0.18763 | 1033 |

| 9 | Accept | 0.19227 | 1.0609 | 0.18764 | 0.18762 | 80 |

| 10 | Accept | 0.24409 | 1.4868 | 0.18764 | 0.18761 | 13 |

| 11 | Accept | 0.18764 | 0.3479 | 0.18764 | 0.18568 | 1363 |

| 12 | Accept | 0.18764 | 0.70426 | 0.18764 | 0.1861 | 231 |

| 13 | Accept | 0.18764 | 0.48941 | 0.18764 | 0.18678 | 698 |

| 14 | Accept | 0.29519 | 2.1238 | 0.18764 | 0.18671 | 1 |

| 15 | Accept | 0.18764 | 0.35153 | 0.18764 | 0.18736 | 1438 |

| 16 | Accept | 0.18764 | 0.86203 | 0.18764 | 0.18735 | 119 |

| 17 | Accept | 0.18764 | 0.41595 | 0.18764 | 0.18734 | 849 |

| 18 | Accept | 0.18764 | 0.31486 | 0.18764 | 0.18737 | 1527 |

| 19 | Accept | 0.18764 | 0.60161 | 0.18764 | 0.18738 | 404 |

| 20 | Accept | 0.18764 | 0.45615 | 0.18764 | 0.18738 | 589 |

|======================================================================================|

| Iter | Eval | Objective | Objective | BestSoFar | BestSoFar | MinLeafSize |

| | result | | runtime | (observed) | (estim.) | |

|======================================================================================|

| 21 | Accept | 0.18764 | 0.30864 | 0.18764 | 0.18745 | 1515 |

| 22 | Accept | 0.18764 | 0.71981 | 0.18764 | 0.18745 | 138 |

| 23 | Accept | 0.18764 | 0.62974 | 0.18764 | 0.18745 | 278 |

| 24 | Accept | 0.18764 | 0.27013 | 0.18764 | 0.18749 | 1511 |

| 25 | Accept | 0.18764 | 0.62894 | 0.18764 | 0.18749 | 196 |

| 26 | Accept | 0.18764 | 0.40254 | 0.18764 | 0.18749 | 811 |

| 27 | Accept | 0.30239 | 0.19617 | 0.18764 | 0.18741 | 2944 |

| 28 | Accept | 0.18764 | 0.27176 | 0.18764 | 0.18741 | 1170 |

| 29 | Accept | 0.18764 | 0.37273 | 0.18764 | 0.18747 | 1576 |

| 30 | Accept | 0.18764 | 0.45381 | 0.18764 | 0.18747 | 945 |__________________________________________________________

Optimization completed.

MaxObjectiveEvaluations of 30 reached.

Total function evaluations: 30

Total elapsed time: 50.9097 seconds

Total objective function evaluation time: 20.2308Best observed feasible point:

MinLeafSize

___________1676Observed objective function value = 0.18764

Estimated objective function value = 0.18815

Function evaluation time = 1.4699Best estimated feasible point (according to models):

MinLeafSize

___________1527Estimated objective function value = 0.18747

Estimated function evaluation time = 0.36237

I used a binary tree as my starting point, but it’s important to test out different types of algorithms to see what works best with your data! Check out the Classification Learner app documentation and this short video to learn how to train several machine learning models quickly and iteratively!

Test Your Model

There are many ways to evaluate the performance of a machine learning model, so in this blog I’ll show how to do so by computing validation accuracy and using testing data.

Validation Accuracy

Cross-validation is one method of evaluating a model, and at a high level is done by:

Setting aside a subset of the training data, known as validation data
Using the rest of the training data to fit the model
Testing how well the model performs on the validation data

You can use the crossval function to do this:

% Perform cross-validation

partitionedModel = crossval(classificationTree, ‘KFold’, 5);

Then, extract the misclassification rate, and subtract it from 1 to get the model’s accuracy. The closer to 1 this value is, the more accurate our model is.

% Compute validation accuracy

validationAccuracy = 1 – kfoldLoss(partitionedModel, LossFun=‘ClassifError’)

validationAccuracy = 0.8124

Testing Data

In this section, we’ll use the ‘testingData’ dataset we created earlier. Similar to what we did with the validation data, we can use the loss function to compute the misclassification rate when you use the classification tree on the testing data, and subtract it from 1 to get a measure of accuracy.

testAccuracy = 1 – loss(classificationTree, testingData, “DiagPeriodL90D”,…

 LossFun=‘classiferror’)

testAccuracy = 0.8048

I also want to compare the predictions that the model makes to the actual outputs, so let’s remove the ‘DiagPeriodL90D’ variable from our testing data

testActual = testingData.DiagPeriodL90D;

testingData = removevars(testingData, “DiagPeriodL90D”);

Now, use the model to make predictions on the testing set

[testPreds, scores, ~, ~] = predict(classificationTree, testingData);

And use the confusionchart function to compare the predicted outputs to the actual outputs, to see how often they match or don’t.

confusionchart(testActual, testPreds)

This shows that it almost always predicts 1s correctly, or when the patient is diagnosed within 90 days, but it’s almost a 50/50 chance that this model will predict the 0s correctly.

We can also use the test data and predictions to visualize receiver operating characteristic (ROC) metrics. The ROC curve shows the true positive rate (TPR) versus the false positive rate (FPR) for different thresholds of classification scores. The “Model Operating Point” shows the false positive rate and true positive rate of the model.

rocObj = rocmetrics(testActual, scores, classificationTree.ClassNames);

plot(rocObj)

Here we can see that the classifier correctly assigns about 90-95% of the 1 class observations to 1 (TPR), but incorrectly assigns about 40% of the 0 class observations as 1 (FPR). This is similar to what we observed with the confusion chart.

You can also extract the area under the curve (AUC) value, which is a measure of the overall quality of the classifier. The AUC values are in the range 0 to 1, and larger AUC values indicate better classifier performance.

rocObj.AUC

The AUC is pretty high, but shows that there is definitely room for improvement. To learn more about ROC metrics, check out this documentation page that explains it in more detail.

Create Submission

Once you have a model that performs well on the validation and testing data, it’s time to create a submission for the datathon! As a reminder, you will upload this file to Kaggle to be scored on the leaderboard.

First, import the ‘Test’ dataset:

testDataFilename = ‘Test.csv’;

allTestData = readtable(fullfile(dataFolder, testDataFilename))

allTestData = 3999×83 table 




patient_id
patient_race
payer_type
patient_state
patient_zip3
patient_age
patient_gender
bmi
breast_cancer_diagnosis_code
breast_cancer_diagnosis_desc
metastatic_cancer_diagnosis_code
metastatic_first_novel_treatment
metastatic_first_novel_treatment_type
Region
Division
population
density
age_median
age_under_10
age_10_to_19
age_20s
age_30s
age_40s
age_50s
age_60s
age_70s
age_over_80
male
female
married


1
573710
‘White’
‘MEDICAID’
‘IN’
467
54
‘F’
NaN
‘C50412’
‘Malig neoplasm of upper-outer quadrant of left female breast’
‘C773’
NaN
NaN
‘Midwest’
‘East North Central’
5.4414e+03
85.6210
40.8803
12.7323
14.0887
10.6597
11.6258
11.2081
15.6194
12.3226
8.4097
3.3435
49.1548
50.8452
55.1758

2
593679
”
‘COMMERCIAL’
‘FL’
337
52
‘F’
NaN
‘C50912’
‘Malignant neoplasm of unspecified site of left female breast’
‘C787’
NaN
NaN
‘South’
‘South Atlantic’
1.9614e+04
1.5551e+03
49.1077
8.0692
8.5872
10.6846
11.3026
10.9718
15.8231
15.9026
11.8282
6.8154
49.6590
50.3410
44.8000

3
184532
‘Hispanic’
‘MEDICAID’
‘CA’
917
61
‘F’
NaN
‘C50911’
‘Malignant neoplasm of unsp site of right female breast’
‘C773’
NaN
NaN
‘West’
‘Pacific’
4.3030e+04
2.0486e+03
38.8522
11.3065
12.8978
14.1217
13.5326
13.1609
13.3783
11.4739
6.3804
3.7370
49.0522
50.9478
48.5043

4
184532
‘Hispanic’
‘MEDICARE ADVANTAGE’
‘CA’
917
61
‘F’
NaN
‘C50912’
‘Malignant neoplasm of unspecified site of left female breast’
‘C779’
NaN
NaN
‘West’
‘Pacific’
4.3030e+04
2.0486e+03
38.8522
11.3065
12.8978
14.1217
13.5326
13.1609
13.3783
11.4739
6.3804
3.7370
49.0522
50.9478
48.5043

5
447383
‘Black’
”
‘CA’
917
64
‘F’
23
‘C50412’
‘Malig neoplasm of upper-outer quadrant of left female breast’
‘C779’
NaN
NaN
‘West’
‘Pacific’
3.6054e+04
5.2943e+03
36.6538
9.7615
11.2677
17.2338
17.4415
13.0908
12.3046
9.4077
5.6738
3.8246
50.5108
49.4892
33.4785

6
281312
”
‘COMMERCIAL’
‘MI’
483
64
‘F’
24
‘1748’
‘Malignant neoplasm of other specified sites of female breast’
‘C7800’
NaN
NaN
‘Midwest’
‘East North Central’
2.0151e+04
724.9353
42.0784
11.0392
13.0098
11.6431
11.8882
13.0647
15.1098
12.8686
7.4000
3.9588
49.2922
50.7078
54.0137

7
492714
”
‘COMMERCIAL’
‘TX’
761
91
‘F’
NaN
‘C50912’
‘Malignant neoplasm of unspecified site of left female breast’
‘C773’
NaN
NaN
‘South’
‘West South Central’
2.9482e+04
1.3355e+03
33.6278
13.1611
15.3444
16.7250
15.2167
12.5361
11.4139
8.8583
4.4167
2.3250
47.6694
52.3306
43.4639

8
378266
‘White’
‘MEDICARE ADVANTAGE’
‘IN’
473
79
‘F’
NaN
‘C50212’
‘Malig neoplasm of upper-inner quadrant of left female breast’
‘C773’
NaN
NaN
‘Midwest’
‘East North Central’
5.2774e+03
296.8542
42.0763
11.0220
13.9932
12.1288
10.4949
12.3237
13.4797
13.8864
7.6407
5.0271
48.5627
51.4373
50.6559

9
291550
”
‘COMMERCIAL’
‘AZ’
852
50
‘F’
NaN
‘C50919’
‘Malignant neoplasm of unsp site of unspecified female breast’
‘C773’
NaN
NaN
‘West’
‘Mountain’
3.5899e+04
1.1664e+03
41.8273
10.8364
12.3045
12.7114
12.7545
11.8909
13.0341
12.7659
9.1523
4.5614
49.7568
50.2432
51.5750

10
612272
”
‘COMMERCIAL’
‘CA’
902
47
‘F’
24
‘C50412’
‘Malig neoplasm of upper-outer quadrant of left female breast’
‘C7801’
NaN
NaN
‘West’
‘Pacific’
3.5350e+04
3.5588e+03
38.7486
11.0686
13.8657
13.6371
13.7886
13.7000
13.0686
10.7143
6.4571
3.6943
49.2714
50.7286
45.4657

11
240105
‘White’
‘MEDICAID’
‘CO’
802
56
‘F’
NaN
‘C50919’
‘Malignant neoplasm of unsp site of unspecified female breast’
‘C7931’
NaN
NaN
‘West’
‘Mountain’
2.5754e+04
2.4639e+03
35.8175
10.1600
10.0575
18.0375
19.6900
13.9675
10.9375
8.9850
5.3875
2.7900
50.2375
49.7625
41.6875

12
277939
‘White’
‘MEDICAID’
‘KY’
401
44
‘F’
NaN
‘1749’
‘Malignant neoplasm of breast (female), unspecified’
‘C7931’
NaN
NaN
‘South’
‘East South Central’
4.9004e+03
64.2871
42.1097
11.2903
11.8742
12.5065
11.3323
11.7258
15.5484
13.4419
8.6645
3.6194
51.4742
48.5258
47.8613

13
504153
”
‘COMMERCIAL’
‘IL’
600
52
‘F’
NaN
‘1749’
‘Malignant neoplasm of breast (female), unspecified’
‘C7931’
NaN
NaN
‘Midwest’
‘East North Central’
2.5744e+04
981.7631
41.7625
11.7846
13.8677
10.5738
11.3246
12.5923
15.0154
13.0277
7.8185
3.9908
49.9282
50.0708
57.2108

14
287269
‘Asian’
‘COMMERCIAL’
‘IL’
606
58
‘F’
23
‘C50912’
‘Malignant neoplasm of unspecified site of left female breast’
‘C773’
NaN
NaN
‘Midwest’
‘East North Central’
4.8671e+04
6.4314e+03
35.7554
10.4286
10.6518
18.3107
18.9036
11.9696
11.7268
9.6839
5.4071
2.8911
48.6964
51.3036
35.9304

⋮

Then we need to process this dataset in the same way that we did the training data. In this section, I use code instead of the live tasks for simplicity.

% replace cell arrays with categoricals

varTypes = varfun(@class, allTestData, OutputFormat=“cell”);

catIdx = strcmp(varTypes, “cell”);

varNames = allTestData.Properties.VariableNames;

catVarNames = varNames(catIdx);

for catNameIdx = 1:length(catVarNames)

 allTestData.(catVarNames{catNameIdx}) = categorical(allTestData.(catVarNames{catNameIdx}));

end

% remove variables with too many missing data points

fullTestData = removevars(allTestData, [“bmi”, “metastatic_first_novel_treatment”, “metastatic_first_novel_treatment_type”]);

We also need to use the transform function to create the same features as we created using gencfeatures for the training data.

augTestData = transform(T, fullTestData);

Now that the data is in the format our machine learning model expects it to be in, use the predict function to make predictions, and create a table to contain the patient IDs and corresponding predictions.

submissionPreds = predict(classificationTree, augTestData);

submissionTable = table(fullTestData.patient_id, submissionPreds, VariableNames=[“patient_id”, “DiagPeriodL90D”])

submissionTable = 3780×2 table 




patient_id
DiagPeriodL90D


1
573710
1

2
593679
1

3
184532
1

4
447383
1

5
687972
1

6
281312
0

7
492714
1

8
378266
1

9
291550
1

10
612272
1

11
240105
1

12
277939
0

13
504153
0

14
287269
1

⋮

	patient_id	DiagPeriodL90D
1	573710	1
2	593679	1
3	184532	1
4	447383	1
5	687972	1
6	281312	0
7	492714	1
8	378266	1
9	291550	1
10	612272	1
11	240105	1
12	277939	0
13	504153	0
14	287269	1
⋮

Last, export your predictions to a .CSV file, then upload to Kaggle for scoring.

writetable(submissionTable, “Predictions.csv”);

And that’s it! Thank you for following along with this tutorial, and best of luck to all participants. If you have any questions about this tutorial or MATLAB, reach out to us at studentcompetitions@mathworks.com or by tagging gracewoolson in the forum. Keep your eye out for our upcoming WiDS Workshop on January 31st, where we will walk through this tutorial and answer any questions you have along the way!

Category:: Data Science

Comments

To leave a comment, please click here to sign in to your MathWorks Account or create a new one.

Student Lounge
Sharing technical and real-life examples of how students can use MATLAB and Simulink in their everyday projects #studentsuccess

Sharing technical and real-life examples of how students can use MATLAB and Simulink in their everyday projects #studentsuccess

Predicting Timely Diagnosis of Metastatic Breast Cancer for the WiDS Datathon 2024

Introduction

Import Data

Process and Clean the Data

1. Convert text data to categorical

2. Handle Missing Data

Explore the Data

Univariate Analysis

Bivariate and Multivariate Analysis

Statistical Analysis

Feature Engineering

Split the Data

Train a Machine Learning Model

Test Your Model

Validation Accuracy

Testing Data

Create Submission

Comments

	patient_id	patient_race	payer_type	patient_state	patient_zip3	patient_age	patient_gender	bmi	breast_cancer_diagnosis_code	breast_cancer_diagnosis_desc	metastatic_cancer_diagnosis_code	metastatic_first_novel_treatment	metastatic_first_novel_treatment_type	Region	Division	population	density	age_median	age_under_10	age_10_to_19	age_20s	age_30s	age_40s	age_50s	age_60s	age_70s	age_over_80	male	female	married
1	475714	”	‘MEDICAID’	‘CA’	924	84	‘F’	NaN	‘C50919’	‘Malignant neoplasm of unsp site of unspecified female breast’	‘C7989’	”	”	‘West’	‘Pacific’	3.1438e+04	1.1896e+03	30.6429	16.0143	15.5429	17.6143	14.0143	11.6143	11.5571	7.5714	4	2.1000	49.8571	50.1429	36.5714
2	349367	‘White’	‘COMMERCIAL’	‘CA’	928	62	‘F’	28.4900	‘C50411’	‘Malig neoplm of upper-outer quadrant of right female breast’	‘C773’	”	”	‘West’	‘Pacific’	3.9122e+04	2.2959e+03	38.2000	11.8788	13.3545	14.2303	13.4182	13.3333	14.0606	10.2485	5.9515	3.5030	49.8939	50.1061	50.2455
3	138632	‘White’	‘COMMERCIAL’	‘TX’	760	43	‘F’	38.09	‘C50112’	‘Malignant neoplasm of central portion of left female breast’	‘C773’	”	”	‘South’	‘West South Central’	2.1997e+04	626.2367	37.9067	13.0283	14.4633	12.5317	13.5450	12.8600	12.7700	11.4267	6.5650	2.8117	50.1233	49.8767	55.7533
4	617843	‘White’	‘COMMERCIAL’	‘CA’	926	45	‘F’	NaN	‘C50212’	‘Malig neoplasm of upper-inner quadrant of left female breast’	‘C773’	”	”	‘West’	‘Pacific’	3.2795e+04	1.8962e+03	42.8714	10.0714	12.1357	12.5381	12.4643	12.6500	14.8476	12.2810	8.2167	4.7595	49.0667	50.9333	52.6048
5	817482	”	‘COMMERCIAL’	‘ID’	836	55	‘F’	NaN	‘1749’	‘Malignant neoplasm of breast (female), unspecified’	‘C773’	”	”	‘West’	‘Mountain’	1.0886e+04	116.8860	43.4735	10.8240	13.9760	9.4920	10.3640	12.6000	14.9920	14.8360	9.4620	3.4660	52.3120	47.6880	57.8820
6	111545	‘White’	‘MEDICARE ADVANTAGE’	‘NY’	141	66	‘F’	NaN	‘1749’	‘Malignant neoplasm of breast (female), unspecified’	‘C7981’	”	”	‘Northeast’	‘Middle Atlantic’	5.6438e+03	219.3629	45.1800	8.5114	14.8571	11.0886	9.7543	13.6143	13.3743	15.6857	9.4457	3.6457	50.9114	49.0914	51.3229
7	914071	”	‘COMMERCIAL’	‘CA’	900	51	‘F’	29.0500	‘C50912’	‘Malignant neoplasm of unspecified site of left female breast’	‘C779’	”	”	‘West’	‘Pacific’	3.6054e+04	5.2943e+03	36.6538	9.7615	11.2677	17.2338	17.4415	13.0908	12.3046	9.4077	5.6738	3.8246	50.5108	49.4892	33.4785
8	479368	‘White’	‘COMMERCIAL’	‘IL’	619	60	‘F’	NaN	‘C50512’	‘Malig neoplasm of lower-outer quadrant of left female breast’	‘C773’	”	”	‘Midwest’	‘East North Central’	3.4041e+03	25.7333	42.7900	11.9833	13.2567	9.5733	12.4000	11.8133	13.5767	14.0433	8.5267	4.8533	49.2833	50.7167	55.8867
9	994014	‘White’	‘MEDICARE ADVANTAGE’	”	973	82	‘F’	NaN	‘1744’	‘Malignant neoplasm of upper-outer quadrant of female breast’	‘C7800’	”	”	”	”	1.0111e+04	240.5785	44.9159	9.0646	12.1000	11.1385	11.5123	10.5354	12.7292	18.5462	10.7431	3.6400	52.5846	47.4154	52.9938
10	155485	”	‘COMMERCIAL’	‘IL’	617	64	‘F’	NaN	‘C50912’	‘Malignant neoplasm of unspecified site of left female breast’	‘C773’	”	”	‘Midwest’	‘East North Central’	4.4353e+03	68.0019	41.3000	12.8358	13.6811	10.5245	11.9377	11.6585	13.5774	13.7434	7.6868	4.3415	49.3962	50.6038	57.8962
11	875977	”	‘MEDICARE ADVANTAGE’	‘MI’	488	67	‘F’	NaN	‘C50412’	‘Malig neoplasm of upper-outer quadrant of left female breast’	‘C799’	”	”	‘Midwest’	‘East North Central’	8101	246.2810	40.2782	11.0456	14.7684	13.3848	11.4671	11.2203	14.8975	12.5899	7.1494	3.4709	51.3228	48.6772	49.0658
12	343914	”	‘MEDICARE ADVANTAGE’	‘CA’	900	66	‘F’	NaN	‘1749’	‘Malignant neoplasm of breast (female), unspecified’	‘C7800’	”	”	‘West’	‘Pacific’	3.6054e+04	5.2943e+03	36.6538	9.7615	11.2677	17.2338	17.4415	13.0908	12.3046	9.4077	5.6738	3.8246	50.5108	49.4892	33.4785
13	266700	‘White’	‘COMMERCIAL’	‘MI’	480	58	‘F’	NaN	‘C50812’	‘Malignant neoplasm of ovrlp sites of left female breast’	‘C781’	”	”	‘Midwest’	‘East North Central’	1.6938e+04	894.1681	42.9348	10.5116	11.8130	11.9217	12.4043	12.4304	15.2710	13.8826	7.8522	3.8971	50.0217	49.9783	50.9565
14	437659	”	”	‘IL’	606	82	‘F’	undefined	‘1749’	‘Malignant neoplasm of breast (female), unspecified’	‘C779’	”	”	‘Midwest’	‘East North Central’	4.8671e+04	6.4314e+03	35.7554	10.4286	10.6518	18.3107	18.9036	11.9696	11.7268	9.6839	5.4071	2.8911	48.6964	51.3036	35.9304
⋮

	patient_id	patient_race	payer_type	patient_state	patient_zip3	patient_age	patient_gender	breast_cancer_diagnosis_code	breast_cancer_diagnosis_desc	metastatic_cancer_diagnosis_code	Region	Division	population	density	age_median	age_under_10	age_10_to_19	age_20s	age_30s	age_40s	age_50s	age_60s	age_70s	age_over_80	male	female	married	divorced	never_married	widowed
1	475714	<undefined>	MEDICAID	CA	924	84	F	C50919	Malignant neoplasm of unsp site of unspecified female breast	C7989	West	Pacific	3.1438e+04	1.1896e+03	30.6429	16.0143	15.5429	17.6143	14.0143	11.6143	11.5571	7.5714	4	2.1000	49.8571	50.1429	36.5714	11.8857	47.1143	4.4429
2	349367	White	COMMERCIAL	CA	928	62	F	C50411	Malig neoplm of upper-outer quadrant of right female breast	C773	West	Pacific	3.9122e+04	2.2959e+03	38.2000	11.8788	13.3545	14.2303	13.4182	13.3333	14.0606	10.2485	5.9515	3.5030	49.8939	50.1061	50.2455	9.8273	35.2909	4.6515
3	138632	White	COMMERCIAL	TX	760	43	F	C50112	Malignant neoplasm of central portion of left female breast	C773	South	West South Central	2.1997e+04	626.2367	37.9067	13.0283	14.4633	12.5317	13.5450	12.8600	12.7700	11.4267	6.5650	2.8117	50.1233	49.8767	55.7533	12.3300	27.1950	4.7100
4	617843	White	COMMERCIAL	CA	926	45	F	C50212	Malig neoplasm of upper-inner quadrant of left female breast	C773	West	Pacific	3.2795e+04	1.8962e+03	42.8714	10.0714	12.1357	12.5381	12.4643	12.6500	14.8476	12.2810	8.2167	4.7595	49.0667	50.9333	52.6048	11.6238	31.1429	4.6238
5	817482	<undefined>	COMMERCIAL	ID	836	55	F	1749	Malignant neoplasm of breast (female), unspecified	C773	West	Mountain	1.0886e+04	116.8860	43.4735	10.8240	13.9760	9.4920	10.3640	12.6000	14.9920	14.8360	9.4620	3.4660	52.3120	47.6880	57.8820	14.9640	21.7600	5.4060
6	111545	White	MEDICARE ADVANTAGE	NY	141	66	F	1749	Malignant neoplasm of breast (female), unspecified	C7981	Northeast	Middle Atlantic	5.6438e+03	219.3629	45.1800	8.5114	14.8571	11.0886	9.7543	13.6143	13.3743	15.6857	9.4457	3.6457	50.9114	49.0914	51.3229	11.7600	30.8314	6.0914
7	914071	<undefined>	COMMERCIAL	CA	900	51	F	C50912	Malignant neoplasm of unspecified site of left female breast	C779	West	Pacific	3.6054e+04	5.2943e+03	36.6538	9.7615	11.2677	17.2338	17.4415	13.0908	12.3046	9.4077	5.6738	3.8246	50.5108	49.4892	33.4785	11.3015	50.4569	4.7662
8	479368	White	COMMERCIAL	IL	619	60	F	C50512	Malig neoplasm of lower-outer quadrant of left female breast	C773	Midwest	East North Central	3.4041e+03	25.7333	42.7900	11.9833	13.2567	9.5733	12.4000	11.8133	13.5767	14.0433	8.5267	4.8533	49.2833	50.7167	55.8867	12.6400	24.5267	6.9433
9	994014	White	MEDICARE ADVANTAGE	<undefined>	973	82	F	1744	Malignant neoplasm of upper-outer quadrant of female breast	C7800	<undefined>	<undefined>	1.0111e+04	240.5785	44.9159	9.0646	12.1000	11.1385	11.5123	10.5354	12.7292	18.5462	10.7431	3.6400	52.5846	47.4154	52.9938	13.9323	27.5262	5.5615
10	155485	<undefined>	COMMERCIAL	IL	617	64	F	C50912	Malignant neoplasm of unspecified site of left female breast	C773	Midwest	East North Central	4.4353e+03	68.0019	41.3000	12.8358	13.6811	10.5245	11.9377	11.6585	13.5774	13.7434	7.6868	4.3415	49.3962	50.6038	57.8962	10.8981	24.9547	6.2472
11	875977	<undefined>	MEDICARE ADVANTAGE	MI	488	67	F	C50412	Malig neoplasm of upper-outer quadrant of left female breast	C799	Midwest	East North Central	8101	246.2810	40.2782	11.0456	14.7684	13.3848	11.4671	11.2203	14.8975	12.5899	7.1494	3.4709	51.3228	48.6772	49.0658	13.6051	31.8848	5.4392
12	343914	<undefined>	MEDICARE ADVANTAGE	CA	900	66	F	1749	Malignant neoplasm of breast (female), unspecified	C7800	West	Pacific	3.6054e+04	5.2943e+04	36.6538	9.7615	11.2677	17.2338	17.4415	13.0908	12.3046	9.4077	5.6738	3.8246	50.5108	49.4892	33.4785	11.3015	50.4569	4.7662
13	266700	White	COMMERCIAL	MI	480	58	F	C50812	Malignant neoplasm of ovrlp sites of left female breast	C781	Midwest	East North Central	1.6938e+04	894.1681	42.9348	10.5116	11.8130	11.9217	12.4043	12.4304	15.2710	13.8826	7.8522	3.8971	50.0217	49.9783	50.9565	12.3145	30.8333	5.9014
14	437659	<undefined>	<undefined>	IL	606	82	F	1749	Malignant neoplasm of breast (female), unspecified	C779	Midwest	East North Central	4.8671e+04	6.4314e+03	35.7554	10.4286	10.6518	18.3107	18.9036	11.9696	11.7268	9.6839	5.4071	2.8911	48.6964	51.3036	35.9304	10.2982	49.0054	4.7643
⋮

	metastatic_cancer_diagnosis_code	zsc(woe2(breast_cancer_diagnosis_code))	zsc(woe2(breast_cancer_diagnosis_desc))	zsc(woe2(metastatic_cancer_diagnosis_code))	zsc(woe2(patient_state))	zsc(patient_age./Ozone)	zsc(patient_age./commute_time)	zsc(kmc51)	eb28(education_less_highschool)	zsc(income_household_35_to_50./income_household_75_to_100)	zsc(kmc12)	eb11(patient_age)	q28(income_household_under_5)	zsc(rent_burden-education_less_highschool)	q11(patient_age)	zsc(sig(family_dual_income))	zsc(sig(patient_age))	zsc(sin(PM25))	zsc(cos(rent_median))	zsc(sin(patient_zip3))	zsc(health_uninsured./PM25)	zsc(cos(population))	zsc(cos(education_bachelors))	zsc(sin(hispanic))	q28(density)	eb28(education_highschool)	zsc(income_household_75_to_100.*rent_burden)	q28(unemployment_rate)	q28(patient_zip3)	zsc(patient_id.*hispanic)
1	C7989	0.5390	0.5390	-0.4180	0.3566	0.2943	1.2850	-0.9221	28	0.4148	0.8408	11	17	-2.6485	11	0.0164	0.0329	0.4182	-1.3204	0.4400	0.1652	-1.3368	-0.8396	-0.9848	19	16	1.3688	25	25	1.8601
2	C773	0.5403	0.5403	0.6847	0.3566	-0.0953	-0.2533	1.1450	14	-1.0911	-0.6892	7	9	0.2093	7	0.0164	0.0329	0.5817	0.4439	-1.4209	-0.5852	-1.2593	0.1155	0.3359	24	8	0.6588	11	26	0.2666
3	C773	0.4472	0.4472	0.6847	-0.4817	-1.1723	-1.2683	-0.5463	9	-0.1595	0.1389	3	10	-0.1364	2	0.0164	0.0329	0.8441	-0.5097	-0.4515	1.2506	1.0719	0.8394	0.7034	14	14	0.0868	7	19	-0.6458
4	C773	0.4804	0.4804	0.6847	0.3566	-1.1790	-0.8614	2.4460	2	-1.4278	-2.6472	3	19	1.2015	2	0.0164	0.0329	0.5895	0.0984	0.9142	-0.9343	-1.3164	-0.5676	-1.2673	23	1	-0.8003	13	26	0.0137
5	C773	-1.8389	-1.8389	0.6847	-0.1188	-0.4668	-0.1359	-0.1677	10	1.1998	0.5623	5	11	-0.6774	5	0.0164	0.0329	-1.9107	-1.1374	0.3927	2.8054	-1.0779	0.0894	0.9926	5	16	-0.8990	5	21	0.0631
6	C7981	-1.8389	-1.8389	-3.0232	-0.7796	0.3970	0.6970	-0.5463	5	-0.4166	0.1389	8	1	-0.1713	9	0.0164	0.0329	-0.8052	-1.4920	0.4399	-0.6668	0.1846	-1.5704	0.8916	8	23	-0.1033	3	3	-0.8538
7	C779	0.5150	0.5150	-1.0339	0.3566	-0.7176	-0.8180	1.2575	26	0.1398	-0.9325	4	28	-0.8526	4	0.0164	0.0329	-2.1568	0.2754	1.3443	-0.4261	0.5523	0.3214	1.4403	27	6	-0.0186	26	23	2.7136
8	C773	0.5605	05605	0.6847	-0.8787	0.2552	0.4365	-0.9221	8	1.2589	0.8408	6	12	-1.0383	7	0.0164	0.0329	0.6462	0.3778	-0.2288	0.0129	0.3019	1.0857	0.5697	1	26	-1.3912	17	17	-0.7649
9	C773	0.5150	0.5150	0.6847	-0.8787	0.5303	0.7654	-0.5463	1	-0.3771	0.1389	7	15	0.0934	8	0.0164	0.0329	0.6494	1.3120	1.2738	-0.8822	1.1952	1.2474	0.1205	3	23	-0.1783	2	17	-0.8367
10	C799	0.5892	0.5892	-0.7083	0.4285	0.6062	0.4731	-0.1677	5	0.3661	0.5623	8	4	-0.1511	9	0.0164	0.0329	0.8468	0.7382	-1.3160	-0.3069	-0.5051	-1.0839	-1.0734	9	22	-0.5364	11	13	-0.5800
11	C7800	-1.8389	-1.8389	-0.4676	0.3566	0.2790	-0.0623	2.5856	26	0.1398	-3.1658	8	28	-0.8526	9	0.0164	0.0329	-2.1568	0.2754	1.3443	-0.4261	0.5523	0.3214	1.4403	27	6	-0.0186	26	23	0.4736
12	C781	0.5758	0.5758	-1.7502	0.4285	-0.0598	-0.2406	-0.5463	4	-0.4701	0.1389	6	8	0.1148	6	0.0164	0.0329	0.8290	-1.2051	0.8002	-0.9475	0.0945	1.1447	0.5680	16	14	-0.0036	16	12	-0.8148
13	C773	0.5668	0.5668	0.6847	0.3566	0.2459	-0.8014	1.2575	6	-1.3361	-0.9325	6	5	0.6748	6	0.0164	0.0329	-0.6779	0.9431	0.7497	-0.9270	0.9452	-0.4288	0.1632	17	4	-0.4941	13	27	0.7836
14	C786	0.5403	0.5403	-0.0208	0.4096	0.4423	0.4481	-1.0650	7	1.5972	0.0484	7	22	1.3062	7	0.0164	0.0329	-1.4776	0.6570	1.1964	0.8045	1.4144	0.4459	1.3659	4	15	0.5212	22	28	-0.4435
⋮

Introduction

Import Data

Process and Clean the Data

1. Convert text data to categorical

2. Handle Missing Data

Explore the Data

Univariate Analysis

Bivariate and Multivariate Analysis

Statistical Analysis

Feature Engineering

Split the Data

Train a Machine Learning Model

Test Your Model

Validation Accuracy

Testing Data

Create Submission

See Also

Comments

Select a Web Site

Americas

Europe

Asia Pacific