Racing Lounge

Best practices and teamwork for student competitions

MATLAB Benchmark Code for WiDS Datathon 2020

Introduction

Hello all, I am Neha Goel, Technical Lead for AI/Data Science competitions on the MathWorks Student Competition team. MathWorks is excited to support WiDS Datathon 2020 by providing complimentary MATLAB Licenses, tutorials, and getting started resources to each participant.

To request your complimentary license, go to the MathWorks site, click the “Request Software” button, and fill out the software request form. You will get your license within 72 business hours.

The WiDS Datathon 2020 focuses on patient health through data from MIT’s GOSSIS (Global Open Source Severity of Illness Score) initiative. Brought to you by the Global WiDS team, the West Big Data Innovation Hub, and the WiDS Datathon Committee, open until February 24, 2020.

The Datathon task is to train a model that takes as input the patient record data and outputs a prediction of how likely it is that the patient survives. In this blog post I will walk through basic starter code in MATLAB. Additional resources for other training methods are linked at the bottom of the blog post.

Load and Prepare Data

Register for the competition and download the data files from Kaggle. “training.csv” is the training data file and “unlabeled.csv” is the test data.

Tip: Save the csv files as .xlsx to avoid end of file blank rows.

Step 1: Load Data

Once you download the files, make sure that the files are in the MATLAB path. Here I use the readtable function to read the files and store it as tables. TreatAsEmpty is the placeholder text to treat empty values to numeric columns in file. Table elements corresponding to characters ‘NA’ will be set as ‘NaN’ when imported. You can also import data using the MATLAB Import tool.

TrainSet = readtable('training.xlsx','TreatAsEmpty','NA');

Step 2: Clean Data

The biggest challenge with this dataset is that the data is messy. 186 predictor columns, 91713 observations with lot of missing values. Data transformation and modelling will be the key area to work on to avoid overfitting the problem.

Using the summary function, I analyzed the types of the predictors, the min, max, median values and number of missing values for each predictor column. This helped me derive relevant assumptions to clean the data.

summary(TrainSet);

table summary

There are many different approaches to work with the missing values and predictor selection. We will go through one of the approaches in this blog. You can also refer to this document to learn about other methods: Clean Messy and Missing Data.

Note: This approach of data cleaning demonstrated is chosen arbitrarily to cut down number of predictor columns.

Remove the character columns of the table

The reason behind this is that the algorithm I chose to train the model is fitclinear and it only allows numeric matrix as the input arguments.

 TrainSet = removevars(TrainSet, 
{'ethnicity','gender','hospital_admit_source','icu_admit_source',...
    'icu_stay_type','icu_type','apache_3j_bodysystem','apache_2_bodysystem'});

Remove minimum values from all the vitals predictors

After analyzing the WiDS Datathon 2020 dictionary.csv file provided with the Kaggle data, I noticed that the even columns from column 42 to 168 correspond to minimum values of predictors in the vital category.

TrainSet = removevars(TrainSet, [42:2:168]);

Remove the observations which have 30 or more missing predictors

The other assumption I made is the observations (patients) which have 30 or more missing predictor values can be removed.

TrainSet = rmmissing(TrainSet,1,'MinNumMissing',30);

Fill the missing values

The next step is to fill in all the NaN values. One approach is to use the fillmissing function to fill data using linear interpolation. Other approaches include replacing NaN values with mean or median values and removing the outliers using the CurveFitting app.

TrainSet = fillmissing(TrainSet,'linear');

In this step I move our label predictor hospital_death to the last column of the table because for some algorithms in MATLAB and in Classification learner app the last column is the default response variable.

TrainSet = movevars(TrainSet,'hospital_death','After',114);

Step 3: Create Training Data

Once I have the cleaned training data. I separate the label predictor hospital_death from the training set and create two separate tables XTrain: Predictor data , YTrain:Class labels

XTrain = removevars(TrainSet,{'hospital_death'});
YTrain = TrainSet.hospital_death;

Step 4: Create Test Data

Download the unlabeled.csv file from Kaggle. Read the file using the readtable function to store it as a table.

XTest =  readtable('unlabeled.xlsx','TreatAsEmpty','NA');

I used a similar approach for cleaning test data as the training data above. XTest is the test data with no label predictor.

Remove the character columns of the table

As the unlabeled.csv file contains the hospital_death with NA values, I removed it along with other character type columns.

 XTest = removevars(XTest, 
{'hospital_death','ethnicity','gender','hospital_admit_source',...

'icu_admit_source','icu_stay_type','icu_type','apache_3j_bodysystem','apache_2_bodysystem'});

Remove minimum values from all the vitals predictors

After removing the hospital_death column, the minimum values of the vital category are now offset so they correspond to the odd columns from column 41 to 167.

XTest = removevars(XTest, [41:2:167]);

Fill the missing values

XTest = fillmissing(XTest,'linear');

Train a Model

In MATLAB you can train a model using two different methods.

  1. Using custom MATLAB machine learning algorithm functions
  2. Training the model using Classification learner app

Here I walkthrough steps for doing both the methods. I would encourage to try both the approaches and train the model using different algorithms and parameters. It will help in optimization and comparing different model’s scores.

Step 5 – Option 1: Using custom algorithms

A Binary classification problem can be approached using various algorithms like Decision tress, svm and logistic regression. Here I train using fitclinear classification model. It trains the linear binary classification models with high dimensional predictor data.

Convert the table to a numeric matrix because fitclinear function takes only numeric matrix as an input argument.

XTrainMat = table2array(XTrain);
XTestMat = table2array(XTest);

Fit the model

The name value pair input arguments within the function gives the options of tuning the model. Here I use solver as sparsa (Sparse Reconstruction by Separable Approximation), which has default lasso regularization. To optimize the model, I do some Hyperparameter Optimization.

OptimizeHyperparameters‘ as ‘auto‘ uses {lambda, learner} and acquisition function name lets you modify the behavior when the function is overexploiting an area per second.

You can further cross validate the data within input arguments using cross-validation options: crossval, KFold, CVPartition etc. Check out the fitclinear document to know about input arguments.

 Mdl = fitclinear(XTrainMat,YTrain,'ObservationsIn','rows','Solver','sparsa',...
'OptimizeHyperparameters','auto','HyperparameterOptimizationOptions',...
       struct('AcquisitionFunctionName','expected-improvement-plus'))

graph

model output

Predict on the Test Set

Once we have your model ready, you can perform predictions on your test set using predict function. It takes as input the fitted model and Test data with similar predictors as training data. The output is the predicted labels and scores.

[labelOpt1,scoresOpt1] = predict(Mdl,XTestMat);

Step 5 – Option 2: Using Classification Learner App

Second method of training the model is by using the Classification Learner app. It lets you interactively train, validate and tune classification model. Let’s see the steps to work with it.

  • On the Apps tab, in the Machine Learning group, click Classification Learner.
  • Click New Session and select data (TrainSet) from the workspace. Specify the response variable (hospital_death).
  • Select the validation method to avoid overfitting. You can either choose holdout validation or cross-validation selecting the no of k-folds.
  • On the Classification Learner tab, in the Model Type section, select the algorithm to be trained e.g. logistic regression, All svm, All Quick-to-train.
  • You can also try transforming features by enabling PCA to reduce dimensionality.
  • The model can further be improved by changing parameter setting in the Advanced dialog box.
  • Once all required options are selected, click
  • The history window on the left displays the different models trained and their accuracy.
  • Performance of the model on the validation data can be evaluated by Confusion Matrix and ROC Curve
  • To make predictions on the test set I export the model by selecting Export Model on Classification Learner tab.

MATLAB code

Predict on the Test Set

Exported model is saved as trainedModel in the workspace. You can then predict labels and scores using predictFcn.

The label is the predicted labels on Test set. Scores are the scores of each observation for both positive and negative class.

[labelOpt2,scoresOpt2] = trainedModel.predictFcn(XTest)

label scores

Evaluate and Submit

After a classification algorithm has trained on data, we examine the performance of the algorithm on our test dataset. To inspect the classifier performance more closely I plotted a Receiver Operating Characteristic (ROC) curve. By definition, a ROC curve shows true positive rate versus false positive rate for different thresholds of the classifier output.

The AUC (Area Under Curve) is the area enclosed by the ROC curve. A perfect classifier has AUC = 1 and a completely random classifier has AUC = 0.5. Usually, your model will score somewhere in between the range of possible AUC values is [0, 1].

Confusion matrix plot is used to understand how the currently selected classifier performed in each class. To view the confusion matrix after training a model, you can use the MATLAB plotconfusion function.

Step 6: Evaluate Model

To perform evaluation model, MATLAB has perfcurve function. It calculates the false positive, true positive, threshold and auc score. The input arguments to the function include test labels, scores and the positive class label. For your self-evaluation purpose you can create the test label (YTest) by partitioning a subset from XTest. I used the scores generated by option 2 above which correspond to the trainedModel created by the Classification Learner app.   

[fpr,tpr,thr,auc] = perfcurve(YTest,scoresOpt2(:,2),'1');

I get an AUC of 0.85 and the below ROC Curve.

Note: the auc calculated through this function might differ from the auc calculated on Kaggle leaderboard.

graph

Step 7: Kaggle submission

Create a table of the results based on the IDs and prediction scores. The desired file format for submission is:

encounter_id, hospital_death

You can place all the test results in a MATLAB table, which makes it easy to visualize and to write to the desired file format. I stored the positive labels (second column) of the scores.

 testResults = 
table(XTest.encounter_id,scoresOpt2(:,2),'VariableNames',{'encounter_id','hospital_death'});

Write the results to a CSV file. This is the file you will submit for the challenge.

writetable(testResults,'testResults.csv');

Thanks for following along with this code! We are excited to find out how you will modify this starter code and make it yours. I strongly recommend looking at our Resources section below for more ideas on how you can improve our benchmark model.

Feel free to reach out to us in the Kaggle forum or email us at studentcompetitions@mathworks.com if you have any further questions.

Resources

  1. Missing Data in MATLAB
  2. Supervised Learning Workflow and Algorithms
  3. Train Classification Models in Classification Learner App
  4. Export Classification Model to Predict New Data
  5. 8 MATLAB Cheat Sheets for Data Science

 

|

Comments

To leave a comment, please click here to sign in to your MathWorks Account or create a new one.