{"id":4935,"date":"2021-01-08T16:41:14","date_gmt":"2021-01-08T15:41:14","guid":{"rendered":"https:\/\/blogs.mathworks.com\/student-lounge\/?p=4935"},"modified":"2021-04-06T21:38:39","modified_gmt":"2021-04-06T19:38:39","slug":"matlab-benchmark-code-for-wids-datathon-2021","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/student-lounge\/2021\/01\/08\/matlab-benchmark-code-for-wids-datathon-2021\/","title":{"rendered":"MATLAB Benchmark Code for WiDS Datathon 2021"},"content":{"rendered":"<h1>Introduction<\/h1>\n<p>Hello all, I am Neha Goel, Technical Lead for AI\/Data Science competitions on the MathWorks Student Competition team. MathWorks is excited to support WiDS Datathon 2021 by providing complimentary MATLAB Licenses, tutorials, and getting started resources to each participant.<\/p>\n<p><strong>To request your complimentary license, go to the <a href=\"https:\/\/www.mathworks.com\/academia\/student-competitions\/wids-datathon.html\">MathWorks site<\/a>, click the \u201cRequest Software\u201d button, and fill out the software request form.<\/strong> <em>You will get your license within 72 business hours.<\/em><\/p>\n<p>The WiDS Datathon 2021 focuses on patient health through data from MIT\u2019s GOSSIS (Global Open Source Severity of Illness Score) initiative. Brought to you by the Global WiDS team, the <a href=\"https:\/\/westbigdatahub.org\/\">West Big Data Innovation Hub<\/a>, and the WiDS Datathon Committee, open until March 1st, 2021.\u00a0Learn more about the\u00a0<a href=\"https:\/\/www.widsconference.org\/datathon.html\">WiDS Datathon<\/a>, or\u00a0<a href=\"https:\/\/airtable.com\/shrLE1J7hVxuYAILv\">register to participate<\/a>\u00a0today.<\/p>\n<p>The Datathon task is to train a model that takes as input the patient record data and outputs a prediction of how likely it is that the patient have been diagnosed with a certain type of diabetes which could inform treatment in the ICU. In this blog post I will walk through basic starter code workflow in MATLAB. Additional resources for other training methods are linked at the bottom of the blog post.<\/p>\n<h1>Load &amp; Prepare Data<\/h1>\n<p>This dataset presents an opportunity to learn about the data modelling and processing challenges a real-world data problem brings. In this blog, I will talk about some basic methods to handle data challenges. To learn some more methods you can go through this <a href=\"https:\/\/www.mathworks.com\/videos\/series\/data-science-tutorial.html\">Data Science Tutorial video series<\/a> .<\/p>\n<h2>Step 1: Load Data<\/h2>\n<p>Register for the competition and download the data files from Kaggle. &#8220;<em>TrainingWiDS2021.csv<\/em>&#8221; is the training data file and &#8220;<em>UnlabeledWiDS2021.csv<\/em>&#8221; is the test data.<br \/>\nOnce you download the files, make sure that the files are in the MATLAB path. Here I use the <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/readtable.html?\">readtable<\/a> function to read the files and store it as tables. <em>TreatAsEmpty<\/em> is the placeholder text to treat empty values to numeric columns in file. Table elements corresponding to characters <em>&#8216;NA&#8217;<\/em> will be set as <em>&#8216;NaN&#8217;<\/em> when imported. You can also import data using the MATLAB <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/import_export\/select-spreadsheet-data-interactively.html\">Import tool<\/a>.<\/p>\n<pre>TrainSet = readtable('TrainingWiDS2021.csv','TreatAsEmpty','NA');<\/pre>\n<h2>Step 2: Clean Data<\/h2>\n<p>The biggest challenge with this dataset is that the data is messy. 180 predictor columns,\u00a0 130157 observations with lot of missing values. Data transformation and modelling will be the key area to work on to avoid overfitting the problem.<br \/>\nUsing the <em>summary<\/em> function, I analyzed the types of the predictors, the min, max, median values and number of missing values for each predictor column. This helped me derive relevant assumptions to clean the data.<\/p>\n<pre>summary(TrainSet);\r\n<\/pre>\n<p><img decoding=\"async\" loading=\"lazy\" width=\"579\" height=\"318\" class=\"size-full wp-image-4941 aligncenter\" src=\"https:\/\/blogs.mathworks.com\/racing-lounge\/files\/2021\/01\/summary_1.png\" alt=\"\" \/><\/p>\n<p>There are many different approaches to work with the missing values and predictor selection. We will go through one of the basic approaches in this blog. You can also refer to this document to learn about other methods : <a href=\"https:\/\/www.mathworks.com\/help\/stats\/find-missing-data.html\">Clean Messy and Missing Data<\/a>.<\/p>\n<p><em>Note: This approach of data cleaning demonstrated is chosen arbitrarily to cut down number of predictor columns.<\/em><\/p>\n<p><strong>Remove the character columns of the table<\/strong><\/p>\n<p>The reason behind this is that the algorithm I chose to train the model is <em>fitclinear<\/em> and it only allows numeric matrix as the input arguments. Hence removing variables of &#8220;Categorical&#8221; or &#8220;non-numeric&#8221; type.<\/p>\n<pre>TrainSet = removevars(TrainSet, {'ethnicity','gender','hospital_admit_source','icu_admit_source',...\r\n'icu_stay_type','icu_type'});<\/pre>\n<p><strong>Remove minimum values from all the vitals predictors<\/strong><\/p>\n<p>After analyzing the\u00a0<em> DataDictionaryWiDS2021.csv<\/em>\u00a0 \u00a0file provided with the Kaggle data, I noticed that the even columns from column 40 to 166 correspond to minimum values of predictors in the <em>vital<\/em> category.<\/p>\n<pre>TrainSet = removevars(TrainSet,(40:2:166));<\/pre>\n<p><strong>Remove the observations which have 30 or more missing predictors<\/strong><\/p>\n<p>The other assumption I made is the observations (patients) which have 30 or more missing predictor values can be removed.<\/p>\n<pre>TrainSet = rmmissing(TrainSet,1,'MinNumMissing',30);\r\n<\/pre>\n<p><strong>Remove the observations which have 30 or more missing predictors<\/strong><\/p>\n<p>The other assumption I made is the observations (patients) which have 30 or more missing predictor values can be removed.<\/p>\n<pre>TrainSet = rmmissing(TrainSet,1,'MinNumMissing',30);<\/pre>\n<p><strong>Fill the missing values<\/strong><\/p>\n<p>The next step is to fill in all the NaN values. One approach is to use the <a href=\"https:\/\/www.mathworks.com\/help\/releases\/R2019b\/matlab\/ref\/fillmissing.html#bvc6et8\">fillmissing<\/a> function to fill data using linear interpolation. Other approaches include replacing NaN values with mean or median values and <a href=\"https:\/\/www.mathworks.com\/help\/releases\/R2019b\/curvefit\/removing-outliers.html\">removing the outliers using the CurveFitting app<\/a>.<\/p>\n<pre>TrainSet = convertvars(TrainSet,@iscategorical,'double');\r\nTrainSet = fillmissing(TrainSet,'linear');<\/pre>\n<p>In this step I move our label predictor <em>diabetes_mellitus<\/em> to the last column of the table because for some algorithms in MATLAB and in Classification learner app the last column is the default response variable.<\/p>\n<pre>TrainSet = movevars(TrainSet,'diabetes_mellitus','After',110);<\/pre>\n<h2>Step 3: Create Training Data<\/h2>\n<p>Once I have the cleaned training data. I separate the label predictor <em>diabetes_mellitus<\/em> from the training set and create two separate tables <em>XTrain:<\/em> Predictor data , <em>YTrain<\/em>: Class labels<\/p>\n<pre>XTrain = removevars(TrainSet,{'diabetes_mellitus'});\r\nYTrain = TrainSet.diabetes_mellitus;<\/pre>\n<h2>Step 4: Create Test Data<\/h2>\n<p>Download the <em>UnlabeledWiDS2021.csv<\/em>\u00a0file from Kaggle. Read the file using the <em>readtable<\/em> function to store it as a table. You can also use Import tool to load the data.<\/p>\n<pre>XTest = readtable('UnlabeledWiDS2021.csv','TreatAsEmpty','NA');<\/pre>\n<p>I used a similar approach for cleaning test data as the training data above. XTest is the test data with no label predictor.<\/p>\n<p><strong>Remove the character columns of the table<\/strong><\/p>\n<pre>XTest = removevars(XTest, {'ethnicity','gender','hospital_admit_source',...\r\n'icu_admit_source','icu_stay_type','icu_type'});<\/pre>\n<p><strong>Remove minimum values from all the vitals predictors<\/strong><\/p>\n<p>The minimum values of <em>vital<\/em> category correspond to the even columns from column 40 to 166.<\/p>\n<pre>XTest = removevars(XTest, (40:2:166));<\/pre>\n<p><strong>Fill the missing values<\/strong><\/p>\n<pre>XTest = convertvars(XTest,@iscategorical,'double');\r\nXTest = fillmissing(XTest,'linear');<\/pre>\n<h2>Step 5: Train a Model<\/h2>\n<p>In MATLAB you can train a model using two different methods.<\/p>\n<ol>\n<li>Using custom MATLAB machine learning algorithm functions<\/li>\n<li>Training the model using Classification learner app.<\/li>\n<\/ol>\n<p>Here I walkthrough steps for doing both the methods. I would encourage to try both the approaches and train the model using different algorithms and parameters. It will help in optimization and comparing different model&#8217;s scores.<\/p>\n<h2>Option 1: Using custom algorithms<\/h2>\n<p>A Binary classification problem can be approached using various algorithms like Decision tress, svm and logistic regression. Here I train using <a href=\"https:\/\/www.mathworks.com\/help\/stats\/fitclinear.html?\">fitclinear<\/a> classification model. It trains the linear binary classification models with high dimensional predictor data.<\/p>\n<p>Convert the table to a numeric matrix because <em>fitclinear<\/em> function takes only numeric matrix as an input argument.<\/p>\n<pre>XTrainMat = table2array(XTrain);\r\nXTestMat = table2array(XTest);<\/pre>\n<p><strong>Fit the model<\/strong><\/p>\n<p>The name value pair input arguments within the function gives the options of tuning the model. Here I use solver as sparsa (Sparse Reconstruction by Separable Approximation), which has default <em>lasso<\/em> regularization. To optimize the model, I do some Hyperparameter Optimization.<\/p>\n<p><em>&#8216;OptimizeHyperparameters&#8217;<\/em> as <em>&#8216;auto&#8217;<\/em> uses <em>{lambda, learner}<\/em> and <em>acquisition<\/em> function name lets you modify the behavior when the function is overexploiting an area per second.<\/p>\n<p>You can further cross validate the data within input arguments using cross-validation options: <em>crossval, KFold, CVPartition<\/em> etc. Check out the <a href=\"https:\/\/www.mathworks.com\/help\/stats\/fitclinear.html?\"><em>fitclinear<\/em><\/a> document to know about input arguments.<\/p>\n<pre>Mdl = fitclinear(XTrainMat,YTrain,'ObservationsIn','rows','Solver','sparsa',...\r\n'OptimizeHyperparameters','auto','HyperparameterOptimizationOptions',...\r\nstruct('AcquisitionFunctionName','expected-improve\r\nment-plus'));<\/pre>\n<p>&nbsp;<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignleft wp-image-4939 size-medium\" src=\"https:\/\/blogs.mathworks.com\/racing-lounge\/files\/2021\/01\/graph_2-300x280.png\" alt=\"\" width=\"300\" height=\"280\" \/>\u00a0<img decoding=\"async\" loading=\"lazy\" class=\"wp-image-4943 size-medium aligncenter\" src=\"https:\/\/blogs.mathworks.com\/racing-lounge\/files\/2021\/01\/undergraph_3-225x300.png\" alt=\"\" width=\"225\" height=\"300\" \/><\/p>\n<p><strong>Predict on the Test Set<\/strong><\/p>\n<p>Once we have your model ready, you can perform predictions on your test set using predict function. It takes as input the fitted model and Test data with similar predictors as training data. The output is the predicted labels and scores.<\/p>\n<pre>[label,scores] = predict(Mdl,XTestMat);<\/pre>\n<h2>Option 2: Using Classification Learner App<\/h2>\n<p>Second method of training the model is by using the <a href=\"https:\/\/www.mathworks.com\/help\/stats\/classification-learner-app.html\">Classification Learner app<\/a>. It lets you interactively train, validate and tune classification model. Let&#8217;s see the steps to work with it.<\/p>\n<ul>\n<li>On the <strong><em>Apps<\/em><\/strong> tab, in the Machine Learning group, click <strong><em>Classification Learner<\/em><\/strong>.<\/li>\n<li>Click <strong><em>New Session<\/em><\/strong> and select data (<strong><em>TrainSet<\/em><\/strong>) from the workspace. Specify the response variable (<strong><em>diabetes_mellitus<\/em><\/strong>).<\/li>\n<li>Select the validation method to avoid overfitting. You can either choose <strong><em>holdout validation<\/em><\/strong> or <strong><em>cross-validation<\/em><\/strong> selecting the no of k-folds.<\/li>\n<li>On the <strong><em>Classification Learner tab<\/em><\/strong>, in the <strong><em>Model Type<\/em><\/strong> section, select the algorithm to be trained e.g. <em>logistic regression, All svm, All Quick-to-train<\/em>.<\/li>\n<li>You can also try transforming features by <strong><em>enabling PCA<\/em><\/strong> to reduce dimensionality.<\/li>\n<li>The model can further be improved by changing parameter setting in the <strong><em>Advanced dialog box<\/em><\/strong>.<\/li>\n<li>Once all required options are selected, click <strong><em>Train<\/em><\/strong>.<\/li>\n<li>The history window on the left displays the different models trained and their accuracy.<\/li>\n<li>Performance of the model on the validation data can be evaluated by <strong><em>Confusion Matrix<\/em><\/strong> and <strong><em>ROC Curve<\/em><\/strong> sections.<\/li>\n<li>To make predictions on the test set I export the model by selecting <strong><em>Export Model<\/em><\/strong> on <em>Classification Learner tab<\/em>.<\/li>\n<\/ul>\n<p><img decoding=\"async\" loading=\"lazy\" width=\"800\" height=\"500\" class=\"aligncenter size-full wp-image-4945\" src=\"https:\/\/blogs.mathworks.com\/racing-lounge\/files\/2021\/01\/wids_3.gif\" alt=\"\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>ROC Curve of validation data, exported from Classification Learner App, with Median Gaussian SVM model<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter wp-image-4937 size-medium\" src=\"https:\/\/blogs.mathworks.com\/racing-lounge\/files\/2021\/01\/AUC_81_latest-291x300.png\" alt=\"\" width=\"291\" height=\"300\" \/><\/p>\n<p><strong>Predict on the Test Set<\/strong><\/p>\n<p>Exported model is saved as <strong><em>trainedModel<\/em><\/strong> in the workspace. You can then predict labels and scores using <strong><em>predictFcn<\/em><\/strong>.<\/p>\n<p>The <strong><em>label<\/em><\/strong> is the predicted labels on Test set. <strong><em>Scores<\/em><\/strong> are the scores of each observation for both positive and negative class.<\/p>\n<pre>[label,scores] = trainedModel.predictFcn(XTest);<\/pre>\n<h1>Submit &amp; Evaluate<\/h1>\n<h2><\/h2>\n<h2>Step 6: Kaggle Submission<\/h2>\n<p>Create a table of the results based on the IDs and prediction scores. The desired file format for submission is:<\/p>\n<p><em>encounter_id, diabetes_mellitus<\/em><\/p>\n<p>You can place all the test results in a MATLAB table, which makes it easy to visualize and to write to the desired file format. I stored the positive labels (second column) of the scores.<\/p>\n<pre>testResults = table(XTest.encounter_id,scores(:,2),'VariableNames',...{'encounter_id','diabetes_mellitus'});<\/pre>\n<p>Write the results to a CSV file. This is the file you will submit for the challenge.<\/p>\n<pre>writetable(testResults,'testResults.csv');<\/pre>\n<h2>Step 7: Evaluate<\/h2>\n<p>Submissions for the Kaggle leaderboard will be evaluated on the Area under the Receiver Operating Characteristic (ROC) curve between the predicted and the observed target (<em>diabetes_mellitus<\/em>). Submit your <em>testResults.csv<\/em> file generated above on Kaggle to view your AUC score for the test dataset.<\/p>\n<p>The AUC (Area Under Curve) is the area enclosed by the ROC curve. A perfect classifier has AUC = 1 and a completely random classifier has AUC = 0.5. Usually, your model will score somewhere in between the range of possible AUC values is [0, 1].<\/p>\n<p>Confusion matrix plot is used to understand how the currently selected classifier performed in each class. To view the confusion matrix after training a model, you can use the MTALAB <a href=\"https:\/\/www.mathworks.com\/help\/releases\/R2019b\/deeplearning\/ref\/plotconfusion.html\">plotconfusion<\/a> function.<\/p>\n<p>To perform evaluation model, MATLAB has <a href=\"https:\/\/www.mathworks.com\/help\/releases\/R2019b\/stats\/perfcurve.html\">perfcurve<\/a> function. It calculates the false positive, true positive, threshold and auc score. The input arguments to the function include test labels, scores and the positive class label.<\/p>\n<p>For your <em><strong>self-evaluation purpose<\/strong><\/em> you can create the test label (<em>YTest<\/em>) by partitioning a subset from training set and use the scores generated from the <em>trainedModel<\/em>.<\/p>\n<p><em>Note: the auc calculated through this function might differ from the auc calculated on Kaggle leaderboard.<\/em><\/p>\n<pre>[fpr,tpr,thr,auc] = perfcurve(YTest,scores(:,2),'1');\r\n\r\n<\/pre>\n<p>Thanks for following along with this code! We are excited to find out how you will modify this starter code and make it yours. I strongly recommend looking at our Resources section below for more ideas on how you can improve our benchmark model.<br \/>\nFeel free to reach out to us in the Kaggle forum or email us at <a href=\"mailto:studentcompetitions@mathworks.com\"><em>studentcompetitions@mathworks.com<\/em><\/a> if you have any further questions.<\/p>\n<h1>Additional Resources<\/h1>\n<ol>\n<li><a href=\"https:\/\/www.mathworks.com\/videos\/series\/data-science-tutorial.html\">Data Science Tutorial<\/a><\/li>\n<li><a href=\"https:\/\/www.mathworks.com\/help\/releases\/R2019b\/matlab\/data_analysis\/missing-data-in-matlab.html\">Missing Data in MATLAB<\/a><\/li>\n<li><a href=\"https:\/\/www.mathworks.com\/help\/releases\/R2019b\/stats\/supervised-learning-machine-learning-workflow-and-algorithms.html\">Supervised Learning Workflow and Algorithms<\/a><\/li>\n<li><a href=\"https:\/\/www.mathworks.com\/help\/stats\/train-classification-models-in-classification-learner-app.html\">Train Classification Models in Classification Learner App<\/a><\/li>\n<li><a href=\"https:\/\/www.mathworks.com\/help\/stats\/export-classification-model-for-use-with-new-data.html\">Export Classification Model to Predict New Data<\/a><\/li>\n<li><a href=\"https:\/\/www.mathworks.com\/campaigns\/offers\/data-science-cheat-sheets.html\">8 MATLAB Cheat Sheets for Data Science<\/a><\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img src=\"https:\/\/blogs.mathworks.com\/student-lounge\/files\/2021\/01\/wids_3.gif\" class=\"img-responsive attachment-post-thumbnail size-post-thumbnail wp-post-image\" alt=\"\" decoding=\"async\" loading=\"lazy\" \/><\/div>\n<p>Introduction<br \/>\nHello all, I am Neha Goel, Technical Lead for AI\/Data Science competitions on the MathWorks Student Competition team. MathWorks is excited to support WiDS Datathon 2021 by providing&#8230; <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/student-lounge\/2021\/01\/08\/matlab-benchmark-code-for-wids-datathon-2021\/\">read more >><\/a><\/p>\n","protected":false},"author":174,"featured_media":4945,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[365,6],"tags":[285,322,363,361,104,128],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/student-lounge\/wp-json\/wp\/v2\/posts\/4935"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/student-lounge\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/student-lounge\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/student-lounge\/wp-json\/wp\/v2\/users\/174"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/student-lounge\/wp-json\/wp\/v2\/comments?post=4935"}],"version-history":[{"count":17,"href":"https:\/\/blogs.mathworks.com\/student-lounge\/wp-json\/wp\/v2\/posts\/4935\/revisions"}],"predecessor-version":[{"id":4981,"href":"https:\/\/blogs.mathworks.com\/student-lounge\/wp-json\/wp\/v2\/posts\/4935\/revisions\/4981"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/student-lounge\/wp-json\/wp\/v2\/media\/4945"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/student-lounge\/wp-json\/wp\/v2\/media?parent=4935"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/student-lounge\/wp-json\/wp\/v2\/categories?post=4935"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/student-lounge\/wp-json\/wp\/v2\/tags?post=4935"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}