{"id":1195,"date":"2015-06-18T13:44:57","date_gmt":"2015-06-18T18:44:57","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=1195"},"modified":"2017-11-14T15:39:02","modified_gmt":"2017-11-14T20:39:02","slug":"getting-started-with-kaggle-data-science-competitions","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2015\/06\/18\/getting-started-with-kaggle-data-science-competitions\/","title":{"rendered":"Getting Started with Kaggle Data Science Competitions"},"content":{"rendered":"<div class=\"content\"><!--introduction--><p>Have you been interested in data science competitions, but not sure where to begin? Today's guest blogger, Toshi Takeuchi, would like to give a quick tutorial on how to get started with Kaggle using MATLAB.<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/data-scientists.png\" alt=\"\"> <\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#539630b4-502f-404d-bf05-423ebb2248b6\">The Titanic Competition on Kaggle<\/a><\/li><li><a href=\"#0be56091-fd67-45f2-a01c-b9b05a3e5ee7\">Data Import and Preview<\/a><\/li><li><a href=\"#2b3ec17a-1fbb-4e86-94b2-7a6730c55012\">Establishing the Baseline<\/a><\/li><li><a href=\"#e3881827-d03a-4ad2-a3c5-3183bd9d5496\">Back to Examining the Data<\/a><\/li><li><a href=\"#f11bf682-eba3-4661-a077-e61bf923bd2d\">Exploratory Data Analysis and Visualization<\/a><\/li><li><a href=\"#02adbf41-e16b-4b35-92da-0347c88b768b\">Feature Engineering<\/a><\/li><li><a href=\"#23f7834f-eff1-45dd-a0b2-46b113e7e7ef\">Your Secret Weapon - Classification Learner<\/a><\/li><li><a href=\"#9a6b03c8-fd0f-48f0-810d-66293c1caa3a\">Random Forest and Boosted Trees<\/a><\/li><li><a href=\"#1484fcc1-9d1e-4296-b4f6-43d34593eb78\">Model Evaluation<\/a><\/li><li><a href=\"#8d7a216e-b341-4f0c-8a00-138b5ccf5a55\">Create a Submission File<\/a><\/li><li><a href=\"#61580856-a9cc-41ee-b5e5-dfec8e39ebcf\">Conclusion - Let's Give It a Try<\/a><\/li><\/ul><\/div><h4>The Titanic Competition on Kaggle<a name=\"539630b4-502f-404d-bf05-423ebb2248b6\"><\/a><\/h4><p>MATLAB is no stranger to competition - the <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/contest\/\">MATLAB Programming Contest<\/a> continued for over a decade. When it comes to data science competitions, Kaggle is currently one of the most popular destinations and it offers a number of \"Getting Started 101\" projects you can try before you take on a real one. One of those is <a href=\"https:\/\/www.kaggle.com\/c\/titanic\">Titanic: Machine Learning from Disaster<\/a>.<\/p><p>The goal of the competition is to predict the survival outcomes for the ill-fated Titanic passengers. You use the training data to build your predictive model and you submit the predicted survival outcomes for the test data. Your score is determined by the prediction accuracy.<\/p><p>Don't worry if you don't rank well on this one. There are entries with a <a href=\"https:\/\/www.kaggle.com\/c\/titanic\/leaderboard\">1.00000 score<\/a> in the leaderboard, but they either seriously <a href=\"http:\/\/en.wikipedia.org\/wiki\/Overfitting\">overfit<\/a> their models to the test data, or perhaps even cheated, given that the full dataset is available from the other sources. That is not only pointless, but also raises serious questions - what kind of standards of conduct must data scientists meet to produce trustworthy results?<\/p><p>So just think of this as a way to do a practice run on Kaggle before you take on a real challenge.<\/p><p>If you haven't done so, sign up with <a href=\"https:\/\/www.kaggle.com\/\">Kaggle<\/a> - it's free. Then navigate to the <a href=\"https:\/\/www.kaggle.com\/c\/titanic\/data\">Titanic data<\/a> page to download the following files:<\/p><div><ul><li><tt>train.csv<\/tt> - the training data<\/li><li><tt>test.csv<\/tt> - the test data<\/li><\/ul><\/div><h4>Data Import and Preview<a name=\"0be56091-fd67-45f2-a01c-b9b05a3e5ee7\"><\/a><\/h4><p>We begin by importing the data into tables in MATLAB. Let's check the imported data. I am assuming that you have downloaded the CSV files into the current folder.<\/p><pre class=\"codeinput\">Train = readtable(<span class=\"string\">'train.csv'<\/span>,<span class=\"string\">'Format'<\/span>,<span class=\"string\">'%f%f%f%q%C%f%f%f%q%f%q%C'<\/span>);\r\nTest = readtable(<span class=\"string\">'test.csv'<\/span>,<span class=\"string\">'Format'<\/span>,<span class=\"string\">'%f%f%q%C%f%f%f%q%f%q%C'<\/span>);\r\ndisp(Train(1:5,[2:3 5:8 10:11]))\r\n<\/pre><pre class=\"codeoutput\">    Survived    Pclass     Sex      Age    SibSp    Parch     Fare     Cabin \r\n    ________    ______    ______    ___    _____    _____    ______    ______\r\n    0           3         male      22     1        0          7.25    ''    \r\n    1           1         female    38     1        0        71.283    'C85' \r\n    1           3         female    26     0        0         7.925    ''    \r\n    1           1         female    35     1        0          53.1    'C123'\r\n    0           3         male      35     0        0          8.05    ''    \r\n<\/pre><p><tt>Train<\/tt> contains the column <tt>Survived<\/tt>, and it is the response variable that denotes the survival outcome of the passengers:<\/p><pre>  1 - Survived\r\n  0 - Didn't survive<\/pre><h4>Establishing the Baseline<a name=\"2b3ec17a-1fbb-4e86-94b2-7a6730c55012\"><\/a><\/h4><p>When you downloaded the data from Kaggle, you probably noticed that additional files were also available - <tt>gendermodel<\/tt>, <tt>genderclassmodel<\/tt>, etc. These are simple predictive models that determined the outcome based on the gender or gender and class. When you tabulate the survival outcome by gender, you see that 74.2% of women survived.<\/p><pre class=\"codeinput\">disp(grpstats(Train(:,{<span class=\"string\">'Survived'<\/span>,<span class=\"string\">'Sex'<\/span>}), <span class=\"string\">'Sex'<\/span>))\r\n<\/pre><pre class=\"codeoutput\">               Sex      GroupCount    mean_Survived\r\n              ______    __________    _____________\r\n    female    female    314           0.74204      \r\n    male      male      577           0.18891      \r\n<\/pre><p>If we predict all women to survive and all men not to, then our overall accuracy would be 78.68% because we would be correct for women who actually survived as well as men who didn't. This is the baseline Gender Model. Our predictive model needs to do better than that on the training data. Kaggle's leaderboard shows that the score of this model on the test data is 0.76555.<\/p><pre class=\"codeinput\">gendermdl = grpstats(Train(:,{<span class=\"string\">'Survived'<\/span>,<span class=\"string\">'Sex'<\/span>}), {<span class=\"string\">'Survived'<\/span>,<span class=\"string\">'Sex'<\/span>})\r\nall_female = (gendermdl.GroupCount(<span class=\"string\">'0_male'<\/span>) + gendermdl.GroupCount(<span class=\"string\">'1_female'<\/span>))<span class=\"keyword\">...<\/span>\r\n    \/ sum(gendermdl.GroupCount)\r\n<\/pre><pre class=\"codeoutput\">gendermdl = \r\n                Survived     Sex      GroupCount\r\n                ________    ______    __________\r\n    0_female    0           female     81       \r\n    0_male      0           male      468       \r\n    1_female    1           female    233       \r\n    1_male      1           male      109       \r\nall_female =\r\n      0.78676\r\n<\/pre><h4>Back to Examining the Data<a name=\"e3881827-d03a-4ad2-a3c5-3183bd9d5496\"><\/a><\/h4><p>When we looked at <tt>Train<\/tt>, you probably noticed that some values were missing in the variable <tt>Cabin<\/tt>. Let's see if we have other variables with missing data. We also want to check if there are any strange values. For example, it would be strange to see 0 in <tt>Fare<\/tt>. When we make changes to <tt>Train<\/tt>, we also have to apply the same changes to <tt>Test<\/tt>.<\/p><pre class=\"codeinput\">Train.Fare(Train.Fare == 0) = NaN;      <span class=\"comment\">% treat 0 fare as NaN<\/span>\r\nTest.Fare(Test.Fare == 0) = NaN;        <span class=\"comment\">% treat 0 fare as NaN<\/span>\r\nvars = Train.Properties.VariableNames;  <span class=\"comment\">% extract column names<\/span>\r\n\r\nfigure\r\nimagesc(ismissing(Train))\r\nax = gca;\r\nax.XTick = 1:12;\r\nax.XTickLabel = vars;\r\nax.XTickLabelRotation = 90;\r\ntitle(<span class=\"string\">'Missing Values'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/Kaggle_Titanic_01.png\" alt=\"\"> <p>We have 177 passengers with unknown age. There are several ways to deal with <a title=\"https:\/\/www.mathworks.com\/help\/matlab\/data_analysis\/missing-data.html (link no longer works)\">missing values<\/a>. Sometimes you simply remove them, but let's use the average, 29.6991, for simplicity in this case.<\/p><pre class=\"codeinput\">avgAge = nanmean(Train.Age)             <span class=\"comment\">% get average age<\/span>\r\nTrain.Age(isnan(Train.Age)) = avgAge;   <span class=\"comment\">% replace NaN with the average<\/span>\r\nTest.Age(isnan(Test.Age)) = avgAge;     <span class=\"comment\">% replace NaN with the average<\/span>\r\n<\/pre><pre class=\"codeoutput\">avgAge =\r\n       29.699\r\n<\/pre><p>We have 15 passengers associated with unknown fares. We know their classes, and it is reasonable to assume that fares varied by passenger class.<\/p><pre class=\"codeinput\">fare = grpstats(Train(:,{<span class=\"string\">'Pclass'<\/span>,<span class=\"string\">'Fare'<\/span>}),<span class=\"string\">'Pclass'<\/span>);   <span class=\"comment\">% get class average<\/span>\r\ndisp(fare)\r\n<span class=\"keyword\">for<\/span> i = 1:height(fare) <span class=\"comment\">% for each |Pclass|<\/span>\r\n    <span class=\"comment\">% apply the class average to missing values<\/span>\r\n    Train.Fare(Train.Pclass == i &amp; isnan(Train.Fare)) = fare.mean_Fare(i);\r\n    Test.Fare(Test.Pclass == i &amp; isnan(Test.Fare)) = fare.mean_Fare(i);\r\n<span class=\"keyword\">end<\/span>\r\n<\/pre><pre class=\"codeoutput\">         Pclass    GroupCount    mean_Fare\r\n         ______    __________    _________\r\n    1    1         216           86.149   \r\n    2    2         184           21.359   \r\n    3    3         491           13.788   \r\n<\/pre><p>With regards to <tt>Cabin<\/tt>, you notice that some passengers had multiple cabins and they are all in the first class. We will treat missing values as 0. Some third class cabin numbers are irregular and we need to handle those exceptions.<\/p><pre class=\"codeinput\"><span class=\"comment\">% tokenize the text string by white space<\/span>\r\ntrain_cabins = cellfun(@strsplit, Train.Cabin, <span class=\"string\">'UniformOutput'<\/span>, false);\r\ntest_cabins = cellfun(@strsplit, Test.Cabin, <span class=\"string\">'UniformOutput'<\/span>, false);\r\n\r\n<span class=\"comment\">% count the number of tokens<\/span>\r\nTrain.nCabins = cellfun(@length, train_cabins);\r\nTest.nCabins = cellfun(@length, test_cabins);\r\n\r\n<span class=\"comment\">% deal with exceptions - only the first class people had multiple cabins<\/span>\r\nTrain.nCabins(Train.Pclass ~= 1 &amp; Train.nCabins &gt; 1,:) = 1;\r\nTest.nCabins(Test.Pclass ~= 1 &amp; Test.nCabins &gt; 1,:) = 1;\r\n\r\n<span class=\"comment\">% if |Cabin| is empty, then |nCabins| should be 0<\/span>\r\nTrain.nCabins(cellfun(@isempty, Train.Cabin)) = 0;\r\nTest.nCabins(cellfun(@isempty, Test.Cabin)) = 0;\r\n<\/pre><p>For two passengers, we don't know their port of embarkation. We will use the most frequent value, <tt>S<\/tt> (Southampton), from this variable to fill in the missing values. We also want to turn this into a numeric variable for later use.<\/p><pre class=\"codeinput\"><span class=\"comment\">% get most frequent value<\/span>\r\nfreqVal = mode(Train.Embarked);\r\n\r\n<span class=\"comment\">% apply it to missling value<\/span>\r\nTrain.Embarked(isundefined(Train.Embarked)) = freqVal;\r\nTest.Embarked(isundefined(Test.Embarked)) = freqVal;\r\n\r\n<span class=\"comment\">% convert the data type from categorical to double<\/span>\r\nTrain.Embarked = double(Train.Embarked);\r\nTest.Embarked = double(Test.Embarked);\r\n<\/pre><p>Let's also turn <tt>Sex<\/tt> into a numeric variable for later use.<\/p><pre class=\"codeinput\">Train.Sex = double(Train.Sex);\r\nTest.Sex = double(Test.Sex);\r\n<\/pre><p>Let's remove variables that we don't plan to use, because they contain too many missing values or unique values.<\/p><pre class=\"codeinput\">Train(:,{<span class=\"string\">'Name'<\/span>,<span class=\"string\">'Ticket'<\/span>,<span class=\"string\">'Cabin'<\/span>}) = [];\r\nTest(:,{<span class=\"string\">'Name'<\/span>,<span class=\"string\">'Ticket'<\/span>,<span class=\"string\">'Cabin'<\/span>}) = [];\r\n<\/pre><h4>Exploratory Data Analysis and Visualization<a name=\"f11bf682-eba3-4661-a077-e61bf923bd2d\"><\/a><\/h4><p>At this point, we can begin further <a href=\"https:\/\/www.mathworks.com\/help\/stats\/exploratory-data-analysis.html\">exploration of the data<\/a> by visualizing the distribution of variables. This is a time consuming but very important step. To keep it simple, I will just use one example - <tt>Age<\/tt>. The histogram shows that you have a higher survival rate for agess under 5, and a very low survival rate for ages above 65.<\/p><pre class=\"codeinput\">figure\r\nhistogram(Train.Age(Train.Survived == 0))   <span class=\"comment\">% age histogram of non-survivers<\/span>\r\nhold <span class=\"string\">on<\/span>\r\nhistogram(Train.Age(Train.Survived == 1))   <span class=\"comment\">% age histogram of survivers<\/span>\r\nhold <span class=\"string\">off<\/span>\r\nlegend(<span class=\"string\">'Didn''t Survive'<\/span>, <span class=\"string\">'Survived'<\/span>)\r\ntitle(<span class=\"string\">'The Titanic Passenger Age Distribution'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/Kaggle_Titanic_02.png\" alt=\"\"> <h4>Feature Engineering<a name=\"02adbf41-e16b-4b35-92da-0347c88b768b\"><\/a><\/h4><p>How can you take advantage of this visualization? We can create a new variable called <tt>AgeGroup<\/tt> using <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/discretize.html\">discretize()<\/a> to group values into separate bins like <tt>child<\/tt>, <tt>teen<\/tt>, etc.<\/p><pre class=\"codeinput\"><span class=\"comment\">% group values into separate bins<\/span>\r\nTrain.AgeGroup = double(discretize(Train.Age, [0:10:20 65 80], <span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'categorical'<\/span>,{<span class=\"string\">'child'<\/span>,<span class=\"string\">'teen'<\/span>,<span class=\"string\">'adult'<\/span>,<span class=\"string\">'senior'<\/span>}));\r\nTest.AgeGroup = double(discretize(Test.Age, [0:10:20 65 80], <span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'categorical'<\/span>,{<span class=\"string\">'child'<\/span>,<span class=\"string\">'teen'<\/span>,<span class=\"string\">'adult'<\/span>,<span class=\"string\">'senior'<\/span>}));\r\n<\/pre><p>Creating such a new variable by processing existing variables is called <b>feature engineering<\/b> and it is a critical step to perform well with the competition and it is where your creativity really comes in. We had already created a new variable <tt>nCabins<\/tt> to deal with missing data, but often you do this as part of exploratory data analysis. Let's also look at <tt>Fare<\/tt>.<\/p><pre class=\"codeinput\">figure\r\nhistogram(Train.Fare(Train.Survived == 0));         <span class=\"comment\">% fare histogram of non-survivers<\/span>\r\nhold <span class=\"string\">on<\/span>\r\nhistogram(Train.Fare(Train.Survived == 1),0:10:520) <span class=\"comment\">% fare histogram of survivers<\/span>\r\nhold <span class=\"string\">off<\/span>\r\nlegend(<span class=\"string\">'Didn''t Survive'<\/span>, <span class=\"string\">'Survived'<\/span>)\r\ntitle(<span class=\"string\">'The Titanic Passenger Fare Distribution'<\/span>)\r\n\r\n<span class=\"comment\">% group values into separate bins<\/span>\r\nTrain.FareRange = double(discretize(Train.Fare, [0:10:30, 100, 520], <span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'categorical'<\/span>,{<span class=\"string\">'&lt;10'<\/span>,<span class=\"string\">'10-20'<\/span>,<span class=\"string\">'20-30'<\/span>,<span class=\"string\">'30-100'<\/span>,<span class=\"string\">'&gt;100'<\/span>}));\r\nTest.FareRange = double(discretize(Test.Fare, [0:10:30, 100, 520], <span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'categorical'<\/span>,{<span class=\"string\">'&lt;10'<\/span>,<span class=\"string\">'10-20'<\/span>,<span class=\"string\">'20-30'<\/span>,<span class=\"string\">'30-100'<\/span>,<span class=\"string\">'&gt;100'<\/span>}));\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/Kaggle_Titanic_03.png\" alt=\"\"> <h4>Your Secret Weapon - Classification Learner<a name=\"23f7834f-eff1-45dd-a0b2-46b113e7e7ef\"><\/a><\/h4><p>The <a href=\"https:\/\/www.mathworks.com\/help\/stats\/classificationlearner-app.html\">Classification Learner<\/a> app is a new GUI-based MATLAB app that was introduced in R2015a in Statistics and Machine Learning Toolbox. This will be your secret weapon to try out different algorithms very quickly. Let's launch it!<\/p><pre>         classificationLearner<\/pre><div><ol><li>Click on <tt>Import Data<\/tt><\/li><li>Select <tt>Train<\/tt> in <b>Step 1<\/b> in <tt>Set Up Classification<\/tt> dialog box<\/li><li>In <b>Step 2<\/b>, change the \"Import as\" value for <tt>PassengerId<\/tt> to \"Do not import\", and <tt>Survived<\/tt> to \"Response\". All other variables should be already marked as <tt>Predictor<\/tt>.<\/li><li>In <b>Step 3<\/b>, just leave it as is to <tt>Cross Validation<\/tt>.<\/li><\/ol><\/div><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/importData.png\" alt=\"\"> <\/p><h4>Random Forest and Boosted Trees<a name=\"9a6b03c8-fd0f-48f0-810d-66293c1caa3a\"><\/a><\/h4><p>At this point, we are ready to apply some machine learning algorithms on the dataset. One of the popular algorithms on Kaggle is an ensemble method called <b>Random Forest<\/b>, and it is available as <tt>Bagged Trees<\/tt> in the app. Let's try that by selecting it from the <tt>classifier<\/tt> menu and clicking on the <tt>Train<\/tt> button.<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/baggedTrees.png\" alt=\"\"> <\/p><p>When finished, you can open the <tt>Confusion Matrix<\/tt> tab. You see that this model achieved 83.7% overall accuracy, which is better than the Gender Model baseline.<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/rfresult.png\" alt=\"\"> <\/p><p><tt>Boosted Trees<\/tt> is another family of ensemble methods popular among Kaggle participants. You can easily try various options and compare the results in the app. It seems Random Forest is the clear winner here.<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/allresults.png\" alt=\"\"> <\/p><p>You can save the trained model into the workspace by clicking on <tt>Export Model<\/tt> in the app. If you save the model as <tt>trainedClassifier<\/tt>, then you can use it on <tt>Test<\/tt> as follows.<\/p><pre>  yfit = predict(trainedClassifier, Test{:,trainedClassifier.PredictorNames})<\/pre><p>You can also generate a Random Forest model programmatically using <a href=\"https:\/\/www.mathworks.com\/help\/stats\/treebagger-class.html\"><tt>TreeBagger<\/tt><\/a>. Let's adjust the formatting of the data to satisfy its requirements and split the training data into subsets for holdout cross validation.<\/p><pre class=\"codeinput\">Y_train = Train.Survived;                   <span class=\"comment\">% slice response variable<\/span>\r\nX_train = Train(:,3:end);                   <span class=\"comment\">% select predictor variables<\/span>\r\nvars = X_train.Properties.VariableNames;    <span class=\"comment\">% get variable names<\/span>\r\nX_train = table2array(X_train);             <span class=\"comment\">% convert to a numeric matrix<\/span>\r\nX_test = table2array(Test(:,2:end));        <span class=\"comment\">% convert to a numeric matrix<\/span>\r\ncategoricalPredictors = {<span class=\"string\">'Pclass'<\/span>, <span class=\"string\">'Sex'<\/span>, <span class=\"string\">'Embarked'<\/span>, <span class=\"string\">'AgeGroup'<\/span>, <span class=\"string\">'FareRange'<\/span>};\r\nrng(1);                                     <span class=\"comment\">% for reproducibility<\/span>\r\nc = cvpartition(Y_train,<span class=\"string\">'holdout'<\/span>, 0.30);   <span class=\"comment\">% 30%-holdout cross validation<\/span>\r\n<\/pre><p>Now we can train a Random Forest model and get the out-of-bag sampling accuracy metric, which is similar to the error metric from k-fold cross validation. You can generate random indices from the cvpartition object <tt>c<\/tt> to partition the dataset for training.<\/p><pre class=\"codeinput\"><span class=\"comment\">% generate a Random Forest model from the partitioned data<\/span>\r\nRF = TreeBagger(200, X_train(training(c),:), Y_train(training(c)),<span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'PredictorNames'<\/span>, vars, <span class=\"string\">'Method'<\/span>,<span class=\"string\">'classification'<\/span>,<span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'CategoricalPredictors'<\/span>, categoricalPredictors, <span class=\"string\">'oobvarimp'<\/span>, <span class=\"string\">'on'<\/span>);\r\n\r\n<span class=\"comment\">% compute the out-of-bag accuracy<\/span>\r\noobAccuracy = 1 - oobError(RF, <span class=\"string\">'mode'<\/span>, <span class=\"string\">'ensemble'<\/span>)\r\n<\/pre><pre class=\"codeoutput\">oobAccuracy =\r\n      0.82212\r\n<\/pre><p>One of the benefits of Random Forest is its feature importance metric, which represents the change in prediction error with or without the presence of a given variable in the out-of-bag sampling process.<\/p><pre class=\"codeinput\">[~,order] = sort(RF.OOBPermutedVarDeltaError);  <span class=\"comment\">% sort the metrics<\/span>\r\nfigure\r\nbarh(RF.OOBPermutedVarDeltaError(order))        <span class=\"comment\">% horizontal bar chart<\/span>\r\ntitle(<span class=\"string\">'Feature Importance Metric'<\/span>)\r\nax = gca; ax.YTickLabel = vars(order);          <span class=\"comment\">% variable names as labels<\/span>\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/Kaggle_Titanic_04.png\" alt=\"\"> <p>As expected <tt>Sex<\/tt> has the most predictive power, but <tt>nCabins<\/tt>, an engineered feature we came up with, also made a significant contribution. This is why feature engineering is important to do well in the competition! We also used fairly naive ways to fill missing values; you can also be much more creative there.<\/p><h4>Model Evaluation<a name=\"1484fcc1-9d1e-4296-b4f6-43d34593eb78\"><\/a><\/h4><p>To get a sense of how well this model actually performs, we want to check it against the holdout data. The accuracy drops significantly against unseen data, and that's what we expect to see when we submit our prediction to Kaggle.<\/p><pre class=\"codeinput\">[Yfit, Yscore] = predict(RF, X_train(test(c),:));       <span class=\"comment\">% use holdout data<\/span>\r\ncfm = confusionmat(Y_train(test(c)), str2double(Yfit)); <span class=\"comment\">% confusion matrix<\/span>\r\ncvAccuracy = sum(cfm(logical(eye(2))))\/length(Yfit)     <span class=\"comment\">% compute accuracy<\/span>\r\n<\/pre><pre class=\"codeoutput\">cvAccuracy =\r\n      0.79401\r\n<\/pre><p>When you tweak your features and modify your parameters, it is useful to use a <a href=\"https:\/\/www.mathworks.com\/help\/stats\/perfcurve.html\"><tt>perfcurve<\/tt> plot<\/a> (performance curve or receiver operating characteristic plot) to compare the performance. Here is an example.<\/p><pre class=\"codeinput\">posClass = strcmp(RF.ClassNames,<span class=\"string\">'1'<\/span>);   <span class=\"comment\">% get the index of the positive class<\/span>\r\ncurves = zeros(2,1); labels = cell(2,1);<span class=\"comment\">% pre-allocated variables<\/span>\r\n[rocX, rocY, ~, auc] = perfcurve(Y_train(test(c)),Yscore(:,posClass),<span class=\"string\">'1'<\/span>);\r\nfigure\r\ncurves(1) = plot(rocX, rocY);           <span class=\"comment\">% use the perfcurve output to plot<\/span>\r\nlabels{1} = sprintf(<span class=\"string\">'Random Forest - AUC: %.1f%%'<\/span>, auc*100);\r\ncurves(end) = refline(1,0); set(curves(end),<span class=\"string\">'Color'<\/span>,<span class=\"string\">'r'<\/span>);\r\nlabels{end} = <span class=\"string\">'Reference Line - A random classifier'<\/span>;\r\nxlabel(<span class=\"string\">'False Positive Rate'<\/span>)\r\nylabel(<span class=\"string\">'True Positive Rate'<\/span>)\r\ntitle(<span class=\"string\">'ROC Plot'<\/span>)\r\nlegend(curves, labels, <span class=\"string\">'Location'<\/span>, <span class=\"string\">'SouthEast'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/Kaggle_Titanic_05.png\" alt=\"\"> <h4>Create a Submission File<a name=\"8d7a216e-b341-4f0c-8a00-138b5ccf5a55\"><\/a><\/h4><p>To enter your submission to the Kaggle competition, all you have to do is to <a href=\"https:\/\/www.kaggle.com\/c\/titanic\/submissions\/attach\">upload a CSV file<\/a>. You just need the <tt>PassengerId<\/tt> and <tt>Survived<\/tt> columns for submission, and you populate the <tt>Survived<\/tt> with 1s and 0s. We are going to use the Random Forest model we built to populate this variable.<\/p><pre class=\"codeinput\">PassengerId = Test.PassengerId;             <span class=\"comment\">% extract Passenger Ids<\/span>\r\nSurvived = predict(RF, X_test);             <span class=\"comment\">% generate response variable<\/span>\r\nSurvived = str2double(Survived);            <span class=\"comment\">% convert to double<\/span>\r\nsubmission = table(PassengerId,Survived);   <span class=\"comment\">% combine them into a table<\/span>\r\ndisp(submission(1:5,:))                     <span class=\"comment\">% preview the table<\/span>\r\nwritetable(submission,<span class=\"string\">'submission.csv'<\/span>)     <span class=\"comment\">% write to a CSV file<\/span>\r\n<\/pre><pre class=\"codeoutput\">    PassengerId    Survived\r\n    ___________    ________\r\n    892            0       \r\n    893            0       \r\n    894            0       \r\n    895            0       \r\n    896            0       \r\n<\/pre><h4>Conclusion - Let's Give It a Try<a name=\"61580856-a9cc-41ee-b5e5-dfec8e39ebcf\"><\/a><\/h4><p>When you upload the submission CSV file, you should see your score immediately, and that would be around the 0.7940 range, putting you within the top 800. I'm pretty sure you are seeing a lot of room for improvement. For example, I just used averages for filling missing values in <tt>Fare<\/tt> but perhaps you can do better than that given the importance of the feature. Maybe you can come up with better engineered features from the variables I glossed over.<\/p><p>If you want to learn more about how you canget started with Kaggle using MATLAB, please visit our Kaggle page and check out more tutorials and resources. Good luck, and let us know your results <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=1195#respond\">here<\/a>!<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_7ede323584fe45828cb8919a1c14acc4() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='7ede323584fe45828cb8919a1c14acc4 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 7ede323584fe45828cb8919a1c14acc4';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2015 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_7ede323584fe45828cb8919a1c14acc4()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2015a<br><\/p><\/div><!--\r\n7ede323584fe45828cb8919a1c14acc4 ##### SOURCE BEGIN #####\r\n%% Getting Started with Kaggle Data Science Competitions\r\n% Have you been interested in data science competitions, but not sure where\r\n% to begin? Today's guest blogger, Toshi Takeuchi, would like to give a\r\n% quick tutorial on how to get started with Kaggle using MATLAB.\r\n%\r\n% <<data-scientists.png>>\r\n\r\n%% The Titanic Competition on Kaggle\r\n% MATLAB is no stranger to competition - the\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/contest\/ MATLAB Programming\r\n% Contest> continued for over a decade. When it comes to data\r\n% science competitions, Kaggle is currently one of the most popular\r\n% destinations and it offers a number of \"Getting Started 101\" projects you\r\n% can try before you take on a real one. One of those is\r\n% <https:\/\/www.kaggle.com\/c\/titanic Titanic: Machine Learning from\r\n% Disaster>.\r\n%\r\n% The goal of the competition is to predict the survival outcomes for the\r\n% ill-fated Titanic passengers. You use the training data to build your\r\n% predictive model and you submit the predicted survival outcomes for the\r\n% test data. Your score is determined by the prediction accuracy. \r\n%\r\n% Don't worry if you don't rank well on this one. There are entries with a\r\n% <https:\/\/www.kaggle.com\/c\/titanic\/leaderboard 1.00000 score> in the\r\n% leaderboard, but they either seriously\r\n% <http:\/\/en.wikipedia.org\/wiki\/Overfitting overfit> their models to the\r\n% test data, or perhaps even cheated, given that the full dataset is\r\n% available from the\r\n% <http:\/\/lib.stat.cmu.edu\/S\/Harrell\/data\/ascii\/titanic.txt other sources>.\r\n% That is not only pointless, but also raises serious questions - what kind\r\n% of standards of conduct must data scientists meet to produce trustworthy\r\n% results?\r\n% \r\n% So just think of this as a way to do a practice run on Kaggle before you\r\n% take on a real challenge.\r\n%  \r\n% If you haven't done so, sign up with <https:\/\/www.kaggle.com\/ Kaggle> -\r\n% it's free. Then navigate to the <https:\/\/www.kaggle.com\/c\/titanic\/data\r\n% Titanic data> page to download the following files:\r\n%\r\n% * |train.csv| - the training data\r\n% * |test.csv| - the test data\r\n\r\n%% Data Import and Preview\r\n% We begin by importing the data into tables in MATLAB. Let's check the\r\n% imported data. I am assuming that you have downloaded the CSV files into\r\n% the current folder.\r\nTrain = readtable('train.csv','Format','%f%f%f%q%C%f%f%f%q%f%q%C');\r\nTest = readtable('test.csv','Format','%f%f%q%C%f%f%f%q%f%q%C');\r\ndisp(Train(1:5,[2:3 5:8 10:11]))\r\n\r\n%%\r\n% |Train| contains the column |Survived|, and it is the response variable\r\n% that denotes the survival outcome of the passengers:\r\n%\r\n%    1 - Survived\r\n%    0 - Didn't survive\r\n\r\n%% Establishing the Baseline\r\n% When you downloaded the data from Kaggle, you probably noticed that\r\n% additional files were also available - |gendermodel|, |genderclassmodel|,\r\n% etc. These are simple predictive models that determined the outcome based\r\n% on the gender or gender and class. When you tabulate the survival outcome\r\n% by gender, you see that 74.2% of women survived. \r\n\r\ndisp(grpstats(Train(:,{'Survived','Sex'}), 'Sex'))\r\n\r\n%%\r\n% If we predict all women to survive and all men not to, then our overall\r\n% accuracy would be 78.68% because we would be correct for women who\r\n% actually survived as well as men who didn't. This is the baseline Gender\r\n% Model. Our predictive model needs to do better than that on the training\r\n% data. Kaggle's leaderboard shows that the score of this model on the test\r\n% data is 0.76555.\r\n\r\ngendermdl = grpstats(Train(:,{'Survived','Sex'}), {'Survived','Sex'})\r\nall_female = (gendermdl.GroupCount('0_male') + gendermdl.GroupCount('1_female'))... \r\n    \/ sum(gendermdl.GroupCount) \r\n\r\n%% Back to Examining the Data\r\n% When we looked at |Train|, you probably noticed that some values were\r\n% missing in the variable |Cabin|. Let's see if we have other variables\r\n% with missing data. We also want to check if there are any strange values.\r\n% For example, it would be strange to see 0 in |Fare|. When we make changes\r\n% to |Train|, we also have to apply the same changes to |Test|.\r\n\r\nTrain.Fare(Train.Fare == 0) = NaN;      % treat 0 fare as NaN\r\nTest.Fare(Test.Fare == 0) = NaN;        % treat 0 fare as NaN\r\nvars = Train.Properties.VariableNames;  % extract column names\r\n\r\nfigure\r\nimagesc(ismissing(Train))\r\nax = gca; \r\nax.XTick = 1:12; \r\nax.XTickLabel = vars; \r\nax.XTickLabelRotation = 90;\r\ntitle('Missing Values')\r\n\r\n%%\r\n% We have 177 passengers with unknown age. There are several ways to deal\r\n% with\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/data_analysis\/missing-data.html\r\n% missing values>. Sometimes you simply remove them, but let's use the\r\n% average, 29.6991, for simplicity in this case.\r\n\r\navgAge = nanmean(Train.Age)             % get average age\r\nTrain.Age(isnan(Train.Age)) = avgAge;   % replace NaN with the average\r\nTest.Age(isnan(Test.Age)) = avgAge;     % replace NaN with the average\r\n\r\n%%\r\n% We have 15 passengers associated with unknown fares. We know their\r\n% classes, and it is reasonable to assume that fares varied by passenger\r\n% class.\r\nfare = grpstats(Train(:,{'Pclass','Fare'}),'Pclass');   % get class average\r\ndisp(fare)\r\nfor i = 1:height(fare) % for each |Pclass|\r\n    % apply the class average to missing values\r\n    Train.Fare(Train.Pclass == i & isnan(Train.Fare)) = fare.mean_Fare(i);\r\n    Test.Fare(Test.Pclass == i & isnan(Test.Fare)) = fare.mean_Fare(i);\r\nend\r\n\r\n%%\r\n% With regards to |Cabin|, you notice that some passengers had multiple\r\n% cabins and they are all in the first class. We will treat missing values\r\n% as 0. Some third class cabin numbers are irregular and we need to handle\r\n% those exceptions.\r\n\r\n% tokenize the text string by white space\r\ntrain_cabins = cellfun(@strsplit, Train.Cabin, 'UniformOutput', false);\r\ntest_cabins = cellfun(@strsplit, Test.Cabin, 'UniformOutput', false);\r\n\r\n% count the number of tokens\r\nTrain.nCabins = cellfun(@length, train_cabins);\r\nTest.nCabins = cellfun(@length, test_cabins);\r\n\r\n% deal with exceptions - only the first class people had multiple cabins\r\nTrain.nCabins(Train.Pclass ~= 1 & Train.nCabins > 1,:) = 1;\r\nTest.nCabins(Test.Pclass ~= 1 & Test.nCabins > 1,:) = 1;\r\n\r\n% if |Cabin| is empty, then |nCabins| should be 0\r\nTrain.nCabins(cellfun(@isempty, Train.Cabin)) = 0;\r\nTest.nCabins(cellfun(@isempty, Test.Cabin)) = 0;\r\n\r\n%% \r\n% For two passengers, we don't know their port of embarkation. We will use\r\n% the most frequent value, |S| (Southampton), from this variable to fill in\r\n% the missing values. We also want to turn this into a numeric variable for\r\n% later use.\r\n\r\n% get most frequent value\r\nfreqVal = mode(Train.Embarked);\r\n\r\n% apply it to missling value\r\nTrain.Embarked(isundefined(Train.Embarked)) = freqVal;  \r\nTest.Embarked(isundefined(Test.Embarked)) = freqVal;\r\n\r\n% convert the data type from categorical to double\r\nTrain.Embarked = double(Train.Embarked);\r\nTest.Embarked = double(Test.Embarked);\r\n\r\n%%\r\n% Let's also turn |Sex| into a numeric variable for later use. \r\nTrain.Sex = double(Train.Sex);\r\nTest.Sex = double(Test.Sex);\r\n\r\n%%\r\n% Let's remove variables that we don't plan to use, because they contain\r\n% too many missing values or unique values.  \r\nTrain(:,{'Name','Ticket','Cabin'}) = []; \r\nTest(:,{'Name','Ticket','Cabin'}) = [];\r\n\r\n%% Exploratory Data Analysis and Visualization\r\n% At this point, we can begin further\r\n% <https:\/\/www.mathworks.com\/help\/stats\/exploratory-data-analysis.html\r\n% exploration of the data> by visualizing the distribution of variables.\r\n% This is a time consuming but very important step. To keep it simple, I\r\n% will just use one example - |Age|. The histogram shows that you have a\r\n% higher survival rate for agess under 5, and a very low survival rate for\r\n% ages above 65.\r\n\r\nfigure\r\nhistogram(Train.Age(Train.Survived == 0))   % age histogram of non-survivers\r\nhold on\r\nhistogram(Train.Age(Train.Survived == 1))   % age histogram of survivers\r\nhold off\r\nlegend('Didn''t Survive', 'Survived')\r\ntitle('The Titanic Passenger Age Distribution')\r\n%% Feature Engineering\r\n% How can you take advantage of this visualization? We can create a new\r\n% variable called |AgeGroup| using\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/discretize.html discretize()>\r\n% to group values into separate bins like |child|, |teen|, etc.\r\n\r\n% group values into separate bins\r\nTrain.AgeGroup = double(discretize(Train.Age, [0:10:20 65 80], ...\r\n    'categorical',{'child','teen','adult','senior'}));\r\nTest.AgeGroup = double(discretize(Test.Age, [0:10:20 65 80], ...\r\n    'categorical',{'child','teen','adult','senior'}));\r\n\r\n%%\r\n% Creating such a new variable by processing existing variables is called\r\n% *feature engineering* and it is a critical step to perform well with the\r\n% competition and it is where your creativity really comes in. We had\r\n% already created a new variable |nCabins| to deal with missing data, but\r\n% often you do this as part of exploratory data analysis. Let's also look\r\n% at |Fare|.\r\n\r\nfigure\r\nhistogram(Train.Fare(Train.Survived == 0));         % fare histogram of non-survivers\r\nhold on\r\nhistogram(Train.Fare(Train.Survived == 1),0:10:520) % fare histogram of survivers\r\nhold off\r\nlegend('Didn''t Survive', 'Survived')\r\ntitle('The Titanic Passenger Fare Distribution')\r\n\r\n% group values into separate bins\r\nTrain.FareRange = double(discretize(Train.Fare, [0:10:30, 100, 520], ...\r\n    'categorical',{'<10','10-20','20-30','30-100','>100'})); \r\nTest.FareRange = double(discretize(Test.Fare, [0:10:30, 100, 520], ...\r\n    'categorical',{'<10','10-20','20-30','30-100','>100'})); \r\n\r\n%% Your Secret Weapon - Classification Learner\r\n% The <https:\/\/www.mathworks.com\/help\/stats\/classificationlearner-app.html\r\n% Classification Learner> app is a new GUI-based MATLAB app that was\r\n% introduced in R2015a in Statistics and Machine Learning Toolbox. This\r\n% will be your secret weapon to try out different algorithms very quickly.\r\n% Let's launch it!\r\n%\r\n%           classificationLearner\r\n\r\n%%\r\n% # Click on |Import Data| \r\n% # Select |Train| in *Step 1* in |Set Up Classification| dialog box\r\n% # In *Step 2*, change the \"Import as\" value for |PassengerId| to \"Do not\r\n% import\", and |Survived| to \"Response\". All other variables should be\r\n% already marked as |Predictor|.\r\n% # In *Step 3*, just leave it as is to |Cross Validation|.\r\n%\r\n% <<importData.png>>\r\n\r\n%% Random Forest and Boosted Trees\r\n% At this point, we are ready to apply some machine learning algorithms on\r\n% the dataset. One of the popular algorithms on Kaggle is an ensemble\r\n% method called *Random Forest*, and it is available as |Bagged Trees| in\r\n% the app. Let's try that by selecting it from the |classifier| menu and\r\n% clicking on the |Train| button.\r\n%\r\n% <<baggedTrees.png>>\r\n%\r\n% When finished, you can open the |Confusion Matrix| tab. You see that this\r\n% model achieved 83.7% overall accuracy, which is better than the Gender\r\n% Model baseline.\r\n%\r\n% <<rfresult.png>>\r\n%\r\n% |Boosted Trees| is another family of ensemble methods popular among\r\n% Kaggle participants. You can easily try various options and compare the\r\n% results in the app. It seems Random Forest is the clear winner here. \r\n%\r\n% <<allresults.png>>\r\n%\r\n% You can save the trained model into the workspace by clicking on |Export\r\n% Model| in the app. If you save the model as |trainedClassifier|, then you\r\n% can use it on |Test| as follows.\r\n%\r\n%    yfit = predict(trainedClassifier, Test{:,trainedClassifier.PredictorNames})\r\n%\r\n%%\r\n% You can also generate a Random Forest model programmatically using\r\n% <https:\/\/www.mathworks.com\/help\/stats\/treebagger-class.html |TreeBagger|>.\r\n% Let's adjust the formatting of the data to satisfy its requirements and\r\n% split the training data into subsets for holdout cross validation.\r\n\r\nY_train = Train.Survived;                   % slice response variable\r\nX_train = Train(:,3:end);                   % select predictor variables\r\nvars = X_train.Properties.VariableNames;    % get variable names\r\nX_train = table2array(X_train);             % convert to a numeric matrix\r\nX_test = table2array(Test(:,2:end));        % convert to a numeric matrix\r\ncategoricalPredictors = {'Pclass', 'Sex', 'Embarked', 'AgeGroup', 'FareRange'};\r\nrng(1);                                     % for reproducibility\r\nc = cvpartition(Y_train,'holdout', 0.30);   % 30%-holdout cross validation\r\n\r\n%%\r\n% Now we can train a Random Forest model and get the out-of-bag sampling\r\n% accuracy metric, which is similar to the error metric from k-fold cross\r\n% validation. You can generate random indices from the cvpartition object\r\n% |c| to partition the dataset for training.\r\n\r\n% generate a Random Forest model from the partitioned data\r\nRF = TreeBagger(200, X_train(training(c),:), Y_train(training(c)),...\r\n    'PredictorNames', vars, 'Method','classification',...\r\n    'CategoricalPredictors', categoricalPredictors, 'oobvarimp', 'on');\r\n\r\n% compute the out-of-bag accuracy\r\noobAccuracy = 1 - oobError(RF, 'mode', 'ensemble')\r\n\r\n%%\r\n% One of the benefits of Random Forest is its feature importance metric,\r\n% which represents the change in prediction error with or without the\r\n% presence of a given variable in the out-of-bag sampling process.\r\n\r\n[~,order] = sort(RF.OOBPermutedVarDeltaError);  % sort the metrics\r\nfigure\r\nbarh(RF.OOBPermutedVarDeltaError(order))        % horizontal bar chart\r\ntitle('Feature Importance Metric')\r\nax = gca; ax.YTickLabel = vars(order);          % variable names as labels\r\n\r\n%%\r\n% As expected |Sex| has the most predictive power, but |nCabins|, an\r\n% engineered feature we came up with, also made a significant contribution.\r\n% This is why feature engineering is important to do well in the\r\n% competition! We also used fairly naive ways to fill missing values; you\r\n% can also be much more creative there.\r\n\r\n%% Model Evaluation\r\n% To get a sense of how well this model actually performs, we want to check\r\n% it against the holdout data. The accuracy drops significantly against\r\n% unseen data, and that's what we expect to see when we submit our\r\n% prediction to Kaggle. \r\n\r\n[Yfit, Yscore] = predict(RF, X_train(test(c),:));       % use holdout data\r\ncfm = confusionmat(Y_train(test(c)), str2double(Yfit)); % confusion matrix\r\ncvAccuracy = sum(cfm(logical(eye(2))))\/length(Yfit)     % compute accuracy\r\n\r\n%%\r\n% When you tweak your features and modify your parameters, it is useful to\r\n% use a <https:\/\/www.mathworks.com\/help\/stats\/perfcurve.html |perfcurve|\r\n% plot> (performance curve or receiver operating characteristic plot) to\r\n% compare the performance. Here is an example.\r\n\r\nposClass = strcmp(RF.ClassNames,'1');   % get the index of the positive class\r\ncurves = zeros(2,1); labels = cell(2,1);% pre-allocated variables\r\n[rocX, rocY, ~, auc] = perfcurve(Y_train(test(c)),Yscore(:,posClass),'1');\r\nfigure\r\ncurves(1) = plot(rocX, rocY);           % use the perfcurve output to plot\r\nlabels{1} = sprintf('Random Forest - AUC: %.1f%%', auc*100);\r\ncurves(end) = refline(1,0); set(curves(end),'Color','r');\r\nlabels{end} = 'Reference Line - A random classifier';\r\nxlabel('False Positive Rate')\r\nylabel('True Positive Rate')\r\ntitle('ROC Plot')\r\nlegend(curves, labels, 'Location', 'SouthEast')\r\n\r\n%% Create a Submission File\r\n% To enter your submission to the Kaggle competition, all you have to do is\r\n% to <https:\/\/www.kaggle.com\/c\/titanic\/submissions\/attach upload a CSV\r\n% file>. You just need the |PassengerId| and |Survived| columns for\r\n% submission, and you populate the |Survived| with 1s and 0s. We are going\r\n% to use the Random Forest model we built to populate this variable.\r\n\r\nPassengerId = Test.PassengerId;             % extract Passenger Ids\r\nSurvived = predict(RF, X_test);             % generate response variable\r\nSurvived = str2double(Survived);            % convert to double\r\nsubmission = table(PassengerId,Survived);   % combine them into a table\r\ndisp(submission(1:5,:))                     % preview the table\r\nwritetable(submission,'submission.csv')     % write to a CSV file\r\n\r\n%% Conclusion - Let's Give It a Try\r\n% When you upload the submission CSV file, you should see your score\r\n% immediately, and that would be around the 0.7940 range, putting you\r\n% within the top 800. I'm pretty sure you are seeing a lot of room for\r\n% improvement. For example, I just used averages for filling missing values\r\n% in |Fare| but perhaps you can do better than that given the importance of\r\n% the feature. Maybe you can come up with better engineered features from\r\n% the variables I glossed over.\r\n% \r\n% If you want to learn more about how you canget started with Kaggle using\r\n% MATLAB, please visit our\r\n% <https:\/\/www.mathworks.com\/academia.htmlstudent-competitions\/kaggle\/ Kaggle>\r\n% page and check out more tutorials and resources. Good luck, and let us\r\n% know your results <https:\/\/blogs.mathworks.com\/loren\/?p=1195#respond\r\n% here>!\r\n\r\n##### SOURCE END ##### 7ede323584fe45828cb8919a1c14acc4\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/Kaggle_Titanic_05.png\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction--><p>Have you been interested in data science competitions, but not sure where to begin? Today's guest blogger, Toshi Takeuchi, would like to give a quick tutorial on how to get started with Kaggle using MATLAB.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2015\/06\/18\/getting-started-with-kaggle-data-science-competitions\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[66,1],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1195"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=1195"}],"version-history":[{"count":4,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1195\/revisions"}],"predecessor-version":[{"id":2465,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1195\/revisions\/2465"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=1195"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=1195"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=1195"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}