{"id":776,"date":"2013-09-10T08:26:44","date_gmt":"2013-09-10T13:26:44","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=776"},"modified":"2017-01-06T10:01:13","modified_gmt":"2017-01-06T15:01:13","slug":"introduction-to-the-new-matlab-data-types-in-r2013b","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2013\/09\/10\/introduction-to-the-new-matlab-data-types-in-r2013b\/","title":{"rendered":"Introduction to the New MATLAB Data Types in R2013b"},"content":{"rendered":"\r\n<div class=\"content\"><!--introduction--><p>Today I&#8217;d like to introduce a fairly frequent guest blogger <a href=\"mailto:sarah.zaranek@mathworks.com\">Sarah Wait Zaranek<\/a> who works for the MATLAB Marketing team here at The MathWorks. She and I will be writing about the new capabilities for MATLAB in R2013b. In particular, there are two new data types in MATLAB in R2013b &#8211; <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/table.html\">table<\/a> and <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/categorical.html\">categorical<\/a> arrays.<\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#6f8f0253-8f83-411a-97c0-98c7bb7298cd\">What are Tables and Categorical Arrays?<\/a><\/li><li><a href=\"#fe5c445a-1e68-4a44-ab40-d9cc90438539\">Importing Data into a Table<\/a><\/li><li><a href=\"#03c35868-e2bb-4478-8a1d-ef28c65a5c2b\">Looking at Variable Names (Column Names)<\/a><\/li><li><a href=\"#6b93ec3a-4b7c-4bbc-b614-19db1522b9a1\">Accessing the Data in Your Table<\/a><\/li><li><a href=\"#ad0abf48-6ea0-4a55-aa1a-16e691f51c16\">Converting Data to Categorical Arrays<\/a><\/li><li><a href=\"#d2fb5f8a-77bb-48e9-a5ff-96ee9197fe54\">Creating a New Table<\/a><\/li><li><a href=\"#a3ed4439-bc11-4c11-a719-5d219c66a359\">Adding\/Removing Variables<\/a><\/li><li><a href=\"#ef9ff432-e43b-4862-9369-62f0ecaee699\">Removing Missing Data<\/a><\/li><li><a href=\"#decddc77-f98a-42bd-a02e-b43b153640d8\">Summarizing a Table<\/a><\/li><li><a href=\"#aea474c5-aeb1-4a99-bd2c-84d045b731db\">Sorting Data<\/a><\/li><li><a href=\"#05d8a9c4-c899-486e-9919-6368af425b89\">Applying Functions to Table Variables<\/a><\/li><li><a href=\"#bee50f00-c409-4921-b4d6-e1684c1fa2f2\">Joining (Merging) Tables<\/a><\/li><li><a href=\"#796cfe08-5654-416b-9ea6-4101724f7960\">Plotting Data From the Final Table<\/a><\/li><li><a href=\"#ae528e3a-39f9-44c3-9cd2-d9154f6983f9\">Your Thoughts?<\/a><\/li><\/ul><\/div><h4>What are Tables and Categorical Arrays?<a name=\"6f8f0253-8f83-411a-97c0-98c7bb7298cd\"><\/a><\/h4><p>Table is a new data type suitable for holding heterogenous data and metadata. Specifically, tables are useful for mixed-type tabular data that are often stored as columns in a text file or in a spreadsheet. Tables consist of rows and column-oriented variables. Categorical arrays are useful for holding categorical data - which have values from a finite list of discrete categories.<\/p><p>One of the best ways to learn more about tables and categorical arrays is to see them in action.  So, in this post, we will use tables and categoricals to examine some airplane flight delay data.  The flight data is freely available from the Bureau of Transportation Statistics (BTS). You can download it yourself <a href=\"http:\/\/www.transtats.bts.gov\/DL_SelectFields.asp?Table_ID=236\">here<\/a>. The weather data is from the National Climatic Data Center (NCDC) and is available <a title=\"http:\/\/cdo.ncdc.noaa.gov\/qclcd\/ (link no longer works)\">here<\/a>.<\/p><h4>Importing Data into a Table<a name=\"fe5c445a-1e68-4a44-ab40-d9cc90438539\"><\/a><\/h4><p>You can import your data into a table interactively using the <a href=\"\">Import Tool<\/a> or you can do it programmatically, using <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/readtable.html\"><tt>readtable<\/tt><\/a>.<\/p><pre class=\"codeinput\">FlightData = readtable(<span class=\"string\">'Jan2010Flights.csv'<\/span>);\r\n\r\nwhos <span class=\"string\">FlightData<\/span>\r\n<\/pre><pre class=\"codeoutput\">  Name                Size              Bytes  Class    Attributes\r\n\r\n  FlightData      17816x7             9007742  table              \r\n\r\n<\/pre><p>The entire contents of the file are now contained in a single variable &#8211; a table. Here you are reading in your data from a csv file.  <tt>readtable<\/tt> also supports reading from .txt,.dat text files and Excel spreadsheet files. Tables can also be created directly from variables in your workspace.<\/p><h4>Looking at Variable Names (Column Names)<a name=\"03c35868-e2bb-4478-8a1d-ef28c65a5c2b\"><\/a><\/h4><p>You can see all the variable names (column names in our table) by looking at the <tt>VariableNames<\/tt> properties.<\/p><pre class=\"codeinput\">FlightData.Properties.VariableNames\r\n<\/pre><pre class=\"codeoutput\">ans = \r\n  Columns 1 through 5\r\n    'FL_DATE'    'CARRIER'    'ORIGIN'    'DEST'    'CRS_DEP_TIME'\r\n  Columns 6 through 7\r\n    'DEP_TIME'    'DEP_DELAY'\r\n<\/pre><p>This particular table does not contain any row names, but for a table with row names you can access the row names using the <tt>RowNames<\/tt> property.<\/p><h4>Accessing the Data in Your Table<a name=\"6b93ec3a-4b7c-4bbc-b614-19db1522b9a1\"><\/a><\/h4><p>There are multiple ways to access the data in your table. You can use dot indexing to access or modify a single table variable, similar to how you use fieldnames in structures.<\/p><p>For example, using dot indexing you can plot a histogram of the departure delays (in minutes).<\/p><pre class=\"codeinput\">hist(FlightData.DEP_DELAY)\r\ntitle(<span class=\"string\">'Histogram of Flight Delays in Minutes'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2013\/FlightTableIntroFinal_01.png\" alt=\"\"> <p>You can also display the first 5 departure delays.<\/p><pre class=\"codeinput\">FlightData.DEP_DELAY(1:5)\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n    49\r\n    -7\r\n    -5\r\n    -8\r\n   -10\r\n<\/pre><p>You can also extract data from one or more variables in the table using curly braces.  Within the curly braces you can use numeric indexing or variable and row names. For example, you can extract the actual departure times and scheduled depature times for the first 5 flights.<\/p><pre class=\"codeinput\">SomeTimes = FlightData{1:5,{<span class=\"string\">'DEP_TIME'<\/span>,<span class=\"string\">'CRS_DEP_TIME'<\/span>}};\r\ndisp(SomeTimes)\r\n<\/pre><pre class=\"codeoutput\">        1149        1100\r\n        1053        1100\r\n        1055        1100\r\n        1052        1100\r\n        1050        1100\r\n<\/pre><p>This is similar to indexing with cell arrays.  However, <b>unlike<\/b> with cells, this concatenates the specified variables into a single array. Therefore, the data types of all the specified variables need to be compatible for concatenation.<\/p><h4>Converting Data to Categorical Arrays<a name=\"ad0abf48-6ea0-4a55-aa1a-16e691f51c16\"><\/a><\/h4><p>You can convert some of the variables in your table using <a href=\"\"><tt>categorical<\/tt><\/a>. Categorical arrays are more memory efficent than holding cell arrays of strings when you have repeated data. Categorical arrays store only one copy of each category name, reducing the amount of memory required to store the array. You can use <tt>whos<\/tt> to see the amount of memory you save by converting the data to a categorical array.<\/p><pre class=\"codeinput\">whos <span class=\"string\">FlightData<\/span>\r\n<\/pre><pre class=\"codeoutput\">  Name                Size              Bytes  Class    Attributes\r\n\r\n  FlightData      17816x7             9007742  table              \r\n\r\n<\/pre><pre class=\"codeinput\">FlightData.ORIGIN = categorical(FlightData.ORIGIN);\r\nFlightData.DEST = categorical(FlightData.DEST);\r\nFlightData.CARRIER = categorical(FlightData.CARRIER);\r\n\r\nwhos <span class=\"string\">FlightData<\/span>\r\n<\/pre><pre class=\"codeoutput\">  Name                Size              Bytes  Class    Attributes\r\n\r\n  FlightData      17816x7             2857264  table              \r\n\r\n<\/pre><p>Using <tt>categories<\/tt>, you can find all the distinct categories in your array.  By default, categorical arrays do not define a definite order. If your data contains categories with a definite order, you can set the <tt>'Ordinal'<\/tt> flag to <tt>true<\/tt> when creating your categorical array.  The default the order will be alphabetical, but you can prescribe you own order instead.<\/p><pre class=\"codeinput\">categories(FlightData.CARRIER)\r\n<\/pre><pre class=\"codeoutput\">ans = \r\n    '9E'\r\n    'AA'\r\n    'AS'\r\n    'B6'\r\n    'CO'\r\n    'DL'\r\n    'F9'\r\n    'FL'\r\n    'MQ'\r\n    'OH'\r\n    'UA'\r\n    'US'\r\n    'WN'\r\n    'XE'\r\n    'YV'\r\n<\/pre><p>Categorical arrays are also faster and more convenient than cell arrays of strings for indexing and searching. By converting to categorical arrays, you can then mathematically compare sets of strings just like you would do with numeric values. You can use this functionality to create a new table containing only the flights that left from Boston.<\/p><h4>Creating a New Table<a name=\"d2fb5f8a-77bb-48e9-a5ff-96ee9197fe54\"><\/a><\/h4><p>You can create a new table from a section of an existing table using parentheses with numerical indexing, variable names, or row names. Since the flight origin is now a categorical array, you can use logical indexing to find all flights that left from Boston.<\/p><pre class=\"codeinput\">idxBoston = FlightData.ORIGIN == <span class=\"string\">'BOS'<\/span> ;\r\nBostonFlights = FlightData(idxBoston,:);\r\n\r\nheight(FlightData)\r\nheight(BostonFlights)\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n       17816\r\nans =\r\n        8904\r\n<\/pre><h4>Adding\/Removing Variables<a name=\"a3ed4439-bc11-4c11-a719-5d219c66a359\"><\/a><\/h4><p>You can also modify your table by adding and removing variables and rows. All variables in a table must have the same number of rows, but they can be of different widths.<\/p><p>Let's add a new variable (DATE) to represent the serial date number for the various flight dates.<\/p><pre class=\"codeinput\">BostonFlights.DATE = datenum(BostonFlights.FL_DATE);\r\n<\/pre><p>The origin now has all <tt>BOS<\/tt> values and you are not going to use destination information right now, so those variables can be removed. <tt>HOUR<\/tt> can also be calculated, as well as a <tt>LATE<\/tt> variable which indicates if the flight was 15 minutes late or more.<\/p><pre class=\"codeinput\">BostonFlights.ORIGIN = [];\r\nBostonFlights.DEST = [];\r\nBostonFlights.FL_DATE = [];\r\n\r\nBostonFlights.HOUR = floor(BostonFlights.CRS_DEP_TIME.\/100);\r\nBostonFlights{:,<span class=\"string\">'LATE'<\/span>} = BostonFlights.DEP_DELAY &gt; 15;\r\n<\/pre><h4>Removing Missing Data<a name=\"ef9ff432-e43b-4862-9369-62f0ecaee699\"><\/a><\/h4><p>Tables have supported functions for finding and standardizing missing data.  In this case,  you can find any missing data using <tt>ismissing<\/tt> and remove it.  You can use <tt>height<\/tt>, which gives you the number of table rows, to see how many flights were removed from the table.  We exploit logical indexing to get only the flights that have no missing data.<\/p><pre class=\"codeinput\">height(BostonFlights)\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n        8904\r\n<\/pre><pre class=\"codeinput\">TF = any(ismissing(BostonFlights),2);\r\nBostonFlights = BostonFlights(~TF,:);\r\nheight(BostonFlights)\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n        8640\r\n<\/pre><h4>Summarizing a Table<a name=\"decddc77-f98a-42bd-a02e-b43b153640d8\"><\/a><\/h4><p>You can then see descriptive statistics for each variable in this new table by using <a href=\"\"><tt>summary<\/tt><\/a>.<\/p><pre class=\"codeinput\">summary(BostonFlights)\r\n<\/pre><pre class=\"codeoutput\">Variables:\r\n    CARRIER: 8640x1 categorical\r\n        Values:\r\n            9E      85     \r\n            AA     913     \r\n            AS      55     \r\n            B6    1726     \r\n            CO     321     \r\n            DL    1215     \r\n            F9      26     \r\n            FL     536     \r\n            MQ     751     \r\n            OH     341     \r\n            UA     643     \r\n            US    1494     \r\n            WN     384     \r\n            XE      93     \r\n            YV      57     \r\n    CRS_DEP_TIME: 8640x1 double\r\n        Values:\r\n            min        500          \r\n            median    1215          \r\n            max       2359          \r\n    DEP_TIME: 8640x1 double\r\n        Values:\r\n            min          2      \r\n            median    1224      \r\n            max       2400      \r\n    DEP_DELAY: 8640x1 double\r\n        Values:\r\n            min       -25        \r\n            median     -3        \r\n            max       419        \r\n    DATE: 8640x1 double\r\n        Values:\r\n            min       7.3414e+05\r\n            median    7.3415e+05\r\n            max       7.3417e+05\r\n    HOUR: 8640x1 double\r\n        Values:\r\n            min        5    \r\n            median    12    \r\n            max       23    \r\n    LATE: 8640x1 logical\r\n        Values:\r\n            true     1471  \r\n            false    7169  \r\n<\/pre><h4>Sorting Data<a name=\"aea474c5-aeb1-4a99-bd2c-84d045b731db\"><\/a><\/h4><p>There are additional functions to sort tables, apply functions to table variables, and merge tables together. For example, you can sort your <tt>BostonFlights<\/tt> by departure delay.<\/p><pre class=\"codeinput\">BostonFlights = sortrows(BostonFlights,<span class=\"string\">'DEP_DELAY'<\/span>,<span class=\"string\">'descend'<\/span>);\r\nBostonFlights(1:10,:)\r\n<\/pre><pre class=\"codeoutput\">ans = \r\n    CARRIER    CRS_DEP_TIME    DEP_TIME    DEP_DELAY       DATE       HOUR\r\n    _______    ____________    ________    _________    __________    ____\r\n    DL         1830             129        419          7.3416e+05    18  \r\n    DL         1850             111        381          7.3414e+05    18  \r\n    CO         1755              10        375          7.3414e+05    17  \r\n    AA         1710            2323        373          7.3416e+05    17  \r\n    DL          630            1240        370          7.3416e+05     6  \r\n    FL         1400            1951        351          7.3414e+05    14  \r\n    FL         1741            2330        349          7.3414e+05    17  \r\n    UA         1906              22        316          7.3414e+05    19  \r\n    AA          840            1355        315          7.3416e+05     8  \r\n    AA          905            1420        315          7.3414e+05     9  \r\n\r\n    LATE \r\n    _____\r\n    true \r\n    true \r\n    true \r\n    true \r\n    true \r\n    true \r\n    true \r\n    true \r\n    true \r\n    true \r\n<\/pre><h4>Applying Functions to Table Variables<a name=\"05d8a9c4-c899-486e-9919-6368af425b89\"><\/a><\/h4><p>You can apply functions to work with table variables, with <a href=\"\"><tt>varfun<\/tt><\/a>.<\/p><p><tt>varfun<\/tt> has optional additional calling inputs such as <tt>'InputVariables'<\/tt> and <tt>'GroupingVariables'<\/tt>. <tt>'InputVariables'<\/tt> lets you specific which variables you want to operate on instead of operating on all the variables in your table. <tt>'GroupingVariables'<\/tt> let you define groups of rows on which to operate. <tt>varfun<\/tt> would then apply your function to each group of rows within each of the variables of your table, rather than to each entire variable.<\/p><p>You can use <tt>varfun<\/tt> to calculate the mean delay for all flights and the fraction of late flights for a given hour on a given day.  The default output of <tt>varfun<\/tt> is a table.<\/p><pre class=\"codeinput\">ByHour = varfun(@mean, BostonFlights, <span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'InputVariables'<\/span>, {<span class=\"string\">'DEP_DELAY'<\/span>, <span class=\"string\">'LATE'<\/span>},<span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'GroupingVariables'<\/span>,{<span class=\"string\">'DATE'<\/span>,<span class=\"string\">'HOUR'<\/span>});\r\n\r\ndisp(ByHour(1:5,:))\r\n<\/pre><pre class=\"codeoutput\">                   DATE       HOUR    GroupCount    mean_DEP_DELAY    mean_LATE\r\n                __________    ____    __________    ______________    _________\r\n    734139_5    7.3414e+05    5        5               3.8                  0  \r\n    734139_6    7.3414e+05    6       19            10.579            0.31579  \r\n    734139_7    7.3414e+05    7       18            6.7778            0.11111  \r\n    734139_8    7.3414e+05    8       21            8.8571            0.33333  \r\n    734139_9    7.3414e+05    9       17            5.0588            0.23529  \r\n<\/pre><h4>Joining (Merging) Tables<a name=\"bee50f00-c409-4921-b4d6-e1684c1fa2f2\"><\/a><\/h4><p>Weather might have an important role in determining if a flight is delayed.  For a given hour, you might want to know both the delayed flight information and the weather at the airport.   So, you can start by reading in another table containing weather data for Boston Logan Airport. Then, you can merge that table with the existing <tt>ByHour<\/tt> table.<\/p><p>Since there are a lot of variables in this file, you can specify the input format when using <tt>readtable<\/tt>.  This allows you to use <tt>*<\/tt> to skip variables that you aren't interested in loading into the table. For more information about specifying formating strings, see <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/textscan.html#inputarg_formatSpec\">here<\/a> in the documentation.  Since this data uses 'M' to represent missing data, you can use <tt>'TreatAsEmpty'<\/tt> to replace any instances of 'M' with the standard missing value indicator (NaN for numeric values).<\/p><pre class=\"codeinput\">FormatStr = [<span class=\"string\">'%*s%s%f'<\/span> repmat(<span class=\"string\">'%*s'<\/span>,1,9) <span class=\"string\">'%f'<\/span> repmat(<span class=\"string\">'%*s'<\/span>,1,7) <span class=\"string\">'%f'<\/span>,<span class=\"keyword\">...<\/span>\r\n             repmat(<span class=\"string\">'%*s'<\/span>,1,3),<span class=\"string\">'%f'<\/span>, repmat(<span class=\"string\">'%*s'<\/span>,1,18)];\r\n\r\nWeatherData = readtable(<span class=\"string\">'BostonWeather.txt'<\/span>,<span class=\"string\">'HeaderLines'<\/span>,6,<span class=\"keyword\">...<\/span>\r\n                        <span class=\"string\">'Format'<\/span>,FormatStr,<span class=\"string\">'TreatAsEmpty'<\/span>,<span class=\"string\">'M'<\/span>);\r\n\r\nWeatherData.Properties.VariableNames\r\n<\/pre><pre class=\"codeoutput\">ans = \r\n    'Date'    'Time'    'DryBulbCelsius'    'DewPointCelsius'    'WindSpeed'\r\n<\/pre><p>WeatherData contains the date, time, <a href=\"http:\/\/en.wikipedia.org\/wiki\/Dew_point\">dew point<\/a> and dry bulb temperature in Celsius, and wind speed. Let's convert <tt>DATE<\/tt> to a serial date number and round to the hour for the time measurement.<\/p><pre class=\"codeinput\">WeatherData.DATE = datenum(WeatherData.Date,<span class=\"string\">'yyyymmdd'<\/span>);\r\nWeatherData.Date = [];\r\nWeatherData.HOUR = floor(WeatherData.Time\/100);\r\n<\/pre><p>Since there are multiple weather measurements per hour, you can average the data by hour using <tt>varfun<\/tt>.<\/p><pre class=\"codeinput\">ByHourWeather = varfun(@mean, WeatherData, <span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'InputVariables'<\/span>, {<span class=\"string\">'DryBulbCelsius'<\/span>,<span class=\"string\">'DewPointCelsius'<\/span>,<span class=\"string\">'WindSpeed'<\/span>},<span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'GroupingVariables'<\/span>,{<span class=\"string\">'DATE'<\/span>,<span class=\"string\">'HOUR'<\/span>});\r\n<\/pre><p>Now, you can merge the two tables using <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/join.html\"><tt>join<\/tt><\/a> which matches rows using key variables (columns) common to both tables. <tt>join<\/tt> keeps all the variables from the first input table and appends the corresponding variables from the second input table. The table that <tt>join<\/tt> creates will use the key variable values as the row names.  In this case, that means the row names will represent the date and hour data.<\/p><pre class=\"codeinput\">AllData = join(ByHour,ByHourWeather,<span class=\"string\">'Keys'<\/span>,{<span class=\"string\">'DATE'<\/span>,<span class=\"string\">'HOUR'<\/span>});\r\n<\/pre><h4>Plotting Data From the Final Table<a name=\"796cfe08-5654-416b-9ea6-4101724f7960\"><\/a><\/h4><p>Let's now plot this final data set to get an idea of the effect of weather on the flight delays.<\/p><pre class=\"codeinput\">AllData.TDIFF =  <span class=\"keyword\">...<\/span>\r\n    abs(AllData.mean_DewPointCelsius - AllData.mean_DryBulbCelsius);\r\n\r\nscatter(AllData.TDIFF, AllData.mean_DEP_DELAY,<span class=\"keyword\">...<\/span>\r\n    [], AllData.mean_DEP_DELAY,<span class=\"string\">'filled'<\/span>);\r\nxlabel(<span class=\"string\">'abs(DewPoint-Temperature)'<\/span>)\r\nylabel(<span class=\"string\">'Average Departure Delay'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2013\/FlightTableIntroFinal_02.png\" alt=\"\"> <pre class=\"codeinput\">scatter3( AllData.HOUR,AllData.TDIFF,AllData.mean_DEP_DELAY,<span class=\"keyword\">...<\/span>\r\n    [],AllData.mean_DEP_DELAY,<span class=\"string\">'filled'<\/span>);\r\nxlabel(<span class=\"string\">'Hour of Flight'<\/span>)\r\nylabel(<span class=\"string\">'abs(DewPoint-Temperature)'<\/span>)\r\nzlabel(<span class=\"string\">'Average Departure Delay'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2013\/FlightTableIntroFinal_03.png\" alt=\"\"> <p>Qualitatively it looks having a temperature near the dew point (greater change of precipitation) effects the departure time.  There are other factors at work as well, but it is nice to know that our intuition (flights later in the day and when it is cold and snowing might have a greater chance to be delayed) seems to work with the data.<\/p><h4>Your Thoughts?<a name=\"ae528e3a-39f9-44c3-9cd2-d9154f6983f9\"><\/a><\/h4><p>Can you see yourself using tables and categorical arrays? Let us know what you think or if you have any questions by leaving a comment <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=776#respond\">here<\/a>.<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_86c049679c9c4207b35ad797a72b6608() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='86c049679c9c4207b35ad797a72b6608 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 86c049679c9c4207b35ad797a72b6608';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2013 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_86c049679c9c4207b35ad797a72b6608()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2013b<br><\/p><p class=\"footer\"><br>\r\n      Published with MATLAB&reg; R2013b<br><\/p><\/div><!--\r\n86c049679c9c4207b35ad797a72b6608 ##### SOURCE BEGIN #####\r\n%% Introduction to the New MATLAB Data Types in R2013b\r\n% Today I\u00e2\u20ac\u2122d like to introduce a fairly frequent guest blogger\r\n% <mailto:sarah.zaranek@mathworks.com Sarah Wait Zaranek> who works for the\r\n% MATLAB Marketing team here at The MathWorks. She and I will be writing\r\n% about the new capabilities for MATLAB in R2013b. In particular, there are\r\n% two new data types in MATLAB in R2013b \u00e2\u20ac\u201c\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/table.html table> and\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/categorical.html categorical>\r\n% arrays.  \r\n%\r\n%% What are Tables and Categorical Arrays?\r\n% \r\n% Table is a new data type suitable for holding heterogenous data and\r\n% metadata. Specifically, tables are useful for mixed-type tabular data\r\n% that are often stored as columns in a text file or in a spreadsheet.\r\n% Tables consist of rows and column-oriented variables. Categorical arrays\r\n% are useful for holding categorical data - which have values from a finite\r\n% list of discrete categories.\r\n%\r\n% One of the best ways to learn more about tables and categorical arrays is\r\n% to see them in action.  So, in this post, we will use tables and\r\n% categoricals to examine some airplane flight delay data.  The flight data\r\n% is freely available from the Bureau of Transportation Statistics\r\n% (BTS). You can download it yourself\r\n% <http:\/\/www.transtats.bts.gov\/DL_SelectFields.asp?Table_ID=236 here>. The\r\n% weather data is from the National Climatic Data Center (NCDC) and is available\r\n% <http:\/\/cdo.ncdc.noaa.gov\/qclcd\/ here>.\r\n\r\n%% Importing Data into a Table\r\n%\r\n% You can import your data into a table interactively using the \r\n% < Import Tool>\r\n% or you can do it programmatically, using\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/readtable.html |readtable|>.\r\n\r\nFlightData = readtable('Jan2010Flights.csv');\r\n\r\nwhos FlightData\r\n\r\n%%\r\n% The entire contents of the file are now contained in a single variable \u00e2\u20ac\u201c\r\n% a table. Here you are reading in your data from a csv file.  |readtable|\r\n% also supports reading from .txt,.dat text files and Excel spreadsheet\r\n% files. Tables can also be created directly from variables in your\r\n% workspace.\r\n%\r\n\r\n%% Looking at Variable Names (Column Names) \r\n%\r\n% You can see all the variable names (column names in our table) by looking\r\n% at the |VariableNames| properties. \r\n\r\nFlightData.Properties.VariableNames\r\n\r\n%%\r\n% This particular table does not contain any row names, but for a table\r\n% with row names you can access the row names using the |RowNames|\r\n% property.\r\n\r\n%% Accessing the Data in Your Table\r\n% There are multiple ways to access the data in your table. You can use dot\r\n% indexing to access or modify a single table variable, similar to how you\r\n% use fieldnames in structures.\r\n%\r\n% For example, using dot indexing you can plot a histogram of the departure\r\n% delays (in minutes).\r\n\r\nhist(FlightData.DEP_DELAY)\r\ntitle('Histogram of Flight Delays in Minutes')\r\n\r\n%%\r\n% You can also display the first 5 departure delays.\r\n\r\nFlightData.DEP_DELAY(1:5)\r\n\r\n%%\r\n% You can also extract data from one or more variables in the table using\r\n% curly braces.  Within the curly braces you can use numeric indexing or\r\n% variable and row names. For example, you can extract the actual departure\r\n% times and scheduled depature times for the first 5 flights.\r\n\r\nSomeTimes = FlightData{1:5,{'DEP_TIME','CRS_DEP_TIME'}};\r\ndisp(SomeTimes)\r\n\r\n%%\r\n% This is similar to indexing with cell arrays.  However, *unlike* with\r\n% cells, this concatenates the specified variables into a single array.\r\n% Therefore, the data types of all the specified variables need to be\r\n% compatible for concatenation.\r\n\r\n%% Converting Data to Categorical Arrays\r\n% You can convert some of the variables in your table using\r\n% <\r\n% |categorical|>. Categorical arrays are more memory efficent than holding\r\n% cell arrays of strings when you have repeated data. Categorical arrays\r\n% store only one copy of each category name, reducing the amount of memory\r\n% required to store the array. You can use |whos| to see the amount of\r\n% memory you save by converting the data to a categorical array.\r\n\r\nwhos FlightData\r\n%%\r\nFlightData.ORIGIN = categorical(FlightData.ORIGIN);\r\nFlightData.DEST = categorical(FlightData.DEST);\r\nFlightData.CARRIER = categorical(FlightData.CARRIER);\r\n\r\nwhos FlightData\r\n\r\n%%\r\n% Using |categories|, you can find all the distinct categories in your\r\n% array.  By default, categorical arrays do not define a definite order. If\r\n% your data contains categories with a definite order, you can set the\r\n% |'Ordinal'| flag to |true| when creating your categorical array.  The\r\n% default the order will be alphabetical, but you can prescribe you own\r\n% order instead.\r\n\r\ncategories(FlightData.CARRIER)\r\n\r\n%%\r\n% Categorical arrays are also faster and more convenient than cell arrays\r\n% of strings for indexing and searching. By converting to categorical\r\n% arrays, you can then mathematically compare sets of strings just like you\r\n% would do with numeric values. You can use this functionality to create a\r\n% new table containing only the flights that left from Boston.\r\n\r\n%% Creating a New Table\r\n% You can create a new table from a section of an existing table using\r\n% parentheses with numerical indexing, variable names, or row names. Since\r\n% the flight origin is now a categorical array, you can use logical\r\n% indexing to find all flights that left from Boston.\r\n\r\nidxBoston = FlightData.ORIGIN == 'BOS' ;\r\nBostonFlights = FlightData(idxBoston,:);\r\n\r\nheight(FlightData)\r\nheight(BostonFlights)\r\n\r\n%% Adding\/Removing Variables \r\n% You can also modify your table by adding and removing variables and rows.\r\n% All variables in a table must have the same number of rows, but they can\r\n% be of different widths. \r\n%\r\n% Let's add a new variable (DATE) to represent the serial date number for\r\n% the various flight dates.\r\n\r\nBostonFlights.DATE = datenum(BostonFlights.FL_DATE);\r\n\r\n%%\r\n% The origin now has all |BOS| values and you are not going to use\r\n% destination information right now, so those variables can be removed.\r\n% |HOUR| can also be calculated, as well as a |LATE| variable which\r\n% indicates if the flight was 15 minutes late or more.\r\n\r\nBostonFlights.ORIGIN = [];\r\nBostonFlights.DEST = [];\r\nBostonFlights.FL_DATE = [];\r\n\r\nBostonFlights.HOUR = floor(BostonFlights.CRS_DEP_TIME.\/100);\r\nBostonFlights{:,'LATE'} = BostonFlights.DEP_DELAY > 15;\r\n\r\n%% Removing Missing Data\r\n% Tables have supported functions for finding and standardizing missing\r\n% data.  In this case,  you can find any missing data using |ismissing|\r\n% and remove it.  You can use |height|, which gives you the number\r\n% of table rows, to see how many flights were removed from the table.  We\r\n% exploit logical indexing to get only the flights that have no missing\r\n% data.\r\n\r\nheight(BostonFlights)\r\n%%\r\nTF = any(ismissing(BostonFlights),2);\r\nBostonFlights = BostonFlights(~TF,:);\r\nheight(BostonFlights)\r\n\r\n%% Summarizing a Table\r\n% You can then see descriptive statistics for each variable in this new\r\n% table by using\r\n% <\r\n% |summary|>.\r\n\r\nsummary(BostonFlights)\r\n\r\n%% Sorting Data\r\n% There are additional functions to sort tables, apply functions to table\r\n% variables, and merge tables together. For example, you can sort your\r\n% |BostonFlights| by departure delay.\r\n\r\nBostonFlights = sortrows(BostonFlights,'DEP_DELAY','descend');\r\nBostonFlights(1:10,:)\r\n\r\n%% Applying Functions to Table Variables\r\n% You can apply functions to work with table variables, with\r\n% <\r\n% |varfun|>.\r\n% \r\n% |varfun| has optional additional calling inputs such as\r\n% |'InputVariables'| and |'GroupingVariables'|. |'InputVariables'| lets you\r\n% specific which variables you want to operate on instead of\r\n% operating on all the variables in your table. |'GroupingVariables'| let\r\n% you define groups of rows on which to operate. |varfun| would then apply\r\n% your function to each group of rows within each of the variables of your\r\n% table, rather than to each entire variable.\r\n%\r\n% You can use |varfun| to calculate the mean delay for all flights and the\r\n% fraction of late flights for a given hour on a given day.  The default\r\n% output of |varfun| is a table.\r\n\r\nByHour = varfun(@mean, BostonFlights, ...\r\n    'InputVariables', {'DEP_DELAY', 'LATE'},...\r\n    'GroupingVariables',{'DATE','HOUR'});\r\n\r\ndisp(ByHour(1:5,:))\r\n\r\n%% Joining (Merging) Tables \r\n% Weather might have an important role in determining if a flight is\r\n% delayed.  For a given hour, you might want to know both the delayed\r\n% flight information and the weather at the airport.   So, you can start by\r\n% reading in another table containing weather data for Boston Logan\r\n% Airport. Then, you can merge that table with the existing |ByHour| table.\r\n%\r\n% Since there are a lot of variables in this file, you can specify the\r\n% input format when using |readtable|.  This allows you to use |*| to skip\r\n% variables that you aren't interested in loading into the table. For more\r\n% information about specifying formating strings, see\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/textscan.html#inputarg_formatSpec\r\n% here> in the documentation.  Since this data uses 'M' to represent\r\n% missing data, you can use |'TreatAsEmpty'| to replace any instances of\r\n% 'M' with the standard missing value indicator (NaN for numeric values).\r\n\r\nFormatStr = ['%*s%s%f' repmat('%*s',1,9) '%f' repmat('%*s',1,7) '%f',...\r\n             repmat('%*s',1,3),'%f', repmat('%*s',1,18)];\r\n         \r\nWeatherData = readtable('BostonWeather.txt','HeaderLines',6,...\r\n                        'Format',FormatStr,'TreatAsEmpty','M');  \r\n\r\nWeatherData.Properties.VariableNames\r\n\r\n%%\r\n% WeatherData contains the date, time,\r\n% <http:\/\/en.wikipedia.org\/wiki\/Dew_point dew point> and\r\n% <http:\/\/http:\/\/en.wikipedia.org\/wiki\/Dry-bulb_temperature dry bulb\r\n% temperature> in Celsius, and wind speed. Let's convert |DATE| to a serial\r\n% date number and round to the hour for the time measurement. \r\n\r\nWeatherData.DATE = datenum(WeatherData.Date,'yyyymmdd');\r\nWeatherData.Date = [];\r\nWeatherData.HOUR = floor(WeatherData.Time\/100);\r\n\r\n%%\r\n% Since there are multiple weather measurements per hour, you can\r\n% average the data by hour using |varfun|.\r\n\r\nByHourWeather = varfun(@mean, WeatherData, ...\r\n    'InputVariables', {'DryBulbCelsius','DewPointCelsius','WindSpeed'},...\r\n    'GroupingVariables',{'DATE','HOUR'});\r\n\r\n%%\r\n% Now, you can merge the two tables using\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/join.html |join|> which matches\r\n% rows using key variables (columns) common to both tables. |join| keeps\r\n% all the variables from the first input table and appends the\r\n% corresponding variables from the second input table. The table that\r\n% |join| creates will use the key variable values as the row names.  In\r\n% this case, that means the row names will represent the date and hour\r\n% data.\r\n\r\nAllData = join(ByHour,ByHourWeather,'Keys',{'DATE','HOUR'});\r\n\r\n%% Plotting Data From the Final Table\r\n% Let's now plot this final data set to get an idea of the effect of\r\n% weather on the flight delays.\r\n\r\nAllData.TDIFF =  ...\r\n    abs(AllData.mean_DewPointCelsius - AllData.mean_DryBulbCelsius);\r\n\r\nscatter(AllData.TDIFF, AllData.mean_DEP_DELAY,...\r\n    [], AllData.mean_DEP_DELAY,'filled');\r\nxlabel('abs(DewPoint-Temperature)')\r\nylabel('Average Departure Delay')\r\n\r\n%%\r\nscatter3( AllData.HOUR,AllData.TDIFF,AllData.mean_DEP_DELAY,...\r\n    [],AllData.mean_DEP_DELAY,'filled');\r\nxlabel('Hour of Flight')\r\nylabel('abs(DewPoint-Temperature)')\r\nzlabel('Average Departure Delay')\r\n\r\n%%\r\n% Qualitatively it looks having a temperature near the dew point (greater\r\n% change of precipitation) effects the departure time.  There are other\r\n% factors at work as well, but it is nice to know that our intuition\r\n% (flights later in the day and when it is cold and snowing might have a\r\n% greater chance to be delayed) seems to work with the data.\r\n\r\n%% Your Thoughts?\r\n% Can you see yourself using tables and categorical arrays? Let us know\r\n% what you think or if you have any questions by leaving a comment\r\n% <https:\/\/blogs.mathworks.com\/loren\/?p=776#respond here>.\r\n##### SOURCE END ##### 86c049679c9c4207b35ad797a72b6608\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2013\/FlightTableIntroFinal_03.png\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction--><p>Today I&#8217;d like to introduce a fairly frequent guest blogger <a href=\"mailto:sarah.zaranek@mathworks.com\">Sarah Wait Zaranek<\/a> who works for the MATLAB Marketing team here at The MathWorks. She and I will be writing about the new capabilities for MATLAB in R2013b. In particular, there are two new data types in MATLAB in R2013b &#8211; <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/table.html\">table<\/a> and <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/categorical.html\">categorical<\/a> arrays.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2013\/09\/10\/introduction-to-the-new-matlab-data-types-in-r2013b\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[57,6],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/776"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=776"}],"version-history":[{"count":7,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/776\/revisions"}],"predecessor-version":[{"id":2179,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/776\/revisions\/2179"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=776"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=776"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=776"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}