{"id":3597,"date":"2020-03-16T08:55:01","date_gmt":"2020-03-16T13:55:01","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=3597"},"modified":"2020-03-29T11:55:39","modified_gmt":"2020-03-29T16:55:39","slug":"analyzing-novel-corona-virus-covid-19-dataset","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2020\/03\/16\/analyzing-novel-corona-virus-covid-19-dataset\/","title":{"rendered":"Analyzing Novel Corona Virus COVID-19 Dataset"},"content":{"rendered":"\r\n\r\n<div class=\"content\"><!--introduction--><p>As the threat of novel corona virus COVID-19 spreads through the world, we live in an increasingly anxious time. While healthcare workers fight the virus in the front line, we do our part by practicing social distancing to slow the pandemic. Today's guest blogger, <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/951521\">Toshi Takeuchi<\/a>, would like to share how he spends his time by analyzing data in MATLAB.<\/p><p><b>Disclaimer: this post is NOT a valid and credible source of information for COVID-19, which is a serious threat, and you should consult with authoritative sources for accurate information, such as WHO or CDC.<\/b><\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#2eb18c50-4433-4bca-8de4-b804bf3cc092\">COVID-19 Data Source<\/a><\/li><li><a href=\"#d5b8cf4e-fbae-468a-b398-49b72f20c983\">Mapping Confirmed Cases Globally<\/a><\/li><li><a href=\"#691a2c81-b7c7-4537-afe5-8ee4fa287609\">Mapping Confirmed Cases in the US<\/a><\/li><li><a href=\"#c7dd6e33-af68-4963-a5f5-85fd61303c4d\">Ranking Country\/Region by Confirmed Cases<\/a><\/li><li><a href=\"#a9d18469-fd0a-4989-a08b-9d63d462ca37\">Growth of Confirmed Cases by Country\/Region<\/a><\/li><li><a href=\"#8f70dc76-594e-43aa-ad61-afa5bdc1e3b8\">Growth of New Cases by Country\/Region<\/a><\/li><li><a href=\"#13386c61-3df6-49d9-8bf5-46bb310e7199\">Closer look at Mainland China<\/a><\/li><li><a href=\"#0ab12735-4b6a-45c3-9dd9-85de03ae19ef\">Fitting a Curve<\/a><\/li><li><a href=\"#77b7c02a-434a-4926-9006-da9be29e7e2f\">What about South Korea?<\/a><\/li><li><a href=\"#72456cd3-6146-49ca-92ab-6310179396f2\">Summary<\/a><\/li><\/ul><\/div><h4>COVID-19 Data Source<a name=\"2eb18c50-4433-4bca-8de4-b804bf3cc092\"><\/a><\/h4><p>As we hear the news of novel corona virus COVID-19 day after day and we start practicing social distancing, I needed to find a way to calm my nerves. Am I the only one who finds data analysis in MATLAB a meditative exercise? Then why not analyze COVID-19, I asked myself.<\/p><p>I looked in the File Exchange and found <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/74076-track-the-global-spread-of-wuhan-coronavirus-in-matlab\">Kevin Chng made this FileExchnage submmission about COVID-19<\/a>. I also found <a href=\"https:\/\/www.kaggle.com\/sudalairajkumar\/novel-corona-virus-2019-dataset\">Novel Corona Virus 2019 Dataset<\/a> on Kaggle. I decided to use the dataset from Kaggle.<\/p><p>I downloaded the zip file from Kaggle and moved its content to my current working directory.<\/p><p>Let's check the unzipped files. Please note that \"|2019_nCoV_data.csv|\" is obsolete and we shouldn't use it.<\/p><pre class=\"codeinput\">s = dir(<span class=\"string\">\"*.csv\"<\/span>);\r\ns = s(arrayfun(@(x) ~matches(x.name,<span class=\"string\">\"2019_nCoV_data.csv\"<\/span>), s));\r\nfilenames = arrayfun(@(x) string(x.name), s)\r\n<\/pre><pre class=\"codeoutput\">filenames = \r\n  6&times;1 string array\r\n    \"COVID19_line_list_data.csv\"\r\n    \"COVID19_open_line_list.csv\"\r\n    \"covid_19_data.csv\"\r\n    \"time_series_covid_19_confirmed.csv\"\r\n    \"time_series_covid_19_deaths.csv\"\r\n    \"time_series_covid_19_recovered.csv\"\r\n<\/pre><div><ul><li><tt>covid_19_data.csv<\/tt> - this is the main file - daily level data of global cases by province\/state, from Jan 22, 2020<\/li><li><tt>time_series_covid_19_confirmed.csv<\/tt> - time series data of confirmed cases<\/li><li><tt>time_series_covid_19_deaths.csv<\/tt> - time series data of cumulative number of deaths<\/li><li><tt>time_series_covid_19_recovered.csv<\/tt> - time series data of cumulative number of recovered cases<\/li><li><tt>COVID19_line_list_data.csv<\/tt> - individual level information<\/li><li><tt>COVID19_open_line_list.csv<\/tt> - individual level information<\/li><\/ul><\/div><h4>Mapping Confirmed Cases Globally<a name=\"d5b8cf4e-fbae-468a-b398-49b72f20c983\"><\/a><\/h4><p>Let's visualize the number of confirmed cases on a map. We start by loading <tt>time_series_covid_19_confirmed.csv<\/tt> which contains latitude and longitude variables we need for mapping. I also decided to keep the variable names as is, rather than letting MATLAB convert them to valid MATLAB identifiers, because some of the column names are dates.<\/p><pre class=\"codeinput\">opts = detectImportOptions(filenames(4), <span class=\"string\">\"TextType\"<\/span>,<span class=\"string\">\"string\"<\/span>);\r\nopts.VariableNamesLine = 1;\r\nopts.DataLines = [2,inf];\r\nopts.PreserveVariableNames = true;\r\ntimes_conf = readtable(filenames(4),opts);\r\n<\/pre><p>The dataset contains <tt>Province\/State<\/tt> variable, but we want to aggregate the data at the <tt>Country\/Region<\/tt> level. Before we do so, we need to clean up the data a bit. Please note that I have use the () notation because the variable names are not valid MATLAB identifiers.<\/p><pre class=\"codeinput\">times_conf.(<span class=\"string\">\"Country\/Region\"<\/span>)(times_conf.(<span class=\"string\">\"Country\/Region\"<\/span>) == <span class=\"string\">\"China\"<\/span>) = <span class=\"string\">\"Mainland China\"<\/span>;\r\ntimes_conf.(<span class=\"string\">\"Country\/Region\"<\/span>)(times_conf.(<span class=\"string\">\"Country\/Region\"<\/span>) == <span class=\"string\">\"Czechia\"<\/span>) = <span class=\"string\">\"Czech Republic\"<\/span>;\r\ntimes_conf.(<span class=\"string\">\"Country\/Region\"<\/span>)(times_conf.(<span class=\"string\">\"Country\/Region\"<\/span>) == <span class=\"string\">\"Iran (Islamic Republic of)\"<\/span>) = <span class=\"string\">\"Iran\"<\/span>;\r\ntimes_conf.(<span class=\"string\">\"Country\/Region\"<\/span>)(times_conf.(<span class=\"string\">\"Country\/Region\"<\/span>) == <span class=\"string\">\"Republic of Korea\"<\/span>) = <span class=\"string\">\"Korea, South\"<\/span>;\r\ntimes_conf.(<span class=\"string\">\"Country\/Region\"<\/span>)(times_conf.(<span class=\"string\">\"Country\/Region\"<\/span>) == <span class=\"string\">\"Republic of Moldova\"<\/span>) = <span class=\"string\">\"Moldova\"<\/span>;\r\ntimes_conf.(<span class=\"string\">\"Country\/Region\"<\/span>)(times_conf.(<span class=\"string\">\"Country\/Region\"<\/span>) == <span class=\"string\">\"Russian Federation\"<\/span>) = <span class=\"string\">\"Russia\"<\/span>;\r\ntimes_conf.(<span class=\"string\">\"Country\/Region\"<\/span>)(times_conf.(<span class=\"string\">\"Country\/Region\"<\/span>) == <span class=\"string\">\"Taipei and environs\"<\/span>) = <span class=\"string\">\"Taiwan\"<\/span>;\r\ntimes_conf.(<span class=\"string\">\"Country\/Region\"<\/span>)(times_conf.(<span class=\"string\">\"Country\/Region\"<\/span>) == <span class=\"string\">\"Taiwan*\"<\/span>) = <span class=\"string\">\"Taiwan\"<\/span>;\r\ntimes_conf.(<span class=\"string\">\"Country\/Region\"<\/span>)(times_conf.(<span class=\"string\">\"Country\/Region\"<\/span>) == <span class=\"string\">\"United Kingdom\"<\/span>) = <span class=\"string\">\"UK\"<\/span>;\r\ntimes_conf.(<span class=\"string\">\"Country\/Region\"<\/span>)(times_conf.(<span class=\"string\">\"Country\/Region\"<\/span>) == <span class=\"string\">\"Viet Nam\"<\/span>) = <span class=\"string\">\"Vietnam\"<\/span>;\r\ntimes_conf.(<span class=\"string\">\"Country\/Region\"<\/span>)(times_conf.(<span class=\"string\">\"Province\/State\"<\/span>) == <span class=\"string\">\"St Martin\"<\/span>) = <span class=\"string\">\"St Martin\"<\/span>;\r\ntimes_conf.(<span class=\"string\">\"Country\/Region\"<\/span>)(times_conf.(<span class=\"string\">\"Province\/State\"<\/span>) == <span class=\"string\">\"Saint Barthelemy\"<\/span>) = <span class=\"string\">\"Saint Barthelemy\"<\/span>;\r\n<\/pre><p>Now we can use <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/double.groupsummary.html\"><tt>groupsummary<\/tt><\/a> to aggregate the data by <tt>Country\/Region<\/tt> by summing the confirmed cases and averaging the latitudes and longitudes.<\/p><pre class=\"codeinput\">vars = times_conf.Properties.VariableNames;\r\ntimes_conf_country = groupsummary(times_conf,<span class=\"string\">\"Country\/Region\"<\/span>,{<span class=\"string\">'sum'<\/span>,<span class=\"string\">'mean'<\/span>},vars(3:end));\r\n<\/pre><p>The output contains unnecessary columns, such as sums of latitudes and longitudes or means of confirmed cases. Let's remove those variables, and also remove <tt>'sum_'<\/tt> or <tt>'mean_'<\/tt> prefixes from the variables we keep.<\/p><pre class=\"codeinput\">vars = times_conf_country.Properties.VariableNames;\r\nvars = regexprep(vars,<span class=\"string\">\"^(sum_)(?=L(a|o))\"<\/span>,<span class=\"string\">\"remove_\"<\/span>);\r\nvars = regexprep(vars,<span class=\"string\">\"^(mean_)(?=[0-9])\"<\/span>,<span class=\"string\">\"remove_\"<\/span>);\r\nvars = erase(vars,{<span class=\"string\">'sum_'<\/span>,<span class=\"string\">'mean_'<\/span>});\r\ntimes_conf_country.Properties.VariableNames = vars;\r\ntimes_conf_country = removevars(times_conf_country,[{<span class=\"string\">'GroupCount'<\/span>},vars(contains(vars,<span class=\"string\">\"remove_\"<\/span>))]);\r\n<\/pre><p>Because Mainland China is so disproportionately large, we want to exclude it from our visualization.<\/p><pre class=\"codeinput\">times_conf_exChina = times_conf_country(times_conf_country.(<span class=\"string\">\"Country\/Region\"<\/span>) ~= <span class=\"string\">\"Mainland China\"<\/span>,:);\r\nvars = times_conf_exChina.Properties.VariableNames;\r\n<\/pre><p>Let's use <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/geobubble.html\"><tt>geobubble<\/tt><\/a> to visualize the first and the last dates in the dataset. Since the column names of numerica data are dates, I can simply pick the first date and the last date to show the maps together. Please note that <tt>geobubble<\/tt> would show a bubble for zero values, and therefore we need to remove rows with zero values if we don't want to show bubbles for zero cases.<\/p><pre class=\"codeinput\">figure\r\nt = tiledlayout(<span class=\"string\">\"flow\"<\/span>);\r\n<span class=\"keyword\">for<\/span> ii = [4, length(vars)]\r\n    times_conf_exChina.Category = categorical(repmat(<span class=\"string\">\"&lt;100\"<\/span>,height(times_conf_exChina),1));\r\n    times_conf_exChina.Category(table2array(times_conf_exChina(:,ii)) &gt;= 100) = <span class=\"string\">\"&gt;=100\"<\/span>;\r\n    nexttile\r\n    tbl = times_conf_exChina(:,[1:3, ii, end]);\r\n    tbl(tbl.(4) == 0,:) = [];\r\n    gb = geobubble(tbl,<span class=\"string\">\"Lat\"<\/span>,<span class=\"string\">\"Long\"<\/span>,<span class=\"string\">\"SizeVariable\"<\/span>,vars(ii),<span class=\"string\">\"ColorVariable\"<\/span>,<span class=\"string\">\"Category\"<\/span>);\r\n    gb.BubbleColorList = [1,0,1;1,0,0];\r\n    gb.LegendVisible = <span class=\"string\">\"off\"<\/span>;\r\n    gb.Title = <span class=\"string\">\"As of \"<\/span> + vars(ii);\r\n    gb.SizeLimits = [0, max(times_conf_exChina.(vars{length(vars)}))];\r\n    gb.MapCenter = [21.6385   36.1666];\r\n    gb.ZoomLevel = 0.3606;\r\n<span class=\"keyword\">end<\/span>\r\ntitle(t,[<span class=\"string\">\"COVID-19 Confirmed Cases outside Mainland China\"<\/span>; <span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">\"Country\/Region with 100+ cases highlighted in red\"<\/span>])\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2020\/analyze_covid19_01.png\" alt=\"\"> <p>We can see that it initially only affected the countries\/regions surrounding Mainland China, but since there have been massive breakouts in South Korea, Italy, and Iran. It is also worth noting that we already had confirmed cases in the US as early as January 22, 2020.<\/p><h4>Mapping Confirmed Cases in the US<a name=\"691a2c81-b7c7-4537-afe5-8ee4fa287609\"><\/a><\/h4><p>Since I live in Boston, I'm interested in more local cases. Let's go down to the <tt>Province\/State<\/tt> level in the US.<\/p><pre class=\"codeinput\">times_conf_us = times_conf((times_conf.(<span class=\"string\">\"Country\/Region\"<\/span>) == <span class=\"string\">\"US\"<\/span>),:);\r\ntimes_conf_us(times_conf_us.(<span class=\"string\">\"Province\/State\"<\/span>) == <span class=\"string\">\"Diamond Princess\"<\/span>,:) = [];\r\nvars = times_conf_us.Properties.VariableNames;\r\n\r\nfigure\r\nt = tiledlayout(<span class=\"string\">\"flow\"<\/span>);\r\n<span class=\"keyword\">for<\/span> ii = [5, length(vars)]\r\n    times_conf_us.Category = categorical(repmat(<span class=\"string\">\"&lt;100\"<\/span>,height(times_conf_us),1));\r\n    times_conf_us.Category(table2array(times_conf_us(:,ii)) &gt;= 100) = <span class=\"string\">\"&gt;=100\"<\/span>;\r\n    nexttile\r\n    tbl = times_conf_us(:,[1:4, ii, end]);\r\n    tbl(tbl.(5) == 0,:) = [];\r\n    gb = geobubble(tbl,<span class=\"string\">\"Lat\"<\/span>,<span class=\"string\">\"Long\"<\/span>,<span class=\"string\">\"SizeVariable\"<\/span>,vars(ii),<span class=\"string\">\"ColorVariable\"<\/span>,<span class=\"string\">\"Category\"<\/span>);\r\n    gb.BubbleColorList = [1,0,1;1,0,0];\r\n    gb.LegendVisible = <span class=\"string\">\"off\"<\/span>;\r\n    gb.Title = <span class=\"string\">\"As of \"<\/span> + vars(ii);\r\n    gb.SizeLimits = [0, max(times_conf_us.(vars{length(vars)}))];\r\n    gb.MapCenter = [44.9669 -113.6201];\r\n    gb.ZoomLevel = 1.7678;\r\n<span class=\"keyword\">end<\/span>\r\ntitle(t,[<span class=\"string\">\"COVID-19 Confirmed Cases in the US\"<\/span>; <span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">\"Province\/State with 100+ cases highlighted in red\"<\/span>])\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2020\/analyze_covid19_02.png\" alt=\"\"> <p>You can see it started out in Washington where it became a major outbreak, as well as in California, and New York.<\/p><h4>Ranking Country\/Region by Confirmed Cases<a name=\"c7dd6e33-af68-4963-a5f5-85fd61303c4d\"><\/a><\/h4><p>Let's compare the number of confirmed cases by Country\/Region using <tt>covid_19_data.csv<\/tt>. There are inconsistencies in the datetime format, so we will treat it as text initially.<\/p><pre class=\"codeinput\">opts = detectImportOptions(filenames(3), <span class=\"string\">\"TextType\"<\/span>,<span class=\"string\">\"string\"<\/span>,<span class=\"string\">\"DatetimeType\"<\/span>,<span class=\"string\">\"text\"<\/span>);\r\nprovData = readtable(filenames(3),opts);\r\n<\/pre><pre class=\"codeoutput\">Warning: Column headers from the file were modified to make them valid MATLAB\r\nidentifiers before creating variable names for the table. The original column\r\nheaders are saved in the VariableDescriptions property.\r\nSet 'PreserveVariableNames' to true to use the original column headers as table\r\nvariable names. \r\n<\/pre><p>Let's clean up the datetime format.<\/p><pre class=\"codeinput\">provData.ObservationDate = regexprep(provData.ObservationDate,<span class=\"string\">\"\\\/20$\"<\/span>,<span class=\"string\">\"\/2020\"<\/span>);\r\nprovData.ObservationDate = datetime(provData.ObservationDate);\r\n<\/pre><p>We also need to standardize the values in Country\/Region.<\/p><pre class=\"codeinput\">provData.Country_Region(provData.Country_Region == <span class=\"string\">\"Iran (Islamic Republic of)\"<\/span>) = <span class=\"string\">\"Iran\"<\/span>;\r\nprovData.Country_Region(provData.Country_Region == <span class=\"string\">\"Republic of Ireland\"<\/span>) = <span class=\"string\">\"Ireland\"<\/span>;\r\nprovData.Country_Region(provData.Country_Region == <span class=\"string\">\"Republic of Korea\"<\/span>) = <span class=\"string\">\"South Korea\"<\/span>;\r\nprovData.Country_Region(provData.Country_Region == <span class=\"string\">\"('St. Martin',)\"<\/span>) = <span class=\"string\">\"St. Martin\"<\/span>;\r\nprovData.Country_Region(provData.Country_Region == <span class=\"string\">\"Holy See\"<\/span>) = <span class=\"string\">\"Vatican City\"<\/span>;\r\nprovData.Country_Region(provData.Country_Region == <span class=\"string\">\"occupied Palestinian territory\"<\/span>) = <span class=\"string\">\"Palestine\"<\/span>;\r\n<\/pre><p>The dataset contains <tt>Province\/State<\/tt> variable. Let's aggregate the data at <tt>Country\/Region<\/tt> level.<\/p><pre class=\"codeinput\">countryData = groupsummary(provData,{<span class=\"string\">'ObservationDate'<\/span>,<span class=\"string\">'Country_Region'<\/span>}, <span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">\"sum\"<\/span>,{<span class=\"string\">'Confirmed'<\/span>,<span class=\"string\">'Deaths'<\/span>,<span class=\"string\">'Recovered'<\/span>});\r\ncountryData.Properties.VariableNames = erase(countryData.Properties.VariableNames,<span class=\"string\">\"sum_\"<\/span>);\r\n<\/pre><p><tt>countryData<\/tt> contains daily cumulative data. We need the most recent numbers only.<\/p><pre class=\"codeinput\">countryLatest = groupsummary(countryData,<span class=\"string\">\"Country_Region\"<\/span>, <span class=\"string\">\"max\"<\/span>, <span class=\"string\">\"Confirmed\"<\/span>);\r\ncountryLatest.Properties.VariableNames = erase(countryLatest.Properties.VariableNames,<span class=\"string\">\"max_\"<\/span>);\r\n<\/pre><p>Let's rank the top 10, and visualize them with a <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/matlab.graphics.chart.primitive.histogram.html\"><tt>histogram<\/tt><\/a>.<\/p><pre class=\"codeinput\">[sorted,idx] = sort(countryLatest.Confirmed,<span class=\"string\">'descend'<\/span>);\r\nlabels = countryLatest.Country_Region(idx);\r\nk = 10;\r\ntopK = sorted(1:k);\r\nlabelsK = labels(1:k);\r\nfigure\r\nhistogram(<span class=\"string\">'Categories'<\/span>,categorical(labelsK),<span class=\"string\">\"BinCounts\"<\/span>,topK, <span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">\"DisplayOrder\"<\/span>,<span class=\"string\">\"ascend\"<\/span>,<span class=\"string\">\"Orientation\"<\/span>,<span class=\"string\">\"horizontal\"<\/span>)\r\nxlabel(<span class=\"string\">\"Confirmed Cases\"<\/span>)\r\ntitle([compose(<span class=\"string\">\"COVID-19 Confirmed Cases by Country\/Region - Top %d\"<\/span>,k); <span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">\"As of \"<\/span> + datestr(max(provData.ObservationDate))])\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2020\/analyze_covid19_03.png\" alt=\"\"> <p>Outside Mainland China, Italy are Iran are now surpassing South Korea.<\/p><h4>Growth of Confirmed Cases by Country\/Region<a name=\"a9d18469-fd0a-4989-a08b-9d63d462ca37\"><\/a><\/h4><p>We can also check how fast the cases are growing in those countries.<\/p><pre class=\"codeinput\">figure\r\nplot(countryData.ObservationDate(countryData.Country_Region == labelsK(2)), <span class=\"keyword\">...<\/span>\r\n    countryData.Confirmed(countryData.Country_Region == labelsK(2)));\r\nhold <span class=\"string\">on<\/span>\r\n<span class=\"keyword\">for<\/span> ii = 3:length(labelsK)\r\n    plot(countryData.ObservationDate(countryData.Country_Region == labelsK(ii)), <span class=\"keyword\">...<\/span>\r\n        countryData.Confirmed(countryData.Country_Region == labelsK(ii)),<span class=\"string\">\"LineWidth\"<\/span>,1);\r\n<span class=\"keyword\">end<\/span>\r\nhold <span class=\"string\">off<\/span>\r\ntitle([<span class=\"string\">\"COVID-19 Confirmed Cases outside Mainland China\"<\/span>;compose(<span class=\"string\">\"Top %d Country\/Region\"<\/span>,k)])\r\nlegend(labelsK(2:end),<span class=\"string\">\"location\"<\/span>,<span class=\"string\">\"northwest\"<\/span>)\r\nxlabel(<span class=\"string\">\"As of \"<\/span> + datestr(max(provData.ObservationDate)))\r\nylabel(<span class=\"string\">\"Cases\"<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2020\/analyze_covid19_04.png\" alt=\"\"> <p>While South Korea shows a sign of slowdown, it's accelerating everywhere else.<\/p><h4>Growth of New Cases by Country\/Region<a name=\"8f70dc76-594e-43aa-ad61-afa5bdc1e3b8\"><\/a><\/h4><p>We can calculate the number of new cases by subtracting the cumulative number of confirmed cases between two dates.<\/p><pre class=\"codeinput\">by_country = cell(size(labelsK));\r\nfigure\r\nt = tiledlayout(<span class=\"string\">'flow'<\/span>);\r\n<span class=\"keyword\">for<\/span> ii = 1:length(labelsK)\r\n    country = provData(provData.Country_Region == labelsK(ii),:);\r\n    country = groupsummary(country,{<span class=\"string\">'ObservationDate'<\/span>,<span class=\"string\">'Country_Region'<\/span>}, <span class=\"keyword\">...<\/span>\r\n        <span class=\"string\">\"sum\"<\/span>,{<span class=\"string\">'Confirmed'<\/span>,<span class=\"string\">'Deaths'<\/span>,<span class=\"string\">'Recovered'<\/span>});\r\n    country.Properties.VariableNames = erase(country.Properties.VariableNames,<span class=\"string\">\"sum_\"<\/span>);\r\n    country.New =  [0; country.Confirmed(2:end) - country.Confirmed(1:end-1)];\r\n    country.New(country.New &lt; 0) = 0;\r\n    by_country{ii} = country;\r\n    <span class=\"keyword\">if<\/span> labelsK(ii) ~= <span class=\"string\">\"Others\"<\/span>\r\n        nexttile\r\n        plot(country.ObservationDate,country.New)\r\n        title(labelsK(ii) + compose(<span class=\"string\">\" - %d\"<\/span>, max(country.Confirmed)))\r\n    <span class=\"keyword\">end<\/span>\r\n<span class=\"keyword\">end<\/span>\r\ntitle(t, compose(<span class=\"string\">\"COVID-19 New Cases - Top %d Country\/Region\"<\/span>,k))\r\nxlabel(t, <span class=\"string\">\"As of \"<\/span> + datestr(max(provData.ObservationDate)))\r\nylabel(t,<span class=\"string\">\"New Cases\"<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2020\/analyze_covid19_05.png\" alt=\"\"> <p>You can see that there haven't been many new cases in Mainland China and South Korea. It seems like they managed to contain the outbreak.<\/p><h4>Closer look at Mainland China<a name=\"13386c61-3df6-49d9-8bf5-46bb310e7199\"><\/a><\/h4><p>Since the infection is slowing down in Mainland China, let's see how many active cases are still here. You can calculate the active cases by subtracting recovered cases and deaths from confirmed cases.<\/p><pre class=\"codeinput\"><span class=\"keyword\">for<\/span> ii = 1:length(labelsK)\r\n    by_country{ii}.Active = by_country{ii}.Confirmed - by_country{ii}.Deaths - by_country{ii}.Recovered;\r\n<span class=\"keyword\">end<\/span>\r\n\r\nfigure\r\narea(by_country{1}.ObservationDate, <span class=\"keyword\">...<\/span>\r\n    [by_country{1}.Active by_country{1}.Recovered by_country{1}.Deaths])\r\nlegend(<span class=\"string\">\"Active\"<\/span>,<span class=\"string\">\"Recovered\"<\/span>,<span class=\"string\">\"Deaths\"<\/span>,<span class=\"string\">\"location\"<\/span>,<span class=\"string\">\"northwest\"<\/span>)\r\ntitle(<span class=\"string\">\"Breakdown of Confirmed Cases in Mainland China\"<\/span>)\r\nxlabel(<span class=\"string\">\"As of \"<\/span> + datestr(max(provData.ObservationDate)))\r\nylabel(<span class=\"string\">\"Cases\"<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2020\/analyze_covid19_06.png\" alt=\"\"> <h4>Fitting a Curve<a name=\"0ab12735-4b6a-45c3-9dd9-85de03ae19ef\"><\/a><\/h4><p>The number of active cases is dropping, and the curve looks roughly Gaussian. Can we fit a Gaussian model and predict when the active cases will be zero?<\/p><p><b>Disclaimer: this is a very crude approach and you shouldn't draw any conclusion from this - this is just for your reading enjoyment only.<\/b><\/p><p>I used the <a href=\"https:\/\/www.mathworks.com\/products\/curvefitting.html\">Curve Fitting Toolbox<\/a> to fit a gaussian to the <tt>Active<\/tt> line. To evaluate the goodness of fit, <a href=\"https:\/\/www.mathworks.com\/help\/curvefit\/evaluating-goodness-of-fit.html\">check this out<\/a>.<\/p><pre class=\"codeinput\">[x, y] = prepareCurveData((1:length(by_country{1}.Active))',by_country{1}.Active);\r\nft = fittype(<span class=\"string\">\"gauss1\"<\/span>);\r\nopts = fitoptions(<span class=\"string\">\"Method\"<\/span>, <span class=\"string\">\"NonlinearLeastSquares\"<\/span>);\r\nopts.Display = <span class=\"string\">\"Off\"<\/span>;\r\nopts.Lower = [-Inf -Inf 0];\r\nopts.StartPoint = [58046 27 7.66733432245782];\r\n[fobj, gof] = fit(x,y,ft,opts);\r\ngof\r\n<\/pre><pre class=\"codeoutput\">gof = \r\n  struct with fields:\r\n\r\n           sse: 4.4145e+08\r\n       rsquare: 0.9743\r\n           dfe: 47\r\n    adjrsquare: 0.9732\r\n          rmse: 3.0647e+03\r\n<\/pre><p>Let's project the output into the future by adding 20 days.<\/p><pre class=\"codeinput\">extend_days = 20;\r\nxhat = [x; (x(end)+1:x(end)+extend_days)'];\r\nxdates = [by_country{1}.ObservationDate; <span class=\"keyword\">...<\/span>\r\n    (by_country{1}.ObservationDate(end) + days(1): <span class=\"keyword\">...<\/span>\r\n    by_country{1}.ObservationDate(end) + days(extend_days))'];\r\nyhat = fobj(xhat);\r\nci = predint(fobj,xhat);\r\n<\/pre><p>Now we are ready to plot it.<\/p><pre class=\"codeinput\">figure\r\narea(by_country{1}.ObservationDate,by_country{1}.Active)\r\nhold <span class=\"string\">on<\/span>\r\nplot(xdates,yhat,<span class=\"string\">\"lineWidth\"<\/span>,2)\r\nplot(xdates,ci,<span class=\"string\">\"Color\"<\/span>,<span class=\"string\">\"m\"<\/span>,<span class=\"string\">\"LineStyle\"<\/span>,<span class=\"string\">\":\"<\/span>,<span class=\"string\">\"LineWidth\"<\/span>,2)\r\nhold <span class=\"string\">off<\/span>\r\nylim([0 inf])\r\nlegend(<span class=\"string\">\"Actual\"<\/span>,<span class=\"string\">\"Gaussian Fit\"<\/span>,<span class=\"string\">\"Confidence Intevals\"<\/span>,<span class=\"string\">\"location\"<\/span>,<span class=\"string\">\"northeast\"<\/span>)\r\ntitle(<span class=\"string\">\"Gaussian model over Actives Cases in Mainland China\"<\/span>)\r\nxlabel(<span class=\"string\">\"Actual as of \"<\/span> + datestr(max(provData.ObservationDate)))\r\nylabel(<span class=\"string\">\"Cases\"<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2020\/analyze_covid19_07.png\" alt=\"\"> <p>Obviously, I am not going to take this at the face value, but wouldn't it be nice if Mainland China can reduce the active cases to zero by the beginning of April?<\/p><h4>What about South Korea?<a name=\"77b7c02a-434a-4926-9006-da9be29e7e2f\"><\/a><\/h4><p>Let's plot the number of active cases, recovered cases and deaths for South Korea.<\/p><pre class=\"codeinput\">figure\r\narea(by_country{4}.ObservationDate, <span class=\"keyword\">...<\/span>\r\n    [by_country{4}.Active by_country{4}.Recovered by_country{4}.Deaths])\r\nlegend(<span class=\"string\">\"Active\"<\/span>,<span class=\"string\">\"Recovered\"<\/span>,<span class=\"string\">\"Deaths\"<\/span>,<span class=\"string\">\"location\"<\/span>,<span class=\"string\">\"northwest\"<\/span>)\r\ntitle(<span class=\"string\">\"Breakdown of Confirmed Cases in South Korea\"<\/span>)\r\nxlabel(<span class=\"string\">\"As of \"<\/span> + datestr(max(provData.ObservationDate)))\r\nylabel(<span class=\"string\">\"Cases\"<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2020\/analyze_covid19_08.png\" alt=\"\"> <p>As you can see in the plot, it is too soon to tell if they reached the peak yet. I don't think we can get any good fit using Gaussian.<\/p><h4>Summary<a name=\"72456cd3-6146-49ca-92ab-6310179396f2\"><\/a><\/h4><p>Do you use MATLAB for help fighting COVID-19? Or perhaps you are already self-isolating? Share how you use MATLAB while you go through this trying time <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=3597#respond\">here<\/a>.<\/p><p><i>Copyright 2020 The MathWorks, Inc.<\/i><\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_9551bb69c5a8482f8f0f6f95aed6b6a5() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='9551bb69c5a8482f8f0f6f95aed6b6a5 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 9551bb69c5a8482f8f0f6f95aed6b6a5';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2020 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_9551bb69c5a8482f8f0f6f95aed6b6a5()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2020a<br><\/p><\/div><!--\r\n9551bb69c5a8482f8f0f6f95aed6b6a5 ##### SOURCE BEGIN #####\r\n%% Analyzing Novel Corona Virus COVID-19 Dataset\r\n% As the threat of novel corona virus COVID-19 spreads through the world,\r\n% we live in an increasingly anxious time. While healthcare workers fight\r\n% the virus in the front line, we do our part by practicing social\r\n% distancing to slow the pandemic. Today's guest blogger,\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/951521 Toshi\r\n% Takeuchi>, would like to share how he spends his time by analyzing data\r\n% in MATLAB.\r\n% \r\n% *Disclaimer: this post is NOT a valid and credible source of information for \r\n% COVID-19, which is a serious threat, and you should consult with authoritative \r\n% sources for accurate information, such as WHO or CDC.* \r\n%\r\n%% COVID-19 Data Source\r\n% As we hear the news of novel corona virus COVID-19 day after day and we\r\n% start practicing social distancing, I needed to find a way to calm my\r\n% nerves. Am I the only one who finds data analysis in MATLAB a meditative\r\n% exercise? Then why not analyze COVID-19, I asked myself.\r\n% \r\n% I looked in the File Exchange and found\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/74076-track-the-global-spread-of-wuhan-coronavirus-in-matlab\r\n% Kevin Chng made this FileExchnage submmission about COVID-19>. I also\r\n% found\r\n% <https:\/\/www.kaggle.com\/sudalairajkumar\/novel-corona-virus-2019-dataset\r\n% Novel Corona Virus 2019 Dataset> on Kaggle. I decided to use the dataset\r\n% from Kaggle.\r\n% \r\n% I downloaded the zip file from Kaggle and moved its content to my current \r\n% working directory.\r\n% \r\n% Let's check the unzipped files. Please note that \"|2019_nCoV_data.csv|\"\r\n% is obsolete and we shouldn't use it.\r\n\r\ns = dir(\"*.csv\");\r\ns = s(arrayfun(@(x) ~matches(x.name,\"2019_nCoV_data.csv\"), s));\r\nfilenames = arrayfun(@(x) string(x.name), s)\r\n\r\n%% \r\n% * |covid_19_data.csv| - this is the main file - daily level data of\r\n% global cases by province\/state, from Jan 22, 2020\r\n% * |time_series_covid_19_confirmed.csv| - time series data of confirmed cases\r\n% * |time_series_covid_19_deaths.csv| - time series data of cumulative number \r\n% of deaths\r\n% * |time_series_covid_19_recovered.csv| - time series data of cumulative number \r\n% of recovered cases\r\n% * |COVID19_line_list_data.csv| - individual level information\r\n% * |COVID19_open_line_list.csv| - individual level information\r\n%\r\n%% Mapping Confirmed Cases Globally\r\n% Let's visualize the number of confirmed cases on a map. We start by loading \r\n% |time_series_covid_19_confirmed.csv| which contains latitude and longitude variables \r\n% we need for mapping. I also decided to keep the variable names as is, rather \r\n% than letting MATLAB convert them to valid MATLAB identifiers, because some of \r\n% the column names are dates. \r\n\r\nopts = detectImportOptions(filenames(4), \"TextType\",\"string\");\r\nopts.VariableNamesLine = 1;\r\nopts.DataLines = [2,inf];\r\nopts.PreserveVariableNames = true;\r\ntimes_conf = readtable(filenames(4),opts);\r\n\r\n%% \r\n% The dataset contains |Province\/State| variable, but we want to aggregate\r\n% the data at the |Country\/Region| level. Before we do so, we need to clean up\r\n% the data a bit. Please note that I have use the () notation because the\r\n% variable names are not valid MATLAB identifiers.\r\n\r\ntimes_conf.(\"Country\/Region\")(times_conf.(\"Country\/Region\") == \"China\") = \"Mainland China\";\r\ntimes_conf.(\"Country\/Region\")(times_conf.(\"Country\/Region\") == \"Czechia\") = \"Czech Republic\";\r\ntimes_conf.(\"Country\/Region\")(times_conf.(\"Country\/Region\") == \"Iran (Islamic Republic of)\") = \"Iran\";\r\ntimes_conf.(\"Country\/Region\")(times_conf.(\"Country\/Region\") == \"Republic of Korea\") = \"Korea, South\";\r\ntimes_conf.(\"Country\/Region\")(times_conf.(\"Country\/Region\") == \"Republic of Moldova\") = \"Moldova\";\r\ntimes_conf.(\"Country\/Region\")(times_conf.(\"Country\/Region\") == \"Russian Federation\") = \"Russia\";\r\ntimes_conf.(\"Country\/Region\")(times_conf.(\"Country\/Region\") == \"Taipei and environs\") = \"Taiwan\";\r\ntimes_conf.(\"Country\/Region\")(times_conf.(\"Country\/Region\") == \"Taiwan*\") = \"Taiwan\";\r\ntimes_conf.(\"Country\/Region\")(times_conf.(\"Country\/Region\") == \"United Kingdom\") = \"UK\";\r\ntimes_conf.(\"Country\/Region\")(times_conf.(\"Country\/Region\") == \"Viet Nam\") = \"Vietnam\";\r\ntimes_conf.(\"Country\/Region\")(times_conf.(\"Province\/State\") == \"St Martin\") = \"St Martin\";\r\ntimes_conf.(\"Country\/Region\")(times_conf.(\"Province\/State\") == \"Saint Barthelemy\") = \"Saint Barthelemy\";\r\n\r\n%% \r\n% Now we can use\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/double.groupsummary.html\r\n% |groupsummary|> to aggregate the data by |Country\/Region| by summing the\r\n% confirmed cases and averaging the latitudes and longitudes.\r\n\r\nvars = times_conf.Properties.VariableNames;\r\ntimes_conf_country = groupsummary(times_conf,\"Country\/Region\",{'sum','mean'},vars(3:end));\r\n\r\n%% \r\n% The output contains unnecessary columns, such as sums of latitudes and\r\n% longitudes or means of confirmed cases. Let's remove those variables,\r\n% and also remove |'sum_'| or |'mean_'| prefixes from the variables we\r\n% keep.\r\n\r\nvars = times_conf_country.Properties.VariableNames;\r\nvars = regexprep(vars,\"^(sum_)(?=L(a|o))\",\"remove_\");\r\nvars = regexprep(vars,\"^(mean_)(?=[0-9])\",\"remove_\");\r\nvars = erase(vars,{'sum_','mean_'});\r\ntimes_conf_country.Properties.VariableNames = vars;\r\ntimes_conf_country = removevars(times_conf_country,[{'GroupCount'},vars(contains(vars,\"remove_\"))]);\r\n%% \r\n% Because Mainland China is so disproportionately large, we want to exclude\r\n% it from our visualization.\r\n\r\ntimes_conf_exChina = times_conf_country(times_conf_country.(\"Country\/Region\") ~= \"Mainland China\",:);\r\nvars = times_conf_exChina.Properties.VariableNames;\r\n\r\n%% \r\n% Let's use <https:\/\/www.mathworks.com\/help\/matlab\/ref\/geobubble.html\r\n% |geobubble|> to visualize the first and the last dates in the dataset.\r\n% Since the column names of numerica data are dates, I can simply pick the\r\n% first date and the last date to show the maps together. Please note that\r\n% |geobubble| would show a bubble for zero values, and therefore we need to\r\n% remove rows with zero values if we don't want to show bubbles for zero\r\n% cases.\r\n\r\nfigure\r\nt = tiledlayout(\"flow\");\r\nfor ii = [4, length(vars)]\r\n    times_conf_exChina.Category = categorical(repmat(\"<100\",height(times_conf_exChina),1));\r\n    times_conf_exChina.Category(table2array(times_conf_exChina(:,ii)) >= 100) = \">=100\";\r\n    nexttile\r\n    tbl = times_conf_exChina(:,[1:3, ii, end]);\r\n    tbl(tbl.(4) == 0,:) = [];\r\n    gb = geobubble(tbl,\"Lat\",\"Long\",\"SizeVariable\",vars(ii),\"ColorVariable\",\"Category\");\r\n    gb.BubbleColorList = [1,0,1;1,0,0];\r\n    gb.LegendVisible = \"off\";\r\n    gb.Title = \"As of \" + vars(ii);\r\n    gb.SizeLimits = [0, max(times_conf_exChina.(vars{length(vars)}))];\r\n    gb.MapCenter = [21.6385   36.1666];\r\n    gb.ZoomLevel = 0.3606;\r\nend\r\ntitle(t,[\"COVID-19 Confirmed Cases outside Mainland China\"; ...\r\n    \"Country\/Region with 100+ cases highlighted in red\"])\r\n%% \r\n% We can see that it initially only affected the countries\/regions\r\n% surrounding Mainland China, but since there have been massive breakouts\r\n% in South Korea, Italy, and Iran. It is also worth noting that we already\r\n% had confirmed cases in the US as early as January 22, 2020.\r\n\r\n%% Mapping Confirmed Cases in the US\r\n% Since I live in Boston, I'm interested in more local cases. Let's go down\r\n% to the |Province\/State| level in the US.\r\n\r\ntimes_conf_us = times_conf((times_conf.(\"Country\/Region\") == \"US\"),:);\r\ntimes_conf_us(times_conf_us.(\"Province\/State\") == \"Diamond Princess\",:) = [];\r\nvars = times_conf_us.Properties.VariableNames;\r\n\r\nfigure\r\nt = tiledlayout(\"flow\");\r\nfor ii = [5, length(vars)]\r\n    times_conf_us.Category = categorical(repmat(\"<100\",height(times_conf_us),1));\r\n    times_conf_us.Category(table2array(times_conf_us(:,ii)) >= 100) = \">=100\";\r\n    nexttile\r\n    tbl = times_conf_us(:,[1:4, ii, end]);\r\n    tbl(tbl.(5) == 0,:) = [];\r\n    gb = geobubble(tbl,\"Lat\",\"Long\",\"SizeVariable\",vars(ii),\"ColorVariable\",\"Category\");\r\n    gb.BubbleColorList = [1,0,1;1,0,0];\r\n    gb.LegendVisible = \"off\";\r\n    gb.Title = \"As of \" + vars(ii);\r\n    gb.SizeLimits = [0, max(times_conf_us.(vars{length(vars)}))];\r\n    gb.MapCenter = [44.9669 -113.6201];\r\n    gb.ZoomLevel = 1.7678;\r\nend\r\ntitle(t,[\"COVID-19 Confirmed Cases in the US\"; ...\r\n    \"Province\/State with 100+ cases highlighted in red\"])\r\n%% \r\n% You can see it started out in Washington where it became a major outbreak,\r\n% as well as in California, and New York.\r\n%\r\n%% Ranking Country\/Region by Confirmed Cases \r\n% Let's compare the number of confirmed cases by Country\/Region using |covid_19_data.csv|. \r\n% There are inconsistencies in the datetime format, so we will treat it as text initially. \r\n\r\nopts = detectImportOptions(filenames(3), \"TextType\",\"string\",\"DatetimeType\",\"text\");\r\nprovData = readtable(filenames(3),opts);\r\n\r\n%% \r\n% Let's clean up the datetime format. \r\n\r\nprovData.ObservationDate = regexprep(provData.ObservationDate,\"\\\/20$\",\"\/2020\");\r\nprovData.ObservationDate = datetime(provData.ObservationDate);\r\n\r\n%% \r\n% We also need to standardize the values in Country\/Region. \r\n\r\nprovData.Country_Region(provData.Country_Region == \"Iran (Islamic Republic of)\") = \"Iran\";\r\nprovData.Country_Region(provData.Country_Region == \"Republic of Ireland\") = \"Ireland\";\r\nprovData.Country_Region(provData.Country_Region == \"Republic of Korea\") = \"South Korea\";\r\nprovData.Country_Region(provData.Country_Region == \"('St. Martin',)\") = \"St. Martin\";\r\nprovData.Country_Region(provData.Country_Region == \"Holy See\") = \"Vatican City\";\r\nprovData.Country_Region(provData.Country_Region == \"occupied Palestinian territory\") = \"Palestine\";\r\n\r\n%% \r\n% The dataset contains |Province\/State| variable. Let's aggregate the data\r\n% at |Country\/Region| level.\r\n\r\ncountryData = groupsummary(provData,{'ObservationDate','Country_Region'}, ...\r\n    \"sum\",{'Confirmed','Deaths','Recovered'});\r\ncountryData.Properties.VariableNames = erase(countryData.Properties.VariableNames,\"sum_\");\r\n\r\n%% \r\n% |countryData| contains daily cumulative data. We need the most recent\r\n% numbers only.\r\n\r\ncountryLatest = groupsummary(countryData,\"Country_Region\", \"max\", \"Confirmed\");\r\ncountryLatest.Properties.VariableNames = erase(countryLatest.Properties.VariableNames,\"max_\");\r\n\r\n%% \r\n% Let's rank the top 10, and visualize them with a\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/matlab.graphics.chart.primitive.histogram.html\r\n% |histogram|>.\r\n\r\n[sorted,idx] = sort(countryLatest.Confirmed,'descend');\r\nlabels = countryLatest.Country_Region(idx); \r\nk = 10;\r\ntopK = sorted(1:k);\r\nlabelsK = labels(1:k);\r\nfigure\r\nhistogram('Categories',categorical(labelsK),\"BinCounts\",topK, ...\r\n    \"DisplayOrder\",\"ascend\",\"Orientation\",\"horizontal\")\r\nxlabel(\"Confirmed Cases\")\r\ntitle([compose(\"COVID-19 Confirmed Cases by Country\/Region - Top %d\",k); ...\r\n    \"As of \" + datestr(max(provData.ObservationDate))])\r\n\r\n%% \r\n% Outside Mainland China, Italy are Iran are now surpassing South Korea. \r\n%\r\n%% Growth of Confirmed Cases by Country\/Region\r\n% We can also check how fast the cases are growing in those countries.\r\n\r\nfigure\r\nplot(countryData.ObservationDate(countryData.Country_Region == labelsK(2)), ...\r\n    countryData.Confirmed(countryData.Country_Region == labelsK(2)));\r\nhold on\r\nfor ii = 3:length(labelsK)\r\n    plot(countryData.ObservationDate(countryData.Country_Region == labelsK(ii)), ...\r\n        countryData.Confirmed(countryData.Country_Region == labelsK(ii)),\"LineWidth\",1);\r\nend\r\nhold off\r\ntitle([\"COVID-19 Confirmed Cases outside Mainland China\";compose(\"Top %d Country\/Region\",k)])\r\nlegend(labelsK(2:end),\"location\",\"northwest\")\r\nxlabel(\"As of \" + datestr(max(provData.ObservationDate)))\r\nylabel(\"Cases\")\r\n\r\n%% \r\n% While South Korea shows a sign of slowdown, it's accelerating everywhere\r\n% else.\r\n%\r\n%% Growth of New Cases by Country\/Region\r\n% We can calculate the number of new cases by subtracting the cumulative\r\n% number of confirmed cases between two dates.\r\n\r\nby_country = cell(size(labelsK));\r\nfigure\r\nt = tiledlayout('flow');\r\nfor ii = 1:length(labelsK)\r\n    country = provData(provData.Country_Region == labelsK(ii),:);\r\n    country = groupsummary(country,{'ObservationDate','Country_Region'}, ...\r\n        \"sum\",{'Confirmed','Deaths','Recovered'});\r\n    country.Properties.VariableNames = erase(country.Properties.VariableNames,\"sum_\");\r\n    country.New =  [0; country.Confirmed(2:end) - country.Confirmed(1:end-1)];\r\n    country.New(country.New < 0) = 0;\r\n    by_country{ii} = country;\r\n    if labelsK(ii) ~= \"Others\"\r\n        nexttile\r\n        plot(country.ObservationDate,country.New)\r\n        title(labelsK(ii) + compose(\" - %d\", max(country.Confirmed)))\r\n    end\r\nend\r\ntitle(t, compose(\"COVID-19 New Cases - Top %d Country\/Region\",k))\r\nxlabel(t, \"As of \" + datestr(max(provData.ObservationDate)))\r\nylabel(t,\"New Cases\")\r\n\r\n%% \r\n% You can see that there haven't been many new cases in Mainland China and South \r\n% Korea. It seems like they managed to contain the outbreak. \r\n%\r\n%% Closer look at Mainland China\r\n% Since the infection is slowing down in Mainland China, let's see how many\r\n% active cases are still here. You can calculate the active cases by\r\n% subtracting recovered cases and deaths from confirmed cases.\r\n\r\nfor ii = 1:length(labelsK)\r\n    by_country{ii}.Active = by_country{ii}.Confirmed - by_country{ii}.Deaths - by_country{ii}.Recovered;\r\nend\r\n\r\nfigure\r\narea(by_country{1}.ObservationDate, ...\r\n    [by_country{1}.Active by_country{1}.Recovered by_country{1}.Deaths])\r\nlegend(\"Active\",\"Recovered\",\"Deaths\",\"location\",\"northwest\")\r\ntitle(\"Breakdown of Confirmed Cases in Mainland China\")\r\nxlabel(\"As of \" + datestr(max(provData.ObservationDate)))\r\nylabel(\"Cases\")\r\n\r\n%% Fitting a Curve\r\n% The number of active cases is dropping, and the curve looks roughly Gaussian. \r\n% Can we fit a Gaussian model and predict when the active cases will be zero? \r\n% \r\n% *Disclaimer: this is a very crude approach and you shouldn't draw any conclusion \r\n% from this - this is just for your reading enjoyment only.* \r\n% \r\n% I used the <https:\/\/www.mathworks.com\/products\/curvefitting.html Curve Fitting \r\n% Toolbox> to fit a gaussian to the |Active| line. To evaluate the goodness of fit, \r\n% <https:\/\/www.mathworks.com\/help\/curvefit\/evaluating-goodness-of-fit.html check \r\n% this out>. \r\n\r\n[x, y] = prepareCurveData((1:length(by_country{1}.Active))',by_country{1}.Active);\r\nft = fittype(\"gauss1\");\r\nopts = fitoptions(\"Method\", \"NonlinearLeastSquares\");\r\nopts.Display = \"Off\";\r\nopts.Lower = [-Inf -Inf 0];\r\nopts.StartPoint = [58046 27 7.66733432245782];\r\n[fobj, gof] = fit(x,y,ft,opts);\r\ngof\r\n\r\n%% \r\n% Let's project the output into the future by adding 20 days. \r\n\r\nextend_days = 20;\r\nxhat = [x; (x(end)+1:x(end)+extend_days)'];\r\nxdates = [by_country{1}.ObservationDate; ...\r\n    (by_country{1}.ObservationDate(end) + days(1): ...\r\n    by_country{1}.ObservationDate(end) + days(extend_days))'];\r\nyhat = fobj(xhat);\r\nci = predint(fobj,xhat);\r\n\r\n%% \r\n% Now we are ready to plot it. \r\n\r\nfigure\r\narea(by_country{1}.ObservationDate,by_country{1}.Active)\r\nhold on\r\nplot(xdates,yhat,\"lineWidth\",2)\r\nplot(xdates,ci,\"Color\",\"m\",\"LineStyle\",\":\",\"LineWidth\",2)\r\nhold off\r\nylim([0 inf])\r\nlegend(\"Actual\",\"Gaussian Fit\",\"Confidence Intevals\",\"location\",\"northeast\")\r\ntitle(\"Gaussian model over Actives Cases in Mainland China\")\r\nxlabel(\"Actual as of \" + datestr(max(provData.ObservationDate)))\r\nylabel(\"Cases\")\r\n\r\n%% \r\n% Obviously, I am not going to take this at the face value, but wouldn't it\r\n% be nice if Mainland China can reduce the active cases to zero by the\r\n% beginning of April?\r\n%\r\n%% What about South Korea?\r\n% Let's plot the number of active cases, recovered cases and deaths for\r\n% South Korea.\r\n\r\nfigure\r\narea(by_country{4}.ObservationDate, ...\r\n    [by_country{4}.Active by_country{4}.Recovered by_country{4}.Deaths])\r\nlegend(\"Active\",\"Recovered\",\"Deaths\",\"location\",\"northwest\")\r\ntitle(\"Breakdown of Confirmed Cases in South Korea\")\r\nxlabel(\"As of \" + datestr(max(provData.ObservationDate)))\r\nylabel(\"Cases\")\r\n\r\n%% \r\n% As you can see in the plot, it is too soon to tell if they reached the peak\r\n% yet. I don't think we can get any good fit using Gaussian.\r\n%\r\n%% Summary\r\n% Do you use MATLAB for help fighting COVID-19? Or perhaps you are already\r\n% self-isolating? Share how you use MATLAB while you go through this trying\r\n% time <https:\/\/blogs.mathworks.com\/loren\/?p=3597#respond here>.\r\n\r\n%%\r\n% _Copyright 2020 The MathWorks, Inc._\r\n##### SOURCE END ##### 9551bb69c5a8482f8f0f6f95aed6b6a5\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2020\/analyze_covid19_08.png\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction--><p>As the threat of novel corona virus COVID-19 spreads through the world, we live in an increasingly anxious time. While healthcare workers fight the virus in the front line, we do our part by practicing social distancing to slow the pandemic. Today's guest blogger, <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/951521\">Toshi Takeuchi<\/a>, would like to share how he spends his time by analyzing data in MATLAB.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2020\/03\/16\/analyzing-novel-corona-virus-covid-19-dataset\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[66,40,61],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/3597"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=3597"}],"version-history":[{"count":4,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/3597\/revisions"}],"predecessor-version":[{"id":3621,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/3597\/revisions\/3621"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=3597"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=3597"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=3597"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}