Analyzing Novel Corona Virus COVID-19 Dataset

Posted by Loren Shure, March 16, 2020

13 views (last 30 days) | 0 Likes | 10 comments

As the threat of novel corona virus COVID-19 spreads through the world, we live in an increasingly anxious time. While healthcare workers fight the virus in the front line, we do our part by practicing social distancing to slow the pandemic. Today's guest blogger, Toshi Takeuchi, would like to share how he spends his time by analyzing data in MATLAB.

Disclaimer: this post is NOT a valid and credible source of information for COVID-19, which is a serious threat, and you should consult with authoritative sources for accurate information, such as WHO or CDC.

COVID-19 Data Source
Mapping Confirmed Cases Globally
Mapping Confirmed Cases in the US
Ranking Country/Region by Confirmed Cases
Growth of Confirmed Cases by Country/Region
Growth of New Cases by Country/Region
Closer look at Mainland China
Fitting a Curve
What about South Korea?
Summary

COVID-19 Data Source

As we hear the news of novel corona virus COVID-19 day after day and we start practicing social distancing, I needed to find a way to calm my nerves. Am I the only one who finds data analysis in MATLAB a meditative exercise? Then why not analyze COVID-19, I asked myself.

I looked in the File Exchange and found Kevin Chng made this FileExchnage submmission about COVID-19. I also found Novel Corona Virus 2019 Dataset on Kaggle. I decided to use the dataset from Kaggle.

I downloaded the zip file from Kaggle and moved its content to my current working directory.

Let's check the unzipped files. Please note that "|2019_nCoV_data.csv|" is obsolete and we shouldn't use it.

s = dir("*.csv");
s = s(arrayfun(@(x) ~matches(x.name,"2019_nCoV_data.csv"), s));
filenames = arrayfun(@(x) string(x.name), s)

filenames = 
  6×1 string array
    "COVID19_line_list_data.csv"
    "COVID19_open_line_list.csv"
    "covid_19_data.csv"
    "time_series_covid_19_confirmed.csv"
    "time_series_covid_19_deaths.csv"
    "time_series_covid_19_recovered.csv"

covid_19_data.csv - this is the main file - daily level data of global cases by province/state, from Jan 22, 2020
time_series_covid_19_confirmed.csv - time series data of confirmed cases
time_series_covid_19_deaths.csv - time series data of cumulative number of deaths
time_series_covid_19_recovered.csv - time series data of cumulative number of recovered cases
COVID19_line_list_data.csv - individual level information
COVID19_open_line_list.csv - individual level information

Mapping Confirmed Cases Globally

Let's visualize the number of confirmed cases on a map. We start by loading time_series_covid_19_confirmed.csv which contains latitude and longitude variables we need for mapping. I also decided to keep the variable names as is, rather than letting MATLAB convert them to valid MATLAB identifiers, because some of the column names are dates.

opts = detectImportOptions(filenames(4), "TextType","string");
opts.VariableNamesLine = 1;
opts.DataLines = [2,inf];
opts.PreserveVariableNames = true;
times_conf = readtable(filenames(4),opts);

The dataset contains Province/State variable, but we want to aggregate the data at the Country/Region level. Before we do so, we need to clean up the data a bit. Please note that I have use the () notation because the variable names are not valid MATLAB identifiers.

times_conf.("Country/Region")(times_conf.("Country/Region") == "China") = "Mainland China";
times_conf.("Country/Region")(times_conf.("Country/Region") == "Czechia") = "Czech Republic";
times_conf.("Country/Region")(times_conf.("Country/Region") == "Iran (Islamic Republic of)") = "Iran";
times_conf.("Country/Region")(times_conf.("Country/Region") == "Republic of Korea") = "Korea, South";
times_conf.("Country/Region")(times_conf.("Country/Region") == "Republic of Moldova") = "Moldova";
times_conf.("Country/Region")(times_conf.("Country/Region") == "Russian Federation") = "Russia";
times_conf.("Country/Region")(times_conf.("Country/Region") == "Taipei and environs") = "Taiwan";
times_conf.("Country/Region")(times_conf.("Country/Region") == "Taiwan*") = "Taiwan";
times_conf.("Country/Region")(times_conf.("Country/Region") == "United Kingdom") = "UK";
times_conf.("Country/Region")(times_conf.("Country/Region") == "Viet Nam") = "Vietnam";
times_conf.("Country/Region")(times_conf.("Province/State") == "St Martin") = "St Martin";
times_conf.("Country/Region")(times_conf.("Province/State") == "Saint Barthelemy") = "Saint Barthelemy";

Now we can use groupsummary to aggregate the data by Country/Region by summing the confirmed cases and averaging the latitudes and longitudes.

vars = times_conf.Properties.VariableNames;
times_conf_country = groupsummary(times_conf,"Country/Region",{'sum','mean'},vars(3:end));

The output contains unnecessary columns, such as sums of latitudes and longitudes or means of confirmed cases. Let's remove those variables, and also remove 'sum_' or 'mean_' prefixes from the variables we keep.

vars = times_conf_country.Properties.VariableNames;
vars = regexprep(vars,"^(sum_)(?=L(a|o))","remove_");
vars = regexprep(vars,"^(mean_)(?=[0-9])","remove_");
vars = erase(vars,{'sum_','mean_'});
times_conf_country.Properties.VariableNames = vars;
times_conf_country = removevars(times_conf_country,[{'GroupCount'},vars(contains(vars,"remove_"))]);

Because Mainland China is so disproportionately large, we want to exclude it from our visualization.

times_conf_exChina = times_conf_country(times_conf_country.("Country/Region") ~= "Mainland China",:);
vars = times_conf_exChina.Properties.VariableNames;

Let's use geobubble to visualize the first and the last dates in the dataset. Since the column names of numerica data are dates, I can simply pick the first date and the last date to show the maps together. Please note that geobubble would show a bubble for zero values, and therefore we need to remove rows with zero values if we don't want to show bubbles for zero cases.

figure
t = tiledlayout("flow");
for ii = [4, length(vars)]
    times_conf_exChina.Category = categorical(repmat("<100",height(times_conf_exChina),1));
    times_conf_exChina.Category(table2array(times_conf_exChina(:,ii)) >= 100) = ">=100";
    nexttile
    tbl = times_conf_exChina(:,[1:3, ii, end]);
    tbl(tbl.(4) == 0,:) = [];
    gb = geobubble(tbl,"Lat","Long","SizeVariable",vars(ii),"ColorVariable","Category");
    gb.BubbleColorList = [1,0,1;1,0,0];
    gb.LegendVisible = "off";
    gb.Title = "As of " + vars(ii);
    gb.SizeLimits = [0, max(times_conf_exChina.(vars{length(vars)}))];
    gb.MapCenter = [21.6385   36.1666];
    gb.ZoomLevel = 0.3606;
end
title(t,["COVID-19 Confirmed Cases outside Mainland China"; ...
    "Country/Region with 100+ cases highlighted in red"])

We can see that it initially only affected the countries/regions surrounding Mainland China, but since there have been massive breakouts in South Korea, Italy, and Iran. It is also worth noting that we already had confirmed cases in the US as early as January 22, 2020.

Mapping Confirmed Cases in the US

Since I live in Boston, I'm interested in more local cases. Let's go down to the Province/State level in the US.

times_conf_us = times_conf((times_conf.("Country/Region") == "US"),:);
times_conf_us(times_conf_us.("Province/State") == "Diamond Princess",:) = [];
vars = times_conf_us.Properties.VariableNames;

figure
t = tiledlayout("flow");
for ii = [5, length(vars)]
    times_conf_us.Category = categorical(repmat("<100",height(times_conf_us),1));
    times_conf_us.Category(table2array(times_conf_us(:,ii)) >= 100) = ">=100";
    nexttile
    tbl = times_conf_us(:,[1:4, ii, end]);
    tbl(tbl.(5) == 0,:) = [];
    gb = geobubble(tbl,"Lat","Long","SizeVariable",vars(ii),"ColorVariable","Category");
    gb.BubbleColorList = [1,0,1;1,0,0];
    gb.LegendVisible = "off";
    gb.Title = "As of " + vars(ii);
    gb.SizeLimits = [0, max(times_conf_us.(vars{length(vars)}))];
    gb.MapCenter = [44.9669 -113.6201];
    gb.ZoomLevel = 1.7678;
end
title(t,["COVID-19 Confirmed Cases in the US"; ...
    "Province/State with 100+ cases highlighted in red"])

You can see it started out in Washington where it became a major outbreak, as well as in California, and New York.

Ranking Country/Region by Confirmed Cases

Let's compare the number of confirmed cases by Country/Region using covid_19_data.csv. There are inconsistencies in the datetime format, so we will treat it as text initially.

opts = detectImportOptions(filenames(3), "TextType","string","DatetimeType","text");
provData = readtable(filenames(3),opts);

Warning: Column headers from the file were modified to make them valid MATLAB
identifiers before creating variable names for the table. The original column
headers are saved in the VariableDescriptions property.
Set 'PreserveVariableNames' to true to use the original column headers as table
variable names.

Let's clean up the datetime format.

provData.ObservationDate = regexprep(provData.ObservationDate,"\/20$","/2020");
provData.ObservationDate = datetime(provData.ObservationDate);

We also need to standardize the values in Country/Region.

provData.Country_Region(provData.Country_Region == "Iran (Islamic Republic of)") = "Iran";
provData.Country_Region(provData.Country_Region == "Republic of Ireland") = "Ireland";
provData.Country_Region(provData.Country_Region == "Republic of Korea") = "South Korea";
provData.Country_Region(provData.Country_Region == "('St. Martin',)") = "St. Martin";
provData.Country_Region(provData.Country_Region == "Holy See") = "Vatican City";
provData.Country_Region(provData.Country_Region == "occupied Palestinian territory") = "Palestine";

The dataset contains Province/State variable. Let's aggregate the data at Country/Region level.

countryData = groupsummary(provData,{'ObservationDate','Country_Region'}, ...
    "sum",{'Confirmed','Deaths','Recovered'});
countryData.Properties.VariableNames = erase(countryData.Properties.VariableNames,"sum_");

countryData contains daily cumulative data. We need the most recent numbers only.

countryLatest = groupsummary(countryData,"Country_Region", "max", "Confirmed");
countryLatest.Properties.VariableNames = erase(countryLatest.Properties.VariableNames,"max_");

Let's rank the top 10, and visualize them with a histogram.

[sorted,idx] = sort(countryLatest.Confirmed,'descend');
labels = countryLatest.Country_Region(idx);
k = 10;
topK = sorted(1:k);
labelsK = labels(1:k);
figure
histogram('Categories',categorical(labelsK),"BinCounts",topK, ...
    "DisplayOrder","ascend","Orientation","horizontal")
xlabel("Confirmed Cases")
title([compose("COVID-19 Confirmed Cases by Country/Region - Top %d",k); ...
    "As of " + datestr(max(provData.ObservationDate))])

Outside Mainland China, Italy are Iran are now surpassing South Korea.

Growth of Confirmed Cases by Country/Region

We can also check how fast the cases are growing in those countries.

figure
plot(countryData.ObservationDate(countryData.Country_Region == labelsK(2)), ...
    countryData.Confirmed(countryData.Country_Region == labelsK(2)));
hold on
for ii = 3:length(labelsK)
    plot(countryData.ObservationDate(countryData.Country_Region == labelsK(ii)), ...
        countryData.Confirmed(countryData.Country_Region == labelsK(ii)),"LineWidth",1);
end
hold off
title(["COVID-19 Confirmed Cases outside Mainland China";compose("Top %d Country/Region",k)])
legend(labelsK(2:end),"location","northwest")
xlabel("As of " + datestr(max(provData.ObservationDate)))
ylabel("Cases")

While South Korea shows a sign of slowdown, it's accelerating everywhere else.

Growth of New Cases by Country/Region

We can calculate the number of new cases by subtracting the cumulative number of confirmed cases between two dates.

by_country = cell(size(labelsK));
figure
t = tiledlayout('flow');
for ii = 1:length(labelsK)
    country = provData(provData.Country_Region == labelsK(ii),:);
    country = groupsummary(country,{'ObservationDate','Country_Region'}, ...
        "sum",{'Confirmed','Deaths','Recovered'});
    country.Properties.VariableNames = erase(country.Properties.VariableNames,"sum_");
    country.New =  [0; country.Confirmed(2:end) - country.Confirmed(1:end-1)];
    country.New(country.New < 0) = 0;
    by_country{ii} = country;
    if labelsK(ii) ~= "Others"
        nexttile
        plot(country.ObservationDate,country.New)
        title(labelsK(ii) + compose(" - %d", max(country.Confirmed)))
    end
end
title(t, compose("COVID-19 New Cases - Top %d Country/Region",k))
xlabel(t, "As of " + datestr(max(provData.ObservationDate)))
ylabel(t,"New Cases")

You can see that there haven't been many new cases in Mainland China and South Korea. It seems like they managed to contain the outbreak.

Closer look at Mainland China

Since the infection is slowing down in Mainland China, let's see how many active cases are still here. You can calculate the active cases by subtracting recovered cases and deaths from confirmed cases.

for ii = 1:length(labelsK)
    by_country{ii}.Active = by_country{ii}.Confirmed - by_country{ii}.Deaths - by_country{ii}.Recovered;
end

figure
area(by_country{1}.ObservationDate, ...
    [by_country{1}.Active by_country{1}.Recovered by_country{1}.Deaths])
legend("Active","Recovered","Deaths","location","northwest")
title("Breakdown of Confirmed Cases in Mainland China")
xlabel("As of " + datestr(max(provData.ObservationDate)))
ylabel("Cases")

Fitting a Curve

The number of active cases is dropping, and the curve looks roughly Gaussian. Can we fit a Gaussian model and predict when the active cases will be zero?

Disclaimer: this is a very crude approach and you shouldn't draw any conclusion from this - this is just for your reading enjoyment only.

I used the Curve Fitting Toolbox to fit a gaussian to the Active line. To evaluate the goodness of fit, check this out.

[x, y] = prepareCurveData((1:length(by_country{1}.Active))',by_country{1}.Active);
ft = fittype("gauss1");
opts = fitoptions("Method", "NonlinearLeastSquares");
opts.Display = "Off";
opts.Lower = [-Inf -Inf 0];
opts.StartPoint = [58046 27 7.66733432245782];
[fobj, gof] = fit(x,y,ft,opts);
gof

gof = 
  struct with fields:

           sse: 4.4145e+08
       rsquare: 0.9743
           dfe: 47
    adjrsquare: 0.9732
          rmse: 3.0647e+03

Let's project the output into the future by adding 20 days.

extend_days = 20;
xhat = [x; (x(end)+1:x(end)+extend_days)'];
xdates = [by_country{1}.ObservationDate; ...
    (by_country{1}.ObservationDate(end) + days(1): ...
    by_country{1}.ObservationDate(end) + days(extend_days))'];
yhat = fobj(xhat);
ci = predint(fobj,xhat);

Now we are ready to plot it.

figure
area(by_country{1}.ObservationDate,by_country{1}.Active)
hold on
plot(xdates,yhat,"lineWidth",2)
plot(xdates,ci,"Color","m","LineStyle",":","LineWidth",2)
hold off
ylim([0 inf])
legend("Actual","Gaussian Fit","Confidence Intevals","location","northeast")
title("Gaussian model over Actives Cases in Mainland China")
xlabel("Actual as of " + datestr(max(provData.ObservationDate)))
ylabel("Cases")

Obviously, I am not going to take this at the face value, but wouldn't it be nice if Mainland China can reduce the active cases to zero by the beginning of April?

What about South Korea?

Let's plot the number of active cases, recovered cases and deaths for South Korea.

figure
area(by_country{4}.ObservationDate, ...
    [by_country{4}.Active by_country{4}.Recovered by_country{4}.Deaths])
legend("Active","Recovered","Deaths","location","northwest")
title("Breakdown of Confirmed Cases in South Korea")
xlabel("As of " + datestr(max(provData.ObservationDate)))
ylabel("Cases")

As you can see in the plot, it is too soon to tell if they reached the peak yet. I don't think we can get any good fit using Gaussian.

Summary

Do you use MATLAB for help fighting COVID-19? Or perhaps you are already self-isolating? Share how you use MATLAB while you go through this trying time here.

Published with MATLAB® R2020a