Sam Schoberg and Liam Horch
The novel coronavirus is a disease that started in Asia in early 2020. At the time of writing this, the 69.1M people have been infected on all contients excluding Antarctica.
While each contient has dealt with infections and community spread, some regions have been able to better contain the virus. The goal of this tutorial is to discover how Coronavirus metrics (total confirmed cases, deaths, and total recovered) vary by contient.
This tutorial will have two exploratory sections including a map based approach and a grahpical approach. The graphical approach will then lead into hypothesis testing and ML analysis.
# Necessary libraries and imports to complete this tutorial
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from scipy import stats
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
!pip install countryinfo
from countryinfo import CountryInfo
We're collecting our dataset from the "COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University." They provide timeseries data for each metric (confirmed cases, deaths, recoveries) across all countries from 1/22/20 until current day. Its important to note that the dataset provides Province/State values for some countries. For example, the dataset provides values for all states within the United States while Afganistan only has country level metrics. Later, we'll have to deal with aggregating these values from states and countries to continents.
The URLs below start a .csv download of the timeseries data from Github. We then read the .csv into pandas dataframes.
# Data urls
base_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/'
confirmed_cases_data_url = base_url + 'time_series_covid19_confirmed_global.csv'
death_cases_data_url = base_url + 'time_series_covid19_deaths_global.csv'
recovery_cases_data_url = base_url+ 'time_series_covid19_recovered_global.csv'
# Import datasets as pandas dataframes
raw_data_confirmed = pd.read_csv(confirmed_cases_data_url)
raw_data_deaths = pd.read_csv(death_cases_data_url)
raw_data_recovered = pd.read_csv(recovery_cases_data_url)
raw_data_confirmed.head()
First, we'll explore the data using a heatmap to try to build inutition about which contients and regions have been able to best contain the Coronavirus. Right off the bat, the first problem arises, we have messy data! Not to fear, for I specialize in cleaning up data. We shall simply drop the uneeded Province/State and Country/Region columns and then remove all nan values from our dataframe. Some may say that the formatting of the dates being column labels is an odd choice and while I agree, I find this layout easier to itterate over, so shall keep it as is.
raw_data_confirmed = pd.read_csv(confirmed_cases_data_url)
new_data = raw_data_confirmed.drop("Province/State", axis=1)
new_data = new_data.drop("Country/Region", axis=1)
new_data = new_data.dropna()
new_data.head(10)
We now must prep the data to be represented in the form of a heat map. A heat map takes in a list of lists, with each of those inner lists following the format of [lon, lat, weight], weight being the amount of confirmed cases for us. We shall proceed to itterate through our entire dataframe, adding these lists of longitude, latitude, and confirmed cases to an even larger list. We end up splitting this list into 2 so we may create a heat map for the first and second halves of the outbreak.
new_data['1/22/20'] = new_data['1/22/20']
all_data = new_data[['Lat', 'Long', '1/22/20']].to_numpy()
for column in new_data:
if(column == '1/22/20' or column == 'new_col'):
continue
new_data[column] = new_data[column]
heat_data = new_data[['Lat', 'Long', column]].to_numpy()
heat_data = np.nan_to_num(heat_data)
all_data = np.append(all_data, heat_data, axis=0)
all_data = np.nan_to_num(all_data)
split = math.ceil(len(all_data)/2)
first_half = all_data[:split]
second_half = all_data[split:]
Here is the heatmap representing the first half of the outbreak. Parameters for the heat map were decided through playing around, landing at these values for they represent the data the best.
from folium import plugins
m1 = folium.Map()
plugins.HeatMap(first_half, radius = 10, min_opacity = 0.1,gradient={.6: 'blue', .9: 'lime', .98: 'red'}).add_to(m1)
m1
And here is the heatmap for the second half of the outbreak.
m2 = folium.Map()
plugins.HeatMap(second_half, radius = 10, min_opacity = 0.1, gradient={.6: 'blue', .9: 'lime', .98: 'red'}).add_to(m2)
m2
The heatmaps display a strong concetration in Asia during the first half of the outbreak, as expected since it originated there, with a modest spread across the resrt of the world. The second half reveals an almost uniform severe concentration across all continents. As Covid consumes our meesly little heat map, all we can do is pray and watch.
Now we'll use the intuition built from the heatmap to explore trends graphically. We'll start fresh with the raw dataframes.
raw_data_confirmed.head()
First, we'll drop the Latitude, Longitude, and Province/State columns from the raw dataframes. We're only interested in values at the contient level, so these columns aren't useful to us. Yes, Countries/Regions with multiple Province/State values will all have the same Country/Region value. For example, now there will be 50 rows with Country/Region as United States. This is okay because we'll aggregate these rows later.
confirmed_by_country = raw_data_confirmed.drop(columns=['Lat', 'Long', 'Province/State'])
deaths_by_country = raw_data_deaths.drop(columns=['Lat', 'Long', 'Province/State'])
recovered_by_country = raw_data_recovered.drop(columns=['Lat', 'Long', 'Province/State'])
recovered_by_country.head()
Next, we'll transpose the dataframe so the Country/Region names are the columns and the date strings are the rows. Setting the index to Country/Region before transposing ensures the Country/Region will be the column names in the transposed dataframe. We transposed the dataframe because it'll be easier to plot using our imported libraries.
# Transpose so country is columns and date strings are rows
confirmed_by_country = confirmed_by_country.set_index('Country/Region').T
deaths_by_country = deaths_by_country.set_index('Country/Region').T
recovered_by_country = recovered_by_country.set_index('Country/Region').T
recovered_by_country.head()
Now we have to account for a subtle detail. The index values of all the rows look like dates, but they're in fact strings. We'll have to convert these strings to datetime objects if we want to plot our metrics over time using a line chart. Since all three dataframes have matching row indicies, we only convert the indicies of confirmed_by_country to a datetime then apply that to all three dataframes.
dt_series = pd.to_datetime(confirmed_by_country.index)
datetime_index = pd.DatetimeIndex(dt_series.values)
confirmed_by_country_dt=confirmed_by_country.set_index(datetime_index)
deaths_by_country_dt=deaths_by_country.set_index(datetime_index)
recovered_by_country_dt=recovered_by_country.set_index(datetime_index)
Before we can plot the confirmed, death, and recovered metrics for each continent, we need to consider more information about each country.
To obtain this data we're using the python package CountryInfo. Its important to note that some countries listed in the dataframe do not have information in the CountryInfo package. We dropped these countries from our dataset. We don't believe this will impact our analysis because they are only small nations with few cases.
pop_dict = {}
region_dict = {}
unlisted_countries = []
for country in raw_data_confirmed['Country/Region']:
try:
info = CountryInfo(country).info()
pop = info['population']
pop_dict[country] = pop
region_dict[country] = info['region']
except:
unlisted_countries.append(country)
print("UNLISTED COUNTRIES: ", unlisted_countries)
confirmed_by_country_dt.drop(columns=unlisted_countries, inplace=True)
deaths_by_country_dt.drop(columns=unlisted_countries, inplace=True)
recovered_by_country_dt.drop(columns=unlisted_countries, inplace=True)
Now that we have population data for each country and each country's continent, we calculate the population of each continent and standardize it by 100,000 residents.
continent_populations = {continent: 0 for country, continent in region_dict.items()}
for country, pop in pop_dict.items():
continent_populations[region_dict[country]]+=pop
continent_populations_standardized = {continent: pop/100000 for continent, pop in continent_populations.items()}
Next, we'll map each country to a continent in our dataframes using the data from the CountryInfo package.
# Assign each country column to a region
confirmed_by_region = confirmed_by_country_dt.rename(columns=region_dict)
deaths_by_region = deaths_by_country_dt.rename(columns=region_dict)
recovered_by_region = recovered_by_country_dt.rename(columns=region_dict)
Now we'll consider contients that are listed multiple times. It makes sense that there are many columns labeled with the same contient because there are many counties in a contients.
recovered_by_region.tail()
We'll sum all columns with the same country name and standardize their values based on the contient populations calculated earlier.
confirmed_by_region=confirmed_by_region.groupby(confirmed_by_region.columns, axis=1).sum()
deaths_by_region=deaths_by_region.groupby(deaths_by_region.columns, axis=1).sum()
recovered_by_region=recovered_by_region.groupby(recovered_by_region.columns, axis=1).sum()
confirmed_by_region_standardized = pd.DataFrame()
deaths_by_region_standardized = pd.DataFrame()
recovered_by_region_standardized = pd.DataFrame()
for cont in confirmed_by_region.columns:
confirmed_by_region_standardized[cont]=confirmed_by_region[cont].apply(lambda val: val / continent_populations_standardized[cont])
deaths_by_region_standardized[cont]=deaths_by_region[cont].apply(lambda val: val / continent_populations_standardized[cont])
recovered_by_region_standardized[cont]=recovered_by_region[cont].apply(lambda val: val / continent_populations_standardized[cont])
Finally, we'll plot the deaths, confirmed cases, and recoveries for all contients.
fig, axs = plt.subplots(ncols=3,figsize=(25,5))
axs[0].set_title("Deaths Per 100,000 Residents")
axs[1].set_title("Confirmed Cases Per 100,000 Residents")
axs[2].set_title("Recoveries Per 100,000 Residents")
# Gets end of every month to set as x ticks
month_dates = deaths_by_region.resample('M').sum().index.strftime('%Y-%m-%d')
axs[0].set_xticklabels(labels=month_dates, rotation=45, ha='right')
axs[1].set_xticklabels(labels=month_dates, rotation=45, ha='right')
axs[2].set_xticklabels(labels=month_dates, rotation=45, ha='right')
sns.lineplot(data=deaths_by_region_standardized, ax=axs[0])
sns.lineplot(data=confirmed_by_region_standardized, ax=axs[1])
sns.lineplot(data=recovered_by_region_standardized, ax=axs[2])
Its clear that not all continents were able to contain the Coronavirus equally. For example, Asia, Europe, and Oceania have been able to maintain relatively low death tolls and confirmed cases. All three have stayed below 500 confirmed cases and below 10 deaths per 100,000 citizens.
On the other hand, the Americas and Europe have had a much more difficult time stopping the virus's spread. Cases and deaths have risen consistently in the Americas since March of 2020. Europe seemed to contain the virus from April 2020 to October 2020 after an initial outbreak in March, but has since lost control.
Its clear that Oceania, Asia, and Africa have the closest values for death totals. We want to see if the mean increase in deaths per day is the same for each of those continents. We'll perform a paired t-test on the increase in deaths per 100,000 people for each Continent in the last 30 days.
Upon calculating the increase in deaths per day, it becomes clear that Oceania has nearly contained the virus. In the last 30 days, they have 0 new COVID deaths each day (good for oceania!). Becuase of this, we'll only perform the paired t-test betweeen Africa and Asia.
Null hypothesis: The true mean difference between increase in deaths of Africa and Asia in the last 30 days is 0.
inc_deaths_last_30=deaths_by_region_standardized[['Africa', 'Asia', 'Oceania']].iloc[-30:-1].diff()
inc_deaths_last_30.describe()
stats.ttest_rel(inc_deaths_last_30[['Africa']][1:],inc_deaths_last_30[['Asia']][1:] )
Since the pvalue is 0.01473902, we reject the null hypothesis that the mean difference between increase of deaths in Africa and Asia in the last 30 days is 0. We can say with 95% certainty that Asia has had a greater average increase in deaths in the last 30 days.
From the Deaths graph above, it appears that the number of deaths in the Americas increases linearlly with time after 3-31-2020. In order to predict the number of deaths the Americas might see in the future, we'll fit a linear model to this data.
To see if a linear model fits the data well, we'll calculate an R^2 value which will tell us how much the increase in total deaths is influenced by the progression of time.
Its important to note that our linear model and R^2 calculation don't accept Datetimes as paramenters. We have to convert the datetimes to some scalar value. To do this, we'll encode the datetimes as the number of days since 3-31-2020. This retains all the relevant information and converts the data into a format ingestible by sklearn.
deaths_americas=deaths_by_region_standardized['Americas'].iloc[69:]
days_since_3312020 = (deaths_americas.index - deaths_americas.index[0]).days.values.reshape(-1, 1)
reg = LinearRegression().fit(days_since_3312020, deaths_americas.values)
r2_score(reg.predict(days_since_3312020), deaths_americas.values)
The linear model recieves an R^2 score of 0.998. This means that the model explains 99% of the increase in total deaths in the Americas starting at 3-30-2020 and ending at current day. We'll confirm the accuracy of our model by plotting it against the true values.
fig, axs = plt.subplots(figsize=(10,5))
axs.set_title("Deaths Per 100,000 Residents")
# Gets end of every month to set as x ticks
month_dates = deaths_by_region.resample('M').sum().index.strftime('%Y-%m-%d')
axs.set_xticklabels(labels=month_dates, rotation=45, ha='right')
sns.lineplot(data=deaths_by_region_standardized[['Americas']], ax=axs)
sns.lineplot(deaths_by_region_standardized['Americas'].iloc[69:].index, reg.predict(days_since_3312020).reshape(-1), color='red')
We'll now use the model to predict the total deaths in the upcoming weeks.
I am writing this section on Dec. 14, and The final deliverable is due on Dec. 21. We'll have our model predcit the total deaths in the Americas for that date. If you're interested, you can run the cell during grading and see how close our prediction was.
# reg.predict(deaths_americas.index[0])
import datetime
pred = reg.predict(np.array([(datetime.datetime(2020, 12, 21) - deaths_americas.index[0]).days]).reshape(-1, 1))
print("The total deaths per 100,000 citizens in the Americas is predicted to be: ", pred[0])
From our analysis of COVID metrics by continents, its clear that not everyone has been able to contain the virus effectively. Some regions, for example Africa and Asia, have consistently maintained low numbers of deaths and confirmed cases. Oceania has managed to curb community transmission with nearly 0 new confirmed cases in the last 30 days. The Americas and Europe, on the other hand, have been unable to contain the virus's spread. The total Americas deaths have been increasing linearally since March 2020. Cases in Europe plateaued between April and September 2020 but have since started increasing again.
Since we standardized all of our values for population, we can assume that there are certain policies or environmental factors in regions like Oceania and Africa that allowed them to control the virus. It isn't the job of the data analysts (or 320 students) to make medical recommendations, however we can say that it may be useful for Europe and the Americas to look ot the tactics in those better-performing regions for ideas on how to reduce community transmission of COVID-19.