This project is also available on https://www.covid19.victoryap.com/

Table of contents

1 Importing Packages

Section A: Explanatory Data Analysis

2 Exploratory Data Analysis
3 Cases
= 3.1. Overview of the cases in Korea
== 3.1.1. Column to use
== 3.1.2. Map Visualisation of cases in Korea
= 3.2. Check the relationship between group and solo cases
== 3.2.1. Bar chart visualisation of cases in Korea
== 3.2.2. Calculation of individual & group cases
= 3.3. Source of infected cases sort by source and province
= 3.4. Risk Level of each province
4 Patient Info
= 4.1. Basic Analysis
== 4.1.1. Numbers Of Cases Per Day
== 4.1.2. Does infection have something to do with gender, age, or geographical location?
= 4.2. Average Survivor Treatment Day (Recovery Speed), By Age and Gender
= 4.3. Average Survivor Treatment Day (Recovery Speed) By City & Age
= 4.4. Average Deceased Treatment Day (Fatality Speed) By Gender and Age
= 4.5. Average Deceased Treatment Day (Fatality Speed) By City and Age
= 4.6. Number of Infection, Survivor, Deceased Per City by percentage
= 4.7. Number of Infection, Survivor, Deceased Sort By Gender
= 4.8. Network Diagram

Section B: Time Series Analysis

5 Policy & Search Trend
6 Time
= 6.1. Test & Negative cases daily average increase amount by day
= 6.2. Test & Negative cases daily average increase amount by month
= 6.3. Confirmed, released & deceased cases daily average increase amount by day
= 6.4. Confirmed, released & deceased cases daily average increase amount by month
= 6.5. Temporal Distribution and Prediction
= 6.6. Stationarity Testing
= 6.7. Seasonal Decomposition
= 6.8. ARIMA Modelling
= 6.9. ETS Modelling
= 6.10.Facebook Prophet
7 Time Age
= 7.1. Deceased vs confirmed cases across different age group over time
= 7.2. Testing of Homogeneous Mortality Hypothesis
= 7.3. Does TimeAge Dataset tallies with the patient.csv ?
= 7.4. Deceased Rate
8 Time Gender
= 8.1. Deceased vs confirmed cases across different gender over time
= 8.2. Confirmed cases daily average increase amount by month (Sort by gender)
= 8.3. Confirmed cases daily average increase amount by day (Sort by gender)
= 8.4. Deceased cases daily average increase amount by month (Sort by gender)
= 8.5. Deceased cases daily average increase amount by day (Sort by gender)
= 8.6. Mortality Rates across Gender Groups
9 Time Province
= 9.1. Deceased vs confirmed cases across different province over time (With Daegu)
= 9.2. Deceased vs confirmed cases across different province over time (Without Daegu)
= 9.3. Mortality Rates across Provinces
= 9.4. Does TimeProvince Dataset tallies with the cases.csv & patient.csv?
10 In depth Analysis Of Daegu Cases using Technical Analysis
= 10.1 Technical Analysis
= 10.2 Stats By Month
= 10.3 Stats By Day
11 Time Series World Map Visualisation of COVID-19 cases in korea
= 11.1 With Animation
= 11.2 Without animation
12 Weather
= 12.1 Busan Avg Temp (Mean) vs Most Wind Direction (Mean) Analysis for days with wind direction greater than 100
= 12.2 Does other region has the same relationship between avg_temp and most_wind direction?
13 Region
= 13.1 Number of nusing home and elder population ratio across Korea
= 13.2 Correlation between elder population ratio and number of cases

Section C: Web Scalping

14 Web Scalping: Comparsion across the world
= 14.1 Getting the data from the table
= 14.2 Covert to data
= 14.3 Worldwide Total Cases Chart
= 14.4 Worldwide Total Death Chart
= 14.5 Scatterplot between total cases and total death

Section D: Summary from EDA

15 Summarise Key findings from Exploratory Data Analysis
= 15.1 Overall Cases in Korea
= 15.2 Patient Cases
= 15.3 Policy
= 15.4 Cases Over Time Analysis
= 15.5 Cases Over Age & Time Analysis
= 15.6 Cases Over Gender & Time Analysis
= 15.7 Cases Over Province & Time Analysis
= 15.8 Weather
= 15.9 Number of nursing home vs Population Ratio

Section E: Predictive modelling

16 Preparing the data for modelling
= 16.1 Get the necessary columns
= 16.2 Calculate number of days of treatment before dropped date column
= 16.3 Convert state, gender and gender columns to categorical data
= 16.4 Get dummy for province
17 Correlation Matrix
18 Building Machine Learning Models Part 1
= 18.1 Stochastic Gradient Descent (SGD)
= 18.2 Random Forest
= 18.3 Logistic Regression
= 18.4 Gaussian Naive Bayes
= 18.5 K Nearest Neighbor
= 18.6 Perceptron
= 18.7 Linear Support Vector Machine
= 18.8 Decision Tree
= 18.9 Getting the best model
= 18.10 Decision Tree Diagram
= 18.11 What is the most important feature ?
= 18.12 Confusion Matrix with Precision & Recall & F-Score
19 Building Machine Learning Models Part 2
= 19.1 Random Forest
= 19.2 Decision Tree
= 19.3 Getting the best model
= 19.4 Decision Tree Diagram
= 19.5 Importance
= 19.6 Confusion Matrix with Precision & Recall & F-Score
20 Refine Dataset for Machine Learning
= 20.1 Preparing the data for modelling
= 20.2 Building Machine Learning Models Part 3
= 20.3 Importances
= 20.4 Confusion Matrix with Precision & Recall & F-Score
= 20.5 Summary

Section F: Final Summary

21 Step to take to control the COVID-19 situation even better
= 21.1 Abraham Wald and the Missing Bullet Holes
= 21.2 Survivorship Bias
= 21.3 Improving the situation
22 References
23 Appendix
24 Contribution Statements

In [1]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

1) Importing Packages

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline
import networkx as nx

import plotly
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
import plotly.graph_objs as go
import plotly.figure_factory as ff
import plotly.express as px
px.set_mapbox_access_token(open("./mapbox_token").read())
from plotly.subplots import make_subplots

from matplotlib import gridspec
from bs4 import BeautifulSoup
from warnings import filterwarnings
filterwarnings('ignore')
from datetime import datetime 
import requests


# Algorithms
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB

# Matrix
from sklearn.metrics import precision_score, recall_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_curve

# Diagram
from sklearn.tree import export_graphviz
import pydotplus
from sklearn.externals.six import StringIO  
from IPython.display import Image

from keras.models import Sequential
from keras.layers import LSTM, Dense

from dateutil.easter import easter
from fbprophet import Prophet

from statsmodels.tsa.holtwinters import ExponentialSmoothing, Holt
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller, kpss

import pmdarima as pm
import scipy.stats as scs
/opt/anaconda3/lib/python3.7/site-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

2) Exploratory Data Analysis

3) Case

3.1) Overview of the cases in Korea

3.1.1) Column to use

In [3]:
case = pd.read_csv("covid/Case.csv")
case.groupby("city")["confirmed"].sum().sort_values(ascending = False)
case.groupby("city")["confirmed"].sum().sort_values(ascending = False).iloc[1:3]/case["confirmed"].sum() * 100
INFO:numexpr.utils:NumExpr defaulting to 4 threads.
Out[3]:
city
-                  31.812198
from other city    10.680123
Name: confirmed, dtype: float64

Since city has around 42% confirmed cases that is not classified, we shall use province

3.1.2) Map Visualisation of cases in Korea

In [4]:
caseData = pd.read_csv('covid/case.csv')
caseDataForMap = caseData.copy()
caseDataForMap = caseDataForMap[caseDataForMap[['latitude', 'longitude']] != '-']

cols = ['latitude', 'longitude']
caseData[cols] = caseData[cols].apply(pd.to_numeric, errors='coerce', axis=1)
display(caseData.head())

fig = px.scatter_mapbox(
    caseData[caseData.latitude != '-'], 
    text="<br>City: " + caseData["city"] +" <br>Province: " + caseData["province"],
    lat="latitude", 
    lon="longitude",     
    color="confirmed", 
    size="confirmed",
    color_continuous_scale=px.colors.sequential.Burg,
    size_max=100, 
    mapbox_style='dark',
    zoom=6,
    title="Overview of the cases in Korea")
fig.show()
case_id province city group infection_case confirmed latitude longitude
0 1000001 Seoul Yongsan-gu True Itaewon Clubs 139 37.538621 126.992652
1 1000002 Seoul Gwanak-gu True Richway 119 37.482080 126.901384
2 1000003 Seoul Guro-gu True Guro-gu Call Center 95 37.508163 126.884387
3 1000004 Seoul Yangcheon-gu True Yangcheon Table Tennis Club 43 37.546061 126.874209
4 1000005 Seoul Dobong-gu True Day Care Center 43 37.679422 127.044374

Majority of the cases is concentrated in Daegu while the rest of the areas have lower amount of cases.
It seems there are 2 huge cluster in Korea regarding Covid-19 cases. One of the cluster is in the province of Daegu.
And another cluster is in the province of seoul

3.2) Check the relationship between group and solo cases

3.2.1) Bar chart visualisation of cases in Korea

In [5]:
caseData = pd.read_csv('covid/case.csv')
caseDataForMap = caseData.copy()
caseDataSorted = caseDataForMap.sort_values(by=['confirmed'], ascending=False)
display(caseDataSorted.head())

sortedValues = caseDataForMap.groupby(['province','group']).sum().sort_values(by=['confirmed'], ascending=False).reset_index()
sortedValues = sortedValues[['province','confirmed','group']]
display(sortedValues.head())

fig = px.bar(sortedValues, x='confirmed', y='province', color='group', barmode='group')
fig.update_layout(hovermode='y')
fig.show()
case_id province city group infection_case confirmed latitude longitude
48 1200001 Daegu Nam-gu True Shincheonji Church 4511 35.84008 128.5667
56 1200009 Daegu - False contact with patient 917 - -
57 1200010 Daegu - False etc 747 - -
145 6000001 Gyeongsangbuk-do from other city True Shincheonji Church 566 - -
109 2000020 Gyeonggi-do - False overseas inflow 305 - -
province confirmed group
0 Daegu 4975 True
1 Daegu 1705 False
2 Gyeongsangbuk-do 979 True
3 Seoul 720 True
4 Seoul 560 False

In general, there are more group cases than non-group cases in korea.
The only prominent exception is Busan.
This may imply that people are more prone to covid-19 when they are in group

3.2.2) Calculation of individual & group cases

In [6]:
case_pivot = case.pivot_table(values = "confirmed",index = "province",columns = "group", aggfunc = "sum")
case_pivot = pd.DataFrame(case_pivot.to_records())
case_pivot.rename(columns = {"False": "Group_False","True": "Group_True"}, inplace = True)
case_pivot.set_index("province", inplace = True)
print("There are a total of {} individual cases".format(case_pivot["Group_False"].sum()))
print("There are a total of {} group cases".format(case_pivot["Group_True"].sum()))
There are a total of 3544 individual cases
There are a total of 7851 group cases

Where are the reported cases located? Are they in group or individual reports?

The majority of the reported cases are located in Daegu province. The virus spreads easily in a group setting.

3.3) Source of infected cases sort by source and province

In [7]:
locations = caseData.pivot("province", "infection_case", "confirmed")
display(locations.head())

f, ax = plt.subplots(figsize=(20, 5))
sns.heatmap(locations, cmap="gnuplot", annot=False, fmt="d", linewidths=.5, ax=ax).set_title('Numbers of cases sort by source and province')
infection_case Anyang Gunpo Pastors Group Biblical Language study meeting Bonghwa Pureun Nursing Home Bundang Jesaeng Hospital Changnyeong Coin Karaoke Cheongdo Daenam Hospital Coupang Logistics Center Daejeon door-to-door sales Daesil Convalescent Hospital Daezayeon Korea ... Yangcheon Table Tennis Club Yechun-gun Yeonana News Class Yeongdeungpo Learning Institute Yongin Brothers contact with patient etc gym facility in Cheonan gym facility in Sejong overseas inflow
province
Busan NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 19.0 30.0 NaN NaN 36.0
Chungcheongbuk-do NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 8.0 11.0 NaN NaN 13.0
Chungcheongnam-do NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 2.0 12.0 103.0 NaN 16.0
Daegu NaN NaN NaN NaN NaN 2.0 NaN NaN 101.0 NaN ... NaN NaN NaN NaN NaN 917.0 747.0 NaN NaN 41.0
Daejeon NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 15.0 15.0 NaN NaN 15.0

5 rows × 81 columns

Out[7]:
Text(0.5, 1, 'Numbers of cases sort by source and province')
In [9]:
fig = px.sunburst(caseDataSorted, 
                  path=['province', 'infection_case'], 
                  values='confirmed',
                  width=1000, height=600)
fig.show()

The top covid sources for

  • Daegu is Shincheonji church
  • Gyeongsangbuk-do is Shincheonji church
  • Seoul is overseas inflow
  • Gyeonggi-do is overseas inflow
  • Incheon is overseas inflow
In [10]:
fig = px.sunburst(caseDataSorted, 
                  path=['infection_case', 'province'], 
                  values='confirmed',
                  width=1000, height=600)
fig.show()

The top 5 sources for covid-19 comes from

  • Shincheonji Church
  • Contract With patient
  • Unknown Sources
  • Overseas inflow
  • Itaewon clubs

3.4) Risk Level of each province

In [11]:
caseData = pd.read_csv('covid/case.csv')
caseDataForMap = caseData.copy()
caseDataForMap['logConfirmed'] = np.log(caseDataForMap['confirmed'])
display(caseDataForMap.head())

fig = px.box(caseDataForMap, x="logConfirmed", y="province")
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.update_layout(hovermode='y')
fig.show()
case_id province city group infection_case confirmed latitude longitude logConfirmed
0 1000001 Seoul Yongsan-gu True Itaewon Clubs 139 37.538621 126.992652 4.934474
1 1000002 Seoul Gwanak-gu True Richway 119 37.48208 126.901384 4.779123
2 1000003 Seoul Guro-gu True Guro-gu Call Center 95 37.508163 126.884387 4.553877
3 1000004 Seoul Yangcheon-gu True Yangcheon Table Tennis Club 43 37.546061 126.874209 3.761200
4 1000005 Seoul Dobong-gu True Day Care Center 43 37.679422 127.044374 3.761200
In [12]:
caseDataForMap = caseData.copy()
sortedValues = caseDataForMap.groupby(['province']).sum().sort_values(by=['confirmed'], ascending=False).reset_index()
sortedValues['logConfirmed'] = np.log(sortedValues['confirmed'])
sortedValues = sortedValues[['province','confirmed','logConfirmed']]
display(sortedValues.head())
province confirmed logConfirmed
0 Daegu 6680 8.806873
1 Gyeongsangbuk-do 1324 7.188413
2 Seoul 1280 7.154615
3 Gyeonggi-do 1000 6.907755
4 Incheon 202 5.308268
In [13]:
print('Mean: ', sortedValues.mean().values)
print('Standard Derivation: ', sortedValues.std().values)
Mean:  [670.29411765   4.96126521]
Standard Derivation:  [1610.8280776     1.65884891]

This amount of COVID-19 cases has a mean of 670 and a standard deviation of 1610.

Hence we sort the provinces into 4 different types of level:

  • Very Risk = 4 (Above 2000 cases)
  • High Risk = 3 (1000 - 2000 cases)
  • Medium Risk = 2 (100 - 999 cases)
  • Low Risk = 1 ( Below 99 cases)
In [14]:
def calculate_risk_level(sortedValues) :
    if sortedValues['confirmed'] >= 2000:
        return 4
    elif sortedValues['confirmed'] >= 1000 and sortedValues['confirmed'] < 2000:
        return 3
    elif sortedValues['confirmed'] >= 100 and sortedValues['confirmed'] < 900:
        return 2
    else:
        return 1
    
sortedValues['Risk Level'] = sortedValues.apply(calculate_risk_level,axis=1);
sortedValues = sortedValues[['province','Risk Level']]
display(sortedValues.head())
province Risk Level
0 Daegu 4
1 Gyeongsangbuk-do 3
2 Seoul 3
3 Gyeonggi-do 3
4 Incheon 2

4) Patient Info

4.1) Data Analysis

4.1.1) Numbers Of Cases Per Day

In [15]:
patientData = pd.read_csv('covid/patientinfo.csv')
display(patientData.head())

groupPatientData = patientData.groupby('confirmed_date').size().reset_index()
groupPatientData.columns = ['confirmed_date', 'count']

fig = px.line(groupPatientData, x="confirmed_date", y="count", title='Numbers Of Covid Cases Per Day')
fig.update_layout(hovermode='x')
fig.show()
patient_id sex age country province city infection_case infected_by contact_number symptom_onset_date confirmed_date released_date deceased_date state
0 1000000001 male 50s Korea Seoul Gangseo-gu overseas inflow NaN 75 2020-01-22 2020-01-23 2020-02-05 NaN released
1 1000000002 male 30s Korea Seoul Jungnang-gu overseas inflow NaN 31 NaN 2020-01-30 2020-03-02 NaN released
2 1000000003 male 50s Korea Seoul Jongno-gu contact with patient 2002000001 17 NaN 2020-01-30 2020-02-19 NaN released
3 1000000004 male 20s Korea Seoul Mapo-gu overseas inflow NaN 9 2020-01-26 2020-01-30 2020-02-15 NaN released
4 1000000005 female 20s Korea Seoul Seongbuk-gu contact with patient 1000000002 2 NaN 2020-01-31 2020-02-24 NaN released

The covid cases has a spike from Feb-Mar period before it gradually reduces from Mar-May period.
During June, the covid cases had a minor rebound.

4.1.2) Does infection have something to do with gender, age, or geographical location?

In [16]:
patient = pd.read_csv("covid/PatientInfo.csv")
patient.head()
Out[16]:
patient_id sex age country province city infection_case infected_by contact_number symptom_onset_date confirmed_date released_date deceased_date state
0 1000000001 male 50s Korea Seoul Gangseo-gu overseas inflow NaN 75 2020-01-22 2020-01-23 2020-02-05 NaN released
1 1000000002 male 30s Korea Seoul Jungnang-gu overseas inflow NaN 31 NaN 2020-01-30 2020-03-02 NaN released
2 1000000003 male 50s Korea Seoul Jongno-gu contact with patient 2002000001 17 NaN 2020-01-30 2020-02-19 NaN released
3 1000000004 male 20s Korea Seoul Mapo-gu overseas inflow NaN 9 2020-01-26 2020-01-30 2020-02-15 NaN released
4 1000000005 female 20s Korea Seoul Seongbuk-gu contact with patient 1000000002 2 NaN 2020-01-31 2020-02-24 NaN released
In [17]:
patient.isna().sum()
Out[17]:
patient_id               0
sex                   1122
age                   1380
country                  0
province                 0
city                    94
infection_case         919
infected_by           3819
contact_number        4374
symptom_onset_date    4475
confirmed_date           3
released_date         3578
deceased_date         5099
state                    0
dtype: int64

Since we're looking for gender, age and geograhical location, dropping those NaN data for the age group would suffice as it has the highest NaN among these parameters.

In [18]:
patient.dropna(subset = ["age"], inplace = True)
patient.isna().sum()
Out[18]:
patient_id               0
sex                      3
age                      0
country                  0
province                 0
city                    89
infection_case         827
infected_by           2824
contact_number        3008
symptom_onset_date    3231
confirmed_date           3
released_date         2209
deceased_date         3719
state                    0
dtype: int64
In [19]:
patient.patient_id.count()
Out[19]:
3785

We have 3,782 patient info as opposed to 11,395 reported cases in case.csv. Patient info is a small subset of the total number of cases. Let's look into the dataset

In [20]:
pct_gender = patient.groupby("sex")["patient_id"].count()/patient["patient_id"].count()
pct_gender
Out[20]:
sex
female    0.546367
male      0.452840
Name: patient_id, dtype: float64
In [21]:
def plot_bar(df, palette = None):
    fig, ax = plt.subplots(figsize = (8,5))
    sns.barplot(x = df.index, y = df, palette = palette)
    ax.tick_params(axis = "both", labelsize = 15)
In [22]:
plot_bar(pct_gender, "Paired")
plt.xlabel("Gender", size = 15)
plt.ylabel("Confirmed Cases in %", size = 15)
plt.title("Confirmed Cases by Gender", size = 15)
plt.show()

Gender seems to be even

In [23]:
patient_age = patient.groupby("age")["patient_id"].count().drop(index = "100s").sort_index()
patient_age
Out[23]:
age
0s      66
10s    178
20s    899
30s    523
40s    518
50s    667
60s    482
70s    232
80s    170
90s     49
Name: patient_id, dtype: int64
In [24]:
plot_bar(patient_age)
plt.xlabel("Age", size = 15)
plt.ylabel("Number of patients", size = 15)
plt.title("Age Group of Patient", size = 15)
plt.show()

There are many confirmed cases in the 20s age group. This could be probably that they have more social activities like attending university, hanging out with friends and dating.

In [25]:
patient_province = patient.groupby("province")["patient_id"].count()
In [26]:
plot_bar(patient_province, "rocket")
plt.xlabel("Age", size = 15)
plt.ylabel("Number of patients", size = 15)
plt.title("Province of confirmed cases", size = 15)
plt.xticks(rotation = 90)
plt.show()

Patient.csv Data is not a representative of the overall Dataset as Daegu does not have the highest confirmed cases when we compare it to the cases.csv

4.2) Average Survivor Treatment Day (Recovery Speed), By Age and Gender

In [27]:
confinedDaysAnalysis = patientData.copy()
confinedDaysAnalysis = confinedDaysAnalysis[confinedDaysAnalysis['age'] != '100s'] #Just a single data which is highly skewed
confinedDaysAnalysis = confinedDaysAnalysis[confinedDaysAnalysis['released_date'].notnull()]

cols = ['released_date', 'confirmed_date']
confinedDaysAnalysis[cols] = confinedDaysAnalysis[cols].apply(pd.to_datetime, errors='coerce', axis=1)
confinedDaysAnalysis['Total Treatment Days'] = (confinedDaysAnalysis['released_date'] - confinedDaysAnalysis['confirmed_date']).dt.days

groupedData = confinedDaysAnalysis.groupby(['sex', 'age'])['Total Treatment Days'].mean().unstack().stack().reset_index()
groupedData.columns = ["sex", "age", "Average Treatment Days"]

dataForHeatmap = groupedData.pivot("sex", "age", "Average Treatment Days")
display(dataForHeatmap.head())

f, ax = plt.subplots(figsize=(15, 5))
sns.heatmap(dataForHeatmap, cmap="RdPu", annot=False, fmt="d", linewidths=.5, ax=ax).set_title('Average Survivor Treatment Day, Sort by Age & Gender')
age 0s 10s 20s 30s 40s 50s 60s 70s 80s 90s
sex
female 28.375000 19.80 23.659091 23.415094 22.025806 23.310345 26.543689 31.344828 33.738095 28.461538
male 21.727273 21.45 23.275132 22.971429 26.581395 24.666667 27.257143 36.200000 37.055556 33.000000
Out[27]:
Text(0.5, 1, 'Average Survivor Treatment Day, Sort by Age & Gender')

In general, the older the patient, the longer it will take for them to recover

In [28]:
confinedDaysAnalysis['age'] = confinedDaysAnalysis['age'].dropna()
age = {'0s':0, '10s':1, '20s':2, '30s':3, '40s':4, '50s':5, '60s':6, '70s':7, '80s':8, '90s':9}
confinedDaysAnalysis['age'] = confinedDaysAnalysis['age'].map(age)
confinedDaysAnalysis = confinedDaysAnalysis[confinedDaysAnalysis['age'].notna()]

fig = px.scatter(confinedDaysAnalysis, x="age", y="Total Treatment Days", color="sex", trendline="ols", title="Does age group affect total treatment days?")
fig.update_layout(hovermode='y')
fig.show()

With a R squared value of 0.03, it seems that the age of the patient has little effect on the number of days it take for them to recover from covid-19

4.3) Average Survivor Treatment Day (Recovery Speed) By City & Age

In [29]:
groupedData = confinedDaysAnalysis.groupby(['province', 'age'])['Total Treatment Days'].mean().unstack().stack().reset_index()
groupedData.columns = ["province", "age", "Average Treatment Days"]
groupedData['Average Treatment Days'] = np.ceil(groupedData['Average Treatment Days'].apply(pd.to_numeric, errors='coerce'))

dataForHeatmap = groupedData.pivot("province", "age", "Average Treatment Days")
display(dataForHeatmap.head())

f, ax = plt.subplots(figsize=(20, 5))
sns.heatmap(dataForHeatmap, cmap="RdPu", annot=False, fmt="d", linewidths=.5, ax=ax).set_title('Average Survivor Treatment Day, Sort by City & Age')
age 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
province
Chungcheongbuk-do 49.0 20.0 24.0 23.0 23.0 25.0 23.0 26.0 26.0 20.0
Chungcheongnam-do 17.0 23.0 23.0 26.0 24.0 27.0 27.0 43.0 9.0 NaN
Daegu NaN NaN NaN NaN NaN 9.0 7.0 12.0 NaN NaN
Daejeon NaN 15.0 22.0 23.0 18.0 26.0 40.0 24.0 17.0 NaN
Gangwon-do NaN NaN 12.0 24.0 27.0 19.0 25.0 NaN NaN NaN
Out[29]:
Text(0.5, 1, 'Average Survivor Treatment Day, Sort by City & Age')

4.4) Average Deceased Treatment Day (Fatality Speed) By Gender and Age

In [30]:
confinedDaysAnalysis = patientData.copy()
confinedDaysAnalysis = confinedDaysAnalysis[confinedDaysAnalysis['age'] != '100s'] #Just a single data which is highly skewed
confinedDaysAnalysis = confinedDaysAnalysis[confinedDaysAnalysis['deceased_date'].notnull()]

cols = ['deceased_date', 'confirmed_date']
confinedDaysAnalysis[cols] = confinedDaysAnalysis[cols].apply(pd.to_datetime, errors='coerce', axis=1)
confinedDaysAnalysis['Total Treatment Days'] = (confinedDaysAnalysis['deceased_date'] - confinedDaysAnalysis['confirmed_date']).dt.days
display(confinedDaysAnalysis.head())

groupedData = confinedDaysAnalysis.groupby(['sex', 'age'])['Total Treatment Days'].mean().unstack().stack().reset_index()
groupedData.columns = ["sex", "age", "Average Treatment Days"]

dataForHeatmap = groupedData.pivot("sex", "age", "Average Treatment Days")
display(dataForHeatmap.head())

f, ax = plt.subplots(figsize=(15, 5))
sns.heatmap(dataForHeatmap, cmap="RdPu", annot=False, fmt="d", linewidths=.5, ax=ax).set_title('Average Deceased Treatment Day, Sort by Age & Gender')
patient_id sex age country province city infection_case infected_by contact_number symptom_onset_date confirmed_date released_date deceased_date state Total Treatment Days
1468 1200000038 female 50s Korea Daegu Nam-gu NaN NaN NaN NaN 2020-02-18 NaN 2020-02-23 deceased 5
1507 1200000114 male 70s Korea Daegu NaN Shincheonji Church NaN NaN NaN 2020-02-21 NaN 2020-02-26 deceased 5
1508 1200000620 male 70s Korea Daegu NaN NaN NaN NaN NaN 2020-02-24 NaN 2020-03-02 deceased 7
1509 1200000901 female 80s Korea Daegu NaN NaN NaN NaN NaN 2020-02-25 NaN 2020-03-04 deceased 8
1510 1200001064 female 70s Korea Daegu NaN NaN NaN NaN NaN 2020-02-26 NaN 2020-03-01 deceased 4
age 30s 50s 60s 70s 80s 90s
sex
female NaN 1.333333 8.0 27.0 11.615385 10.0
male 0.0 8.000000 10.9 5.5 10.666667 10.5
Out[30]:
Text(0.5, 1, 'Average Deceased Treatment Day, Sort by Age & Gender')

4.5) Average Deceased Treatment Day (Fatality Speed) By City and Age

In [31]:
groupedData = confinedDaysAnalysis.groupby(['province', 'age'])['Total Treatment Days'].mean().unstack().stack().reset_index()
groupedData.columns = ["province", "age", "Average Treatment Days"]
groupedData['Average Treatment Days'] = np.ceil(groupedData['Average Treatment Days'].apply(pd.to_numeric, errors='coerce'))

dataForHeatmap = groupedData.pivot("province", "age", "Average Treatment Days")
display(dataForHeatmap.head())

f, ax = plt.subplots(figsize=(20, 5))
sns.heatmap(dataForHeatmap, cmap="RdPu", annot=False, fmt="d", linewidths=.5, ax=ax).set_title('Average Deceased Treatment Day, Sort by City & Age')
age 30s 50s 60s 70s 80s 90s
province
Daegu NaN 3.0 6.0 4.0 5.0 6.0
Daejeon NaN NaN NaN 52.0 NaN NaN
Gangwon-do NaN NaN NaN 15.0 24.0 NaN
Gyeonggi-do 0.0 NaN NaN NaN NaN NaN
Gyeongsangbuk-do NaN 7.0 11.0 17.0 13.0 12.0
Out[31]:
Text(0.5, 1, 'Average Deceased Treatment Day, Sort by City & Age')

4.6) Number of Infection, Survivor, Deceased Per City by percentage

In [32]:
survivorCountAnalysis = patientData.copy()

survivorCountAnalysis['survive'] = survivorCountAnalysis['released_date'].notnull()
survivorCountAnalysis['deceased'] = survivorCountAnalysis['deceased_date'].notnull()
survivorCountAnalysis['under treatment'] = survivorCountAnalysis['deceased_date'].isnull() & survivorCountAnalysis['released_date'].isnull()

provinceStats = survivorCountAnalysis.groupby(['province']).sum()
provinceStatsClean = provinceStats[['survive', 'deceased', 'under treatment']]

provinceStatsClean['survive %'] = np.round(provinceStatsClean['survive'] / (provinceStatsClean['survive'] + provinceStatsClean['deceased'] + provinceStatsClean['under treatment']) * 100, 2)
provinceStatsClean['deceased %'] = np.round(provinceStatsClean['deceased'] / (provinceStatsClean['survive'] + provinceStatsClean['deceased'] + provinceStatsClean['under treatment']) * 100, 2)
provinceStatsClean['under treatment %'] = np.round(provinceStatsClean['under treatment'] / (provinceStatsClean['survive'] + provinceStatsClean['deceased'] + provinceStatsClean['under treatment']) * 100, 2)

provinceStatsAbsolute = provinceStatsClean[['survive', 'deceased', 'under treatment']]
provinceStatsPercentage = provinceStatsClean[['survive %', 'deceased %', 'under treatment %']]

display(provinceStatsAbsolute.head())
display(provinceStatsPercentage.head())
survive deceased under treatment
province
Busan 0 0 151
Chungcheongbuk-do 50 0 6
Chungcheongnam-do 150 0 18
Daegu 4 20 113
Daejeon 44 1 74
survive % deceased % under treatment %
province
Busan 0.00 0.00 100.00
Chungcheongbuk-do 89.29 0.00 10.71
Chungcheongnam-do 89.29 0.00 10.71
Daegu 2.92 14.60 82.48
Daejeon 36.97 0.84 62.18
In [33]:
newTable = provinceStatsPercentage.reset_index()
display(newTable.head())

newTable = pd.melt(newTable, id_vars="province", var_name="stats", value_name="rate")
display(newTable.head())

fig = px.bar(newTable, x='rate', y='province', color='stats', barmode='group')
fig.update_layout(hovermode='y')
fig.show()
province survive % deceased % under treatment %
0 Busan 0.00 0.00 100.00
1 Chungcheongbuk-do 89.29 0.00 10.71
2 Chungcheongnam-do 89.29 0.00 10.71
3 Daegu 2.92 14.60 82.48
4 Daejeon 36.97 0.84 62.18
province stats rate
0 Busan survive % 0.00
1 Chungcheongbuk-do survive % 89.29
2 Chungcheongnam-do survive % 89.29
3 Daegu survive % 2.92
4 Daejeon survive % 36.97

4.7) Number of Infection, Survivor, Deceased Sort By Gender

In [34]:
survivorCountAnalysis = patientData.copy()

survivorCountAnalysis['survive'] = survivorCountAnalysis['released_date'].notnull()
survivorCountAnalysis['deceased'] = survivorCountAnalysis['deceased_date'].notnull()
survivorCountAnalysis['under treatment'] = survivorCountAnalysis['deceased_date'].isnull() & survivorCountAnalysis['released_date'].isnull()

provinceStats = survivorCountAnalysis.groupby(['sex']).sum()
provinceStatsClean = provinceStats[['survive', 'deceased', 'under treatment']]

provinceStatsClean['survive %'] = np.round(provinceStatsClean['survive'] / (provinceStatsClean['survive'] + provinceStatsClean['deceased'] + provinceStatsClean['under treatment']) * 100, 2)
provinceStatsClean['deceased %'] = np.round(provinceStatsClean['deceased'] / (provinceStatsClean['survive'] + provinceStatsClean['deceased'] + provinceStatsClean['under treatment']) * 100, 2)
provinceStatsClean['under treatment %'] = np.round(provinceStatsClean['under treatment'] / (provinceStatsClean['survive'] + provinceStatsClean['deceased'] + provinceStatsClean['under treatment']) * 100, 2)

provinceStatsAbsolute = provinceStatsClean[['survive', 'deceased', 'under treatment']]
provinceStatsPercentage = provinceStatsClean[['survive %', 'deceased %', 'under treatment %']]

display(provinceStatsAbsolute)
display(provinceStatsPercentage)
survive deceased under treatment
sex
female 909 26 1285
male 677 40 1108
survive % deceased % under treatment %
sex
female 40.95 1.17 57.88
male 37.10 2.19 60.71
In [35]:
newTable = provinceStatsPercentage.reset_index()
newTable = pd.melt(newTable, id_vars="sex", var_name="stats", value_name="rate")
display(newTable.head())

fig = px.bar(newTable, x='rate', y='sex', color='stats', barmode='group')
fig.update_layout(hovermode='y')
fig.show()
sex stats rate
0 female survive % 40.95
1 male survive % 37.10
2 female deceased % 1.17
3 male deceased % 2.19
4 female under treatment % 57.88

4.8) Network Diagram

In [36]:
networkData = patientData.copy()
networkData = networkData[networkData['infected_by'].notnull()]
networkData = networkData[['patient_id','sex','age','province','city','infection_case','infected_by','state']]
display(networkData.head())
patient_id sex age province city infection_case infected_by state
2 1000000003 male 50s Seoul Jongno-gu contact with patient 2002000001 released
4 1000000005 female 20s Seoul Seongbuk-gu contact with patient 1000000002 released
5 1000000006 female 50s Seoul Jongno-gu contact with patient 1000000003 released
6 1000000007 male 20s Seoul Jongno-gu contact with patient 1000000003 released
9 1000000010 female 60s Seoul Seongbuk-gu contact with patient 1000000003 released
In [37]:
A = list(networkData["infected_by"].unique())
B = list(networkData["patient_id"].unique())
node_list = set(A+B)

# Create Graph
G = nx.Graph()

for i in node_list:
    G.add_node(i)
# G.nodes()

for i,j in networkData.iterrows():
    G.add_edges_from([(j["infected_by"],j["patient_id"])])
    
pos = nx.spring_layout(G, k=0.5, iterations=50)

for n, p in pos.items():
    G.nodes[n]['pos'] = p

edge_trace = go.Scatter(
    x=[],
    y=[],
    line=dict(width=0.5,color='#888'),
    hoverinfo='none',
    mode='lines')

for edge in G.edges():
    x0, y0 = G.nodes[edge[0]]['pos']
    x1, y1 = G.nodes[edge[1]]['pos']
    edge_trace['x'] += tuple([x0, x1, None])
    edge_trace['y'] += tuple([y0, y1, None])

node_trace = go.Scatter(
    x=[],
    y=[],
    text=[],
    mode='markers',
    hoverinfo='text',
    marker=dict(
        showscale=True,
        colorscale='RdPu_r',
        reversescale=True,
        color=[],
        size=15,
        colorbar=dict(
            thickness=10,
            title='Numnber of Infected Cases',
            xanchor='left',
            titleside='right'
        ),
        line=dict(width=0)))

for node in G.nodes():
    x, y = G.nodes[node]['pos']
    node_trace['x'] += tuple([x])
    node_trace['y'] += tuple([y])

for node, adjacencies in enumerate(G.adjacency()):
    node_trace['marker']['color']+=tuple([len(adjacencies[1])])
    node_info = str(adjacencies[0]) +' # of connections: '+ str(len(adjacencies[1]))
    node_trace['text']+=tuple([node_info])
    
fig = go.Figure(data=[edge_trace, node_trace],
             layout=go.Layout(
                title='<br>Korea Covid Network Connections',
                titlefont=dict(size=16),
                showlegend=False,
                hovermode='closest',
                margin=dict(b=20,l=5,r=5,t=40),
                annotations=[ dict(
                    text="",
                    showarrow=False,
                    xref="paper", yref="paper") ],
                xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)))

iplot(fig)
In [38]:
fig = px.sunburst(networkData, path=['province','infected_by'])
fig.show()
In [39]:
networkData = patientData.copy()
networkData = networkData[networkData['infected_by'].notnull()]
networkData = networkData[['sex','age','province','state','infection_case']]

# print('With NA values')
# display(networkData.count())
# fig = px.parallel_categories(networkData, color_continuous_scale=px.colors.sequential.Inferno)
# fig.show()

# print('Without NA values')
networkData = networkData.dropna()
display(networkData.count())
fig = px.parallel_categories(networkData, color_continuous_scale=px.colors.sequential.Inferno)
fig.show()
sex               961
age               961
province          961
state             961
infection_case    961
dtype: int64
In [ ]:
 

5) Policy & Search Trend

In [40]:
searchData = pd.read_csv('covid/searchtrend.csv')
searchDataCopy = searchData.copy()
searchDataMelted = pd.melt(searchDataCopy, 
                           id_vars=['date'], 
                           value_vars=['cold', 'flu','pneumonia','coronavirus'])
searchDataMelted = searchDataMelted[pd.to_datetime(searchDataMelted['date']).dt.year > 2019]
display(searchDataMelted.head())

policyData = pd.read_csv('covid/policy.csv')
policyDataCopy = policyData.copy()
policyDataCopy['end_date'] = policyDataCopy['end_date'].fillna(datetime.now().strftime('%Y-%m-%d'))
display(policyDataCopy.head())

df = []

for index, row in policyDataCopy.iterrows():
    df.append(dict(Task=row['gov_policy'], Start=row['start_date'], Finish=row['end_date'], Resource=row['type']))

fig = px.timeline(df, x_start="Start", x_end="Finish", y="Task", color="Resource", title="Policy over time")
fig.update_layout(hovermode='x')
fig.show()

patientData = pd.read_csv('covid/patientinfo.csv')
groupPatientData = patientData.groupby('confirmed_date').size().reset_index()
groupPatientData.columns = ['confirmed_date', 'count']

fig = px.line(searchDataMelted, x="date", y="value", color="variable", title='Search Trend over Time')
fig.update_layout(hovermode='x')
fig.show()

fig = px.line(groupPatientData, x="confirmed_date", y="count", title='Numbers Of Covid Cases Per Day')
fig.update_layout(hovermode='x')
fig.show()
date variable value
1461 2020-01-01 cold 0.14454
1462 2020-01-02 cold 0.19508
1463 2020-01-03 cold 0.19581
1464 2020-01-04 cold 0.60343
1465 2020-01-05 cold 0.20081
policy_id country type gov_policy detail start_date end_date
0 1 Korea Alert Infectious Disease Alert Level Level 1 (Blue) 2020-01-03 2020-01-19
1 2 Korea Alert Infectious Disease Alert Level Level 2 (Yellow) 2020-01-20 2020-01-27
2 3 Korea Alert Infectious Disease Alert Level Level 3 (Orange) 2020-01-28 2020-02-22
3 4 Korea Alert Infectious Disease Alert Level Level 4 (Red) 2020-02-23 2020-09-26
4 5 Korea Immigration Special Immigration Procedure from China 2020-02-04 2020-09-26

6) Time

In [41]:
timeData = pd.read_csv('covid/time.csv')
timeDataMelted = pd.melt(timeData, id_vars=['date'], value_vars=['test', 'negative','confirmed','released','deceased'])
display(timeDataMelted.head())

fig = px.line(timeDataMelted, x="date", y="value", color='variable', title="Overall cases over time")
fig.update_layout(hovermode='x')
fig.show()
date variable value
0 2020-01-20 test 1
1 2020-01-21 test 1
2 2020-01-22 test 4
3 2020-01-23 test 22
4 2020-01-24 test 27
In [42]:
timeDataMelted = pd.melt(timeData, id_vars=['date'], value_vars=['confirmed','released','deceased'])
display(timeDataMelted.head())

fig = px.line(timeDataMelted, x="date", y="value", color='variable', title="Confirmed, released & deceased over time")
fig.update_layout(hovermode='x')
fig.show()
date variable value
0 2020-01-20 confirmed 1
1 2020-01-21 confirmed 1
2 2020-01-22 confirmed 1
3 2020-01-23 confirmed 1
4 2020-01-24 confirmed 2
In [43]:
miniTimeData = timeData[['date','test','negative']]
miniTimeData['Test increase'] =  miniTimeData['test'] - miniTimeData['test'].shift(1) 
miniTimeData['Negative increase'] =  miniTimeData['negative'] - miniTimeData['negative'].shift(1) 
miniTimeData['Month'] = pd.to_datetime(miniTimeData['date']).dt.month
miniTimeData['Day'] = pd.to_datetime(miniTimeData['date']).dt.day_name()
display(miniTimeData.head())
date test negative Test increase Negative increase Month Day
0 2020-01-20 1 0 NaN NaN 1 Monday
1 2020-01-21 1 0 0.0 0.0 1 Tuesday
2 2020-01-22 4 3 3.0 3.0 1 Wednesday
3 2020-01-23 22 21 18.0 18.0 1 Thursday
4 2020-01-24 27 25 5.0 4.0 1 Friday

6.1) Test & Negative cases daily average increase amount by day

In [44]:
miniTimeDataDay = miniTimeData[['Test increase', 'Negative increase', 'Day']]
miniTimeDataDay = miniTimeDataDay.groupby('Day')['Test increase', 'Negative increase'].mean()
cats = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
miniTimeDataDay = miniTimeDataDay.reindex(cats) 
miniTimeDataDay = miniTimeDataDay.reset_index()
miniTimeDataDayMelted = pd.melt(miniTimeDataDay, id_vars=['Day'], value_vars=['Test increase','Negative increase'])
display(miniTimeDataDayMelted.head())

fig = px.bar(miniTimeDataDayMelted, x='Day', y='value', color='variable', barmode='group')
fig.update_layout(hovermode='x')
fig.show()
Day variable value
0 Monday Test increase 4852.565217
1 Tuesday Test increase 9753.958333
2 Wednesday Test increase 8594.391304
3 Thursday Test increase 8632.000000
4 Friday Test increase 9373.869565

In general, there are more test conducted during the weekdays
and the number of test conducted is lesser during the weekday
The result might take one day to process which causes a lag in the result of the test cases

6.2) Test & Negative cases daily average increase amount by month

In [45]:
miniTimeDataMonth = miniTimeData[['Test increase', 'Negative increase', 'Month']]
miniTimeDataMonth = miniTimeDataMonth.groupby('Month')['Test increase', 'Negative increase'].mean()
miniTimeDataMonth = miniTimeDataMonth.reset_index()
miniTimeDataMonthMelted = pd.melt(miniTimeDataMonth, id_vars=['Month'], value_vars=['Test increase','Negative increase'])
display(miniTimeDataMonthMelted.head())

fig = px.bar(miniTimeDataMonthMelted, x='Month', y='value', color='variable', barmode='group')
fig.update_layout(hovermode='x')
fig.show()
Month variable value
0 1 Test increase 28.272727
1 2 Test increase 3232.517241
2 3 Test increase 10209.967742
3 4 Test increase 6977.233333
4 5 Test increase 9385.193548

There are more test conducted during March and June.
This trend also correspond to the the number of covid-19 cases in Korea
It seems that there is a spiked in test conducted whether there is an increase in covid-19 cases

In [46]:
miniTimeData = timeData[['date','confirmed','released','deceased']]
miniTimeData['confirmed increase'] =  miniTimeData['confirmed'] - miniTimeData['confirmed'].shift(1) 
miniTimeData['released increase'] =  miniTimeData['released'] - miniTimeData['released'].shift(1) 
miniTimeData['deceased increase'] =  miniTimeData['deceased'] - miniTimeData['deceased'].shift(1) 
miniTimeData['Month'] = pd.to_datetime(miniTimeData['date']).dt.month
miniTimeData['Day'] = pd.to_datetime(miniTimeData['date']).dt.day_name()
display(miniTimeData.head())
date confirmed released deceased confirmed increase released increase deceased increase Month Day
0 2020-01-20 1 0 0 NaN NaN NaN 1 Monday
1 2020-01-21 1 0 0 0.0 0.0 0.0 1 Tuesday
2 2020-01-22 1 0 0 0.0 0.0 0.0 1 Wednesday
3 2020-01-23 1 0 0 0.0 0.0 0.0 1 Thursday
4 2020-01-24 2 0 0 1.0 0.0 0.0 1 Friday

6.3) Confirmed, released & deceased cases daily average increase amount by day

In [47]:
miniTimeDataDay = miniTimeData[['confirmed increase', 'released increase','deceased increase', 'Day']]
miniTimeDataDay = miniTimeDataDay.groupby('Day')['confirmed increase', 'released increase','deceased increase'].mean()
cats = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
miniTimeDataDay = miniTimeDataDay.reindex(cats) 
miniTimeDataDay = miniTimeDataDay.reset_index()
miniTimeDataDayMelted = pd.melt(miniTimeDataDay, id_vars=['Day'], value_vars=['confirmed increase', 'released increase','deceased increase'])
display(miniTimeDataDayMelted.head())

fig = px.bar(miniTimeDataDayMelted, x='Day', y='value', color='variable', barmode='group')
fig.update_layout(hovermode='x')
fig.show()
Day variable value
0 Monday confirmed increase 67.217391
1 Tuesday confirmed increase 63.375000
2 Wednesday confirmed increase 76.130435
3 Thursday confirmed increase 79.565217
4 Friday confirmed increase 84.086957

6.4) Confirmed, released & deceased cases daily average increase amount by month

In [48]:
miniTimeDataMonth = miniTimeData[['confirmed increase', 'released increase','deceased increase', 'Month']]
miniTimeDataMonth = miniTimeDataMonth.groupby('Month')['confirmed increase', 'released increase','deceased increase'].mean()
miniTimeDataMonth = miniTimeDataMonth.reset_index()
miniTimeDataMonthMelted = pd.melt(miniTimeDataMonth, id_vars=['Month'], value_vars=['confirmed increase', 'released increase','deceased increase'])
display(miniTimeDataMonthMelted.head())

fig = px.bar(miniTimeDataMonthMelted, x='Month', y='value', color='variable', barmode='group')
fig.update_layout(hovermode='x')
fig.show()
Month variable value
0 1 confirmed increase 0.909091
1 2 confirmed increase 108.241379
2 3 confirmed increase 214.064516
3 4 confirmed increase 32.633333
4 5 confirmed increase 22.677419

6.5) Temporal Distribution and Prediction

Visualisation and Description

In [49]:
times = pd.read_csv("covid/time.csv")
policies = pd.read_csv("covid/policy.csv")
weathers = pd.read_csv("covid/weather.csv")
times.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163 entries, 0 to 162
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   date       163 non-null    object
 1   time       163 non-null    int64 
 2   test       163 non-null    int64 
 3   negative   163 non-null    int64 
 4   confirmed  163 non-null    int64 
 5   released   163 non-null    int64 
 6   deceased   163 non-null    int64 
dtypes: int64(6), object(1)
memory usage: 9.0+ KB
In [50]:
# Add the datetime index
times["date"] = pd.to_datetime(times.date, format="%Y-%m-%d")
times.set_index("date", inplace=True)
times.head()
Out[50]:
time test negative confirmed released deceased
date
2020-01-20 16 1 0 1 0 0
2020-01-21 16 1 0 1 0 0
2020-01-22 16 4 3 1 0 0
2020-01-23 16 22 21 1 0 0
2020-01-24 16 27 25 2 0 0
In [51]:
fig = plt.figure(figsize=(16, 9))
gs = gridspec.GridSpec(2, 1, height_ratios=[3, 1])
fig.suptitle("Cumulative Reported Cases", size=18)
ax1 = plt.subplot(gs[0])
ax2 = plt.subplot(gs[1], sharex=ax1)
times[["test", "negative"]].plot(ax=ax1, figsize=(16, 9))
times[["confirmed", "released", "deceased"]].plot(ax=ax2, figsize=(16, 9))
plt.tight_layout()
plt.show()
In [52]:
# Create additional variables and plot the positive rate
data = times[["test", "confirmed"]].diff(1)
data["positive_rate"] = data["confirmed"] / data["test"]
fig = plt.figure(figsize=(16, 9))
gs = gridspec.GridSpec(2, 1, height_ratios=[3, 1])
plt.title("Cumulative Reported Cases", size=18)
ax1 = plt.subplot(gs[0])
ax2 = plt.subplot(gs[1], sharex=ax1)
data[["test", "confirmed"]].plot(ax=ax1)
data["positive_rate"].plot(ax=ax2)
ax1.set_title("New Tested and Confirmed Cases", size=18)
ax2.set_title("Positive Rate", size=18)
plt.tight_layout()
plt.show()
In [53]:
# Check the distribution of confirmed cases and positive rates
data["confirmed"].value_counts()
Out[53]:
0.0     13
1.0     10
27.0     6
39.0     5
13.0     4
        ..
30.0     1
11.0     1
14.0     1
26.0     1
94.0     1
Name: confirmed, Length: 94, dtype: int64
In [54]:
data["confirmed"].plot(kind="hist", figsize=(16, 9))
plt.title("Distribution of Confirmed Cases", size=18)
Out[54]:
Text(0.5, 1.0, 'Distribution of Confirmed Cases')
In [55]:
data["confirmed"].apply(lambda x: x**0.5).plot(kind="hist", figsize=(16, 9))
plt.title("Distribution of Root Confirmed Cases")
Out[55]:
Text(0.5, 1.0, 'Distribution of Root Confirmed Cases')
In [56]:
data["confirmed"].apply(lambda x: np.log(x) if x!=0 else 0).plot(kind="hist", figsize=(16, 9))
plt.title("Distribution of Log Confirmed Cases", size=18)
Out[56]:
Text(0.5, 1.0, 'Distribution of Log Confirmed Cases')
In [57]:
# Overall positive rate
times.iloc[-1]["confirmed"] / times.iloc[-1]["test"]
Out[57]:
0.010048941485327761
In [58]:
data["positive_rate"].plot(kind="hist", bins=30, figsize=(16, 9))
plt.axvline(x=data["positive_rate"].mean(), color="red")
plt.annotate("x = {:.4f}".format(data["positive_rate"].mean()), xy=(0.02, 50), size=14, color="red")
plt.title("Distribution of Positive Rate", size=18)
Out[58]:
Text(0.5, 1.0, 'Distribution of Positive Rate')
In [59]:
# Check how many zeros there are for positive rate
(data["positive_rate"] == 0.0).sum()  # Even exclude these cases, there are still many small positive values
Out[59]:
11
In [60]:
data["positive_rate"].plot(kind="line", figsize=(16, 9))
plt.title("Line Graph of Positive Rate", size=18)
Out[60]:
Text(0.5, 1.0, 'Line Graph of Positive Rate')

As there are too few confirmed cases every day, we will use 5-day and 10-day SMA to show the pattern of positive rate.

In [61]:
# Get the smooth positive rate
fig, ax = plt.subplots(1, 1)
(data["confirmed"].rolling(5).mean() / data["test"].rolling(5).mean()).plot(ax=ax, kind="line", linewidth=2, label="SMA(5)", figsize=(16, 9))
(data["confirmed"].rolling(10).mean() / data["test"].rolling(10).mean()).plot(ax=ax, kind="line", color="darkblue", linewidth=2, label="SMA(10)", figsize=(16, 9))
data["positive_rate"].plot(ax=ax, kind="line", color="magenta", linewidth=2, label="Today", figsize=(16, 9))
plt.legend()
plt.tight_layout()
plt.title("5D and 10D Average Positive Rate", size=18)
Out[61]:
Text(0.5, 1, '5D and 10D Average Positive Rate')
In [62]:
# Get the smooth positive rate
fig, ax = plt.subplots(1, 1)
(data["confirmed"].rolling(5).mean() / data["test"].rolling(5).mean()).plot(ax=ax, kind="line", linewidth=2, label="SMA(5)", figsize=(16, 9))
(data["confirmed"].rolling(10).mean() / data["test"].rolling(10).mean()).plot(ax=ax, kind="line", color="darkblue", linewidth=2, label="SMA(10)", figsize=(16, 9))
plt.legend()
plt.tight_layout()
plt.title("MA of Positive Rate", size=18)
Out[62]:
Text(0.5, 1, 'MA of Positive Rate')
In [63]:
# Get the positive rate since May
fig, ax = plt.subplots(1, 1)
(data["2020-05":]["confirmed"].rolling(5).mean() / data["2020-05":]["test"].rolling(5).mean()).plot(ax=ax, kind="line", linewidth=2, label="SMA(5)", figsize=(16, 9))
(data["2020-05":]["confirmed"].rolling(10).mean() / data["2020-05":]["test"].rolling(10).mean()).plot(ax=ax, kind="line", color="darkblue", linewidth=2, label="SMA(10)", figsize=(16, 9))
data["2020-05":]["positive_rate"].plot(ax=ax, kind="line", color="magenta", linewidth=2, label="Today", figsize=(16, 9))
plt.legend()
plt.tight_layout()
plt.title("Positive Rate since May", size=18)
Out[63]:
Text(0.5, 1, 'Positive Rate since May')

6.6) Stationarity Testing

In [64]:
# Plot a histogram of the selected data again

subdata = data["2020-05":]
mu = subdata["confirmed"].mean()
sigma = subdata["confirmed"].std()
r_range = np.linspace(subdata["confirmed"].min(), subdata["confirmed"].max(), 10000)
norm_pdf = scs.norm.pdf(r_range, loc=mu, scale=sigma)
fig = plt.figure(figsize=(16, 9))
plt.hist(subdata["confirmed"], bins=30, color="darkblue", alpha=0.7, density=True)
plt.plot(r_range, norm_pdf, color="darkred", label=f"N({mu: .2f}, {sigma: .2f}$^2$)", lw=4)
plt.xlabel("Confirmed Cases Since May", fontsize=14)
plt.ylabel("Probability Density", fontsize=14)
plt.title("Confirmed Cases Histogram Since May", fontsize=24)
plt.legend(loc=(0.6, 0.6), fontsize=16)
plt.tight_layout()
plt.show()
In [65]:
# Get the 3rd and 4th Moment
print("Skewness: {:.2f}".format(scs.skew(subdata["confirmed"])))
print("Extra kurtosis: {:.2f}".format(scs.kurtosis(subdata["confirmed"])))
Skewness: 0.14
Extra kurtosis: -0.53

The daily confirmed cases in the selected period is not significantly skewed. The distribution is platykurtic. Overall, it suffices to say that normal approximation is a fair approximation of the data and no additional scaling steps (taking log, etc.) are needed at this point.

In [66]:
# Autocorrelation Functions
fig, ax = plt.subplots(2, 1, figsize=(16, 9))
plot_acf(subdata["confirmed"], ax=ax[0], lags=15, alpha=0.05)
plot_pacf(subdata["confirmed"], ax=ax[1], lags=15, alpha=0.05)
plt.tight_layout()
plt.show()

The autocorrelation gradually descreases with time lag and it becomes insignificant beyond lag-4. The PACF indicates that a first-order autoregressive model AR(1) may be plausible to explain correlations as such.

In [67]:
def adf_test(ts):
    indices = ["Augmented Dickey-Fuller statistic", "p-value", "# of lags used", "# of observations used"]
    adf_val = adfuller(ts, autolag="AIC")
    critical_val = pd.Series(adf_val[:4], index=indices)
    for pct, val in adf_val[4].items():
        critical_val[f"Critical Value ({pct})"] = val
    return critical_val

def kpss_test(ts, h0_type="c"):
    indices = ["Kwiatkowski-Phillips-chmidt-Shin statistic", "p-value", "# of lags"]
    kpss_val = kpss(ts, regression=h0_type)
    critical_val = pd.Series(kpss_val[:3], index=indices)
    for pct, val in kpss_val[3].items():
        critical_val[f"Critical Value ({pct})"] = val
    return critical_val

print(adf_test(subdata["confirmed"]))
print(kpss_test(subdata["confirmed"]))
Augmented Dickey-Fuller statistic    -3.237299
p-value                               0.017925
# of lags used                        0.000000
# of observations used               60.000000
Critical Value (1%)                  -3.544369
Critical Value (5%)                  -2.911073
Critical Value (10%)                 -2.593190
dtype: float64
Kwiatkowski-Phillips-chmidt-Shin statistic     0.559479
p-value                                        0.028271
# of lags                                     11.000000
Critical Value (10%)                           0.347000
Critical Value (5%)                            0.463000
Critical Value (2.5%)                          0.574000
Critical Value (1%)                            0.739000
dtype: float64

The ADF and KPSS tests both imply that the selected data is about stationary at 5% significance level.

6.7) Seasonal Decomposition

In [68]:
result = seasonal_decompose(subdata["confirmed"], model="multiplicative")
fig, ax = plt.subplots(4, 1, sharex=True, figsize=(16, 9))
result.observed.plot(ax=ax[0])
ax[0].set(title="Seasonal Decomposition of Confirmed Cases", ylabel="Observed", )
result.trend.plot(ax=ax[1])
ax[1].set(ylabel="Trend")
result.seasonal.plot(ax=ax[2])
ax[2].set(ylabel="Seasonal")
result.resid.plot(ax=ax[3])
ax[3].set(ylabel="Residual")
plt.tight_layout()
plt.show()
In [69]:
result = seasonal_decompose(subdata["confirmed"].rolling(5).mean().dropna(), model="multiplicative")
fig, ax = plt.subplots(4, 1, sharex=True, figsize=(16, 9))
result.observed.plot(ax=ax[0])
ax[0].set(title="Seasonal Decomposition of Confirmed Cases SMA(5)", ylabel="Observed", )
result.trend.plot(ax=ax[1])
ax[1].set(ylabel="Trend")
result.seasonal.plot(ax=ax[2])
ax[2].set(ylabel="Seasonal")
result.resid.plot(ax=ax[3])
ax[3].set(ylabel="Residual")
plt.tight_layout()
plt.show()

6.8) ARIMA Modelling

In [70]:
train = subdata["confirmed"][:45]
test = subdata["confirmed"][45:]
In [71]:
model = pm.auto_arima(subdata["confirmed"].dropna(), stepwise=True, error_action="ignore", support_warnings=True, n_jobs=-1)
model.summary()
Out[71]:
Statespace Model Results
Dep. Variable: y No. Observations: 61
Model: SARIMAX(2, 1, 3) Log Likelihood -225.641
Date: Sat, 26 Sep 2020 AIC 465.283
Time: 16:41:26 BIC 479.943
Sample: 0 HQIC 471.017
- 61
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept 1.6496 0.338 4.885 0.000 0.988 2.311
ar.L1 -1.1268 0.154 -7.321 0.000 -1.428 -0.825
ar.L2 -0.2350 0.146 -1.613 0.107 -0.521 0.051
ma.L1 0.9010 13.050 0.069 0.945 -24.677 26.479
ma.L2 -0.9025 12.999 -0.069 0.945 -26.380 24.575
ma.L3 -0.9985 0.187 -5.340 0.000 -1.365 -0.632
sigma2 93.6280 0.403 232.419 0.000 92.838 94.418
Ljung-Box (Q): 18.88 Jarque-Bera (JB): 1.98
Prob(Q): 1.00 Prob(JB): 0.37
Heteroskedasticity (H): 1.90 Skew: 0.27
Prob(H) (two-sided): 0.16 Kurtosis: 3.70


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 3.02e+20. Standard errors may be unstable.
In [72]:
model.fit(train)
pred = model.predict(n_periods=16)
forecast = pd.DataFrame(pred, index=test.index, columns=["Prediction"])
pd.concat([subdata["confirmed"], forecast], axis=1).plot(figsize=(16, 9), title="AutoARIMA Model")
Out[72]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a4e484b90>
In [73]:
train_smooth = train.rolling(5).mean() / train.rolling(5).std()
total_smooth = subdata["confirmed"].rolling(5).mean() / subdata["confirmed"].rolling(5).std()
model = ARIMA(train_smooth.dropna(), order=(2, 0, 2)).fit(disp=0)
model.summary()
Out[73]:
ARMA Model Results
Dep. Variable: confirmed No. Observations: 41
Model: ARMA(2, 2) Log Likelihood -63.379
Method: css-mle S.D. of innovations 1.094
Date: Sat, 26 Sep 2020 AIC 138.759
Time: 16:41:27 BIC 149.040
Sample: 05-05-2020 HQIC 142.503
- 06-14-2020
coef std err z P>|z| [0.025 0.975]
const 3.7150 0.254 14.623 0.000 3.217 4.213
ar.L1.confirmed 1.7059 0.113 15.049 0.000 1.484 1.928
ar.L2.confirmed -0.7618 0.124 -6.145 0.000 -1.005 -0.519
ma.L1.confirmed -0.7258 0.189 -3.848 0.000 -1.095 -0.356
ma.L2.confirmed -0.2742 0.169 -1.627 0.113 -0.605 0.056
Roots
Real Imaginary Modulus Frequency
AR.1 1.1197 -0.2429j 1.1457 -0.0340
AR.2 1.1197 +0.2429j 1.1457 0.0340
MA.1 1.0000 +0.0000j 1.0000 0.0000
MA.2 -3.6470 +0.0000j 3.6470 0.5000
In [74]:
pred = model.forecast(16, alpha=0.2)
forecast = [pd.DataFrame(pred[0], columns=["prediction"]), pd.DataFrame(pred[2], columns=["ci_lower", "ci_upper"])]
forecast = pd.concat(forecast, axis=1).set_index(test.index)
fig, ax = plt.subplots(1, figsize=(16, 9))
ax = sns.lineplot(data=total_smooth, color="slateblue", label="actural")
ax.plot(forecast.prediction, c="darkblue", label="ARMA(2, 2)")
ax.fill_between(forecast.index, forecast.ci_lower, forecast.ci_upper, alpha=0.2, facecolor="darkblue")
ax.set(title="ARMA(2,2) with SMA(5) and NORM(5) Smoothing")
plt.legend()
plt.tight_layout()
plt.show()

6.9) ETS Modelling

In [75]:
model1 = Holt(train).fit()
model2 = Holt(train, exponential=True).fit()
pred1 = model1.forecast(16)
pred2 = model2.forecast(16)
fig, ax = plt.subplots(1, figsize=(16, 9))
subdata["confirmed"].plot(color="slateblue", label="Actural", legend=True, title="Holt-Winters Model")
pred1.plot(color="darkblue", legend=True, label="Linear Trend")
pred2.plot(color="magenta", legend=True, label="Exponential Trend")
plt.show()
In [76]:
model3 = ExponentialSmoothing(train, trend="mul", seasonal="add", seasonal_periods=3).fit()
pred3 = model3.forecast(16)
fig, ax = plt.subplots(1, figsize=(16, 9))
subdata["confirmed"].plot(color="slateblue", label="Actural", legend=True, title="Holt-Winters Model")
pred3.plot(color="darkblue", legend=True, label="Holt-Winters Seasonal Model")
plt.show()

6.10) Facebook Prophet

In [77]:
model = Prophet()
model.fit(train.reset_index(drop=False).rename(columns={"date": "ds", "confirmed": "y"}))
period_pred = model.make_future_dataframe(periods=16)
pred = model.predict(period_pred)
model.plot(pred)
plt.title("Facebook Prophet Model")
plt.show()
INFO:fbprophet:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.

The model is not a good fit afterall.

In [78]:
def split_sequence(sequence, n_steps):
	X, y = list(), list()
	for i in range(len(sequence)):
		# find the end of this pattern
		end_ix = i + n_steps
		# check if we are beyond the sequence
		if end_ix > len(sequence)-1:
			break
		# gather input and output parts of the pattern
		seq_x, seq_y = sequence[i:end_ix], sequence[end_ix]
		X.append(seq_x)
		y.append(seq_y)
	return np.array(X), np.array(y)
n_steps = 4
X, y = split_sequence(train, n_steps)
for i in range(len(X)):
  print(X[i], y[i])
[ 9.  6. 13.  8.] 3.0
[ 6. 13.  8.  3.] 2.0
[13.  8.  3.  2.] 4.0
[8. 3. 2. 4.] 12.0
[ 3.  2.  4. 12.] 18.0
[ 2.  4. 12. 18.] 34.0
[ 4. 12. 18. 34.] 35.0
[12. 18. 34. 35.] 27.0
[18. 34. 35. 27.] 26.0
[34. 35. 27. 26.] 29.0
[35. 27. 26. 29.] 27.0
[27. 26. 29. 27.] 19.0
[26. 29. 27. 19.] 13.0
[29. 27. 19. 13.] 15.0
[27. 19. 13. 15.] 13.0
[19. 13. 15. 13.] 32.0
[13. 15. 13. 32.] 12.0
[15. 13. 32. 12.] 20.0
[13. 32. 12. 20.] 23.0
[32. 12. 20. 23.] 25.0
[12. 20. 23. 25.] 16.0
[20. 23. 25. 16.] 19.0
[23. 25. 16. 19.] 40.0
[25. 16. 19. 40.] 79.0
[16. 19. 40. 79.] 58.0
[19. 40. 79. 58.] 39.0
[40. 79. 58. 39.] 27.0
[79. 58. 39. 27.] 35.0
[58. 39. 27. 35.] 38.0
[39. 27. 35. 38.] 49.0
[27. 35. 38. 49.] 39.0
[35. 38. 49. 39.] 39.0
[38. 49. 39. 39.] 51.0
[49. 39. 39. 51.] 57.0
[39. 39. 51. 57.] 38.0
[39. 51. 57. 38.] 38.0
[51. 57. 38. 38.] 50.0
[57. 38. 38. 50.] 45.0
[38. 38. 50. 45.] 56.0
[38. 50. 45. 56.] 48.0
[50. 45. 56. 48.] 34.0
In [79]:
n_features = 1
X = X.reshape((X.shape[0], X.shape[1], n_features))
In [80]:
model = Sequential()
model.add(LSTM(50, activation="relu", input_shape=(n_steps, n_features)))
model.add(Dense(1))
model.compile(optimizer="adam", loss="mse")
In [81]:
model.fit(X, y, epochs=100, verbose=1)
Epoch 1/100
2/2 [==============================] - 0s 6ms/step - loss: 1177.0583
Epoch 2/100
2/2 [==============================] - 0s 5ms/step - loss: 1143.4464
Epoch 3/100
2/2 [==============================] - 0s 5ms/step - loss: 1111.7444
Epoch 4/100
2/2 [==============================] - 0s 4ms/step - loss: 1080.0778
Epoch 5/100
2/2 [==============================] - 0s 7ms/step - loss: 1049.2380
Epoch 6/100
2/2 [==============================] - 0s 5ms/step - loss: 1016.4276
Epoch 7/100
2/2 [==============================] - 0s 4ms/step - loss: 982.9418
Epoch 8/100
2/2 [==============================] - 0s 3ms/step - loss: 947.2592
Epoch 9/100
2/2 [==============================] - 0s 5ms/step - loss: 908.7035
Epoch 10/100
2/2 [==============================] - 0s 4ms/step - loss: 865.6646
Epoch 11/100
2/2 [==============================] - 0s 3ms/step - loss: 818.8356
Epoch 12/100
2/2 [==============================] - 0s 3ms/step - loss: 767.1075
Epoch 13/100
2/2 [==============================] - 0s 6ms/step - loss: 708.8638
Epoch 14/100
2/2 [==============================] - 0s 5ms/step - loss: 647.3145
Epoch 15/100
2/2 [==============================] - 0s 4ms/step - loss: 572.5466
Epoch 16/100
2/2 [==============================] - 0s 6ms/step - loss: 490.1855
Epoch 17/100
2/2 [==============================] - 0s 3ms/step - loss: 409.0739
Epoch 18/100
2/2 [==============================] - 0s 3ms/step - loss: 329.8114
Epoch 19/100
2/2 [==============================] - 0s 4ms/step - loss: 279.9131
Epoch 20/100
2/2 [==============================] - 0s 7ms/step - loss: 264.3918
Epoch 21/100
2/2 [==============================] - 0s 6ms/step - loss: 274.5432
Epoch 22/100
2/2 [==============================] - 0s 4ms/step - loss: 281.9088
Epoch 23/100
2/2 [==============================] - 0s 5ms/step - loss: 269.9298
Epoch 24/100
2/2 [==============================] - 0s 4ms/step - loss: 249.6154
Epoch 25/100
2/2 [==============================] - 0s 5ms/step - loss: 234.3614
Epoch 26/100
2/2 [==============================] - 0s 5ms/step - loss: 224.2464
Epoch 27/100
2/2 [==============================] - 0s 6ms/step - loss: 219.6618
Epoch 28/100
2/2 [==============================] - 0s 4ms/step - loss: 219.1349
Epoch 29/100
2/2 [==============================] - 0s 3ms/step - loss: 217.4107
Epoch 30/100
2/2 [==============================] - 0s 3ms/step - loss: 215.1875
Epoch 31/100
2/2 [==============================] - 0s 3ms/step - loss: 211.6960
Epoch 32/100
2/2 [==============================] - 0s 3ms/step - loss: 207.5818
Epoch 33/100
2/2 [==============================] - 0s 3ms/step - loss: 204.0957
Epoch 34/100
2/2 [==============================] - 0s 3ms/step - loss: 199.3798
Epoch 35/100
2/2 [==============================] - 0s 6ms/step - loss: 195.7160
Epoch 36/100
2/2 [==============================] - 0s 7ms/step - loss: 190.2348
Epoch 37/100
2/2 [==============================] - 0s 3ms/step - loss: 186.7491
Epoch 38/100
2/2 [==============================] - 0s 3ms/step - loss: 181.4301
Epoch 39/100
2/2 [==============================] - 0s 3ms/step - loss: 178.5345
Epoch 40/100
2/2 [==============================] - 0s 3ms/step - loss: 174.4546
Epoch 41/100
2/2 [==============================] - 0s 2ms/step - loss: 169.6787
Epoch 42/100
2/2 [==============================] - 0s 3ms/step - loss: 164.7836
Epoch 43/100
2/2 [==============================] - 0s 4ms/step - loss: 159.9550
Epoch 44/100
2/2 [==============================] - 0s 6ms/step - loss: 156.1748
Epoch 45/100
2/2 [==============================] - 0s 2ms/step - loss: 152.2773
Epoch 46/100
2/2 [==============================] - 0s 3ms/step - loss: 151.0622
Epoch 47/100
2/2 [==============================] - 0s 5ms/step - loss: 148.0150
Epoch 48/100
2/2 [==============================] - 0s 5ms/step - loss: 145.4896
Epoch 49/100
2/2 [==============================] - 0s 5ms/step - loss: 143.5040
Epoch 50/100
2/2 [==============================] - 0s 4ms/step - loss: 140.6091
Epoch 51/100
2/2 [==============================] - 0s 7ms/step - loss: 139.2212
Epoch 52/100
2/2 [==============================] - 0s 5ms/step - loss: 136.0082
Epoch 53/100
2/2 [==============================] - 0s 4ms/step - loss: 136.6794
Epoch 54/100
2/2 [==============================] - 0s 2ms/step - loss: 131.0123
Epoch 55/100
2/2 [==============================] - 0s 4ms/step - loss: 132.9219
Epoch 56/100
2/2 [==============================] - 0s 2ms/step - loss: 137.0044
Epoch 57/100
2/2 [==============================] - 0s 3ms/step - loss: 133.4037
Epoch 58/100
2/2 [==============================] - 0s 3ms/step - loss: 129.2749
Epoch 59/100
2/2 [==============================] - 0s 3ms/step - loss: 126.7766
Epoch 60/100
2/2 [==============================] - 0s 3ms/step - loss: 125.8283
Epoch 61/100
2/2 [==============================] - 0s 3ms/step - loss: 123.4427
Epoch 62/100
2/2 [==============================] - 0s 3ms/step - loss: 122.6860
Epoch 63/100
2/2 [==============================] - 0s 5ms/step - loss: 120.6950
Epoch 64/100
2/2 [==============================] - 0s 4ms/step - loss: 119.8195
Epoch 65/100
2/2 [==============================] - 0s 4ms/step - loss: 121.1465
Epoch 66/100
2/2 [==============================] - 0s 2ms/step - loss: 118.1656
Epoch 67/100
2/2 [==============================] - 0s 3ms/step - loss: 120.0271
Epoch 68/100
2/2 [==============================] - 0s 8ms/step - loss: 119.3452
Epoch 69/100
2/2 [==============================] - 0s 3ms/step - loss: 115.4906
Epoch 70/100
2/2 [==============================] - 0s 3ms/step - loss: 115.8041
Epoch 71/100
2/2 [==============================] - 0s 5ms/step - loss: 113.6647
Epoch 72/100
2/2 [==============================] - 0s 5ms/step - loss: 113.1662
Epoch 73/100
2/2 [==============================] - 0s 5ms/step - loss: 115.0494
Epoch 74/100
2/2 [==============================] - 0s 4ms/step - loss: 111.6925
Epoch 75/100
2/2 [==============================] - 0s 6ms/step - loss: 111.1369
Epoch 76/100
2/2 [==============================] - 0s 6ms/step - loss: 110.3789
Epoch 77/100
2/2 [==============================] - 0s 5ms/step - loss: 107.4989
Epoch 78/100
2/2 [==============================] - 0s 4ms/step - loss: 107.0834
Epoch 79/100
2/2 [==============================] - 0s 3ms/step - loss: 108.8487
Epoch 80/100
2/2 [==============================] - 0s 4ms/step - loss: 106.5777
Epoch 81/100
2/2 [==============================] - 0s 2ms/step - loss: 106.5217
Epoch 82/100
2/2 [==============================] - 0s 3ms/step - loss: 107.3736
Epoch 83/100
2/2 [==============================] - 0s 3ms/step - loss: 101.9058
Epoch 84/100
2/2 [==============================] - 0s 3ms/step - loss: 105.5130
Epoch 85/100
2/2 [==============================] - 0s 5ms/step - loss: 108.5569
Epoch 86/100
2/2 [==============================] - 0s 4ms/step - loss: 103.6809
Epoch 87/100
2/2 [==============================] - 0s 5ms/step - loss: 101.6917
Epoch 88/100
2/2 [==============================] - 0s 5ms/step - loss: 103.8853
Epoch 89/100
2/2 [==============================] - 0s 5ms/step - loss: 104.5062
Epoch 90/100
2/2 [==============================] - 0s 4ms/step - loss: 101.6882
Epoch 91/100
2/2 [==============================] - 0s 2ms/step - loss: 99.0736
Epoch 92/100
2/2 [==============================] - 0s 5ms/step - loss: 99.9527
Epoch 93/100
2/2 [==============================] - 0s 3ms/step - loss: 101.3015
Epoch 94/100
2/2 [==============================] - 0s 3ms/step - loss: 100.2426
Epoch 95/100
2/2 [==============================] - 0s 3ms/step - loss: 96.1275
Epoch 96/100
2/2 [==============================] - 0s 3ms/step - loss: 101.8577
Epoch 97/100
2/2 [==============================] - 0s 2ms/step - loss: 103.9911
Epoch 98/100
2/2 [==============================] - 0s 3ms/step - loss: 94.3985
Epoch 99/100
2/2 [==============================] - 0s 4ms/step - loss: 106.5375
Epoch 100/100
2/2 [==============================] - 0s 7ms/step - loss: 111.8336
Out[81]:
<tensorflow.python.keras.callbacks.History at 0x1a49e47610>
In [82]:
X_pred, y_test = split_sequence(subdata["confirmed"][42:], n_steps)
X_pred = X_pred.reshape((X_pred.shape[0], X_pred.shape[1], n_features))
y_pred = model.predict(X_pred)
y_pred
Out[82]:
array([[42.294037],
       [36.562725],
       [49.501846],
       [55.62399 ],
       [37.899746],
       [56.396465],
       [41.9376  ],
       [23.650091],
       [60.16797 ],
       [64.016396],
       [22.517004],
       [46.496487],
       [62.20671 ],
       [47.73969 ],
       [34.285145]], dtype=float32)
In [83]:
forecast = pd.DataFrame(y_pred.reshape(-1), index=test.index[1:], columns=["Prediction"])
pd.concat([subdata["confirmed"], forecast], axis=1).plot(figsize=(16, 9), title="LSTM Model")
Out[83]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a4e03f350>

While some moves have been predicted correctly, some have totally gone to the wrong direction. We do not have enough evidence to justify the forecast as a whole.

7) TimeAge

In [84]:
timeAgeData = pd.read_csv('covid/timeage.csv')
display(timeAgeData.head())

fig = px.line(timeAgeData, x="date", y="confirmed", color='age', title="Confirmed cases of various age group over time")
fig.update_layout(hovermode='x')
fig.show()
date time age confirmed deceased
0 2020-03-02 0 0s 32 0
1 2020-03-02 0 10s 169 0
2 2020-03-02 0 20s 1235 0
3 2020-03-02 0 30s 506 1
4 2020-03-02 0 40s 633 1
In [85]:
timeAgeData = pd.read_csv('covid/timeage.csv')
fig = px.line(timeAgeData, x="date", y="deceased", color='age', title="Deceased cases of various age group over time")
fig.update_layout(hovermode='x')
fig.show()

7.1) Deceased vs confirmed cases across different age group over time

In [86]:
timeAgeData = pd.read_csv('covid/timeage.csv')
timeAgeData['month'] = pd.to_datetime(timeAgeData['date']).dt.month
display(timeAgeData.head())

fig = px.scatter(timeAgeData, 
                 x='confirmed', 
                 y='deceased', 
                 color="age",
                 size_max=100,
                 size="deceased",
                 animation_frame="date", 
                 animation_group="age",
                 range_x=[0, timeAgeData['confirmed'].max()], 
                 range_y=[0, timeAgeData['deceased'].max() + timeAgeData['deceased'].std()]
)

fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 200
fig.layout.updatemenus[0].buttons[0].args[1]["transition"]["duration"] = 200
fig.layout.coloraxis.showscale = False
fig.layout.sliders[0].pad.t = 10
fig.layout.updatemenus[0].pad.t= 10
fig.show()
date time age confirmed deceased month
0 2020-03-02 0 0s 32 0 3
1 2020-03-02 0 10s 169 0 3
2 2020-03-02 0 20s 1235 0 3
3 2020-03-02 0 30s 506 1 3
4 2020-03-02 0 40s 633 1 3

Lets have a closer look by splitting the age group into different group

In [87]:
display(timeAgeData.head())
timeAgeData = timeAgeData[~timeAgeData['age'].isin(['0s','10s','20s'])]
timeAgeData['month'] = pd.to_datetime(timeAgeData['date']).dt.month

fig = px.scatter(timeAgeData, x="confirmed", y="deceased", animation_frame="date", animation_group="age",
                 size="deceased", color="age", hover_name="age", facet_col="age",
                 range_x=[0, timeAgeData['confirmed'].max()], 
                 range_y=[0, timeAgeData['deceased'].max()])
fig.show()
date time age confirmed deceased month
0 2020-03-02 0 0s 32 0 3
1 2020-03-02 0 10s 169 0 3
2 2020-03-02 0 20s 1235 0 3
3 2020-03-02 0 30s 506 1 3
4 2020-03-02 0 40s 633 1 3

7.2) Testing of Homogeneous Mortality Hypothesis

Mortality Rates across Age Groups

In [88]:
timeage = pd.read_csv("covid/timeAge.csv")
timeage.head()
Out[88]:
date time age confirmed deceased
0 2020-03-02 0 0s 32 0
1 2020-03-02 0 10s 169 0
2 2020-03-02 0 20s 1235 0
3 2020-03-02 0 30s 506 1
4 2020-03-02 0 40s 633 1
In [89]:
timeagegrouped = timeage.groupby("age")["confirmed", "deceased"].sum()
timeagegrouped
Out[89]:
confirmed deceased
age
0s 16107 0
10s 68752 0
20s 345827 0
30s 137539 194
40s 168250 295
50s 230030 1537
60s 158505 3743
70s 82107 7599
80s 54086 12136

Single-Factor ANOVA: $H_0: p_{00}=p_{10}=p_{20}=...=p_{80}=p_I$ ; $H_a$: At least two being unequal

$$F=\frac{SSTr/(I-1)}{SSE/\sum_i(n_i-1)}$$
In [90]:
p = timeagegrouped["deceased"] / timeagegrouped["confirmed"]
p.plot(kind="barh", figsize=(16, 9))
Out[90]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a49e24f90>
In [91]:
p_I = p.mean()
SSTr = np.sum([(n*(x-p_I))**2 for x, n in zip(p, timeagegrouped["confirmed"])])
SSE = np.sum([x*(1-x)*n for x, n in zip(p, timeagegrouped["confirmed"])])
df = np.sum([n-1 for n in timeagegrouped["confirmed"]])
F = (SSTr/(len(p))) / (SSE/df) 
F
Out[91]:
2776866034.6319556

The F ratio is huge, which implies that the treatment groups differentiate mortality rates significantly. More specifically, older age groups tend to have higher mortality rates, which is consistent with the observation from the horizontal bar chart.

7.3) Does TimeAge Dataset tallies with the patient.csv ?

In [92]:
age = pd.read_csv("covid/TimeAge.csv")
In [93]:
age.date = age.date.apply(lambda x: pd.to_datetime(x))
In [94]:
age_latest = age[age.date == max(age.date)]
age_latest
Out[94]:
date time age confirmed deceased
1080 2020-06-30 0 0s 193 0
1081 2020-06-30 0 10s 708 0
1082 2020-06-30 0 20s 3362 0
1083 2020-06-30 0 30s 1496 2
1084 2020-06-30 0 40s 1681 3
1085 2020-06-30 0 50s 2286 15
1086 2020-06-30 0 60s 1668 41
1087 2020-06-30 0 70s 850 82
1088 2020-06-30 0 80s 556 139
In [95]:
sns.barplot( x = "age", y = "confirmed", data = age_latest)
plt.title("Latest number of confirmed cases as of 30th June 2020", size = 15)
plt.tick_params(axis = "both", labelsize = 15)

This tallies with the patient.csv although the patient.csv is not complete.

7.4) Deceased Rate

In [96]:
age["deceased rate"] = age.deceased/age.confirmed * 100.0
In [97]:
plt.figure(figsize = (16,10))
for group in age.groupby("age"):
    plt.plot(group[1].date.values, group[1]["deceased rate"].values, label = group[0])
    plt.xticks(rotation = 90)
plt.title("Deceased Rate over time", size = 15)
plt.ylabel("Deceased Rate %", size = 15)
plt.xlabel("Date", size = 15)
plt.legend()
plt.show()

As time goes by, the elderly age group deceased rate has spiked up rapidly

8) Time Gender

In [98]:
timeGender = pd.read_csv('covid/timegender.csv')
display(timeGender.head())

fig = px.line(timeGender, x="date", y="confirmed", color='sex', title="Confirmed cases between genders over time")
fig.update_layout(hovermode='x')
fig.show()
date time sex confirmed deceased
0 2020-03-02 0 male 1591 13
1 2020-03-02 0 female 2621 9
2 2020-03-03 0 male 1810 16
3 2020-03-03 0 female 3002 12
4 2020-03-04 0 male 1996 20
In [99]:
fig = px.line(timeGender, x="date", y="deceased", color='sex', title="Deceased cases between genders over time")
fig.update_layout(hovermode='x')
fig.show()

8.1) Deceased vs confirmed cases across different gender over time

In [100]:
timeGender = pd.read_csv('covid/timegender.csv')
display(timeGender.head())

fig = px.scatter(timeGender, 
                 x='confirmed', 
                 y='deceased', 
                 color="sex",
                 size_max=100,
                 size="deceased",
                 animation_frame="date", 
                 animation_group="sex",
                 range_x=[0, timeGender['confirmed'].max() + timeGender['confirmed'].std()], 
                 range_y=[0, timeGender['deceased'].max() + timeGender['deceased'].std()]
)

fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 200
fig.layout.updatemenus[0].buttons[0].args[1]["transition"]["duration"] = 200
fig.layout.coloraxis.showscale = False
fig.layout.sliders[0].pad.t = 10
fig.layout.updatemenus[0].pad.t= 10
fig.show()
date time sex confirmed deceased
0 2020-03-02 0 male 1591 13
1 2020-03-02 0 female 2621 9
2 2020-03-03 0 male 1810 16
3 2020-03-03 0 female 3002 12
4 2020-03-04 0 male 1996 20

8.2) Confirmed cases daily average increase amount by month (Sort by gender)

In [101]:
timeGender = pd.read_csv('covid/timegender.csv')
timeGenderAnalysis = timeGender.copy()

timeGenderAnalysis['confirmed increase'] = timeGenderAnalysis['confirmed'] - timeGenderAnalysis['confirmed'].shift(2)
timeGenderAnalysis['deceased increase'] = timeGenderAnalysis['deceased'] - timeGenderAnalysis['deceased'].shift(2)

timeGenderAnalysis['month'] = pd.to_datetime(timeGenderAnalysis['date']).dt.month
timeGenderAnalysis['day'] = pd.to_datetime(timeGenderAnalysis['date']).dt.day_name()

timeGenderAnalysisMonth = timeGenderAnalysis.groupby(['month','sex'])['confirmed increase'].mean()
timeGenderAnalysisMonth = timeGenderAnalysisMonth.reset_index()

timeGenderAnalysisMonthMelted = pd.melt(timeGenderAnalysisMonth, id_vars=['month','sex'], value_vars=['confirmed increase'])
display(timeGenderAnalysisMonthMelted.head())

fig = px.bar(timeGenderAnalysisMonthMelted, x='month', y='value', color='sex', barmode='group')
fig.update_layout(hovermode='x')
fig.show()
month sex variable value
0 3 female confirmed increase 112.413793
1 3 male confirmed increase 79.793103
2 4 female confirmed increase 17.733333
3 4 male confirmed increase 14.900000
4 5 female confirmed increase 8.387097

8.3) Confirmed cases daily average increase amount by day (Sort by gender)

In [102]:
timeGenderAnalysisMonth = timeGenderAnalysis.groupby(['day','sex'])['confirmed increase'].mean()
timeGenderAnalysisMonth = timeGenderAnalysisMonth.reset_index()

weekdays = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
timeGenderAnalysisMonth['day'] = pd.Categorical(timeGenderAnalysisMonth['day'],categories=weekdays)
timeGenderAnalysisMonth = timeGenderAnalysisMonth.sort_values('day')

timeGenderAnalysisMonthMelted = pd.melt(timeGenderAnalysisMonth, id_vars=['day','sex'], value_vars=['confirmed increase'])
display(timeGenderAnalysisMonthMelted.head())

fig = px.bar(timeGenderAnalysisMonthMelted, x='day', y='value', color='sex', barmode='group')
fig.update_layout(hovermode='x')
fig.show()
day sex variable value
0 Monday female confirmed increase 25.352941
1 Monday male confirmed increase 21.764706
2 Tuesday female confirmed increase 41.500000
3 Tuesday male confirmed increase 34.833333
4 Wednesday female confirmed increase 48.941176

8.4) Deceased cases daily average increase amount by month (Sort by gender)

In [103]:
timeGenderAnalysisMonth = timeGenderAnalysis.groupby(['month','sex'])['deceased increase'].mean()
timeGenderAnalysisMonth = timeGenderAnalysisMonth.reset_index()

timeGenderAnalysisMonthMelted = pd.melt(timeGenderAnalysisMonth, id_vars=['month','sex'], value_vars=['deceased increase'])
display(timeGenderAnalysisMonthMelted.head())

fig = px.bar(timeGenderAnalysisMonthMelted, x='month', y='value', color='sex', barmode='group')
fig.update_layout(hovermode='x')
fig.show()
month sex variable value
0 3 female deceased increase 2.448276
1 3 male deceased increase 2.379310
2 4 female deceased increase 1.233333
3 4 male deceased increase 1.600000
4 5 female deceased increase 0.322581

8.5) Deceased cases daily average increase amount by day (Sort by gender)

In [104]:
timeGenderAnalysisMonth = timeGenderAnalysis.groupby(['day','sex'])['deceased increase'].mean()
timeGenderAnalysisMonth = timeGenderAnalysisMonth.reset_index()

weekdays = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
timeGenderAnalysisMonth['day'] = pd.Categorical(timeGenderAnalysisMonth['day'],categories=weekdays)
timeGenderAnalysisMonth = timeGenderAnalysisMonth.sort_values('day')

timeGenderAnalysisMonthMelted = pd.melt(timeGenderAnalysisMonth, id_vars=['day','sex'], value_vars=['deceased increase'])
display(timeGenderAnalysisMonthMelted.head())

fig = px.bar(timeGenderAnalysisMonthMelted, x='day', y='value', color='sex', barmode='group')
fig.update_layout(hovermode='x')
fig.show()
day sex variable value
0 Monday female deceased increase 0.882353
1 Monday male deceased increase 0.823529
2 Tuesday female deceased increase 1.666667
3 Tuesday male deceased increase 1.166667
4 Wednesday female deceased increase 0.882353

8.6) Mortality Rates across Gender Groups

In [105]:
timegender = pd.read_csv("covid/timeGender.csv")
timegender.head()
Out[105]:
date time sex confirmed deceased
0 2020-03-02 0 male 1591 13
1 2020-03-02 0 female 2621 9
2 2020-03-03 0 male 1810 16
3 2020-03-03 0 female 3002 12
4 2020-03-04 0 male 1996 20
In [106]:
timegendergrouped = timegender.groupby("sex")["confirmed", "deceased"].sum()
timegendergrouped
Out[106]:
confirmed deceased
sex
female 747467 12019
male 513727 13484

z Test: $H_0: p_m=p_f$; $H_0: p_m\neq p_f$

$$z=\frac{\hat{p}_m - \hat{p}_f}{\sqrt{\frac{\hat{p}_m(1-\hat{p}_m)}{n_m}+\frac{\hat{p}_f(1-\hat{p}_f)}{n_f}}}$$
In [107]:
p = timegendergrouped["deceased"] / timegendergrouped["confirmed"]
p.plot(kind="barh", figsize=(16, 9))
Out[107]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a495e3310>
In [108]:
z = (p["male"] - p["female"]) / np.sqrt(p["male"]*(1-p["male"])/timegendergrouped["confirmed"]["male"]+p["female"]*(1-p["female"])/timegendergrouped["confirmed"]["female"])
z
Out[108]:
38.18116728523888

The staggering z score indicates that mortality rate also varies between genders. More specifically, males are far more vulnerable to COVID-led mortality than females.

9) Time Province

In [109]:
timeProvince = pd.read_csv('covid/timeprovince.csv')
display(timeProvince.head())

fig = px.line(timeProvince, x="date", y="confirmed", color='province', title="Confirmed cases of various province over time")
fig.update_layout(hovermode='x')


fig.show()
date time province confirmed released deceased
0 2020-01-20 16 Seoul 0 0 0
1 2020-01-20 16 Busan 0 0 0
2 2020-01-20 16 Daegu 0 0 0
3 2020-01-20 16 Incheon 1 0 0
4 2020-01-20 16 Gwangju 0 0 0
In [110]:
fig = px.line(timeProvince, x="date", y="released", color='province', title="Released cases of various province over time")
fig.update_layout(hovermode='x')
fig.show()
In [111]:
fig = px.line(timeProvince, x="date", y="deceased", color='province', title="Deceased cases of various province over time")
fig.update_layout(hovermode='x')
fig.show()

9.1) Deceased vs confirmed cases across different province over time (With Daegu)

In [112]:
timeProvince = pd.read_csv('covid/timeprovince.csv')
display(timeProvince.head())

fig = px.scatter(timeProvince, 
                 y='released', 
                 x='confirmed', 
                 color="province",
                 size_max=100,
                 size="deceased",
                 animation_frame="date", 
                 animation_group="province",
                 range_y=[0, timeProvince['confirmed'].max() + timeProvince['confirmed'].std()], 
                 range_x=[0, timeProvince['released'].max() + timeProvince['released'].std()]
)

fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 200
fig.layout.updatemenus[0].buttons[0].args[1]["transition"]["duration"] = 200
fig.layout.coloraxis.showscale = False
fig.layout.sliders[0].pad.t = 10
fig.layout.updatemenus[0].pad.t= 10
fig.show()
date time province confirmed released deceased
0 2020-01-20 16 Seoul 0 0 0
1 2020-01-20 16 Busan 0 0 0
2 2020-01-20 16 Daegu 0 0 0
3 2020-01-20 16 Incheon 1 0 0
4 2020-01-20 16 Gwangju 0 0 0

9.2) Deceased vs confirmed cases across different province over time (Without Daegu)

In [113]:
timeProvince = pd.read_csv('covid/timeprovince.csv')
timeProvince = timeProvince[timeProvince['province'] != 'Daegu']
display(timeProvince.head())

fig = px.scatter(timeProvince, 
                 y='released', 
                 x='confirmed', 
                 color="province",
                 size_max=100,
                 size="deceased",
                 animation_frame="date", 
                 animation_group="province",
                 range_y=[0, timeProvince['confirmed'].max() + timeProvince['confirmed'].std()], 
                 range_x=[0, timeProvince['released'].max() + timeProvince['released'].std()]
)

fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 200
fig.layout.updatemenus[0].buttons[0].args[1]["transition"]["duration"] = 200
fig.layout.coloraxis.showscale = False
fig.layout.sliders[0].pad.t = 10
fig.layout.updatemenus[0].pad.t= 10
fig.show()
date time province confirmed released deceased
0 2020-01-20 16 Seoul 0 0 0
1 2020-01-20 16 Busan 0 0 0
3 2020-01-20 16 Incheon 1 0 0
4 2020-01-20 16 Gwangju 0 0 0
5 2020-01-20 16 Daejeon 0 0 0

9.3) Mortality Rates across Provinces

In [114]:
timeprovince = pd.read_csv("covid/timeProvince.csv")
timeprovince.head()
Out[114]:
date time province confirmed released deceased
0 2020-01-20 16 Seoul 0 0 0
1 2020-01-20 16 Busan 0 0 0
2 2020-01-20 16 Daegu 0 0 0
3 2020-01-20 16 Incheon 1 0 0
4 2020-01-20 16 Gwangju 0 0 0
In [115]:
timeprovincegrouped = timeprovince.groupby("province")["confirmed", "deceased"].sum()
timeprovincegrouped
Out[115]:
confirmed deceased
province
Busan 16341 299
Chungcheongbuk-do 5801 0
Chungcheongnam-do 16780 0
Daegu 807506 17624
Daejeon 5217 58
Gangwon-do 5908 225
Gwangju 3359 0
Gyeonggi-do 81059 1600
Gyeongsangbuk-do 161079 5393
Gyeongsangnam-do 13860 0
Incheon 16645 16
Jeju-do 1449 0
Jeollabuk-do 2108 0
Jeollanam-do 1763 0
Sejong 5111 0
Seoul 81923 298
Ulsan 5269 91
In [116]:
p = timeprovincegrouped["deceased"] / timeprovincegrouped["confirmed"]
p.plot(kind="barh", figsize=(16, 9))
Out[116]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a4d8d0910>
In [117]:
p_I = p.mean()
SSTr = np.sum([(n*(x-p_I))**2 for x, n in zip(p, timeprovincegrouped["confirmed"])])
SSE = np.sum([x*(1-x)*n for x, n in zip(p, timeprovincegrouped["confirmed"])])
df = np.sum([n-1 for n in timeprovincegrouped["confirmed"]])
F = (SSTr/(len(p))) / (SSE/df) 
F
Out[117]:
324758976.26049244

Once again, province treatment groups seem to explain mortality differentials well as F ratio is greater than 1 at any reasonable significance level.

9.4) Does TimeProvince Dataset tallies with the cases.csv & patient.csv?

In [118]:
province = pd.read_csv("covid/TimeProvince.csv")
In [119]:
province.date = province.date.apply(lambda x: pd.to_datetime(x))
province_latest = province[province.date == max(province.date)]
province_latest.head()
Out[119]:
date time province confirmed released deceased
2754 2020-06-30 0 Seoul 1312 985 7
2755 2020-06-30 0 Busan 154 142 3
2756 2020-06-30 0 Daegu 6906 6700 189
2757 2020-06-30 0 Incheon 341 290 1
2758 2020-06-30 0 Gwangju 44 32 0
In [120]:
province_latest.confirmed.sum()
Out[120]:
12076
In [121]:
plt.figure(figsize = (14,12))
sns.barplot( x = "province", y = "confirmed", data = province_latest)
plt.title("Latest number of confirmed cases as of 30th June 2020", size = 15)
plt.tick_params(axis = "both", labelsize = 15, labelrotation = 90)

plt.show()

TimeProvince Dataset tallies with the cases.csv, not with the patient.csv

10) In depth Analysis Of Daegu Cases using Technical Analysis.

In [122]:
timeProvince = pd.read_csv('covid/timeprovince.csv')
dataForDaegu = timeProvince[timeProvince['province'] == 'Daegu']
dataForDaegu = dataForDaegu[['date','province','confirmed']]
dataForDaegu['increasePerDay'] = dataForDaegu['confirmed'] - dataForDaegu['confirmed'].shift(1)
dataForDaegu = dataForDaegu[['date','province','increasePerDay']]
dataForDaegu = dataForDaegu.rename({'increasePerDay':'Increase Per Day'}, axis=1)
display(dataForDaegu.head())

fig = px.line(dataForDaegu, x="date", y="Increase Per Day", title="Daegu Cases Per Day")
fig.update_layout(hovermode='x')
fig.show()
date province Increase Per Day
2 2020-01-20 Daegu NaN
19 2020-01-21 Daegu 0.0
36 2020-01-22 Daegu 0.0
53 2020-01-23 Daegu 0.0
70 2020-01-24 Daegu 0.0

Daeugu's Cases is most active during Feb to Apr period.

10.1) Technical Analysis

In [123]:
# Uncommend to focus data on active period
# dataForDaegu = dataForDaegu.set_index('date')
# dataForDaegu = dataForDaegu['2020-02-01' : '2020-04-31'].reset_index()
dataForDaegu['10 Days Moving Average'] = dataForDaegu['Increase Per Day'].rolling(10).mean()
dataForDaegu['Upper Limit 10 day Bollinger Band']  = dataForDaegu['Increase Per Day'].rolling(10).mean() + (dataForDaegu['Increase Per Day'].rolling(10).std() * 2)
dataForDaegu['Lower Limit 10 day Bollinger Band']  = dataForDaegu['Increase Per Day'].rolling(10).mean() - (dataForDaegu['Increase Per Day'].rolling(10).std() * 2)
dataForDaeguNew = pd.melt(dataForDaegu, id_vars=['date'], value_vars=['Increase Per Day', '10 Days Moving Average','Upper Limit 10 day Bollinger Band','Lower Limit 10 day Bollinger Band'])
display(dataForDaeguNew.head())

fig = px.line(dataForDaeguNew, x="date", y="value", color='variable', title="Daegu Cases Per Day")
fig['data'][2]['line']['color']="rgba(147,212,219,0.8)"
fig['data'][3]['line']['color']="rgba(157,212,219,0.8)"
fig.update_layout(hovermode='x')
fig.show()
date variable value
0 2020-01-20 Increase Per Day NaN
1 2020-01-21 Increase Per Day 0.0
2 2020-01-22 Increase Per Day 0.0
3 2020-01-23 Increase Per Day 0.0
4 2020-01-24 Increase Per Day 0.0

10.2) Stats By Month

In [124]:
dataForDaeguNew['date'] = pd.to_datetime(dataForDaeguNew['date'])
dataForDaeguNew['Month'] = dataForDaeguNew['date'].dt.month 
dataForDaeguNew['Day'] = dataForDaeguNew['date'].dt.day_name()
dataForDaeguMonth = dataForDaeguNew[['variable','Month']]
dataForDaeguMonth = dataForDaeguNew.groupby(['Month','variable'])['value'].mean()
dataForDaeguMonth = dataForDaeguMonth.reset_index()
display(dataForDaeguMonth.head())

fig = px.bar(dataForDaeguMonth, x='Month', y='value', color='variable', barmode='group')
fig.update_layout(hovermode='x')
fig.show()
Month variable value
0 1 10 Days Moving Average 0.000000
1 1 Increase Per Day 0.000000
2 1 Lower Limit 10 day Bollinger Band 0.000000
3 1 Upper Limit 10 day Bollinger Band 0.000000
4 2 10 Days Moving Average 25.265517

Stats By Day

In [125]:
dataForDaeguMonth = dataForDaeguNew[['variable','Day']]
dataForDaeguMonth = dataForDaeguNew.groupby(['Day','variable'])['value'].mean()
dataForDaeguMonth = dataForDaeguMonth.reset_index()


weekdays = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
dataForDaeguMonth['Day'] = pd.Categorical(dataForDaeguMonth['Day'],categories=weekdays)
dataForDaeguMonth = dataForDaeguMonth.sort_values('Day')

display(dataForDaeguMonth.head())

fig = px.bar(dataForDaeguMonth, x='Day', y='value', color='variable', barmode='group')
fig.update_layout(hovermode='x')
fig.show()
Day variable value
4 Monday 10 Days Moving Average 46.068182
5 Monday Increase Per Day 36.434783
6 Monday Lower Limit 10 day Bollinger Band -8.111102
7 Monday Upper Limit 10 day Bollinger Band 100.247465
23 Tuesday Upper Limit 10 day Bollinger Band 92.607688

11) Time Series World Map Visualisation of COVID-19 cases in korea

In [126]:
timeProvince = pd.read_csv('covid/timeprovince.csv')

data = {'province': ['Seoul', 'Busan', 'Daegu', 'Incheon', 'Gwangju', 'Daejeon',
       'Ulsan', 'Sejong', 'Gyeonggi-do', 'Gangwon-do',
       'Chungcheongbuk-do', 'Chungcheongnam-do', 'Jeollabuk-do',
       'Jeollanam-do', 'Gyeongsangbuk-do', 'Gyeongsangnam-do', 'Jeju-do'], 
        'longitude': [127.047325,129.066666,128.600006,126.705208,126.916664,127.385002,
                     129.316666,127.2822,127.143738,127.920158,
                     127.935905,126.979874,126.916664,126.9910,
                      129.263885,128.429581,126.5312],
        'latitude': [37.517235,35.166668,35.866669,37.456257,35.166668,36.351002,
                    35.549999,36.4870,37.603405,37.342220,
                    36.981304,36.806702,35.166668,34.8679,
                     35.835354,34.855228,33.4996]}
location = pd.DataFrame(data=data)

mergedData = timeProvince.merge(location, on='province', how='left')
mergedData = mergedData.dropna()
mergedData.head()
Out[126]:
date time province confirmed released deceased longitude latitude
0 2020-01-20 16 Seoul 0 0 0 127.047325 37.517235
1 2020-01-20 16 Busan 0 0 0 129.066666 35.166668
2 2020-01-20 16 Daegu 0 0 0 128.600006 35.866669
3 2020-01-20 16 Incheon 1 0 0 126.705208 37.456257
4 2020-01-20 16 Gwangju 0 0 0 126.916664 35.166668

11.1) With Animation

In [127]:
fig = px.scatter_mapbox(
    mergedData, lat="latitude", lon="longitude",
    size="confirmed", size_max=100,
    color="deceased", color_continuous_scale=px.colors.sequential.Burg,
    hover_name="province",           
    mapbox_style='dark', zoom=6,
    animation_frame="date", animation_group="province"
)

fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 200
fig.layout.updatemenus[0].buttons[0].args[1]["transition"]["duration"] = 200
fig.layout.coloraxis.showscale = False
fig.layout.sliders[0].pad.t = 10
fig.layout.updatemenus[0].pad.t= 10

fig.show()

11.2) Without animation

In [128]:
fig = px.scatter_mapbox(
    mergedData, lat="latitude", lon="longitude",
    size="confirmed", size_max=100,
    color="deceased", color_continuous_scale=px.colors.sequential.Burg,
    hover_name="province",           
    mapbox_style='dark', zoom=6,
)

fig.show()

12) Weather

In [129]:
weatherData = pd.read_csv('covid/weather.csv')
weatherData = weatherData.set_index('date')
weatherData = weatherData['2020-01-01' : '2020-08-31'].reset_index()
display(weatherData.head())

weatherDataForHeatmap = weatherData.pivot("date", "province", "avg_temp")
f, ax = plt.subplots(figsize=(20, 20))
sns.heatmap(weatherDataForHeatmap, cmap="RdPu", annot=False, fmt="d", linewidths=.5, ax=ax).set_title('Weather Over Time')
date code province avg_temp min_temp max_temp precipitation max_wind_speed most_wind_direction avg_relative_humidity
0 2020-01-01 10000 Seoul -2.2 -6.5 0.3 0.0 2.6 50.0 64.4
1 2020-01-01 11000 Busan 1.9 -3.2 7.8 0.0 5.1 340.0 44.0
2 2020-01-01 12000 Daegu 0.2 -4.9 4.6 0.0 5.6 270.0 53.3
3 2020-01-01 13000 Gwangju -0.3 -4.9 5.7 0.0 4.3 50.0 58.0
4 2020-01-01 14000 Incheon -1.4 -5.4 1.9 0.0 3.8 160.0 66.6
Out[129]:
Text(0.5, 1, 'Weather Over Time')

12.1) Busan Avg Temp (Mean) vs Most Wind Direction (Mean) Analysis

In [130]:
high_direction_province = weatherData.copy()
high_direction_province['month'] = high_direction_province['date'].apply(pd.to_datetime).dt.month
high_direction_province.head()
Out[130]:
date code province avg_temp min_temp max_temp precipitation max_wind_speed most_wind_direction avg_relative_humidity month
0 2020-01-01 10000 Seoul -2.2 -6.5 0.3 0.0 2.6 50.0 64.4 1
1 2020-01-01 11000 Busan 1.9 -3.2 7.8 0.0 5.1 340.0 44.0 1
2 2020-01-01 12000 Daegu 0.2 -4.9 4.6 0.0 5.6 270.0 53.3 1
3 2020-01-01 13000 Gwangju -0.3 -4.9 5.7 0.0 4.3 50.0 58.0 1
4 2020-01-01 14000 Incheon -1.4 -5.4 1.9 0.0 3.8 160.0 66.6 1
In [131]:
high_direction_province_grouped = high_direction_province.groupby(['province','month']).agg({'avg_temp':['mean','std','min','max'],
                                                                                             'min_temp':['mean','std'],
                                                                                             'max_temp':['mean','std'],
                                                                                             'max_wind_speed':['mean','std'],
                                                                                             'most_wind_direction':['mean','std'],
                                                                                             'avg_relative_humidity':['mean','std']})
print('Agg Data')
display(high_direction_province_grouped.head())

high_direction_province_grouped = high_direction_province_grouped.stack().stack()
high_direction_province_grouped = high_direction_province_grouped.reset_index()
high_direction_province_grouped.columns = ['Province','Month', 'Measurement Type','Weather Type','Value']
high_direction_province_grouped = high_direction_province_grouped.set_index(['Weather Type','Measurement Type','Province','Month'])

print('Indexed Data')
display(high_direction_province_grouped.head())
Agg Data
avg_temp min_temp max_temp max_wind_speed most_wind_direction avg_relative_humidity
mean std min max mean std mean std mean std mean std mean std
province month
Busan 1 6.412903 2.652514 1.9 14.4 3.135484 2.938316 10.883871 2.618154 6.245161 2.547788 212.580645 132.362324 56.351613 18.844537
2 7.137931 3.514197 0.1 12.8 3.113793 4.129141 11.837931 3.533858 6.344828 2.403507 223.448276 121.428282 55.134483 18.707929
3 10.441935 2.377782 5.1 14.5 6.438710 2.874332 14.961290 2.177411 7.125806 2.507850 163.333333 100.458718 56.906452 16.644397
4 12.620000 2.059193 9.0 16.9 8.896667 2.074390 17.543333 3.003295 7.580000 2.217237 180.333333 96.614532 52.570000 15.572681
5 17.877419 1.552355 14.9 20.2 14.977419 1.664073 21.790323 2.448313 6.167742 2.057731 135.806452 74.867625 72.483871 14.036930
Indexed Data
Value
Weather Type Measurement Type Province Month
avg_temp max Busan 1 14.400000
avg_relative_humidity mean Busan 1 56.351613
avg_temp mean Busan 1 6.412903
max_temp mean Busan 1 10.883871
max_wind_speed mean Busan 1 6.245161
In [132]:
busan_data = high_direction_province_grouped.loc[(['avg_temp','most_wind_direction'],slice(None),'Busan'), :].sort_values(['Weather Type','Month'])
busan_data.head()
Out[132]:
Value
Weather Type Measurement Type Province Month
avg_temp max Busan 1 14.400000
mean Busan 1 6.412903
min Busan 1 1.900000
std Busan 1 2.652514
max Busan 2 12.800000
In [133]:
busan_data_mean = high_direction_province_grouped.loc[(['avg_temp','most_wind_direction'],'mean','Busan'), :].sort_values(['Weather Type','Month']).reset_index()
display(busan_data_mean.head())

fig = px.bar(busan_data_mean, x='Month', y='Value', color='Weather Type', barmode='group')
fig.update_layout(hovermode='x')
fig.show()
Weather Type Measurement Type Province Month Value
0 avg_temp mean Busan 1 6.412903
1 avg_temp mean Busan 2 7.137931
2 avg_temp mean Busan 3 10.441935
3 avg_temp mean Busan 4 12.620000
4 avg_temp mean Busan 5 17.877419
In [134]:
busan_temp = busan_data_mean[busan_data_mean['Weather Type'] == 'avg_temp']['Value']
busan_temp = pd.DataFrame(busan_temp).reset_index().drop('index',axis=1)
busan_temp.columns = ['avg_temp']
# display(busan_temp)

busan_wind_direction = busan_data_mean[busan_data_mean['Weather Type'] == 'most_wind_direction']['Value']
busan_wind_direction = pd.DataFrame(busan_wind_direction).reset_index().drop('index',axis=1)
busan_wind_direction.columns = ['most_wind_direction']
# display(busan_wind_direction)

plot_table = busan_wind_direction.join(busan_temp)
display(plot_table)

fig = px.scatter(plot_table, x="most_wind_direction", y="avg_temp", trendline="ols", title="Avg Temp & Wind Direction")
fig.update_layout(hovermode='x')
fig.show()
most_wind_direction avg_temp
0 212.580645 6.412903
1 223.448276 7.137931
2 163.333333 10.441935
3 180.333333 12.620000
4 135.806452 17.877419
5 140.689655 22.448276

With a R squared of 0.79, there is a strong correlation between the avg-temp and the most wind direction.
The wind direction increases whether there is a decrease in temperature

12.2) Does other region has the same relationship between avg_temp and most_wind direction?

In [135]:
all_data = high_direction_province_grouped.loc[(['avg_temp','most_wind_direction'],'mean',slice(None),slice(None)), :].sort_values(['Weather Type','Month'])
all_data = all_data.reset_index()

all_data_temp = all_data[all_data['Weather Type'] == 'avg_temp']['Value']
all_data_temp = all_data_temp.reset_index()
all_data_temp = all_data_temp.drop('index',axis=1)
all_data_temp.columns = ['temp']
# all_data_temp.head()

all_data_wind = all_data[all_data['Weather Type'] == 'most_wind_direction']['Value']
all_data_wind = all_data_wind.reset_index()
all_data_wind = all_data_wind.drop('index',axis=1)
all_data_wind.columns = ['wind']
# all_data_wind.head()

all_data_plot = all_data_temp.join(all_data_wind)
display(all_data_plot.head())

fig = px.scatter(all_data_plot, x="wind", y="temp", trendline="ols", title="Avg Temp & Wind Direction")
fig.update_layout(hovermode='x')
fig.show()
temp wind
0 6.412903 212.580645
1 0.609677 126.451613
2 1.770968 152.258065
3 3.764516 213.870968
4 2.706452 246.129032

With a R squared of 0.009, it seems that other region does not follow similar trend as Busan.

13) Region

13.1) Number of nusing home and elder population ratio across Korea

In [136]:
regionData = pd.read_csv('covid/region.csv')
pd.set_option('display.max_rows', regionData.shape[0]+1)
display(regionData.head())

fig = px.scatter_mapbox(
    regionData[regionData.city != 'Korea'], 
    text="city",
    lat="latitude", 
    lon="longitude",     
    color="elderly_population_ratio", 
    size="nursing_home_count",
    color_continuous_scale=px.colors.sequential.Burg, 
    size_max=100, 
    zoom=6,
    title="Number of nusing home and elder population ratio across Korea")
fig.show()
code province city latitude longitude elementary_school_count kindergarten_count university_count academy_ratio elderly_population_ratio elderly_alone_ratio nursing_home_count
0 10000 Seoul Seoul 37.566953 126.977977 607 830 48 1.44 15.38 5.8 22739
1 10010 Seoul Gangnam-gu 37.518421 127.047222 33 38 0 4.18 13.17 4.3 3088
2 10020 Seoul Gangdong-gu 37.530492 127.123837 27 32 0 1.54 14.55 5.4 1023
3 10030 Seoul Gangbuk-gu 37.639938 127.025508 14 21 0 0.67 19.49 8.5 628
4 10040 Seoul Gangseo-gu 37.551166 126.849506 36 56 1 1.17 14.39 5.7 1080

13.2) Correlation between elder population ratio and number of cases

In [137]:
region = pd.read_csv("covid/Region.csv")
region.head()
Out[137]:
code province city latitude longitude elementary_school_count kindergarten_count university_count academy_ratio elderly_population_ratio elderly_alone_ratio nursing_home_count
0 10000 Seoul Seoul 37.566953 126.977977 607 830 48 1.44 15.38 5.8 22739
1 10010 Seoul Gangnam-gu 37.518421 127.047222 33 38 0 4.18 13.17 4.3 3088
2 10020 Seoul Gangdong-gu 37.530492 127.123837 27 32 0 1.54 14.55 5.4 1023
3 10030 Seoul Gangbuk-gu 37.639938 127.025508 14 21 0 0.67 19.49 8.5 628
4 10040 Seoul Gangseo-gu 37.551166 126.849506 36 56 1 1.17 14.39 5.7 1080
In [138]:
region_elderly_mean = pd.DataFrame(region[region.province != "Korea"].groupby("province")["elderly_population_ratio"].mean().sort_values(ascending = False))
region_elderly_mean.head()
Out[138]:
elderly_population_ratio
province
Jeollanam-do 28.142174
Gyeongsangbuk-do 27.556250
Jeollabuk-do 27.470667
Gyeongsangnam-do 24.749474
Chungcheongnam-do 23.844375
In [139]:
plt.figure(figsize = (13,10))
sns.barplot(x = region_elderly_mean.index, y = region_elderly_mean["elderly_population_ratio"], palette = "muted")
plt.tick_params(axis = "both", labelsize = 15, rotation = 90)
plt.title("Elderly population by province", size = 15)
plt.ylabel("Elderly Population Ratio", size = 15)
plt.xlabel("Province", size = 15)
plt.show()
In [140]:
province_latest.sort_values("confirmed", ascending = False, inplace = True)
In [141]:
region_elderly_mean.loc[province_latest.province,:].head()
Out[141]:
elderly_population_ratio
province
Daegu 17.031111
Gyeongsangbuk-do 27.556250
Seoul 15.738077
Gyeonggi-do 14.429688
Incheon 16.399091
In [142]:
region_elderly_mean.loc[province_latest.province,:]
plt.figure(figsize = (13,10))
sns.barplot(x = region_elderly_mean.loc[province_latest.province,:].index, 
            y = region_elderly_mean.loc[province_latest.province,"elderly_population_ratio"], palette = "muted")
plt.tick_params(axis = "both", labelsize = 15, rotation = 90)
plt.title("Elderly population by province (order by the number of confirmed cases - Descending Order)", size = 15)
plt.ylabel("Elderly Population Ratio", size = 15)
plt.xlabel("Province", size = 15)
plt.show()

Seems like the elderly population by province is not correlated with the number of confirmed cases

14) Web Scalping: Comparsion across the world

In [143]:
page = requests.get("https://www.worldometers.info/coronavirus")
page.status_code
Out[143]:
200
In [144]:
# page.content
soup = BeautifulSoup(page.content, 'lxml')
In [145]:
# print(soup.prettify())
In [146]:
covidTable = soup.find('table', attrs={'id': 'main_table_countries_today'})
# covidTable

14.1) Getting the data from the table

In [147]:
rows = covidTable.find_all("tr", attrs={"style": ""})
In [148]:
covidData = []
for i,data in enumerate(rows):
    if i == 0:
        
        covidData.append(data.text.strip().split("\n")[:13])
        
    else:
        covidData.append(data.text.strip().split("\n")[:12])
# covidData

14.2) Covert to data

In [149]:
covidTable = pd.DataFrame(covidData[1:], columns=covidData[0][:12])
covidTable = covidTable[~covidTable['#'].isin(['World', 'Total:'])]
covidTable = covidTable.drop('#', axis =1)
covidTable.head()
Out[149]:
Country,Other TotalCases NewCases TotalDeaths NewDeaths TotalRecovered NewRecovered ActiveCases Serious,Critical Tot Cases/1M pop Deaths/1M pop
1 USA 7,244,184 208,440 4,480,719 2,555,025 14,141 21,855 629
2 India 5,908,748 +7,177 93,440 +30 4,849,584 +3,416 965,724 8,944 4,272 68
3 Brazil 4,692,579 140,709 4,040,949 510,921 8,318 22,039 661
4 Russia 1,136,048 20,056 934,146 181,846 2,300 7,784 137
5 Colombia 798,317 25,103 687,477 85,737 863 15,650 492

14.3) Worldwide Total Cases Chart

In [150]:
fig = px.bar(covidTable, y='TotalCases', x='Country,Other')
fig.update_layout(hovermode='x')
fig.show()

We will be using log because the data is highly skewed

In [151]:
covidTable['logTotalCases'] = np.log(covidTable['TotalCases'].str.replace(r'\D', '').astype(int))
covidTable = covidTable.sort_values(by=['logTotalCases'])
fig = px.bar(covidTable, y='logTotalCases', x='Country,Other')
fig.update_layout(hovermode='x')
fig.show()

14.4) Worldwide Total Death Chart

In [152]:
pd.set_option('display.max_rows', covidTable.shape[0]+1)
covidTable['TotalDeaths'].str.strip()
covidTable['TotalDeaths'].replace('', np.nan, inplace=True)
covidTable['TotalDeaths'].replace(' ', np.nan, inplace=True)
covidTable['TotalDeaths'] = covidTable['TotalDeaths'].fillna(0)

covidTable['logTotalDeath'] = np.log(covidTable['TotalDeaths'].str.replace(r'\D', '').astype(float))
covidTable = covidTable.sort_values(by=['logTotalDeath'])
display(covidTable.head())

fig = px.bar(covidTable, y='logTotalDeath', x='Country,Other')
fig.update_layout(hovermode='x')
fig.show()
Country,Other TotalCases NewCases TotalDeaths NewDeaths TotalRecovered NewRecovered ActiveCases Serious,Critical Tot Cases/1M pop Deaths/1M pop logTotalCases logTotalDeath
202 Western Sahara 10 1 8 1 17 2 2.302585 0.0
195 British Virgin Islands 71 1 62 8 2 2,345 33 4.262680 0.0
194 Caribbean Netherlands 85 1 21 63 3,234 38 4.442651 0.0
192 Liechtenstein 117 1 110 6 3,067 26 4.762174 0.0
186 Cayman Islands 210 1 207 2 3,187 15 5.347108 0.0

14.5) Scatterplot between total cases and total death

In [153]:
fig = px.scatter(covidTable, x="logTotalDeath", y="logTotalCases", trendline="ols", title="Total Cases againist Total Death (Log)")
fig.update_layout(hovermode='x')
fig.show()

With a R squared value of 0.88, there is a strong correlation between the cases and the deaths.
The more cases there are, the higher the amount of death

15) Summarise Key findings from Exploratory Data Analysis

15.1) Overall Cases in Korea

  • On average, there are 100 COVID-19 cases in infected city.
  • The numbers are highly skewed by Nam-gu, a province of Daegu. Nam-gu has about 4500 cases.
  • Most of the clusters are in the province of Daegu, Gyeonggi-do and Seoul

Cases across various Korea province

  • Majority of the province has more individual cases of COVID-19 than group cases with the exception of Incheon and Ulsan
  • Seoul, Daegu, Gyeonggi-do and Gyeongsangbuk-do has the most number of individual cases among the Korea provinces, number of cases range from 150 to 200 with the exception of Daegu
  • In addition, the same group of provinces has the most number of grouped cases, number of cases range from 50-100 with the exception of Daegu

Source of infection

  • Sources of infection mainly come from Shincheonji Church, Itawon Clubs, Contact with Patients and Overseas inflow
  • Most of the infection sources across various province has less than 500 cases with the exception of Daegu.
  • Daegu has more than 4000 cases from Shincheonji Church and about 2000 cases from contact with patient and other cases each.

15.2) Patient Cases

  • There is a quick spike in daily COVID-19 cases from Feb to Mar before slowing dripping down to single digit in May. There are double digit COVID-19 cases in June

Recovery Speed

  • Age has an R-squared value of 0.04 for male and 0.03 for female against total treatment days.
  • At first graze, age seems to have little to no effect on recovery speed.
  • However when we used the mean recovery days for each age group, it was found that younger people recovered fast from COVID-19 as compared to older people.
  • Average days to recover for young people (<=30) is around 25 days and below. Middle-aged people(30-60) took about 25-30 days to recover. Older people (>=70) took about 32-36 days to recover
  • Interestingly, infected people in Daegu only conist of people from 50s to 70s. In addition, it took less than 15 days to recover. People in Jeollabuk-do take longer to recover from COVID-19 as compared to the rest of the provinces. Majority of the people took about 20-25 days to recover.

Fatality Speed

  • The average number of treatment day a patient had before he/she passed away is shorter than the average day it took for a patient recover.
  • In addition, younger patient has a lower number of treatment days as compared to older patients. This might indicate that younger patient seek treatment at the later stage of COVID-19.
  • Coincidentally, the number of treatment days increased as the age increases. The trend is similar to the recovery speed among during age group.
  • Daegu has the fastest fatality speed. An average patient is under treatment for less than 10 days before he/she passed away
  • Gyeonggi-do has an outliner. A 30s year old patient died in last than 10 days after seeking treatment.

Percentage of patients recovered, under treatment and deceased

  • Daegu has the highest fatality amount all the provinces. It has a fatality rate of more than 10%. In addition, Daejeon, Gyeongsangbuk-do and Ulsan are the only provinces which have fatailty.
  • Male has a higher fatality rate and a lowest survival rate as compared to female

COVID-19 Tranmission Amount

  • Majority of the cases only spread to 1 other person.
  • Cases #20000000205 is a super spread who spreaded to more than 50 people.
  • In addition, there are less than 10 cases where patient spread to more than 30 people or more.

15.3) Policy

  • There is a quick spike in daily COVID-19 cases from Feb to Mar before slowing dripping down to single digit in May. There are double digit COVID-19 cases in June
  • Technology related policy are introduced in Feb, especially thanks to policy like open data, people like us are able to do analysis on Korea COVID-19 cases.
  • After the spike in COVID-19 cases in March, education, social, health and immigration policy are introduced rapidly. Furthermore, the alert level was upgraded to Level 4 red.
  • After the introduction of the policies, the number of COVID-19 cases dropped drastically during April and May. Hence the policies are effective.
  • However, it seems to be lax in control of COVID-19 cases as the COVID-19 slowly creep up to double digit in May/June. This forced the govtnment to introduced administrative policies such as closure of clubs to control the situation.

15.4) Cases Over Time Analysis

  • As for 31st Jun 2020, 1.27 million of people has gone for testing, 1.24 million are tested negative, 12.8 thousand of confirmed cases, 11.5 thousand of released cases and 282 people has passed away.
  • Majority of the people are tested negative for covid-19 and less than 1% are tested positive for the virus.
  • Among those who are tested positive, almost 90% of the people has recovered and about 2% of the people has passed away

15.5) Cases Over Age & Time Analysis

  • Age group in the 20s 40s 50s 60s has higher infection cases as compared to other age group.
  • Furthermore, in the deceased section, most of the fatalities are in the age group of 60s 70s 80s,
  • This indicate that the risk of death from COVID-19 increases with age.

15.6) Cases Over Gender & Time Analysis

  • There are more female infected cases as compared to male cases.
  • The rate of increases of confirmed cases are about the same for both gender over the last 6 months
  • On the other hand, male are more prone to death to COVID-19 as compared to female

15.7) Cases Over Province & Time Analysis

  • The province of Daegu, Gyeongsangbuk-do, Incheon and Seoul has the most amount of covid-19 infection cases among the 17 provinces; with Daegu accounting for more than half of the cases.
  • The fatality count is relatively proportional to the infection count with the except of Seoul
  • Bother Seoul and Gyeonggi-do has about 900 infected cases. However, Gyeonggi-do has about 22 fatalities and Seoul has about 7 fatalities.

15.8) Weather

  • The weather is about the same across the provinces.
  • The temperature gradually increases from around 5 degree to 25 degree for the period of Jan 2020 to June 2020
  • The weather is relatively the same across all provinces, it has little to do with the reported cases in the short term

15.9) Region

  • Number of nursing home vs Population Ratio
  • We observed an interesting fact regarding the elderly population ratio and the number of nursing home in the city.
  • The elderly population ratio and the number of nursing home has an inverse relationship
  • Perhaps the absolute number of the elder people is smaller where elderly population ratio are high which explains why there is nursing home in places with higher elderly population ratio.

16) Preparing the data for modelling

We will create a model which will predict whether a COVID-19 patient will survive given a certain set of conditions

  • Sort city & province into different risk level according to number of confirmed cases (Low, Medium, High)
  • Sort age into different age level according to age (kid, adult, elderly)
  • Gender into categorical group of 0 & 1
  • Number of treatment day
  • Status of released, deceased into different category (We will ignore cases who are under treatment)

16.1) Get the necessary columns

In [154]:
patientData = pd.read_csv('covid/patientinfo.csv')
patientModellingData = patientData[['sex','age','province','confirmed_date','released_date','deceased_date','state']]

## We removed isolated patient since it is not confirmed if they survived or not
patientModellingData = patientModellingData[patientModellingData['state'] != 'isolated']
nullList = ['sex','age','confirmed_date']
for item in nullList:
     patientModellingData = patientModellingData[~patientModellingData[item].isnull()]
        
display(patientModellingData.head())
sex age province confirmed_date released_date deceased_date state
0 male 50s Seoul 2020-01-23 2020-02-05 NaN released
1 male 30s Seoul 2020-01-30 2020-03-02 NaN released
2 male 50s Seoul 2020-01-30 2020-02-19 NaN released
3 male 20s Seoul 2020-01-30 2020-02-15 NaN released
4 female 20s Seoul 2020-01-31 2020-02-24 NaN released

16.2) Calculate number of days of treatment before dropped date column

In [155]:
cols = ['released_date', 'deceased_date', 'confirmed_date']
patientModellingData[cols] = patientModellingData[cols].apply(pd.to_datetime, errors='coerce', axis=1)

def calculate_number_of_treatment_days(row):
    if row["released_date"] is not pd.NaT:
        treatmentDays = pd.to_numeric((row['released_date'] - row['confirmed_date']).days)
        return(treatmentDays)
    elif row["deceased_date"] is not pd.NaT:
        treatmentDays = pd.to_numeric((row['deceased_date'] - row['confirmed_date']).days)
        return(treatmentDays)
    else:
        return None

patientModellingData['Treatment Days'] = patientModellingData.apply(calculate_number_of_treatment_days, axis=1)

patientModellingData = patientModellingData[~patientModellingData['Treatment Days'].isnull()]
patientModellingDataTreatment = patientModellingData[['sex','age','province','state','Treatment Days']]
display(patientModellingDataTreatment.head())
sex age province state Treatment Days
0 male 50s Seoul released 13.0
1 male 30s Seoul released 32.0
2 male 50s Seoul released 20.0
3 male 20s Seoul released 16.0
4 female 20s Seoul released 24.0

16.3) Convert state, gender and gender columns to categorical data

In [156]:
genders = {"male": 0, "female": 1}
patientModellingDataTreatment['sex'] = patientModellingDataTreatment['sex'].map(genders)
display(patientModellingDataTreatment.head())
sex age province state Treatment Days
0 0 50s Seoul released 13.0
1 0 30s Seoul released 32.0
2 0 50s Seoul released 20.0
3 0 20s Seoul released 16.0
4 1 20s Seoul released 24.0
In [157]:
state = {"released": 0, "deceased": 1}
patientModellingDataTreatment['state'] = patientModellingDataTreatment['state'].map(state)
display(patientModellingDataTreatment.head())
sex age province state Treatment Days
0 0 50s Seoul 0 13.0
1 0 30s Seoul 0 32.0
2 0 50s Seoul 0 20.0
3 0 20s Seoul 0 16.0
4 1 20s Seoul 0 24.0
In [158]:
age = {'0s':0, '10s':1, '20s':2, '30s':3, '40s':4, '50s':5, '60s':6, '70s':7, '80s':8, '90s':9}
patientModellingDataTreatment['age'] = patientModellingDataTreatment['age'].map(age)
display(patientModellingDataTreatment.head())
sex age province state Treatment Days
0 0 5.0 Seoul 0 13.0
1 0 3.0 Seoul 0 32.0
2 0 5.0 Seoul 0 20.0
3 0 2.0 Seoul 0 16.0
4 1 2.0 Seoul 0 24.0

16.4) Get dummy for province

In [159]:
provinceDummy = pd.get_dummies(patientModellingDataTreatment['province'])
provinceDummy
Out[159]:
Chungcheongbuk-do Chungcheongnam-do Daegu Daejeon Gangwon-do Gwangju Gyeonggi-do Gyeongsangbuk-do Gyeongsangnam-do Incheon Jeju-do Jeollabuk-do Jeollanam-do Sejong Seoul Ulsan
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5156 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
5157 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
5158 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
5159 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
5160 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

1635 rows × 16 columns

In [160]:
dataForModelling = patientModellingDataTreatment.join(provinceDummy)
dataForModelling = dataForModelling.drop('province',axis=1)
dataForModelling
Out[160]:
sex age state Treatment Days Chungcheongbuk-do Chungcheongnam-do Daegu Daejeon Gangwon-do Gwangju Gyeonggi-do Gyeongsangbuk-do Gyeongsangnam-do Incheon Jeju-do Jeollabuk-do Jeollanam-do Sejong Seoul Ulsan
0 0 5.0 0 13.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
1 0 3.0 0 32.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
2 0 5.0 0 20.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
3 0 2.0 0 16.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
4 1 2.0 0 24.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5156 0 3.0 0 46.0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
5157 1 2.0 0 32.0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
5158 1 1.0 0 12.0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
5159 1 3.0 0 34.0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
5160 1 3.0 0 14.0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

1635 rows × 20 columns

17) Correlation Matrix

In [161]:
plt.figure(figsize=(10,5))
sns.heatmap(dataForModelling.corr(), cmap="RdPu")

dataForModelling['age'] = pd.to_numeric(dataForModelling['age'])

dataForModellingForCleaning = dataForModelling.copy()

display(dataForModellingForCleaning.count())
print(np.any(np.isnan(dataForModellingForCleaning)))
print(np.all(np.isfinite(dataForModellingForCleaning)))

def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    return df[indices_to_keep].astype(np.float64)

clean_dataset(dataForModellingForCleaning)
display(dataForModellingForCleaning.count())
print(np.any(np.isnan(dataForModellingForCleaning)))
print(np.all(np.isfinite(dataForModellingForCleaning)))
sex                  1635
age                  1634
state                1635
Treatment Days       1635
Chungcheongbuk-do    1635
Chungcheongnam-do    1635
Daegu                1635
Daejeon              1635
Gangwon-do           1635
Gwangju              1635
Gyeonggi-do          1635
Gyeongsangbuk-do     1635
Gyeongsangnam-do     1635
Incheon              1635
Jeju-do              1635
Jeollabuk-do         1635
Jeollanam-do         1635
Sejong               1635
Seoul                1635
Ulsan                1635
dtype: int64
True
False
sex                  1634
age                  1634
state                1634
Treatment Days       1634
Chungcheongbuk-do    1634
Chungcheongnam-do    1634
Daegu                1634
Daejeon              1634
Gangwon-do           1634
Gwangju              1634
Gyeonggi-do          1634
Gyeongsangbuk-do     1634
Gyeongsangnam-do     1634
Incheon              1634
Jeju-do              1634
Jeollabuk-do         1634
Jeollanam-do         1634
Sejong               1634
Seoul                1634
Ulsan                1634
dtype: int64
False
True

18) Building Machine Learning Models Part 1

In [162]:
x = dataForModellingForCleaning.drop("state", axis=1)
y = dataForModellingForCleaning["state"]
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.33, random_state=42)

18.1) Stochastic Gradient Descent (SGD):

In [163]:
sgd = linear_model.SGDClassifier(max_iter=5, tol=None)
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)

sgd.score(X_train, Y_train)

acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)

18.2) Random Forest:

In [164]:
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)

Y_prediction = random_forest.predict(X_test)

random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)

18.3) Logistic Regression:

In [165]:
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)

Y_pred = logreg.predict(X_test)

acc_log = round(logreg.score(X_train, Y_train) * 100, 2)

18.4) Gaussian Naive Bayes:

In [166]:
gaussian = GaussianNB() 
gaussian.fit(X_train, Y_train)  
Y_pred = gaussian.predict(X_test)  
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)

18.5) K Nearest Neighbor:

In [167]:
knn = KNeighborsClassifier(n_neighbors = 3) 
knn.fit(X_train, Y_train)  
Y_pred = knn.predict(X_test)  
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)

18.6) Perceptron:

In [168]:
perceptron = Perceptron(max_iter=5)
perceptron.fit(X_train, Y_train)

Y_pred = perceptron.predict(X_test)

acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)

18.7) Linear Support Vector Machine:

In [169]:
linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)

Y_pred = linear_svc.predict(X_test)

acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)

18.8) Decision Tree

In [170]:
decision_tree = DecisionTreeClassifier() 
decision_tree.fit(X_train, Y_train)  
Y_pred = decision_tree.predict(X_test)  
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)

18.9) Getting the best model

In [171]:
results = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 
              'Decision Tree'],
    'Score': [acc_linear_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_decision_tree]})
result_df = results.sort_values(by='Score', ascending=False)
result_df = result_df.set_index('Score')
result_df.head(9)
Out[171]:
Model
Score
99.63 Random Forest
99.63 Decision Tree
98.54 KNN
98.45 Support Vector Machines
98.17 Logistic Regression
97.90 Perceptron
97.26 Stochastic Gradient Decent
38.48 Naive Bayes

18.10) Decision Tree Diagram

In [172]:
feature_cols = x.columns

dot_data = StringIO()
export_graphviz(decision_tree, out_file = dot_data, 
                      feature_names = feature_cols,  
                     filled = True, rounded = True,  
                    special_characters = True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('covidTree.png')
Image(graph.create_png())

# Entropy 0 == no disorder // perfect knowledge, perfect classification
Out[172]:

18.11) What is the most important feature ?

In [173]:
importances = pd.DataFrame({'feature':X_train.columns,'importance':np.round(random_forest.feature_importances_,3)})
importances1 = importances.sort_values('importance',ascending=False).set_index('feature')
importances1.head(15)
Out[173]:
importance
feature
Treatment Days 0.576
age 0.214
Daegu 0.120
sex 0.023
Gyeongsangbuk-do 0.020
Seoul 0.010
Gangwon-do 0.010
Incheon 0.005
Ulsan 0.005
Chungcheongnam-do 0.005
Gyeongsangnam-do 0.004
Chungcheongbuk-do 0.003
Gyeonggi-do 0.003
Gwangju 0.001
Daejeon 0.001
In [174]:
barData = importances.reset_index()
fig = px.bar(barData, x='feature', y='importance')
fig.update_layout(hovermode='x')
fig.show()

Confusion Matrix with Precision & Recall & F-Score

In [175]:
predictions = cross_val_predict(random_forest, X_train, Y_train, cv=3)
result1_train = confusion_matrix(Y_train, predictions)
display(result1_train)

p1_train = precision_score(Y_train, predictions)
r1_train = recall_score(Y_train, predictions)
f1_train = f1_score(Y_train, predictions)
print("Precision:", p1_train)
print("Recall:", r1_train)
print("F-Score:", f1_train)
array([[1042,    7],
       [  16,   29]])
Precision: 0.8055555555555556
Recall: 0.6444444444444445
F-Score: 0.7160493827160493
In [176]:
predictions = cross_val_predict(random_forest, X_test, Y_test, cv=3)
result1_test = confusion_matrix(Y_test, predictions)
display(result1_test)

p1_test = precision_score(Y_test, predictions)
r1_test = recall_score(Y_test, predictions)
f1_test = f1_score(Y_test, predictions)
print("Precision:", p1_test)
print("Recall:", r1_test)
print("F-Score:", f1_test)
array([[513,   5],
       [  9,  13]])
Precision: 0.7222222222222222
Recall: 0.5909090909090909
F-Score: 0.65

19) Building Machine Learning Models Part 2

Lets observe what happened if we drop the treatment days column Will the model be more accurate?

In [177]:
x2 = dataForModellingForCleaning.drop(["state","Treatment Days"], axis=1)
y2 = dataForModellingForCleaning["state"]
X_train, X_test, Y_train, Y_test = train_test_split(x2, y2, test_size=0.33, random_state=42)

19.1) Random Forest

In [178]:
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)

Y_prediction = random_forest.predict(X_test)

random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)

19.2) Decision Tree

In [179]:
decision_tree = DecisionTreeClassifier() 
decision_tree.fit(X_train, Y_train)  
Y_pred = decision_tree.predict(X_test)  
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)

19.3) Getting the best model

In [180]:
results = pd.DataFrame({
    'Model': ['Random Forest','Decision Tree'],
    'Score': [acc_random_forest, acc_decision_tree]})
result_df = results.sort_values(by='Score', ascending=False)
result_df = result_df.set_index('Score')
result_df.head(2)
Out[180]:
Model
Score
97.07 Random Forest
97.07 Decision Tree

19.4) Decision Tree Diagram

In [181]:
feature_cols = x2.columns

dot_data = StringIO()
export_graphviz(decision_tree, out_file = dot_data, 
                      feature_names = feature_cols,  
                     filled = True, rounded = True,  
                    special_characters = True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('covidTree2.png')
Image(graph.create_png())
Out[181]:

19.5) Importance

In [182]:
importances = pd.DataFrame({'feature':X_train.columns,'importance':np.round(random_forest.feature_importances_,3)})
importances2 = importances.sort_values('importance',ascending=False).set_index('feature')
importances2.head(15)
Out[182]:
importance
feature
age 0.422
Daegu 0.378
sex 0.070
Gangwon-do 0.041
Gyeongsangbuk-do 0.034
Ulsan 0.010
Gyeonggi-do 0.008
Chungcheongbuk-do 0.008
Chungcheongnam-do 0.006
Incheon 0.005
Seoul 0.005
Gwangju 0.003
Gyeongsangnam-do 0.003
Jeollabuk-do 0.002
Jeollanam-do 0.002
In [183]:
barData = importances.reset_index()
fig = px.bar(barData, x='feature', y='importance')
fig.update_layout(hovermode='x')
fig.show()

19.6) Confusion Matrix with Precision & Recall & F-Score

In [184]:
predictions = cross_val_predict(random_forest, X_train, Y_train, cv=3)
result2_train = confusion_matrix(Y_train, predictions)
display(result2_train)

p2_train = precision_score(Y_train, predictions)
r2_train = recall_score(Y_train, predictions)
f2_train = f1_score(Y_train, predictions)
print("Precision:", p2_train)
print("Recall:", r2_train)
print("F-Score:", f2_train)
array([[1047,    2],
       [  34,   11]])
Precision: 0.8461538461538461
Recall: 0.24444444444444444
F-Score: 0.37931034482758624
In [185]:
predictions = cross_val_predict(random_forest, X_test, Y_test, cv=3)
result2_test = confusion_matrix(Y_test, predictions)
display(result2_test)

p2_test = precision_score(Y_test, predictions)
r2_test = recall_score(Y_test, predictions)
f2_test = f1_score(Y_test, predictions)
print("Precision:", p2_test)
print("Recall:", r2_test)
print("F-Score:", f2_test)
array([[510,   8],
       [ 13,   9]])
Precision: 0.5294117647058824
Recall: 0.4090909090909091
F-Score: 0.46153846153846156

20) Refine Dataset for Machine Learning

Will the model's accuracy improve if we include patient who died without any treatment and the infected month?

In [186]:
patientData = pd.read_csv('covid/patientinfo.csv')
patientModellingData = patientData[['sex','age','province','confirmed_date','released_date','deceased_date','state']]
deadPatientModellingData = patientData[patientData['state'] == 'deceased']
totalDiedPatient = deadPatientModellingData.count()
display(totalDiedPatient)

display('***********')

pd.set_option('display.max_rows', deadPatientModellingData.shape[0]+1)
totalDiedPaitentWithoutTreatment = deadPatientModellingData[deadPatientModellingData['deceased_date'].isnull() == True].count()
display(totalDiedPaitentWithoutTreatment)
patient_id            78
sex                   75
age                   75
country               78
province              78
city                  59
infection_case        36
infected_by            3
contact_number         7
symptom_onset_date     6
confirmed_date        78
released_date          3
deceased_date         66
state                 78
dtype: int64
'***********'
patient_id            12
sex                    9
age                    9
country               12
province              12
city                  12
infection_case         8
infected_by            2
contact_number         4
symptom_onset_date     4
confirmed_date        12
released_date          1
deceased_date          0
state                 12
dtype: int64

Assuming patient who died without any released_date or deceased_date are those died without any treatment

There are about 12 patient who died without any treatment

In [187]:
str(12/78 * 100) + '%'
Out[187]:
'15.384615384615385%'

This accounts for 15% of the total death

20.1) Preparing the data for modelling

We will simply add another line of logic to return treatment days = 0
if the patient is deceased and both released and deceased date are empty

In addition, we will add in the month when the patient is confirmed to have covid cases

In [188]:
patientData = pd.read_csv('covid/patientinfo.csv')
patientModellingData = patientData[['sex','age','province','confirmed_date','released_date','deceased_date','state']]

## We removed isolated patient since it is not confirmed if they survived or not
patientModellingData = patientModellingData[patientModellingData['state'] != 'isolated']
nullList = ['sex','age','confirmed_date']
for item in nullList:
     patientModellingData = patientModellingData[~patientModellingData[item].isnull()]
        
display(patientModellingData.head())
sex age province confirmed_date released_date deceased_date state
0 male 50s Seoul 2020-01-23 2020-02-05 NaN released
1 male 30s Seoul 2020-01-30 2020-03-02 NaN released
2 male 50s Seoul 2020-01-30 2020-02-19 NaN released
3 male 20s Seoul 2020-01-30 2020-02-15 NaN released
4 female 20s Seoul 2020-01-31 2020-02-24 NaN released
In [189]:
cols = ['released_date', 'deceased_date', 'confirmed_date']
patientModellingData[cols] = patientModellingData[cols].apply(pd.to_datetime, errors='coerce', axis=1)

def calculate_number_of_treatment_days(row):
    if row["released_date"] is not pd.NaT:
        treatmentDays = pd.to_numeric((row['released_date'] - row['confirmed_date']).days)
        return(treatmentDays)
    elif row["deceased_date"] is not pd.NaT:
        treatmentDays = pd.to_numeric((row['deceased_date'] - row['confirmed_date']).days)
        return(treatmentDays)
    elif row["deceased_date"] is pd.NaT and row["released_date"] is pd.NaT:
        if row["state"] == 'deceased' and row["state"] != 'released':
            return 0
        else:
            return None
    else:
        return None

patientModellingData['Treatment Days'] = patientModellingData.apply(calculate_number_of_treatment_days, axis=1)

patientModellingData = patientModellingData[~patientModellingData['Treatment Days'].isnull()]
patientModellingDataTreatment = patientModellingData[['sex','age','confirmed_date','province','state','Treatment Days']]
patientModellingDataTreatment['confirmed_month'] = patientModellingDataTreatment['confirmed_date'].dt.month
patientModellingDataTreatment = patientModellingDataTreatment.drop('confirmed_date', axis=1)
display(patientModellingDataTreatment.head())
sex age province state Treatment Days confirmed_month
0 male 50s Seoul released 13.0 1
1 male 30s Seoul released 32.0 1
2 male 50s Seoul released 20.0 1
3 male 20s Seoul released 16.0 1
4 female 20s Seoul released 24.0 1
In [190]:
genders = {"male": 0, "female": 1}
patientModellingDataTreatment['sex'] = patientModellingDataTreatment['sex'].map(genders)

state = {"released": 0, "deceased": 1}
patientModellingDataTreatment['state'] = patientModellingDataTreatment['state'].map(state)

age = {'0s':0, '10s':1, '20s':2, '30s':3, '40s':4, '50s':5, '60s':6, '70s':7, '80s':8, '90s':9}
patientModellingDataTreatment['age'] = patientModellingDataTreatment['age'].map(age)

display(patientModellingDataTreatment.head())
sex age province state Treatment Days confirmed_month
0 0 5.0 Seoul 0 13.0 1
1 0 3.0 Seoul 0 32.0 1
2 0 5.0 Seoul 0 20.0 1
3 0 2.0 Seoul 0 16.0 1
4 1 2.0 Seoul 0 24.0 1
In [191]:
dataForModelling = patientModellingDataTreatment.join(provinceDummy)
dataForModelling = dataForModelling.drop('province',axis=1)
dataForModelling
Out[191]:
sex age state Treatment Days confirmed_month Chungcheongbuk-do Chungcheongnam-do Daegu Daejeon Gangwon-do ... Gyeonggi-do Gyeongsangbuk-do Gyeongsangnam-do Incheon Jeju-do Jeollabuk-do Jeollanam-do Sejong Seoul Ulsan
0 0 5.0 0 13.0 1 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1 0 3.0 0 32.0 1 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2 0 5.0 0 20.0 1 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
3 0 2.0 0 16.0 1 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
4 1 2.0 0 24.0 1 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5156 0 3.0 0 46.0 4 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
5157 1 2.0 0 32.0 4 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
5158 1 1.0 0 12.0 4 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
5159 1 3.0 0 34.0 5 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
5160 1 3.0 0 14.0 5 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0

1643 rows × 21 columns

20.2) Building Machine Learning Models Part 3

In [192]:
clean_dataset(dataForModelling)
display(dataForModelling.count())
print(np.any(np.isnan(dataForModelling)))
print(np.all(np.isfinite(dataForModelling)))

x3 = dataForModelling.drop(["state"], axis=1)
y3 = dataForModelling["state"]
X_train, X_test, Y_train, Y_test = train_test_split(x3, y3, test_size=0.33, random_state=42)

## Random_forest
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)

Y_prediction = random_forest.predict(X_test)

random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)

## Decision Tree
decision_tree = DecisionTreeClassifier() 
decision_tree.fit(X_train, Y_train)  
Y_pred = decision_tree.predict(X_test)  
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)

results = pd.DataFrame({
    'Model': ['Random Forest','Decision Tree'],
    'Score': [acc_random_forest, acc_decision_tree]})
result_df = results.sort_values(by='Score', ascending=False)
result_df = result_df.set_index('Score')
result_df.head(2)
sex                  1634
age                  1634
state                1634
Treatment Days       1634
confirmed_month      1634
Chungcheongbuk-do    1634
Chungcheongnam-do    1634
Daegu                1634
Daejeon              1634
Gangwon-do           1634
Gwangju              1634
Gyeonggi-do          1634
Gyeongsangbuk-do     1634
Gyeongsangnam-do     1634
Incheon              1634
Jeju-do              1634
Jeollabuk-do         1634
Jeollanam-do         1634
Sejong               1634
Seoul                1634
Ulsan                1634
dtype: int64
False
True
Out[192]:
Model
Score
99.73 Random Forest
99.73 Decision Tree

20.3) Importances

In [193]:
importances = pd.DataFrame({'feature':X_train.columns,'importance':np.round(random_forest.feature_importances_,3)})
importances3 = importances.sort_values('importance',ascending=False).set_index('feature')
importances3.head(15)
Out[193]:
importance
feature
Treatment Days 0.568
age 0.217
Daegu 0.110
confirmed_month 0.027
sex 0.023
Gyeongsangbuk-do 0.017
Gangwon-do 0.009
Seoul 0.007
Ulsan 0.006
Chungcheongnam-do 0.003
Gyeongsangnam-do 0.003
Incheon 0.003
Gyeonggi-do 0.003
Chungcheongbuk-do 0.002
Gwangju 0.000
In [194]:
barData = importances.reset_index()
fig = px.bar(barData, x='feature', y='importance')
fig.update_layout(hovermode='x')
fig.show()

20.4) Confusion Matrix with Precision & Recall & F-Score

In [195]:
predictions = cross_val_predict(random_forest, X_train, Y_train, cv=3)
result3_train = confusion_matrix(Y_train, predictions)
display(result3_train)

p3_train = precision_score(Y_train, predictions)
r3_train = recall_score(Y_train, predictions)
f3_train = f1_score(Y_train, predictions)
print("Precision:", p3_train)
print("Recall:", r3_train)
print("F-Score:",f3_train)
array([[1044,    5],
       [  16,   29]])
Precision: 0.8529411764705882
Recall: 0.6444444444444445
F-Score: 0.7341772151898734
In [196]:
predictions = cross_val_predict(random_forest, X_test, Y_test, cv=3)
result3_test = confusion_matrix(Y_test, predictions)
display(result3_test)

p3_test = precision_score(Y_test, predictions)
r3_test = recall_score(Y_test, predictions)
f3_test = f1_score(Y_test, predictions)
print("Precision:", p3_test)
print("Recall:", r3_test)
print("F-Score:", f3_test)
array([[515,   3],
       [  9,  13]])
Precision: 0.8125
Recall: 0.5909090909090909
F-Score: 0.6842105263157896

20.5) Summary

In [198]:
print('******************************')
print('Model without treatment day: ')
print('importance:', importances2.head(3))
display(result2_train)
display(result2_test)

print('Model with treatment day :')
print('importance:', importances1.head(3))
display(result1_train)
display(result1_test)

print('******************************')
print('Model with treatment day, those who died without treatment and infected month:')
print('importance:', importances3.head(3))
display(result3_train)
display(result3_test)

print('******************************')
print('Model without treatment day: ')
print("Train Data Precision:", p2_train)
print("Train Data Recall:", r2_train)
print("Train Data F-Score:", f2_train)
print("Test Data Precision:", p2_test)
print("Test Data Recall:", r2_test)
print("Test Data F-Score:", f2_test)

print('******************************')
print('Model with treatment day :')
print("Train Data Precision:", p1_train)
print("Train Data Recall:", r1_train)
print("Train Data F-Score:", f1_train)
print("Test Data Precision:", p1_test)
print("Test Data Recall:", r1_test)
print("Test Data F-Score:", f1_test)

print('******************************')
print('Model with treatment day, those who died without treatment and infected month:')
print("Train Data Precision:", p3_train)
print("Train Data Recall:", r3_train)
print("Train Data F-Score:", f3_train)
print("Test Data Precision:", p3_test)
print("Test Data Recall:", r3_test)
print("Test Data F-Score:", f3_test)
******************************
Model without treatment day: 
importance:          importance
feature            
age           0.422
Daegu         0.378
sex           0.070
array([[1047,    2],
       [  34,   11]])
array([[510,   8],
       [ 13,   9]])
Model with treatment day :
importance:                 importance
feature                   
Treatment Days       0.576
age                  0.214
Daegu                0.120
array([[1042,    7],
       [  16,   29]])
array([[513,   5],
       [  9,  13]])
******************************
Model with treatment day, those who died without treatment and infected month:
importance:                 importance
feature                   
Treatment Days       0.568
age                  0.217
Daegu                0.110
array([[1044,    5],
       [  16,   29]])
array([[515,   3],
       [  9,  13]])
******************************
Model without treatment day: 
Train Data Precision: 0.8461538461538461
Train Data Recall: 0.24444444444444444
Train Data F-Score: 0.37931034482758624
Test Data Precision: 0.5294117647058824
Test Data Recall: 0.4090909090909091
Test Data F-Score: 0.46153846153846156
******************************
Model with treatment day :
Train Data Precision: 0.8055555555555556
Train Data Recall: 0.6444444444444445
Train Data F-Score: 0.7160493827160493
Test Data Precision: 0.7222222222222222
Test Data Recall: 0.5909090909090909
Test Data F-Score: 0.65
******************************
Model with treatment day, those who died without treatment and infected month:
Train Data Precision: 0.8529411764705882
Train Data Recall: 0.6444444444444445
Train Data F-Score: 0.7341772151898734
Test Data Precision: 0.8125
Test Data Recall: 0.5909090909090909
Test Data F-Score: 0.6842105263157896

The Model with treatment day is better than the Model without treatment day even though it has a lower precision rate.
The number of survivor in COVID-19 is much higher than the number of deceased.
Hence, it is much harder to predict who will pass away from the infection.

The Model without treatment day is unable to predict whether a patient will pass away accurately.
The number of false negatives outweigh the number of true negatives.

On the other hand, the Model with treatment day is able to predict the deceased much accurately than the new model.

With that in mind, this also tell us that the number of days of treatment is vital.

The model accuracy improved after including the month which the patient is infected and those who died without treatment

21) Step to take to control the COVID-19 situation even better

Generally, those who passed away from the infection had lesser days of treatment. However, is it right for us to conclude that lesser amount of treatment days, the higher the fatality rate? Are the risk of dying higher during first few days/weeks of treatment? Or is there something more to it?

21.1) Abraham Wald and the Missing Bullet Holes

This situation is something similar to Abraham Wald and the Missing Bullet Holes. @penguinpress summed up the situation perfectly:

You don't want your planes to get shot down by enemy fighters, so you armor them. But armor makes the plane heavier, and heavier planes are less maneuverable and use more fuel. Armoring the planes too much is a problem; armoring the planes too little is a problem. Somewhere in between there's an optimum.

When American planes came back from engagements over Europe, they were covered in bullet holes. But the damage wasn't uniformly distributed across the aircraft. There were more bullet holes in the fuselage, not so many in the engines.

title

At first glance, it seems reasonable to focus the armor on the fuselage. However, the armor, said Wald, doesn't go where the bullet holes are. It goes where the bullet holes aren't: on the engines.

The missing bullet holes were on the missing planes. The reason planes were coming back with fewer hits to the engine is that planes that got hit in the engine weren't coming back.

If you go to the recovery room at the hospital, you'll see a lot more people with bullet holes in their legs than people with bullet holes in their chests. But that's not because people don't get shot in the chest; it's because the people who get shot in the chest don't recover.

The link to the excellent article written by @penguinpress is in the credit section below.

21.2) Survivorship Bias

In this case, our model only consists of people who have passed away after they have gone through treatment. We have excluded people who have died even before they have the opportunity to go through treatment! 

Even after we have excluded those who have not undergo treatment before they passed away, our model indicated that the number of treatment days is the strongest factor among gender, location and age. Treatment days has a whooping percentage of 60% in terms of importance next to age which has a importance of 20%.

21.3) Improving the situation

The chances of survival increases as the number of treatment days increases. Hence, the faster detection rate, the earlier the treatment, the more days and opportunities to treat the infection, the better the chances of survival. The Korea government put a very strong emphasis on COVID-19 detection programme and hence they managed to contain the COVID-19 situation.

Various cities among the world were also in lockdown to prevent the rapid spread of COVID-19. The healthcare system will be overwhelmed if COVID-19 were to spread widely which will cause the death rate to escalate as more people will have lesser opportunity to undergo treatment.

The older the person, the higher the chances of infection and death. Hence it is not advisable for elderly people to roam around unless it is necessary.

In terms of locations, the majority of the cases came from Daegu and the sources of infection is from places such as church and clubs. The higher the amount of contact, the greater the chances of infection. It is advisable for the authority to close such areas until a vaccine for COVID-19 is found. During May to June, there is an increase in COVID-19 cases due to complacency.

https://www.aa.com.tr/en/asia-pacific/s-korea-sees-mass-covid-19-cases-linked-to-night-clubs/1838031

https://www.channelnewsasia.com/news/asia/south-korea-covid-19-church-backlash-13092284

22) References

Soh, Z. (2020). 02a.bank-marketing.ipynb [Scholarly project]. Retrieved 2020.

Datartist. (2020, July 13). Data Science for COVID-19 (DS4C). Retrieved September 16, 2020, from https://www.kaggle.com/kimjihoo/coronavirusdataset

Press, P. (2016, July 14). Abraham Wald and the Missing Bullet Holes. Retrieved September 16, 2020, from https://medium.com/@penguinpress/an-excerpt-from-how-not-to-be-wrong-by-jordan-ellenberg-664e708cfc3d

Anand. (2019, March 11). Network Graph with AT&T data using Plotly. Retrieved September 16, 2020, from https://medium.com/@anand0427/network-graph-with-at-t-data-using-plotly-a319f9898a02

Donges, N. (2018, May 15). Predicting the Survival of Titanic Passengers. Retrieved September 16, 2020, from https://towardsdatascience.com/predicting-the-survival-of-titanic-passengers-30870ccc7e8

Rostami, D. (2020, March 22). Coronavirus Time Series Map Animation. Retrieved September 16, 2020, from https://datacrayon.com/posts/statistics/data-is-beautiful/coronavirus-time-series-map-animation/

23) Appendix

Case.csv

  • case_id - The ID of the infection case
  • province - Special City / Metropolitan City / Province(-do)
  • city - City(-si) / Country (-gun) / District (-gu)
  • group - TRUE: group infection / FALSE: not group
  • infection_case - the infection case (the name of group or other cases)
  • confirmed - the accumulated number of the confirmed
  • latitude - the latitude of the group (WGS84)
  • longitude - the longitude of the group (WGS84)

PatientInfo.csv

  • patient_id - the ID of the patient
  • sex - the sex of the patient
  • age - the age of the patient
  • country - the country of the patient
  • province - the province of the patient
  • city - the city of the patient
  • infection_case - the case of infection
  • infected_by - the ID of who infected the patient
  • contact_number - the number of contacts with people
  • symptom_onset_date - the date of symptom onset
  • confirmed_date - the date which the patient confirmed to have the vius
  • released_date - the date which the patient is released
  • deceased_date - the date which the patient passed away
  • state - the state of the patient: isolated, deceased, released

Policy.csv

  • policy_id - the ID of the policy
  • country - the country that implemented the policy
  • type - the type of the policy
  • gov_policy - the policy of the government
  • detail - the detail of the policy
  • start_date - the start date of the policy
  • end_date - the end date of the policy

Region.csv

  • code - the code of the region
  • province - Special City / Metropolitan City / Province(-do)
  • city - City(-si) / Country (-gun) / District (-gu)
  • latitude - the latitude of the visit (WGS84)
  • longitude - the longitude of the visit (WGS84)
  • elementary_school_count - the number of elementary schools
  • kindergarten_count - the number of kindergartens
  • university_count - the number of universities
  • academy_ratio - the ratio of academies
  • elderly_population_ratio - the ratio of the elderly population
  • elderly_alone_ratio - the ratio of the elderly who are alone
  • nursing_home_count - the number of nursing home

SearchTrend.csv

  • date - YYYY-MM-DD
  • cold - the search volume of 'cold' in Korean language
  • flu - the search volume of 'flu' in Korean language
  • pneumonia - the search volume of 'pneumonia' in Korean language

Time.csv

  • date - YYYY-MM-DD
  • time - Time (0 = AM 12:00 / 16 = PM 04:00)
  • test - the accumulated number of tests
  • negative - the accumulated number of negative results
  • confirmed - the accumulated number of positive results
  • released - the accumulated number of releases
  • deceased - the accumulated number of deceases

TimeAge.csv

  • date - YYYY-MM-DD
  • time - Time
  • age - the age of patients
  • confirmed - the accumulated number of the confirmed
  • deceased - the accumulated number of the deceased

TimeGender.csv

  • date - YYYY-MM-DD
  • time - Time
  • sex - the gender of patients
  • confirmed - the accumulated number of the confirmed
  • deceased - the accumulated number of the deceased

TimeProvince.csv

  • date - YYYY-MM-DD
  • time - Time
  • province - the province of South Korea
  • confirmed - the accumulated number of the confirmed in the province
  • released - the accumulated number of the released in the province
  • deceased - the accumulated number of the deceased in the province

Weather.csv

  • code - the code of the region
  • province - Special City / Metropolitan City / Province(-do)
  • date - YYYY-MM-DD
  • avg_temp - the average temperature
  • min_temp - the lowest temperature
  • max_temp - the highest temperature
  • precipitation - the daily precipitation
  • max_wind_speed - the maximum wind speed
  • most_wind_direction - the most frequent wind direction
  • avg_relative_humidity - the average relative humidity

24) Contribution Statements

Yap Chee Ann (Victor)

  • Build the overall structure of the project. Collaborate and compile project using parts from various teammates
  • Data analysis, visualization and animation using pandas, plotly and networkx
  • Implementation of machine learning model using sckit-learn to predict whether a patient will survive
  • Deploy project to website for easy viewing
  • Parts done: 1-5, 6.1-6.4, 7.1, 8.1-8.5, 9.1-9.2, 10-24

Yuan Yong Xi (Yannis)

  • Conducted time series analysis including stationarity tests for confirmed cases and forecasting with ARIMA, ETS and LSTM models;
  • Tested homogeneous mortality hypotheses across age, gender, and province groups with single-factor ANOVA;
  • Validated distributional visualisations and machine learning mortality predictive models
  • Parts done: 6.5-6.10, 7.2, 8.6, 9.3, 16, 20

Tan Zhi Yang

  • Data analysis using pandas and matplotlib
  • Gave good suggestion on how to data analysis and help team to keep a lookout for pitfall when doing data analysis
  • Parts done: 3.1.2, 3.2.2, 4.1.2, 7.3, 7.4, 8.1-8.2, 9.4, 13.2

You Yu Quan

  • Involved in discussion