Exploratory Data Analysis
Dependencies
To complete our Data Analysis we’ll be using the tidyverse library. To visualize our data we will be using ggplot2, dplyr, maps and viridis. To get the weather based on latitude and longitude, we will be making an api call using weatherr. To get country based data such as population density, country code etc… we will be using WDI which is the World Development Indicators DataBank and the countrycode Library.
# Tidyverse
library(tidyverse)
# Graphing Library
library(ggplot2)
library(dplyr)
library(maps)
library(viridis)
#------------------
library(dygraphs)
library(xts)
library(tidyverse)
library(lubridate)
# Weather and Country Data
library(weatherr)
library(WDI)
library(countrycode)
Reading The Data
Let’s first go ahead and see what our data looks like by printing out first few rows
# Reading
covid_19 <- read_csv("03-17-2020.csv")
Parsed with column specification:
cols(
`Province/State` = col_character(),
`Country/Region` = col_character(),
`Last Update` = col_datetime(format = ""),
Confirmed = col_double(),
Deaths = col_double(),
Recovered = col_double(),
Latitude = col_double(),
Longitude = col_double()
)
names(covid_19)[1] <- "Province"
names(covid_19)[2] <- "Country"
head(covid_19)
By the looks of the data we have the following fields:
1. Province : Region in the country where the virus was reported
2. Country : Country where it was reported
3. Last Update : The time when the total count was reviewed onto the datatset
4. Confirmed: Total number of people affected by the virus
5. Deaths : Total number of deaths from the virus
6. Recovered : Total number of people that recovered after they have been affected
7. Latitute : Geo-logical location of the reported incident
8. Longtitude : Geo-logical location of the reported incident
Graphing
Now let’s try to visualize our data to get a broader view of the problem at hand. Firstly we’ll take a look at the total number of confirmed cases.
world <- map_data("world")
ggplot() +
geom_polygon(data = world, aes(x=long, y = lat, group = group), fill="grey", alpha=0.3) +
geom_point( data=covid_19, aes(x=covid_19$Longitude, y=covid_19$Latitude, size=covid_19$Confirmed, color=covid_19$Confirmed)) +
scale_color_viridis(trans="log") +
theme_void()+ coord_map()

By looking at the graph we can that the most number of cases reported to-date happen to be in China. We also see some cases being reported at sea, these are most probably ships which were traveling at the time.
world <- map_data("world")
ggplot() +
geom_polygon(data = world, aes(x=long, y = lat, group = group), fill="grey", alpha=0.3) +
geom_point( data=covid_19, aes(x=covid_19$Longitude, y=covid_19$Latitude, size=covid_19$Deaths, color=covid_19$Deaths)) +
scale_color_viridis(trans="log") +
theme_void()+ coord_map()

From the graph we can see that the most number of deaths happened in China and Italy. As compared to the number of confirmed cases this graph seems to be less dense which is a good sign.
world <- map_data("world")
ggplot() +
geom_polygon(data = world, aes(x=long, y = lat, group = group), fill="grey", alpha=0.3) +
geom_point( data=covid_19, aes(x=covid_19$Longitude, y=covid_19$Latitude, size=covid_19$Recovered, color=covid_19$Recovered)) +
scale_color_viridis(trans="log") +
theme_void()+ coord_map()

From this graph we see that most of the recovered cases happened again in China and Italy. With a number of recovered cases growing in North America.
Data Cleaning
Before we can begin to understand the data and effectively use it to do our analysis, we first need to do some data cleaning. We should group by country and we should get the weather for all of them. Population size will also be an important factor in our study, so well will also factor that in. We will not clean our data accordingly.
# Made a new data frame with a columns called temperature,then using the latitude and longitude got there respective weather by doing an api call to the weather station
data.frame(covid_19) -> country_data
country_data$Province = NULL
country_data$Temperature = c(0)
n = nrow(country_data)
for (i in 1:n){
weather = suppressWarnings( locationforecast(lat=country_data$Latitude[i], lon=country_data$Longitude[i]) )
temp = suppressWarnings(mean(weather$temperature))
country_data$Temperature[i] = temp
}
# Get the weather data
Weather_data = country_data
Weather_data$Latitude = NULL
Weather_data$Longitude = NULL
Weather_data %>%
group_by(Country) %>%
summarise(Confirmed_Cases = sum(Confirmed)) -> con
Weather_data %>%
group_by(Country) %>%
summarise(Death_Cases = sum(Deaths)) -> dea
Weather_data %>%
group_by(Country) %>%
summarise(Recovered_Cases = sum(Recovered)) -> rec
Weather_data %>%
group_by(Country) %>%
summarise(Mean_Temprature = mean(Temperature)) -> tem
Weather_data_country = data.frame(con$Country, con$Confirmed_Cases, dea$Death_Cases, rec$Recovered_Cases, tem$Mean_Temprature)
colnames(Weather_data_country) <- c("Country", "Confirmed Cases", "Death Cases", "Recovered Cases", "Average Temprature")
country = Weather_data_country$Country
countrycode(country, origin = 'country.name', destination = 'iso2c') -> Country_code
cbind(Country_code ,Weather_data_country) -> Weather_data_country
WDI(country = Weather_data_country$Country_code, indicator="SP.POP.TOTL",start=2018, end=2018) -> population
#
head(Weather_data_country)
merge(Weather_data_country, population, by.x = "Country_code", by.y = "iso2c") -> df
df$country = NULL
df$year = NULL
colnames(df)[7] = "Total Population"
head(df)
c_url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv"
d_url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv"
r_url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv"
# Reading in the Time Data-Set
confirmed_ts = read_delim(c_url, ",")
death_ts = read_delim(d_url, ",")
recover_ts = read_delim(r_url, ",")
# Confirmed Cases for dates
confirmed_ts %>%
pivot_longer("1/22/20":"3/21/20", names_to = "Dates", values_to = "Confirmed_Cases") -> confirmed_ts
# Death Cases for dates
death_ts %>%
pivot_longer("1/22/20":"3/21/20", names_to = "Dates", values_to = "Death_Cases") -> death_ts
# Recovered Cases for dates
recover_ts %>%
pivot_longer("1/22/20":"3/21/20", names_to = "Dates", values_to = "Recover_Cases") -> recover_ts
confirmed_ts %>%
group_by(Dates) %>%
summarise(Confirmed_Cases = sum(`Confirmed_Cases`))->cs
recover_ts %>%
group_by(Dates) %>%
summarise(Recover_Cases = sum(`Recover_Cases`)) -> rs
death_ts %>%
group_by(Dates) %>%
summarise(Death_Cases = sum(`Death_Cases`)) -> ds
merge(cs, rs, by.x = "Dates", by.y = "Dates") -> dfr
merge(dfr, ds, by.x = "Dates", by.y = "Dates" ) -> TotalCount_ByDate
Now that we have cleaned our data lets answer some questions.
Analysis
a. Which country has the most number of casualties ?
df %>%
group_by(Country) %>%
arrange(desc(`Confirmed Cases`, `Death Cases`, `Recovered Cases`)) -> conf_df
head(conf_df,4) -> dfmax
dfmax
As we can see China ranks the most amongs all of them, followed by Italy, Iran and Spain
b. Which country has the least number of casualties ?
df %>%
group_by(Country) %>%
arrange(`Confirmed Cases`, `Death Cases`, `Recovered Cases`) -> conf_df
head(conf_df,4) -> conf_df
conf_df
As we can see Republic of the Congo ranks the most among all of them, followed by Puerto Rico, Palestinian territory and Antigua and Barbuda
c. What is the general trend of the COVID-19 virus as the day’s progress?
Let us call the data we cleaned up above and show some of the rows
head(TotalCount_ByDate)
We see one categorical data called Dates and three Quantitative Data called Confirmed Cases, Recover Cases and Death Cases . However since we have a time series data, we thought it would be interesting to try using the xts library.
as.Date(TotalCount_ByDate$Dates,format="%m/%d/%y") -> o
data.frame(o, TotalCount_ByDate$Confirmed_Cases, TotalCount_ByDate$Death_Cases, TotalCount_ByDate$Recover_Cases) -> data
don <- xts(x = data, order.by = data$o)
names(don)[2] = "Confirmed Cases"
names(don)[3] = "Death Cases"
names(don)[4] = "Recovered Cases"
# Finally the plot
p <- dygraph(don) %>%
dyOptions(labelsUTC = TRUE, fillGraph=TRUE, fillAlpha=0.1, drawGrid = FALSE, colors = RColorBrewer::brewer.pal(3, "Set1")) %>%
dyRangeSelector() %>%
dyCrosshair(direction = "vertical") %>%
dyHighlight(highlightCircleSize = 5, highlightSeriesBackgroundAlpha = 0.2, hideOnMouseOut = FALSE) %>%
dyRoller(rollPeriod = 1)
p
From the graph we can see that there’s been a global sharp increase in the number of cases after March 12. The graph might possibly be following an exponential curve, but have yet to hit the surge. We also see that the number of confirmed cases are growing at a faster rate than Recovered and Death Cases.
d. Does Climate have an effect on the COVID-19 virus?
For the purpose of the analysis we want to see if the temperature has any affect on the COVID-19 virus. From question (a) we found out that the countries with the most number of confirmed cases is China and Italy. We start by taking the mean of the two temperatures as a bench mark.
bench_temp = c(dfmax$`Average Temprature`[1],dfmax$`Average Temprature`[2])
mu = mean(bench_temp)
mu
[1] 11.62157
Our mu here is 11.6 which signifies that this temperature is ideal for the virus to sustain.
Now lets split our dataset into 3 categories by their number of confirmed cases. Which are as follows: 1. Small: Confirmed cases between 0 and 200 2. Medium: Confirmed cases between 200 and 20000 3. Large: Confirmed cases are larger than 20000
df %>%
filter(df$`Confirmed Cases` < 200) -> small
head(small)
df %>%
filter(df$`Confirmed Cases` <5000 ,df$`Confirmed Cases` >200 ) -> medium
head(medium)
df %>%
filter(df$`Confirmed Cases` >= 5000) -> large
head(large)
This is a scatter plot of all of our separated data sets
ggplot(small, aes(y = `Confirmed Cases`, x = `Average Temprature`)) + geom_point()

Small Graph: We can see that the number of confirmed cases has a high density when the temperature is above 10 degrees celsius. So we can speculate that the temperature might play an important role when the number of confirmed cases is between 0 and 200.
ggplot(medium, aes(y = `Confirmed Cases`, x = `Average Temprature`)) + geom_point()

Medium Graph: We can see that the number of confirmed cases are mostly in between 500 and 750. We also see that for those cases, the temperature is evenly spread out meaning that once there is a certain threshold of people who have been affected, the temperature may no longer play an effective part in shortening the virus lifespan enough to make a difference. We can see that for the number of cases between 750 and 5000 have a lower average temperature, about 15 degrees celsius and below.
ggplot(large, aes(y = `Confirmed Cases`, x = `Average Temprature`)) +geom_point()

Large Graph: We can see that almost all of our data points lie below 15 degrees celsius. Also there is only one data point which is above 40000 confirmed cases. (This is China from our previous EDA).
Lets start by checking if there is any correlation between our variable
cor.test(small$`Average Temprature`, small$`Confirmed Cases`)
Pearson's product-moment correlation
data: small$`Average Temprature` and small$`Confirmed Cases`
t = -0.96404, df = 110, p-value = 0.3371
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.27246040 0.09564811
sample estimates:
cor
-0.09153218
small.1 <- lm(small$`Average Temprature` ~ small$`Confirmed Cases`, data = small)
summary(small.1)
Call:
lm(formula = small$`Average Temprature` ~ small$`Confirmed Cases`,
data = small)
Residuals:
Min 1Q Median 3Q Max
-37.042 -7.943 3.804 7.070 14.597
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.76523 1.17111 16.877 <2e-16 ***
small$`Confirmed Cases` -0.01802 0.01869 -0.964 0.337
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.776 on 110 degrees of freedom
Multiple R-squared: 0.008378, Adjusted R-squared: -0.0006366
F-statistic: 0.9294 on 1 and 110 DF, p-value: 0.3371
medium.1 <- lm(medium$`Average Temprature` ~ medium$`Confirmed Cases`, data = medium)
summary(medium.1)
Call:
lm(formula = medium$`Average Temprature` ~ medium$`Confirmed Cases`,
data = medium)
Residuals:
Min 1Q Median 3Q Max
-21.451 -5.933 -2.701 6.218 16.041
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.975510 2.853452 4.547 0.000111 ***
medium$`Confirmed Cases` -0.002692 0.003011 -0.894 0.379490
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 10.01 on 26 degrees of freedom
Multiple R-squared: 0.02983, Adjusted R-squared: -0.007487
F-statistic: 0.7994 on 1 and 26 DF, p-value: 0.3795
large.1 <- lm(large$`Average Temprature` ~ large$`Confirmed Cases`, data = large)
summary(large.1)
Call:
lm(formula = large$`Average Temprature` ~ large$`Confirmed Cases`,
data = large)
Residuals:
Min 1Q Median 3Q Max
-8.5335 -3.2256 -0.7415 1.5434 11.0840
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.250e+01 3.210e+00 3.896 0.00802 **
large$`Confirmed Cases` -9.623e-06 1.001e-04 -0.096 0.92658
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.728 on 6 degrees of freedom
Multiple R-squared: 0.001536, Adjusted R-squared: -0.1649
F-statistic: 0.009233 on 1 and 6 DF, p-value: 0.9266
After running a linear regression on all 3 datasets, we can see that the p-value in all 3 sets are all above 0.05. This means that Confirmed Cases for all 3 sets is not significant. We also have an r-squared values which are very low, meaning that temperature does not tell us much about its correlation with confirmed cases across all 3 datasets.
After looking at our datasets, our graphs and running linear regressions on our sets, we cannot confidently say that there is a correlation between temperature and the amount of confirmed cases. From the small dataset, it initially may have seemed that there is a relation between higher temperatures and confirmed cases but as we went to the medium and large datasets, we were unable to continue on with the assumption. This makes sense, as our small data sets have a smaller numbers of confirmed cases so we may not have enough evidence in this set in order to make proper assumptions. Overall, based on our evidence and findings, we cannot say with confidence that there is a relationship between temperature and the spread of the virus. Of course something to note is that this virus began to spread during the cooler months of the most effected countries, so it hard to also tell if our conclusions could have been different if this were happening at a different time line.
e. Does Population have an effect on COVID-19 virus ?
Since our dataset has multiple fields and for the purpose of this questuon we will only be concerned with confirmed casea and total population and drop everything else.
df %>% select(Country,`Confirmed Cases`, `Total Population`) %>%
arrange(desc(`Total Population`)) -> dfpop
head(dfpop)
ggplot(dfpop, aes(x = `Total Population`, y = `Confirmed Cases`)) +geom_point()

dfpop %>%
filter(dfpop$`Total Population` < 1000000) -> smallpop
head(smallpop)
dfpop %>%
filter(dfpop$`Total Population` > 1000000, dfpop$`Total Population` < 50000000 ) -> mediumpop
head(mediumpop)
dfpop %>%
filter(dfpop$`Total Population` > 50000000 ) -> largepop
head(largepop)
ggplot(smallpop, aes(x = `Total Population`, y = `Confirmed Cases`)) +geom_point()

ggplot(mediumpop, aes(x = `Total Population`, y = `Confirmed Cases`)) +geom_point()

ggplot(largepop, aes(x = `Total Population`, y = `Confirmed Cases`)) +geom_point()

From all three groups, we can see that the majority of confirmed cases are relatively low with respect to the countries total population. There are a couple outliers in each dataset but this can be expected. There is no trend to show that a higher population equates to more confirmed cases. We can run a linear regression once again to see if there is a trend that we cannot determine from our graphs.
smallpop.1 <- lm(smallpop$`Confirmed Cases` ~ smallpop$`Total Population`, data = smallpop)
summary(smallpop.1)
Call:
lm(formula = smallpop$`Confirmed Cases` ~ smallpop$`Total Population`,
data = smallpop)
Residuals:
Min 1Q Median 3Q Max
-37.948 -26.928 -23.287 -0.944 189.183
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.364e+01 1.972e+01 1.199 0.244
smallpop$`Total Population` 2.029e-05 5.001e-05 0.406 0.689
Residual standard error: 57.5 on 20 degrees of freedom
Multiple R-squared: 0.008158, Adjusted R-squared: -0.04143
F-statistic: 0.1645 on 1 and 20 DF, p-value: 0.6893
mediumpop.1 <- lm(mediumpop$`Confirmed Cases` ~ mediumpop$`Total Population`, data = mediumpop)
summary(mediumpop.1)
Call:
lm(formula = mediumpop$`Confirmed Cases` ~ mediumpop$`Total Population`,
data = mediumpop)
Residuals:
Min 1Q Median 3Q Max
-1062.8 -332.8 -100.0 -24.1 10686.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.053e+00 1.765e+02 0.006 0.9953
mediumpop$`Total Population` 2.269e-05 9.458e-06 2.400 0.0183 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1205 on 97 degrees of freedom
Multiple R-squared: 0.05603, Adjusted R-squared: 0.0463
F-statistic: 5.758 on 1 and 97 DF, p-value: 0.01833
largepop.1 <- lm(largepop$`Confirmed Cases` ~ largepop$`Total Population`, data = largepop)
summary(largepop.1)
Call:
lm(formula = largepop$`Confirmed Cases` ~ largepop$`Total Population`,
data = largepop)
Residuals:
Min 1Q Median 3Q Max
-38122 -3840 -2701 -802 41662
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.215e+02 3.112e+03 0.039 0.96918
largepop$`Total Population` 2.820e-05 7.841e-06 3.596 0.00139 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13660 on 25 degrees of freedom
Multiple R-squared: 0.3409, Adjusted R-squared: 0.3146
F-statistic: 12.93 on 1 and 25 DF, p-value: 0.001386
Based on our data above, we can conclude that a higher population does not necessarily mean more confirmed cases. This can be clearly shown by looking at our dfpop dataframe. We can see that India has a total of 142 confirmed cases with a total population of 1352617328, where as Italy has a total of 31506 confirmed cases with a total population of 60431283.
Despite being half the population of India, there is tens of thousands of more confirmed cases. There are a number of factors which significantly contribute the amount of confirmed cases, however having more people in a country does not always mean more people have the virus. There is a an increased likelihood of the virus spreading faster, if precautions are not taken by the government and people, such is the case in Italy. However, there are many actions individuals can take in order to stop the spread of the virus so quickly, such as social distancing, self quarantining and practicing good hygiene and cleanliness. A high population doesn’t always mean a high number of confirmed cases, but more of a risk of more individuals to catch the virus.
Conclusion
It seems that the global fear of COVID-19 and it’s rapid increase is due to the same problem we drew from doing this project; it is difficult to contain an unknown disease from spreading when we cannot predict how well it thrives in its environment. Our analysis could not help us conclude if temperature played an effective role in controlling the growth of the virus, with the most number of cases being in China and the least number of cases being in Republic of Congo. Although there seemed to be signs of the virus being less effective in hotter climates when looking at the countries with the least number of casualties, as the number of victims grew exponentially, temperature’s effect on the lifespan of the virus became possibly insignificant. And since this growth happened during the colder months of the most effected countries, it is even more difficult to draw any strong conclusions. It is possible that on a different timeline, the scenario could have been completely different, and it may be still too early to draw an opinion. As we’ve seen from our first graph, we have yet to reach worldwide peak. Similarly, we could not conclude that the population size of a country had a play in its spread. There may be other factors that come into play, such as a country’s age distribution, travel restrictions etc., that could explain its current state. From recent events we do know that in general the more people that are traveling, the more the virus is able to spread by contact, with it being somewhat of a domino effect. So our solution has been to go into quarantine despite whatever level of cases in the area. Hopefully we can soon expect to gather enough data to successfully know how to reduce casualties arising from COVID-19.
