Author

Abhinav Chaudhary

Preface

On 31 December 2019, WHO was alerted to several cases of pneumonia in Wuhan City, Hubei Province of China. The virus responsible was not a known one and this raised immediate concern regarding people’s safety, especially given the rate it was spreading throughout the global map. With it’s proper symptoms and treatments still unknown, daily level information on the affected people can give some interesting insights. We would like to thank Johns Hopkins University for making the data available for educational and academic research purposes.

Introduction

The 2019 Novel Coronavirus is a virus identified as the cause of an outbreak of respiratory illness first detected in Wuhan, China. At this time, it is unclear how easily or sustainable this virus is spreading between people. This dataset has daily level information on the number of affected cases, deaths and recovery from 2019 novel coronavirus, the name now shortened to COVID-19. For the purposes of this project, we will only be collecting data till 17th March 2020.

Our target for this project will be to study the effects of geography and population on the rate at which the virus is able to spread. Studying the geography of each case will help us analyzing if temperature plays a part in how well the virus is able to exist, and so how fast it is able to multiply. We will also look at the population density of the effected countries and see if the number of cases have a linear relationship with the population size itself.Most importantly, this project will help us better understand and prepare for the pandemic and possibly help react accordingly to every range of severity, ie. what the appropriate safety measures are.

Exploratory Data Analysis

Dependencies

To complete our Data Analysis we’ll be using the tidyverse library. To visualize our data we will be using ggplot2, dplyr, maps and viridis. To get the weather based on latitude and longitude, we will be making an api call using weatherr. To get country based data such as population density, country code etc… we will be using WDI which is the World Development Indicators DataBank and the countrycode Library.

# Tidyverse
library(tidyverse)
# Graphing Library
library(ggplot2)
library(dplyr)
library(maps)
library(viridis)
#------------------
library(dygraphs)
library(xts)
library(tidyverse)
library(lubridate)
# Weather and Country Data
library(weatherr)
library(WDI)
library(countrycode)

Reading The Data

Let’s first go ahead and see what our data looks like by printing out first few rows

# Reading
covid_19 <- read_csv("03-17-2020.csv")
Parsed with column specification:
cols(
  `Province/State` = col_character(),
  `Country/Region` = col_character(),
  `Last Update` = col_datetime(format = ""),
  Confirmed = col_double(),
  Deaths = col_double(),
  Recovered = col_double(),
  Latitude = col_double(),
  Longitude = col_double()
)
names(covid_19)[1] <- "Province"
names(covid_19)[2] <- "Country"
head(covid_19)

By the looks of the data we have the following fields:

1. Province : Region in the country where the virus was reported

2. Country : Country where it was reported

3. Last Update : The time when the total count was reviewed onto the datatset

4. Confirmed: Total number of people affected by the virus

5. Deaths : Total number of deaths from the virus

6. Recovered : Total number of people that recovered after they have been affected

7. Latitute : Geo-logical location of the reported incident

8. Longtitude : Geo-logical location of the reported incident

Graphing

Now let’s try to visualize our data to get a broader view of the problem at hand. Firstly we’ll take a look at the total number of confirmed cases.

world <- map_data("world")
ggplot() +
  geom_polygon(data = world, aes(x=long, y = lat, group = group), fill="grey", alpha=0.3) +
  geom_point( data=covid_19, aes(x=covid_19$Longitude, y=covid_19$Latitude, size=covid_19$Confirmed, color=covid_19$Confirmed)) +
    scale_color_viridis(trans="log") +

  theme_void()+ coord_map() 

By looking at the graph we can that the most number of cases reported to-date happen to be in China. We also see some cases being reported at sea, these are most probably ships which were traveling at the time.

world <- map_data("world")
ggplot() +
  geom_polygon(data = world, aes(x=long, y = lat, group = group), fill="grey", alpha=0.3) +
  geom_point( data=covid_19, aes(x=covid_19$Longitude, y=covid_19$Latitude, size=covid_19$Deaths, color=covid_19$Deaths)) +
  scale_color_viridis(trans="log") +
  theme_void()+ coord_map() 

From the graph we can see that the most number of deaths happened in China and Italy. As compared to the number of confirmed cases this graph seems to be less dense which is a good sign.

world <- map_data("world")
ggplot() +
  geom_polygon(data = world, aes(x=long, y = lat, group = group), fill="grey", alpha=0.3) +
  geom_point( data=covid_19, aes(x=covid_19$Longitude, y=covid_19$Latitude, size=covid_19$Recovered, color=covid_19$Recovered)) +
  scale_color_viridis(trans="log") +
  theme_void()+ coord_map() 

From this graph we see that most of the recovered cases happened again in China and Italy. With a number of recovered cases growing in North America.

Data Cleaning

Before we can begin to understand the data and effectively use it to do our analysis, we first need to do some data cleaning. We should group by country and we should get the weather for all of them. Population size will also be an important factor in our study, so well will also factor that in. We will not clean our data accordingly.


# Made a new data frame with a columns called temperature,then using the latitude and longitude got there respective weather by doing an api call to the weather station
data.frame(covid_19) -> country_data
country_data$Province = NULL
country_data$Temperature = c(0)

n = nrow(country_data)
for (i in 1:n){
  weather = suppressWarnings( locationforecast(lat=country_data$Latitude[i], lon=country_data$Longitude[i]) )
  temp = suppressWarnings(mean(weather$temperature))
  country_data$Temperature[i] = temp
}

# Get the weather data
Weather_data  = country_data
Weather_data$Latitude = NULL
Weather_data$Longitude = NULL
Weather_data %>%
  group_by(Country) %>%
  summarise(Confirmed_Cases = sum(Confirmed)) -> con
Weather_data %>%
  group_by(Country) %>%
  summarise(Death_Cases = sum(Deaths)) -> dea
Weather_data %>%
  group_by(Country) %>%
  summarise(Recovered_Cases = sum(Recovered)) -> rec
Weather_data %>%
  group_by(Country) %>%
  summarise(Mean_Temprature = mean(Temperature)) -> tem

Weather_data_country = data.frame(con$Country, con$Confirmed_Cases, dea$Death_Cases, rec$Recovered_Cases, tem$Mean_Temprature)
colnames(Weather_data_country) <- c("Country", "Confirmed Cases", "Death Cases", "Recovered Cases", "Average Temprature")
country = Weather_data_country$Country
countrycode(country, origin = 'country.name', destination = 'iso2c') -> Country_code
cbind(Country_code ,Weather_data_country) -> Weather_data_country
WDI(country = Weather_data_country$Country_code, indicator="SP.POP.TOTL",start=2018, end=2018) -> population
#
head(Weather_data_country)
merge(Weather_data_country, population, by.x = "Country_code", by.y = "iso2c") -> df
df$country = NULL
df$year = NULL
colnames(df)[7] = "Total Population"
head(df)
c_url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv"
d_url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv"
r_url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv"

# Reading in the Time Data-Set
confirmed_ts = read_delim(c_url, ",")
death_ts = read_delim(d_url, ",")
recover_ts = read_delim(r_url, ",")
# Confirmed Cases for dates
confirmed_ts %>%
  pivot_longer("1/22/20":"3/21/20", names_to = "Dates", values_to = "Confirmed_Cases") -> confirmed_ts

# Death Cases for dates
death_ts %>%
  pivot_longer("1/22/20":"3/21/20", names_to = "Dates", values_to = "Death_Cases") -> death_ts

# Recovered Cases for dates
recover_ts %>%
  pivot_longer("1/22/20":"3/21/20", names_to = "Dates", values_to = "Recover_Cases") -> recover_ts
confirmed_ts %>%
  group_by(Dates) %>%
  summarise(Confirmed_Cases = sum(`Confirmed_Cases`))->cs

recover_ts %>%
  group_by(Dates) %>%
  summarise(Recover_Cases = sum(`Recover_Cases`)) -> rs

death_ts %>%
  group_by(Dates) %>%
  summarise(Death_Cases = sum(`Death_Cases`)) -> ds
merge(cs, rs, by.x = "Dates", by.y = "Dates") -> dfr
merge(dfr, ds, by.x = "Dates", by.y = "Dates" ) -> TotalCount_ByDate

Now that we have cleaned our data lets answer some questions.

Analysis

a. Which country has the most number of casualties ?

df %>%
  group_by(Country) %>%
  arrange(desc(`Confirmed Cases`, `Death Cases`, `Recovered Cases`)) -> conf_df
head(conf_df,4) -> dfmax
dfmax

As we can see China ranks the most amongs all of them, followed by Italy, Iran and Spain

b. Which country has the least number of casualties ?

df %>%
  group_by(Country) %>%
  arrange(`Confirmed Cases`, `Death Cases`, `Recovered Cases`) -> conf_df
head(conf_df,4) -> conf_df
conf_df

As we can see Republic of the Congo ranks the most among all of them, followed by Puerto Rico, Palestinian territory and Antigua and Barbuda

c. What is the general trend of the COVID-19 virus as the day’s progress?

Let us call the data we cleaned up above and show some of the rows

head(TotalCount_ByDate)

We see one categorical data called Dates and three Quantitative Data called Confirmed Cases, Recover Cases and Death Cases . However since we have a time series data, we thought it would be interesting to try using the xts library.

as.Date(TotalCount_ByDate$Dates,format="%m/%d/%y") -> o

data.frame(o, TotalCount_ByDate$Confirmed_Cases, TotalCount_ByDate$Death_Cases, TotalCount_ByDate$Recover_Cases) -> data
don <- xts(x = data, order.by = data$o)
names(don)[2] = "Confirmed Cases"
names(don)[3] = "Death Cases"
names(don)[4] = "Recovered Cases"
# Finally the plot
p <- dygraph(don) %>%
  dyOptions(labelsUTC = TRUE, fillGraph=TRUE, fillAlpha=0.1, drawGrid = FALSE, colors = RColorBrewer::brewer.pal(3, "Set1")) %>%
  dyRangeSelector() %>%
  dyCrosshair(direction = "vertical") %>%
  dyHighlight(highlightCircleSize = 5, highlightSeriesBackgroundAlpha = 0.2, hideOnMouseOut = FALSE)  %>%
  dyRoller(rollPeriod = 1)
p

From the graph we can see that there’s been a global sharp increase in the number of cases after March 12. The graph might possibly be following an exponential curve, but have yet to hit the surge. We also see that the number of confirmed cases are growing at a faster rate than Recovered and Death Cases.

d. Does Climate have an effect on the COVID-19 virus?

For the purpose of the analysis we want to see if the temperature has any affect on the COVID-19 virus. From question (a) we found out that the countries with the most number of confirmed cases is China and Italy. We start by taking the mean of the two temperatures as a bench mark.

bench_temp = c(dfmax$`Average Temprature`[1],dfmax$`Average Temprature`[2])
mu = mean(bench_temp)
mu
[1] 11.62157

Our mu here is 11.6 which signifies that this temperature is ideal for the virus to sustain.

Now lets split our dataset into 3 categories by their number of confirmed cases. Which are as follows: 1. Small: Confirmed cases between 0 and 200 2. Medium: Confirmed cases between 200 and 20000 3. Large: Confirmed cases are larger than 20000

df %>%
  filter(df$`Confirmed Cases`  < 200) -> small

head(small)
df %>%
  filter(df$`Confirmed Cases`  <5000 ,df$`Confirmed Cases`  >200 ) -> medium

head(medium)
df %>%
  filter(df$`Confirmed Cases`  >= 5000) -> large

head(large)

This is a scatter plot of all of our separated data sets

ggplot(small, aes(y = `Confirmed Cases`, x = `Average Temprature`)) + geom_point()

Small Graph: We can see that the number of confirmed cases has a high density when the temperature is above 10 degrees celsius. So we can speculate that the temperature might play an important role when the number of confirmed cases is between 0 and 200.

ggplot(medium, aes(y = `Confirmed Cases`, x = `Average Temprature`)) + geom_point()

Medium Graph: We can see that the number of confirmed cases are mostly in between 500 and 750. We also see that for those cases, the temperature is evenly spread out meaning that once there is a certain threshold of people who have been affected, the temperature may no longer play an effective part in shortening the virus lifespan enough to make a difference. We can see that for the number of cases between 750 and 5000 have a lower average temperature, about 15 degrees celsius and below.

ggplot(large, aes(y = `Confirmed Cases`, x = `Average Temprature`)) +geom_point()

Large Graph: We can see that almost all of our data points lie below 15 degrees celsius. Also there is only one data point which is above 40000 confirmed cases. (This is China from our previous EDA).

Lets start by checking if there is any correlation between our variable

cor.test(small$`Average Temprature`, small$`Confirmed Cases`)

    Pearson's product-moment correlation

data:  small$`Average Temprature` and small$`Confirmed Cases`
t = -0.96404, df = 110, p-value = 0.3371
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.27246040  0.09564811
sample estimates:
        cor
-0.09153218 
small.1 <- lm(small$`Average Temprature` ~ small$`Confirmed Cases`, data = small)
summary(small.1)

Call:
lm(formula = small$`Average Temprature` ~ small$`Confirmed Cases`,
    data = small)

Residuals:
    Min      1Q  Median      3Q     Max
-37.042  -7.943   3.804   7.070  14.597

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)
(Intercept)             19.76523    1.17111  16.877   <2e-16 ***
small$`Confirmed Cases` -0.01802    0.01869  -0.964    0.337
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.776 on 110 degrees of freedom
Multiple R-squared:  0.008378,  Adjusted R-squared:  -0.0006366
F-statistic: 0.9294 on 1 and 110 DF,  p-value: 0.3371
medium.1 <- lm(medium$`Average Temprature` ~ medium$`Confirmed Cases`, data = medium)
summary(medium.1)

Call:
lm(formula = medium$`Average Temprature` ~ medium$`Confirmed Cases`,
    data = medium)

Residuals:
    Min      1Q  Median      3Q     Max
-21.451  -5.933  -2.701   6.218  16.041

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)
(Intercept)              12.975510   2.853452   4.547 0.000111 ***
medium$`Confirmed Cases` -0.002692   0.003011  -0.894 0.379490
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.01 on 26 degrees of freedom
Multiple R-squared:  0.02983,   Adjusted R-squared:  -0.007487
F-statistic: 0.7994 on 1 and 26 DF,  p-value: 0.3795
large.1 <- lm(large$`Average Temprature` ~ large$`Confirmed Cases`, data = large)
summary(large.1)

Call:
lm(formula = large$`Average Temprature` ~ large$`Confirmed Cases`,
    data = large)

Residuals:
    Min      1Q  Median      3Q     Max
-8.5335 -3.2256 -0.7415  1.5434 11.0840

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)
(Intercept)              1.250e+01  3.210e+00   3.896  0.00802 **
large$`Confirmed Cases` -9.623e-06  1.001e-04  -0.096  0.92658
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.728 on 6 degrees of freedom
Multiple R-squared:  0.001536,  Adjusted R-squared:  -0.1649
F-statistic: 0.009233 on 1 and 6 DF,  p-value: 0.9266

After running a linear regression on all 3 datasets, we can see that the p-value in all 3 sets are all above 0.05. This means that Confirmed Cases for all 3 sets is not significant. We also have an r-squared values which are very low, meaning that temperature does not tell us much about its correlation with confirmed cases across all 3 datasets.

After looking at our datasets, our graphs and running linear regressions on our sets, we cannot confidently say that there is a correlation between temperature and the amount of confirmed cases. From the small dataset, it initially may have seemed that there is a relation between higher temperatures and confirmed cases but as we went to the medium and large datasets, we were unable to continue on with the assumption. This makes sense, as our small data sets have a smaller numbers of confirmed cases so we may not have enough evidence in this set in order to make proper assumptions. Overall, based on our evidence and findings, we cannot say with confidence that there is a relationship between temperature and the spread of the virus. Of course something to note is that this virus began to spread during the cooler months of the most effected countries, so it hard to also tell if our conclusions could have been different if this were happening at a different time line.

e. Does Population have an effect on COVID-19 virus ?

Since our dataset has multiple fields and for the purpose of this questuon we will only be concerned with confirmed casea and total population and drop everything else.

df %>% select(Country,`Confirmed Cases`, `Total Population`) %>%
  arrange(desc(`Total Population`)) -> dfpop
head(dfpop)
ggplot(dfpop, aes(x = `Total Population`, y = `Confirmed Cases`)) +geom_point()

dfpop %>%
  filter(dfpop$`Total Population`  < 1000000) -> smallpop
head(smallpop)
dfpop %>%
  filter(dfpop$`Total Population`  > 1000000, dfpop$`Total Population` < 50000000 ) -> mediumpop

head(mediumpop)
dfpop %>%
  filter(dfpop$`Total Population`  > 50000000 ) -> largepop
head(largepop)
ggplot(smallpop, aes(x = `Total Population`, y = `Confirmed Cases`)) +geom_point()

ggplot(mediumpop, aes(x = `Total Population`, y = `Confirmed Cases`)) +geom_point()

ggplot(largepop, aes(x = `Total Population`, y = `Confirmed Cases`)) +geom_point()

From all three groups, we can see that the majority of confirmed cases are relatively low with respect to the countries total population. There are a couple outliers in each dataset but this can be expected. There is no trend to show that a higher population equates to more confirmed cases. We can run a linear regression once again to see if there is a trend that we cannot determine from our graphs.

smallpop.1 <- lm(smallpop$`Confirmed Cases` ~ smallpop$`Total Population`, data = smallpop)
summary(smallpop.1)

Call:
lm(formula = smallpop$`Confirmed Cases` ~ smallpop$`Total Population`,
    data = smallpop)

Residuals:
    Min      1Q  Median      3Q     Max
-37.948 -26.928 -23.287  -0.944 189.183

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)
(Intercept)                 2.364e+01  1.972e+01   1.199    0.244
smallpop$`Total Population` 2.029e-05  5.001e-05   0.406    0.689

Residual standard error: 57.5 on 20 degrees of freedom
Multiple R-squared:  0.008158,  Adjusted R-squared:  -0.04143
F-statistic: 0.1645 on 1 and 20 DF,  p-value: 0.6893
mediumpop.1 <- lm(mediumpop$`Confirmed Cases` ~ mediumpop$`Total Population`, data = mediumpop)
summary(mediumpop.1)

Call:
lm(formula = mediumpop$`Confirmed Cases` ~ mediumpop$`Total Population`,
    data = mediumpop)

Residuals:
    Min      1Q  Median      3Q     Max
-1062.8  -332.8  -100.0   -24.1 10686.6

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)
(Intercept)                  1.053e+00  1.765e+02   0.006   0.9953
mediumpop$`Total Population` 2.269e-05  9.458e-06   2.400   0.0183 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1205 on 97 degrees of freedom
Multiple R-squared:  0.05603,   Adjusted R-squared:  0.0463
F-statistic: 5.758 on 1 and 97 DF,  p-value: 0.01833
largepop.1 <- lm(largepop$`Confirmed Cases` ~ largepop$`Total Population`, data = largepop)
summary(largepop.1)

Call:
lm(formula = largepop$`Confirmed Cases` ~ largepop$`Total Population`,
    data = largepop)

Residuals:
   Min     1Q Median     3Q    Max
-38122  -3840  -2701   -802  41662

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)
(Intercept)                 1.215e+02  3.112e+03   0.039  0.96918
largepop$`Total Population` 2.820e-05  7.841e-06   3.596  0.00139 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13660 on 25 degrees of freedom
Multiple R-squared:  0.3409,    Adjusted R-squared:  0.3146
F-statistic: 12.93 on 1 and 25 DF,  p-value: 0.001386

Based on our data above, we can conclude that a higher population does not necessarily mean more confirmed cases. This can be clearly shown by looking at our dfpop dataframe. We can see that India has a total of 142 confirmed cases with a total population of 1352617328, where as Italy has a total of 31506 confirmed cases with a total population of 60431283.

Despite being half the population of India, there is tens of thousands of more confirmed cases. There are a number of factors which significantly contribute the amount of confirmed cases, however having more people in a country does not always mean more people have the virus. There is a an increased likelihood of the virus spreading faster, if precautions are not taken by the government and people, such is the case in Italy. However, there are many actions individuals can take in order to stop the spread of the virus so quickly, such as social distancing, self quarantining and practicing good hygiene and cleanliness. A high population doesn’t always mean a high number of confirmed cases, but more of a risk of more individuals to catch the virus.

Conclusion

It seems that the global fear of COVID-19 and it’s rapid increase is due to the same problem we drew from doing this project; it is difficult to contain an unknown disease from spreading when we cannot predict how well it thrives in its environment. Our analysis could not help us conclude if temperature played an effective role in controlling the growth of the virus, with the most number of cases being in China and the least number of cases being in Republic of Congo. Although there seemed to be signs of the virus being less effective in hotter climates when looking at the countries with the least number of casualties, as the number of victims grew exponentially, temperature’s effect on the lifespan of the virus became possibly insignificant. And since this growth happened during the colder months of the most effected countries, it is even more difficult to draw any strong conclusions. It is possible that on a different timeline, the scenario could have been completely different, and it may be still too early to draw an opinion. As we’ve seen from our first graph, we have yet to reach worldwide peak. Similarly, we could not conclude that the population size of a country had a play in its spread. There may be other factors that come into play, such as a country’s age distribution, travel restrictions etc., that could explain its current state. From recent events we do know that in general the more people that are traveling, the more the virus is able to spread by contact, with it being somewhat of a domino effect. So our solution has been to go into quarantine despite whatever level of cases in the area. Hopefully we can soon expect to gather enough data to successfully know how to reduce casualties arising from COVID-19.

---
title: 'Data Analysis: COVID-19 Virus'
output:
  html_notebook: default
  html_document:
    df_print: paged
---
# Authors
Abhinv Chaudhary : 1002707733

Adham Farag : 1003788190

Ethan Anada : 1003171907

Hamish Rajiv :

Lyla Tamim : 1003459465


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Preface


**On 31 December 2019**, WHO was alerted to several cases of pneumonia in Wuhan City, Hubei Province of China. The virus responsible was not a known one and this raised immediate concern regarding people's safety, especially given the rate it was spreading throughout the global map. With it's proper symptoms and treatments still unknown, daily level information on the affected people can give some interesting insights. We would like to thank Johns Hopkins University for making the data available for educational and academic research purposes.


# Introduction
**The 2019 Novel Coronavirus** is a virus identified as the cause of an outbreak of respiratory illness first detected in Wuhan, China. At this time, it is unclear how easily or sustainable this virus is spreading between people. This dataset has daily level information on the number of affected cases, deaths and recovery from **2019 novel coronavirus**, the name now shortened to COVID-19. For the purposes of this project, we will only be collecting data till **17th March 2020.** 

Our target for this project will be to study the effects of geography and population on the rate at which the virus is able to spread. Studying the geography of each case will help us analyzing if temperature plays a part in how well the virus is able to exist, and so how fast it is able to multiply. We will also look at the population density of the effected countries and see if the number of cases have a linear relationship with the population size itself.Most importantly, this project will help us better understand and prepare for the pandemic and possibly help react accordingly to every range of severity, ie. what the appropriate safety measures are. 


# Exploratory Data Analysis


### Dependencies

To complete our Data Analysis we'll be using the **tidyverse** library. To visualize our data we will be using **ggplot2, dplyr, maps and viridis**. To get the weather based on latitude and longitude, we will be making an api call using **weatherr**. To get country based data such as population density, country code etc... we will be using **WDI** which is the *World Development Indicators* DataBank and the **countrycode**  Library.
```{r}
# Tidyverse
library(tidyverse)
# Graphing Library
library(ggplot2)
library(dplyr)
library(maps)
library(viridis)
#------------------
library(dygraphs)
library(xts)         
library(tidyverse)
library(lubridate)
# Weather and Country Data
library(weatherr)
library(WDI)
library(countrycode)

```

### Reading The Data 
Let's first go ahead and see what our data looks like by printing out first few rows

```{r}
# Reading
covid_19 <- read_csv("03-17-2020.csv")

```

```{r}
names(covid_19)[1] <- "Province"
names(covid_19)[2] <- "Country"
head(covid_19)
```

By the looks of the data we have the following fields:


**1. Province :** Region in the country where the virus was reported


**2. Country :** Country where it was reported


**3. Last Update :** The time when the total count was reviewed onto the datatset


**4. Confirmed:** Total number of people affected by the virus 


**5. Deaths :** Total number of deaths from the virus 


**6. Recovered :** Total number of people that recovered after they have been affected 


**7. Latitute :** Geo-logical location of the reported incident 


**8. Longtitude :** Geo-logical location of the reported incident

### Graphing 
Now let's try to visualize our data to get a broader view of the problem at hand. Firstly we'll take a look at the total number of **confirmed cases**.

```{r}
world <- map_data("world")
ggplot() +
  geom_polygon(data = world, aes(x=long, y = lat, group = group), fill="grey", alpha=0.3) +
  geom_point( data=covid_19, aes(x=covid_19$Longitude, y=covid_19$Latitude, size=covid_19$Confirmed, color=covid_19$Confirmed)) +
    scale_color_viridis(trans="log") +

  theme_void()+ coord_map() 
```
By looking at the graph we can that the most number of cases reported to-date happen to be in China. We also see some cases being reported at sea, these are most probably ships which were traveling at the time.




```{r}
world <- map_data("world")
ggplot() +
  geom_polygon(data = world, aes(x=long, y = lat, group = group), fill="grey", alpha=0.3) +
  geom_point( data=covid_19, aes(x=covid_19$Longitude, y=covid_19$Latitude, size=covid_19$Deaths, color=covid_19$Deaths)) +
  scale_color_viridis(trans="log") +
  theme_void()+ coord_map() 
```
From the graph we can see that the most number of deaths happened in China and Italy. As compared to the number of confirmed cases this graph seems to be less dense which is a good sign. 



```{r}
world <- map_data("world")
ggplot() +
  geom_polygon(data = world, aes(x=long, y = lat, group = group), fill="grey", alpha=0.3) +
  geom_point( data=covid_19, aes(x=covid_19$Longitude, y=covid_19$Latitude, size=covid_19$Recovered, color=covid_19$Recovered)) +
  scale_color_viridis(trans="log") +
  theme_void()+ coord_map() 
```
From this graph we see that most of the recovered cases happened again in China and Italy. With a number of recovered cases growing in North America. 


## Data Cleaning 
Before we can begin to understand the data and effectively use it to do our analysis, we first need to do some data cleaning. We should group by country and we should get the weather for all of them. Population size will also be an important factor in our study, so well will also factor that in. We will not clean our data accordingly.

```{r}

# Made a new data frame with a columns called temperature,then using the latitude and longitude got there respective weather by doing an api call to the weather station 
data.frame(covid_19) -> country_data
country_data$Province = NULL
country_data$Temperature = c(0)

n = nrow(country_data)
for (i in 1:n){
  weather = suppressWarnings( locationforecast(lat=country_data$Latitude[i], lon=country_data$Longitude[i]) )
  temp = suppressWarnings(mean(weather$temperature))
  country_data$Temperature[i] = temp
}

```
```{r}

# Get the weather data 
Weather_data  = country_data
Weather_data$Latitude = NULL
Weather_data$Longitude = NULL
Weather_data %>% 
  group_by(Country) %>% 
  summarise(Confirmed_Cases = sum(Confirmed)) -> con
Weather_data %>% 
  group_by(Country) %>% 
  summarise(Death_Cases = sum(Deaths)) -> dea
Weather_data %>% 
  group_by(Country) %>% 
  summarise(Recovered_Cases = sum(Recovered)) -> rec
Weather_data %>% 
  group_by(Country) %>% 
  summarise(Mean_Temprature = mean(Temperature)) -> tem

Weather_data_country = data.frame(con$Country, con$Confirmed_Cases, dea$Death_Cases, rec$Recovered_Cases, tem$Mean_Temprature)
colnames(Weather_data_country) <- c("Country", "Confirmed Cases", "Death Cases", "Recovered Cases", "Average Temprature")

```
```{r}
country = Weather_data_country$Country
countrycode(country, origin = 'country.name', destination = 'iso2c') -> Country_code
cbind(Country_code ,Weather_data_country) -> Weather_data_country
WDI(country = Weather_data_country$Country_code, indicator="SP.POP.TOTL",start=2018, end=2018) -> population
# 
head(Weather_data_country)
```
```{r}
merge(Weather_data_country, population, by.x = "Country_code", by.y = "iso2c") -> df
df$country = NULL
df$year = NULL
colnames(df)[7] = "Total Population"
head(df)

```



```{r}
c_url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv"
d_url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv"
r_url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv"

# Reading in the Time Data-Set  
confirmed_ts = read_delim(c_url, ",")
death_ts = read_delim(d_url, ",")
recover_ts = read_delim(r_url, ",")
```
```{r}
# Confirmed Cases for dates
confirmed_ts %>% 
  pivot_longer("1/22/20":"3/21/20", names_to = "Dates", values_to = "Confirmed_Cases") -> confirmed_ts

# Death Cases for dates
death_ts %>% 
  pivot_longer("1/22/20":"3/21/20", names_to = "Dates", values_to = "Death_Cases") -> death_ts

# Recovered Cases for dates
recover_ts %>% 
  pivot_longer("1/22/20":"3/21/20", names_to = "Dates", values_to = "Recover_Cases") -> recover_ts
```
```{r}
confirmed_ts %>% 
  group_by(Dates) %>%
  summarise(Confirmed_Cases = sum(`Confirmed_Cases`))->cs

recover_ts %>%
  group_by(Dates) %>% 
  summarise(Recover_Cases = sum(`Recover_Cases`)) -> rs

death_ts %>%
  group_by(Dates) %>% 
  summarise(Death_Cases = sum(`Death_Cases`)) -> ds
```
```{r}
merge(cs, rs, by.x = "Dates", by.y = "Dates") -> dfr   
merge(dfr, ds, by.x = "Dates", by.y = "Dates" ) -> TotalCount_ByDate
```


Now that we have cleaned our data lets answer some questions.

# Analysis

### a. Which country has the most number of casualties ? 
```{r}
df %>% 
  group_by(Country) %>% 
  arrange(desc(`Confirmed Cases`, `Death Cases`, `Recovered Cases`)) -> conf_df
head(conf_df,4) -> dfmax
dfmax
```
 As we can see **China** ranks the most amongs all of them, followed by **Italy**, **Iran** and **Spain**
 
 
### b. Which country has the least number of casualties ?
```{r}
df %>% 
  group_by(Country) %>% 
  arrange(`Confirmed Cases`, `Death Cases`, `Recovered Cases`) -> conf_df
head(conf_df,4) -> conf_df
conf_df
```
 As we can see **Republic of the Congo** ranks the most among all of them, followed by **Puerto Rico**, **Palestinian territory** and **Antigua and Barbuda**
 
 
### c. What is the general trend of the COVID-19 virus as the day's progress?

Let us call the data we cleaned up above and show some of the rows
```{r}
head(TotalCount_ByDate)
```
We see one categorical data called **Dates** and three Quantitative Data called **Confirmed Cases, Recover Cases and Death Cases **. However since we have a time series data, we thought it would be interesting to try using  the **xts** library.

```{r}
as.Date(TotalCount_ByDate$Dates,format="%m/%d/%y") -> o

data.frame(o, TotalCount_ByDate$Confirmed_Cases, TotalCount_ByDate$Death_Cases, TotalCount_ByDate$Recover_Cases) -> data
don <- xts(x = data, order.by = data$o)
names(don)[2] = "Confirmed Cases"
names(don)[3] = "Death Cases"
names(don)[4] = "Recovered Cases"
```

```{r}
# Finally the plot
p <- dygraph(don) %>%
  dyOptions(labelsUTC = TRUE, fillGraph=TRUE, fillAlpha=0.1, drawGrid = FALSE, colors = RColorBrewer::brewer.pal(3, "Set1")) %>%
  dyRangeSelector() %>%
  dyCrosshair(direction = "vertical") %>%
  dyHighlight(highlightCircleSize = 5, highlightSeriesBackgroundAlpha = 0.2, hideOnMouseOut = FALSE)  %>%
  dyRoller(rollPeriod = 1)
p
```
From the graph we can see that there's been a global sharp increase in the number of cases after March 12. The graph might possibly be following an exponential curve, but have yet to hit the surge. We also see that the number of confirmed cases are growing at a faster rate than Recovered and Death Cases.


### d. Does Climate have an effect on the COVID-19 virus?
  For the purpose of the analysis we want to see if the temperature has any affect on the COVID-19 virus. From question (a) we found out that the countries with the most number of confirmed cases is China and Italy. We start by taking the mean of the two temperatures as a bench mark. 
```{r}
bench_temp = c(dfmax$`Average Temprature`[1],dfmax$`Average Temprature`[2])
mu = mean(bench_temp)
mu
```
Our mu here is 11.6 which signifies that this temperature is ideal for the virus to sustain. 

Now lets split our dataset into 3 categories by their number of confirmed cases. Which are as follows:
1. Small: Confirmed cases between 0 and 200
2. Medium: Confirmed cases between 200 and 20000
3. Large: Confirmed cases are larger than 20000


```{r}
df %>% 
  filter(df$`Confirmed Cases`  < 200) -> small

head(small)
```

```{r}
df %>% 
  filter(df$`Confirmed Cases`  <5000 ,df$`Confirmed Cases`  >200 ) -> medium

head(medium)
```


```{r}
df %>% 
  filter(df$`Confirmed Cases`  >= 5000) -> large

head(large)
```




This is a scatter plot of all of our separated data sets 
```{r}
ggplot(small, aes(y = `Confirmed Cases`, x = `Average Temprature`)) + geom_point()
```

**Small Graph:** We can see that the number of confirmed cases has a high density when the temperature is above 10 degrees celsius. So we can speculate that the temperature might play an important role when the number of confirmed cases is between 0 and 200.



```{r}
ggplot(medium, aes(y = `Confirmed Cases`, x = `Average Temprature`)) + geom_point()
```

**Medium Graph:** We can see that the number of confirmed cases are mostly in between 500 and 750. We also see that for those cases, the temperature is evenly spread out meaning that once there is a certain threshold of people who have been affected, the temperature may no longer play an effective part in shortening the virus lifespan enough to make a difference. We can see that for the number of cases between 750 and 5000 have a lower average temperature, about 15 degrees celsius and below.



```{r}
ggplot(large, aes(y = `Confirmed Cases`, x = `Average Temprature`)) +geom_point()
```

**Large Graph:** We can see that almost all of our data points lie below 15 degrees celsius. Also there is only one data point which is above 40000 confirmed cases. (This is China from our previous EDA). 

Lets start by checking if there is any correlation between our variable 

```{r}
cor.test(small$`Average Temprature`, small$`Confirmed Cases`)
```




```{r}
small.1 <- lm(small$`Average Temprature` ~ small$`Confirmed Cases`, data = small)
summary(small.1)
```
```{r}
medium.1 <- lm(medium$`Average Temprature` ~ medium$`Confirmed Cases`, data = medium)
summary(medium.1)
```

```{r}
large.1 <- lm(large$`Average Temprature` ~ large$`Confirmed Cases`, data = large)
summary(large.1)
```
After running a linear regression on all 3 datasets, we can see that the p-value in all 3 sets are all above 0.05. This means that Confirmed Cases for all 3 sets is not significant. We also have an r-squared values which are very low, meaning that temperature does not tell us much about its correlation with confirmed cases across all 3 datasets.

After looking at our datasets, our graphs and running linear regressions on our sets, we cannot confidently say that there is a correlation between temperature and the amount of confirmed cases. From the small dataset, it initially may have seemed that there is a relation between higher temperatures and confirmed cases but as we went to the medium and large datasets, we were unable to continue on with the assumption. This makes sense, as our small data sets have a smaller numbers of confirmed cases so we may not have enough evidence in this set in order to make proper assumptions. Overall, based on our evidence and findings, we cannot say with confidence that there is a relationship between temperature and the spread of the virus. Of course something to note is that this virus began to spread during the cooler months of the most effected countries, so it hard to also tell if our conclusions could have been different if this were happening at a different time line. 



### e. Does Population have an effect on COVID-19 virus ?

Since our dataset has multiple fields and for the purpose of this questuon we will only be concerned with confirmed casea and total population and drop everything else.


```{r}
df %>% select(Country,`Confirmed Cases`, `Total Population`) %>% 
  arrange(desc(`Total Population`)) -> dfpop
head(dfpop)
```

```{r}
ggplot(dfpop, aes(x = `Total Population`, y = `Confirmed Cases`)) +geom_point()
```


```{r}
dfpop %>% 
  filter(dfpop$`Total Population`  < 1000000) -> smallpop
head(smallpop)
```

```{r}
dfpop %>% 
  filter(dfpop$`Total Population`  > 1000000, dfpop$`Total Population` < 50000000 ) -> mediumpop

head(mediumpop)
```


```{r}
dfpop %>% 
  filter(dfpop$`Total Population`  > 50000000 ) -> largepop
head(largepop)
```


```{r}
ggplot(smallpop, aes(x = `Total Population`, y = `Confirmed Cases`)) +geom_point()
```

```{r}
ggplot(mediumpop, aes(x = `Total Population`, y = `Confirmed Cases`)) +geom_point()
```

```{r}
ggplot(largepop, aes(x = `Total Population`, y = `Confirmed Cases`)) +geom_point()
```


From all three groups, we can see that the majority of confirmed cases are relatively low with respect to the countries total population. There are a couple outliers in each dataset but this can be expected. There is no trend to show that a higher population equates to more confirmed cases.
We can run a linear regression once again to see if there is a trend that we cannot determine from our graphs.

```{r}
smallpop.1 <- lm(smallpop$`Confirmed Cases` ~ smallpop$`Total Population`, data = smallpop)
summary(smallpop.1)
```
```{r}
mediumpop.1 <- lm(mediumpop$`Confirmed Cases` ~ mediumpop$`Total Population`, data = mediumpop)
summary(mediumpop.1)
```
```{r}
largepop.1 <- lm(largepop$`Confirmed Cases` ~ largepop$`Total Population`, data = largepop)
summary(largepop.1)
```
Based on our data above, we can conclude that a higher population does not necessarily mean more confirmed cases. This can be clearly shown by looking at our dfpop dataframe. We can see that India	has a total of 142 confirmed cases with a total population of 1352617328, where as Italy has a total of 31506 confirmed cases with a total population of 60431283. 

Despite being half the population of India, there is tens of thousands of more confirmed cases. There are a number of factors which significantly contribute the amount of confirmed cases, however having more people in a country does not always mean more people have the virus. There is a an increased likelihood of the virus spreading faster, if precautions are not taken by the government and people, such is the case in Italy. However, there are many actions individuals can take in order to stop the spread of the virus so quickly, such as social distancing, self quarantining and practicing good hygiene and cleanliness. A high population doesn't always mean a high number of confirmed cases, but more of a risk of more individuals to catch the virus. 

### Conclusion

It seems that the global fear of COVID-19 and it's rapid increase is due to the same problem we drew from doing this project; it is difficult to contain an unknown disease from spreading when we cannot predict how well it thrives in its environment. Our analysis could not help us conclude if temperature played an effective role in controlling the growth of the virus, with the most number of cases being in China and the least number of cases being in Republic of Congo. Although there seemed to be signs of the virus being less effective in hotter climates when looking at the countries with the least number of casualties, as the number of victims grew exponentially, temperature's effect on the lifespan of the virus became possibly insignificant. And since this growth happened during the colder months of the most effected countries, it is even more difficult to draw any strong conclusions. It is possible that on a different timeline, the scenario could have been completely different, and it may be still too early to draw an opinion. As we've seen from our first graph, we have yet to reach worldwide peak. Similarly, we could not conclude that the population size of a country had a play in its spread. There may be other factors that come into play, such as a country's age distribution, travel restrictions etc., that could explain its current state. From recent events we do know that in general the more people that are traveling, the more the virus is able to spread by contact, with it being somewhat of a domino effect. So our solution has been to go into quarantine despite whatever level of cases in the area. Hopefully we can soon expect to gather enough data to successfully know how to reduce casualties arising from COVID-19. 





