cautious-broccoli/covid19.Rmd at main · SinSuhas/cautious-broccoli · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
---
title: "Covid19 Analysis"
author: "Sindhu T P"
date: "2023-10-17"
output:
  html_document : default
  pdf_document  : default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
### Importing essential packages
```{r import packages}

# Load the packages
library(tidyverse)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(reshape2)
library(plotly)
```
## Covid-19 Data Import
This data set can be imported from the URL mentioned in the code below. The URL is a github repo of the Johns Hopkins University. There has been some permission changes done to the repo, as a result importing data directly form the URL was not possible. The data was downloaded directly from the repo to my local machine as presented in the 'Reading Data' section of this report.
```{r data import}
url_in <- "https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/"
files <- c("time_series_covid19_confirmed_global.csv","time_series_covid19_deaths_global.csv","time_series_covid19_confirmed_US.csv","time_series_covid19_deaths_US.csv")
url <- str_c(url_in,files)
```
### Reading data
```{r read data}
global_cases <- read_csv("D:/Sindhu/CUB/MSDS_CUB/R all/time_series_covid19_confirmed_global.csv")
global_deaths <- read_csv("D:/Sindhu/CUB/MSDS_CUB/R all/time_series_covid19_deaths_global.csv")

```
### Data Preprocessing
```{r pivoting}
#Pivoting date columns to make it a single column
global_cases <- pivot_longer(global_cases,cols = -c('Province/State', 'Country/Region','Lat','Long'),names_to = 'Date',values_to = 'Cases')%>%select(-c('Lat','Long'))

global_deaths <- pivot_longer(global_deaths,cols= -c('Province/State', 'Country/Region','Lat','Long'),names_to = 'Date',values_to = 'Deaths') %>% select(-c('Lat','Long'))

```

```{r merge}
#Joining global cases and deaths
global <- global_cases %>% full_join(global_deaths) %>% rename(Province_State = `Province/State`, Country_Region = `Country/Region`)
global <- global %>% filter(Cases>0)

```
### Population data import
```{r population import}
#Since the global data set does not have the "population" column, we are importing it from the below mentioned URL and file. Later, this will be joined to the global table.

#Importing population data from the below url
url_for_population <- "https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv"
```

### Reading data from the csv file
```{r read popu data}
popu <- read_csv("D:/Sindhu/CUB/MSDS_CUB/R all/population.csv")
popu <- popu %>% select(-c('UID','iso2','iso3','code3','FIPS','Admin2','Lat','Long_'))
```
```{r joining global and population}
#joining global dataframe and population dataset
global <- global %>% unite("Combined_Key", c(Province_State,Country_Region),na.rm = TRUE,remove = FALSE, sep = ", ")

global <- global %>% left_join(popu , by = c('Province_State','Country_Region'))

global <- global %>% select(-c(Combined_Key.y))

global <- global %>% mutate(Date = mdy(Date))


```

### Summarizing the data by date
```{r globalsummary data}
global_totl <- global %>% group_by(Country_Region,Date) %>% summarise(cases = max(Cases),deaths = max(Deaths),population = sum(Population))%>%mutate(death_per_mill = deaths*1000000/population)%>%mutate(cases_per_mill = cases*1000000/population)%>% select(c(Date,cases,deaths,death_per_mill,cases_per_mill,population))%>% ungroup()

print(global_totl,n=50)
```
```{r}
global_totl <- global_totl %>% drop_na()


```


### Summarizing the data by country_region
```{r global summary data by country}
global_totl_by_country <- global %>% group_by(Country_Region) %>% summarise(cases = max(Cases),deaths = max(Deaths),population = max(Population))%>%mutate(death_per_mil = deaths*1000000/population)%>%
  mutate(cases_per_mil = cases*1000000/population)%>%select(c(Country_Region,cases,deaths,death_per_mil,cases_per_mil,population))%>% ungroup()

print(global_totl_by_country)
```
```{r}
global_totl_by_country<- global_totl_by_country %>% drop_na()
global_cases <- global_cases %>%mutate(Date=mdy(Date))
```


### Visualizing the data - Cases by year
```{r data viz1}

p<-global_totl %>%ggplot(aes(x= year(Date),y=cases))+
    geom_point(aes(color="cases"))+
    geom_line()+
    theme(legend.position="bottom",axis.text.x=element_text(angle=90))+
    labs(title="Covid19 Cases Around The World", y=NULL)

ggplotly(p, tooltip = c("x", "y"))
```


### Visualizing the data - Cases and Deaths by Country
```{r data India viz}
# Summarize cases and deaths by country
df <- global_totl_by_country %>% select(c(Country_Region,cases,deaths))

# Melt the data frame
df_melted <- melt(df, id.vars = "Country_Region")

# Create the plot
p <- plot_ly(df_melted, x = ~Country_Region, y = ~value, color = ~variable, type = "bar") %>% plotly::layout(title = "Covid19 by Country", yaxis = list(rangemode = "nonnegative"),xaxis = list(rangeslider = list()), barmode = "group",width=800,height=400,bargap=0.2)

# Rotate the x-axis labels
p$xaxis$categoryorder <- "category ascending"
p$xaxis.tickangle <- 90

# Display the plot
p

```


### Analyzing the data to look for new cases and new deaths
```{r data analysis}
global_totl <- global_totl %>% mutate(New_Cases = cases - lag(cases), New_Deaths = deaths -lag(deaths))
global_totl %>% select(c(Country_Region,New_Cases,New_Deaths,everything()))
```
### Visualizing the data for India(New_Cases and New_Deaths by date)
```{r Everyday cases for India}
df_india_new<- global_totl %>% filter(Country_Region=='India') %>% filter(Date>2020-01-30)%>% select(c(Date,New_Cases,New_Deaths))
# Melt the data frame
df_melted_new <- melt(df_india_new, id.vars = "Date")

# Create the plot
p <- plot_ly(df_melted_new, x = ~Date, y = ~value, color = ~variable, type = "bar",colors = c("green", "red")) %>%
  layout(title = "Covid19 by Date", yaxis = list(type = "linear", rangemode = "nonnegative",rangeslider = list()), xaxis = list(rangeslider = list()), barmode = "group", width=800,height=400,bargap=0.5)

# Rotate the x-axis labels
p$xaxis$categoryorder <- "category ascending"
p$xaxis.tickangle <- 90

# Add a slider for the y-axis
p$layout$updatemenus <- list(
  list(
    type = "buttons",
    showactive = FALSE,
    buttons = list(
      list(
        label = "New Cases",
        method = "update",
        args = list(list(y = list(df_melted_new$New_Cases)), list(title = "Covid19 by Date - New Cases"))
      ),
      list(
        label = "New Deaths",
        method = "update",
        args = list(list(y = list(df_melted_new$New_Deaths)), list(title = "Covid19 by Date - New Deaths"))
      )
    )
  )
)

# Display the plot
p
```


### Visualizing the data - New cases and new deaths for India by year
```{r data viz India}
data <- global_totl%>% filter(Country_Region == "India") %>% filter(year(Date) %in% c(2022,2023))

pl1 <- data %>% ggplot(aes(x= Date,y=New_Cases))+
    geom_line(aes(color="New_Cases"))+
    geom_point(aes(color="New_Cases"))+
        theme(legend.position="bottom",axis.text.x=element_text(angle=90))+
    labs(title="Covid19 in India - New_Cases", y=NULL)

pl2 <- data %>% ggplot(aes(x= Date,y=New_Deaths))+
    geom_line(aes(color="New_Deaths"))+
    geom_point(aes(color="New_Deaths"))+
        theme(legend.position="bottom",axis.text.x=element_text(angle=90))+
    labs(title="Covid19 in India - New_Deaths", y=NULL)

grid.arrange(pl1,pl2,ncol=2, widths=c(4,4))

```


### Analyzing the data further to look for min amd max deaths and cases

```{r min data}
global_totl %>% filter(cases > 0 & deaths >0) %>% slice_min(n=10,order_by = death_per_mill) %>% select(death_per_mill,everything())
```
```{r max data}
global_totl %>% filter(cases > 0 & deaths >0 & population >0) %>% slice_max(n=10,order_by = death_per_mill) %>% select(death_per_mill,everything())
```

## Modeling the Data
```{r data model}
global_totl <- global_totl %>% filter(cases >0 & population >0)
modl <- lm(data = global_totl,formula = deaths ~ cases)
summary(modl)
cor_coef <- cor(x = global_totl$cases, y = global_totl$deaths, use = "everything",method = c("pearson", "kendall", "spearman"))
print(paste("correlation coefficient for cases and deaths is ",cor_coef))

```
##### Inference:
The correlation coefficient for cases and deaths is 0.885711219050053. This indicates that there is a positive correlation between cases per million people and death per million people, with a correlation coefficient of 0.885711219050053. This means that as cases per million people increase, death per million people also tends to increase.

### Predictions by the model
```{r prediction}
global_totl <- global_totl %>% mutate(pred_deaths = predict(modl)) %>% filter(Country_Region=='India') %>% select(c(cases,deaths,pred_deaths,Date))
global_totl
```
### Visualizing the data - Cases and Deaths by year for India
```{r data viz}
global_totl %>% ggplot()+
    geom_point(aes(x=Date,y=deaths),color="blue")+
    geom_point(aes(x=Date,y=pred_deaths),color="red")+
    scale_y_log10()+
    theme(legend.position="bottom",axis.text.x=element_text(angle=90))+
    labs(title="Covid19 deaths prediction", y=NULL)
```


##### Conclusion:
Although the above model does a good job in predicting the number of deaths after 2021, the initial accent does not resemble the actual data and hence further modifications to the model is required to account for unknown variance.

From the above data visualizations and analysis performed on this data set we can infer following conclusions:

1. When it comes to Covid19 around the world, the overall covid19 cases reached 20M by 2020 mid, and by the year 2023, there was a drastic increase in the number of cases(100M cases).

2. New cases for India: May 2021 has the highest confirmed covid cases in India, followed by second highest peak in Jan 2022.

3. New deaths for India: The year 2022 has seen a huge number of new deaths in India

4. US has recorded the highest overall covid19 cases compared to other countries.

Possible Biases:

1. Testing Bias: The number of confirmed cases is heavily dependent on the testing rate in each country. Countries with more extensive testing will report more cases, which doesn’t necessarily mean they have a higher prevalence of the disease.

2. Reporting Bias: Different countries have different criteria for reporting COVID-19 cases and deaths. Some countries might only report hospital cases, while others might also include cases and deaths in care homes or the wider community.

3. Population Bias: Comparisons between countries can be misleading if not adjusted for population size. A country with a larger population will likely have more cases than a smaller one.

4. Healthcare System Bias: The quality and capacity of healthcare systems can affect the number of reported deaths. Countries with better healthcare systems might have lower death rates because they can provide better care for the sick.

5. Time Lag Bias: There is often a delay between when someone gets infected, when they get tested and confirmed positive, and when their case gets reported in the data. This can lead to underestimation of recent cases.

6.Geographical Bias: The spread of COVID-19 is not uniform within a country. Some regions might be more affected than others, but when we analyze country-level data, this regional variation is lost.