Health-Insurance-Analytics/Health_Insurance_Analytics.Rmd at master · mahajanaman/Health-Insurance-Analytics · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
---
title: "Health Insurance Analysis"
output: html_document
Author: Aman Mahajan
---

#1.Introduction
#1.1 Problem Statement:
Our analysis is focussed on the problem which insurance providers are facing today to define their target market and plan their sale strategies which helps them increase their market share and thereby, maximize their profitability.

#1.2 Model:
As mentioned in the previous slide, our response will be categorical i.e. ???If the respondent is having insurance???.

Therefore, we have conducted our analysis using stepwise regression for ??Generalized Linear Model??(GLM) using binomial distribution and then validation the results using partial F-test and Area Under the Curve.

#1.3 Load Libraries:
```{r}
library(e1071) #Package for Skewness function used for data analysis
library(stats) #Package for finding cook's distance
library(ggplot2)#Package for visualisation of data
library(Amelia)#package to visually display the missing values
library(gridExtra)#Package for arranging different plots in a single grid
library(caTools)# Package for validation of models
library(pscl)# Package for Mc-Fadden R Test
library(ROCR)#Package for ROC graphs


```


#2. Data Pre-Processing:

Cross-section data originating from the Medical Expenditure Panel Survey survey conducted in 1996.

Usage:
data("HealthInsurance")

Format:
A data frame containing 8,802 observations on 11 variables .

Response:

Does the person have age or norage age in years.

limit factor. Is there any limitation?

gender factor indicating gender.

insurance factor. Does the individual have a health insurance?

married factor. Is the individual married?

selfemp factor. Is the individual self-employed?

family family size.

region factor indicating region.

ethnicity factor indicating ethnicity: African-American, Caucasian, other.

education factor indicating highest degree attained: no degree, GED (high school equivalent), high

school, bachelor, master, PhD, other.


#2.1 Load the data:
```{r Export, echo=TRUE}
library("AER")

data("HealthInsurance")

```
Initially process the data by discovering and labeling the missing data with NA; and converting categorical variable(s) to proper factors with meaningful labels.

#2.2 FIle Structue and Content
```{r}
head(HealthInsurance)
str(HealthInsurance)
```

#2.3 Missing Values:

```{r Missing Value, echo=TRUE}
#To replace the possible missing values with NA
summary(HealthInsurance)
HealthInsurance$age[HealthInsurance$age==0]<- NA
HealthInsurance$family[HealthInsurance$family==0]<- NA

#To Verify that the data is complete
#install.packages("Amelia")

missmap(HealthInsurance, main = "Missing values vs observed")
```

#2.4 Validating Total factor variable in dataset.
```{r}

is.factor(HealthInsurance$ethnicity)

is.factor(HealthInsurance$health)

is.factor(HealthInsurance$limit)

is.factor(HealthInsurance$gender)

is.factor(HealthInsurance$age)

is.factor(HealthInsurance$insurance)

is.factor(HealthInsurance$selfemp)

is.factor(HealthInsurance$family)

is.factor(HealthInsurance$region)

is.factor(HealthInsurance$married)

is.factor(HealthInsurance$education)

#We can see that all the variables are factor except age and family variables.
```
For Better understanding, of how R is going to deal with the categorical variables, we can use the contrasts() function for the factors.

```{r}
contrasts(HealthInsurance$health)
contrasts(HealthInsurance$limit)
contrasts(HealthInsurance$gender)
contrasts(HealthInsurance$insurance)
contrasts(HealthInsurance$married)
contrasts(HealthInsurance$selfemp)
contrasts(HealthInsurance$region)
contrasts(HealthInsurance$ethnicity)
contrasts(HealthInsurance$education)


```
It can be said that the raw data taken is a processed data and does not need any cleaning or formatting.


#3 Exploratory Data Visualisation :


#3.1 Scatter plots for Continous Variable:
```{r}
#pairs(~health,age,limit,gender,insurance,married,selfemp,family,region,tehnicity,education , data=HealthInsurance)
#Unlike pairs(), ggpairs() works with non-numeric
#predictors in addition to numeric ones.
# Hence we use simple plot for the output.

plot1 = qplot(age, data = HealthInsurance, xlab = "age")
plot2 = qplot(age, data = HealthInsurance, geom = "density", fill = "red")
plot3 = qplot(sample = age, data = HealthInsurance)
grid.arrange(plot1, plot2, plot3, ncol = 3)

#plot4 = qplot(family, data = HealthInsurance, xlab = "Evaluation")
#plot5 = qplot(family, data = HealthInsurance, geom = "density", fill = "red")
#plot6 = qplot(sample = family, data = HealthInsurance)
#grid.arrange(plot4, plot5, plot6, ncol = 3)


```
Inference:
On observing the density plot graph , we can conclude that variable age is normally distributed and not skewed as the normal distribution graph is neither left nor right skewed.
Since the variable is not skewed so we need not use any tranformations like log or sqrt to make age variable normally distributed.

We can observe that the age is uniformly clustered around qauantile[-4:4]

#3.2 Skewness:
We can also check the skewness using skewness() function:
```{r}
skewness(HealthInsurance$age)
skewness(HealthInsurance$family)
```
It can be seen that skewness factor is close to zero for age so we can say that age is not skewed. However, family variable exhibit right skewness as the values is positive hence we can use log or sqrt tranformation for the same.

```{r}
log.y<-log(HealthInsurance$family)
plot(log.y,xlab="family",ylab="Log(family)")

sqrt.y<-sqrt(HealthInsurance$family)
plot(sqrt.y,xlab="famlily",ylab="Sqrt(family)")

```

#3.3 BoxPlots and Stripcharts:
Scatterplot matrix of all the continous variables while viewing the insurance variable as the output variable.

```{r}
plot4<- qplot(insurance,family, data=HealthInsurance, geom=c("boxplot"))
plot5<- qplot(insurance,age, data=HealthInsurance, geom=c("boxplot"))

grid.arrange(plot4, plot5)

numeric_data <- HealthInsurance[,c("family","age")]
numeric_data <- data.frame(scale(numeric_data ))
stripchart(numeric_data,
           vertical = TRUE,
           method = "jitter",
           col = "orange",
           pch=1,
           main="Stripcharts")

```
Visual Inspection between the 2 continous variables: Family vs insurance box plot tells us that there is considerable amount of potential outliers for family variable

#3.4 Z-Scores Scaling:

```{r}

healthinsurance_r = data.frame(scale(numeric_data))
summary(numeric_data)
```

The mean after rescaling the variables is 0 for both the variables.
Checking the 1st and 3rd quantiles for both the variables we can infer that they lie between -2 and +2 with few exceptions.

We will now plot boxplot and strip charts on the basis of z-score
```{r}
boxplot(numeric_data, main = "Boxplot of re-scaled variables",col = (c("gold","darkgreen")))
stripchart(numeric_data, vertical = TRUE, method = "jitter", col = (c("gold","darkgreen")), pch = 1, main = "Stripcharts of re-scaled variables")
```

It provides confirmation of the variable transformations as all the variables now have mean 0. Also , the number of potential outliers are distinctly visible for family after z-score tranformations.

#3.5 Correlation Matrix:

```{r}
cor(numeric_data)
```

Family and age have inverse relation which should not be the case as family is directly related with age of the person.

#3.6 CHI Square Test of Independence:
The Chi Square test of independence is used to determine if there is a significant relationship between two categorical variables.

```{r}
a.data <- data.frame(HealthInsurance$insurance, HealthInsurance$health)
a.data = table(HealthInsurance$insurance, HealthInsurance$health)
print(a.data)
barplot(a.data, beside = TRUE, legend = levels(unique(HealthInsurance$health)))
print(chisq.test(a.data))

a.data <- data.frame(HealthInsurance$insurance, HealthInsurance$limit)
a.data = table(HealthInsurance$insurance, HealthInsurance$limit)
print(a.data)
barplot(a.data, beside = TRUE, legend = levels(unique(HealthInsurance$limit)))
print(chisq.test(a.data))

a.data <- data.frame(HealthInsurance$insurance, HealthInsurance$gender)
a.data = table(HealthInsurance$insurance, HealthInsurance$gender)
print(a.data)
barplot(a.data, beside = TRUE, legend = levels(unique(HealthInsurance$gender)))
print(chisq.test(a.data))


a.data <- data.frame(HealthInsurance$insurance, HealthInsurance$married)
a.data = table(HealthInsurance$insurance, HealthInsurance$married)
print(a.data)
barplot(a.data, beside = TRUE, legend = levels(unique(HealthInsurance$married)))
print(chisq.test(a.data))


a.data <- data.frame(HealthInsurance$insurance, HealthInsurance$selfemp)
a.data = table(HealthInsurance$insurance, HealthInsurance$selfemp)
print(a.data)
barplot(a.data, beside = TRUE, legend = levels(unique(HealthInsurance$selfemp)))
print(chisq.test(a.data))


a.data <- data.frame(HealthInsurance$insurance, HealthInsurance$region)
a.data = table(HealthInsurance$insurance, HealthInsurance$region)
print(a.data)
barplot(a.data, beside = TRUE, legend = levels(unique(HealthInsurance$region)))
print(chisq.test(a.data))


a.data <- data.frame(HealthInsurance$insurance, HealthInsurance$ethnicity)
a.data = table(HealthInsurance$insurance, HealthInsurance$ethnicity)
print(a.data)
barplot(a.data, beside = TRUE, legend = levels(unique(HealthInsurance$ethnicity)))
print(chisq.test(a.data))

a.data <- data.frame(HealthInsurance$insurance, HealthInsurance$education)
a.data = table(HealthInsurance$insurance, HealthInsurance$education)
print(a.data)
barplot(a.data, beside = TRUE, legend = levels(unique(HealthInsurance$education)))
print(chisq.test(a.data))

```
We can infer by Chi Square test that the variable limit is relatively insignificant as compared to other variables.

#4. Predictive Modelling:


#4.1 Convert the data into training and Test in 80-20 ratio :
```{r}
set.seed(100)

split<- sample.split(HealthInsurance,SplitRatio = 0.8)
training<- subset(HealthInsurance, split=="TRUE")
testing<- subset(HealthInsurance, split=="FALSE")

dim(training)
dim(testing)
```


#4.2 Generalized linear Model:
We would not be implementing linear or polynomial model as the response is not a continous variable. We need the prediction in Yes or No. Hence we would implement glm model with Binomial distribution.We cannot implement Poisson Distribution as we have the factors as "Yes" and "No" which would be taken as missing values by Poisson.

```{r}

#Big Model
model1 <- glm (insurance ~. , data = training, family = binomial(link='logit'))
summary(model1)
```

We can see that this model does a good job on deciding the goodness of the training model. If the p-value is less than or equal to the alpha (i.e p < .05), the result is statistically significant. If the p-value is greater than alpha (p > .05), the result is statistically insignificant.

#4.3 Hypothesis Testing for Stepwise Regression:

No matter how significant a model can be we can still make it better by using Hypothesis testing so that all the co-efficients of variable are significant.

If The below hypothesis holds true as the p-value(ethnicityafam):.78 > .05
Null Hypothesis for X(Limit): H(1): coef(ethnicityafam)=0

We can remove the ethnicity to construct a new model which would have a much better significane

```{r}
#Remove Ethnicity to construct the model
model2 <- glm(insurance ~ health+age+limit+gender+married+selfemp+family+region+education, family =binomial (link='logit'),data=training)
summary(model2)
```

We can still observe insignificant variables whose values are greater than .05. We'll apply null hypothesis testing again for model 2 and see that P-value(regionmidwest)>.05.

Hence we remove region variable as well.

```{r}
#Model after removing region variable
model3 <- glm(insurance ~ health+age+limit+gender+married+selfemp+family+education, family =binomial (link='logit'),data=training)
summary(model3)
```

We remove limit variable as well.

```{r}
#Model after removing limit variable

model4 <- glm(insurance ~ health+age+gender+married+selfemp+family+education, family =binomial (link='logit'),data=training)
summary(model4)
```

We can see that all the variables are significant, so model 4 is the potential model to predcit the response variable.

#4.4 Partial F-test for Confidence Intervals :

For Alpha=.05
We find the 95% probable interval from the 0.0 and 0.95 quantiles of the F distribution for (6388,6401) degree of freedom for model 4
```{r}
anova(model4)
lwr <- qf(0, 6388, model4$df.residual)
upr <- qf(0.95, 6388, model4$df.residual)
c(lwr, upr)
```

So if the z-value falls outside this interval, we could decide that null Hypothesis as false and use alternative hypothesis.

Here are two Decision Rules based on the F distribution for our case using the pima dataset:
Risk: a=0.05 Rule: If Z-Value falls within |1.042| Accept Null Hypothesis
Risk: a=0.05 Rule: If Z-Value greater than |1.042| accept Alternative Hypothesis

To see the z-value we use the summary() function:
```{r}
summary(model4)
```

It can be inferred that even in 5% of probable interval all the variables are significant to predict the outcome as the z-value is not lying in the interval. Hence, alternative hypothesis is accepted which makes it a significant predictor.

Conclusion of the test with risk a=0.05 using the P-value

```{r}
lwrpf <- pf(0, 6388, model4$df.residual)
uprpf <- pf(0.95, 6388, model4$df.residual)
c(lwrpf, uprpf)
```

So if the P-value falls outside this interval, we could decide that null Hypothesis H(1) and H(2) is false.
Here are two Decision Rules based on the F distribution for our case using the pima dataset:
Risk: a=0.05 Rule: If P-value> .0209 Accept Null Hypothesis
Risk: a=0.05 Rule: If P-value< .0209 Accept Alternate Hypothesis

To see the p-value we use the summary() function:
```{r}
summary(model4)
```

Now
It can be inferred that in 5% of probable interval all  the variable are still significant predictor.


#4.5 Test for Validating Models Significance:

Anova Test
```{r}
anova(model1, test="Chisq")
anova(model2, test="Chisq")
anova(model3, test="Chisq")
anova(model4, test="Chisq")
```

The difference between the null deviance and the residual deviance shows how significant model is doing against the null model (a model with only the intercept).The wider this gap, the better which is the max for model 4.

Analyzing the table we can see the increase in deviance when removing each variable one at a time.

we can see that it is a significant increase in deviance and the AIC as we go from model 1 to model 4.


McFaddedn Test:

```{r}

#install.packages("pscl")
pR2(model1)
pR2(model2)
pR2(model3)
pR2(model4)

```


While no exact equivalent to the R2 of linear regression exists, the McFadden R2 index can be used to assess the model fit which is most for model 4 hence validating it as the most signifiant model.

#4.6 Potential Outliers:
```{r}
(plot1 <- qplot(insurance, model4$fitted.values, geom = "boxplot", data=training)+labs(y="Fitted Values")+ggtitle("Residuals vs Test Plot"))
```

The Box Plot for insurance factor variable reveals a lot of outliers when compared with fitted values.


#4.7 ROC Curve:

To assess the predictive ability of the model we use ROC curve and calculate the AUC(Area under curve)
which are typical performance measurementsfor a binary classifier.


```{r}

pred<- predict(model4, training, type= 'response')
pred<- prediction(pred, training$insurance)
eval<- performance(pred,'tpr','fpr')
plot(eval, colorize = TRUE)

```

The ROC is a curve generated by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings while the AUC is the area under the ROC curve.

#4.8 Assesing the predictive ability of the model:

```{r}
model.probs=predict(model4,training,type="response")

# misclassification error:train data
pred1<- ifelse(model.probs>0.5, 1, 0)
#Confusion Matrix
tab1<- table(Predicted= pred1, Actual= training$insurance)
tab1
# misclassification error:train data
trainerror<- 1- sum(diag(tab1))/ sum(tab1)
trainerror
#Accuracy of Training data
print(paste('Accuracy',1-trainerror))


# Test Error
model.test=predict(model4,testing,type="response")

# misclassification error:test data
pred_test<- ifelse(model.test>0.5, 1, 0)
#Confusion Matrix
tab_test<- table(Predicted= pred_test, Actual = testing$insurance)
tab_test
# misclassification error:test data
testerror<- 1- sum(diag(tab_test))/ sum(tab_test)
testerror
#Accuracy of Training data
print(paste('Accuracy',1-testerror))


```

The error rate for training is roughly 19% and accuracy is 81.03% which is very high compared to real time predictions.

After fitting the model with the testing data we can observe that the acccuracy is 80.833 % which indicates satisfactory goodness of fit of the model.

#4.9 Area Under the Curve:

```{r}
auc<- performance(pred,"auc")
auc <- auc@y.values[[1]]
auc

```

As a rule of thumb, a model with good predictive ability should have an AUC closer to 1 (1 is ideal) than to 0.5 which is the case with model 4 having area under the curve as .7486046.

#4.10 To Find the actual movement and predict whether person takes insurance or not:

To predict Direction for new values of Insurance we simply use the predict() function and feed in a data frame of new values. We want to predict Direction on a day when Lag1 and Lag2 equal 1.2 and 1.1, respectively, and on a day when they equal 1.5 and -0.8.
```{r}
predict(model4,newdata=data.frame(health= "yes",age=20,family=3,gender="female",education="bachelor",married="yes",selfemp="no"),data=testing,type="response")
```

As can be seen we can see the actual movement of whether the person has insurance or not by creating a new dataframe. Suppose a person walks into hospital with

health = yes

age=20

family=3

gender=female

education=bachelor

married=yes

selfemp=no

Then ther are 92.38% prediction chance that she has insurance .

#5 Conclusion:

This feature can be widely used by the insurance companies to predict whether the customer has health insurance or not. This would in turn help to infer the potential insurance buyers and help the companies to target the right audience to get maximum health insurance sales.