US Health Insurance Data Set
Insurance Premium Charges in US with important details for risk underwriting.
https://www.kaggle.com/teertha/ushealthinsurancedataset?select=insurance.csv
This data set contains 1338 rows of insured data, where the Insurance charges are given against the following attributes of the insured: age, sex, BMI, number of children, smoker and region.The dataset used is relatively simple and was proposed to be an excellent starting point for the U.S. Economic Development Administration (EDA) to begin general statistical analysis, hypothesis testing, and training linear regression models for predicting insurance premium charges. The dataset was also proposed to be a helpful and simple set that could illuminate the understanding of risk underwriting in health insurance and the interplay of different lifestyle and medical attributes of the insured population to see how they affect the insurance premium.
insurance=read.csv("data/insurance.csv", header=TRUE)
head(insurance)
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
There are four factors used to determine the premium amount for an individual. These four are location they live, smoking status, age, and the amount of people it covers. All these variables are included in the aforementioned dataset. An insurance company can determine an individual’s premium based on their location of residence. The insurance company will use factors such as differences in competition, state vs local rules, and cost of living. The dataset accounts for this by including a region variable for every individual. Insurance companies can charge tobacco users up to 50% more than non tobacco users4. This information is included under the smoker variable. Age affects insurance premiums based on a rate where the baseline is set at 21 years old. Generally, as an individual gets older, their premium costs will increase. The rate continues to increase until the age of 655. The rate differs by state. Insurance companies can also change premium costs depending on the amount of people a policy will cover. If there are more people on the plan, the premium cost will increase4. This information is covered under the children variable. This variable represents the amount of dependents an individual’s policy covers.
The dataset also includes sex and BMI variables. Neither are factors that can be used to determine an individual’s premium amount. Body Mass Index (BMI) is measured but the individual’s weight (kg) divided by height (m) squared. It is standard in epidemiological studies of body weight, to consider an adult between 25 and 30 to be “overweight” and an adult between 30 as “obese”. BMI should not be regarded as a medical diagnosis because many people who would be counted as obese on the basis of body mass index are physically fit when measured in out ways like percentage of body fat6. Although the BMI measurement is not a full proof medical measurement it is easily included in surveys and calculable so it is included as a generalized measurement for body size relating to health in this dataset.
There are no missing or undefined values in the data set. The variables age, bmi, charges, and children are numeric types. The variables sex, region, and smoker are character types.
#Use There is no missing value in this data set
sapply(insurance, function(x) sum(is.na(x)))
## age sex bmi children smoker region charges
## 0 0 0 0 0 0 0
#check dimension of the data set
dim(insurance)
## [1] 1338 7
#check each variable's data type and brief statistic information
summary(insurance)
## age sex bmi children
## Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## smoker region charges
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770
The ages of those included in the dataset have a relatively uniform distribution with the exception of those around 18-19 years of age. Those individuals who are 18-19 years of age appear to have triple the representation of any other age in the data set. Using a log2 transformation on age would not help the distribution become closer to normal.
#Check numeric variables age distributions by plotting histogram.
hist(insurance$age, xlab="Age", breaks = 20, main="Age Distribution")
Body Mass Index (BMI) is an individual’s weight (kg) divided by height (m) squared. The CDC defines adults with BMI between 25 and 30 as overweight and those over 30 as obese. The overall BMI distribution in this dataset is close to a normal distribution with a small right skew and a center around 30.
hist(insurance$bmi, xlab="BMI", breaks = 20, main="BMI Distribution")
Children variable is a discrete variable which only has six numbers. The distribution of number of children included in this dataset is right skewed with the majority of individuals having zero children.
hist(insurance$children, xlab="number of children",breaks = 6, main="Number of Children Distribution")
The distribution of charges is extremely right skewed because due to range of the insurance premium charges, which is $1122 to $63770. A log2 transformation was performed to normalize the distribution of this variable. This transofrmation will allow us to perform more accurate statistical analysis. We use the transformed charges variables in all of the following analyses.
hist(log2(insurance$charges), xlab="log2charges", breaks = 20, main="Charges Distribution")
The percentage of males to females included in the data set is 49.48% to 50.52%. This is the inverse of what we see in the United States 2020 population, which is split 50.75% and 49.25% for female and male respectively. However, since the split is very close to 50/50 both in the dataset and the current US population, we chose not to investigate any sampling bias.
The percentage of non-smoker to smoker included in the dataset is 79.52% to 20.48%. This percentage represents the approximate prevalence in the United States at the time of the data collection.
The data was entered in a way that separated the regions of the United States by quadrant (northeast/northwest/southeast/southwest). The data set includes relatively even samples from each region with slightly more individuals belonging to the southeast group.
#Check ordinal variables distributions
crosstable(insurance,sex,smoker,region) %>% as_flextable(keep_id=FALSE)
label | variable | value |
sex | female | 662 (49.48%) |
male | 676 (50.52%) | |
smoker | no | 1064 (79.52%) |
yes | 274 (20.48%) | |
region | northeast | 324 (24.22%) |
northwest | 325 (24.29%) | |
southeast | 364 (27.20%) | |
southwest | 325 (24.29%) |
Plotting graphs between variables to see the correlations.
The medians and 50% IQR of BMI between male and female are very similar. Although, the spread of BMI in male is larger than in female.
The medians and IQRs of BMI between male and female are very similar. The BMI median is around 30 for both male and female. The upper outliers and maximum BMI in the male variable is larger compared to those in the female variable.
# Boxplot of BMI on sex. There is no significant trend difference of BMI between male and female.
ggplot(insurance, aes(x=sex, y=bmi)) +
geom_boxplot(col="red")+
stat_boxplot(col="red", geom="errorbar", width=0.5)+
geom_jitter(position=position_jitter(0.02))+
ggtitle("Boxplot of BMI in Male and Female")
In the group of people who have zero children, the median age is around 36 years old. The groups for 1 and 2 children have the same median at 40 years old. The 3 children group children has the highest median age at 41. The groups with 4 and 5 children have age medians around 38-39. The 0 to 3 children groups all have a very similar age range from 18 to 64. The ages for those with 0 children are more accumulated in the 25% QR and 75% QR, unlike like those with 1-3 children, which have a narrower IQR. The 4 and 5 children groups have much less people, hence these have a slightly smaller age range. In each of the five distinct categories of children number, there appears to be no significant trend that older individuals have more dependents.
# Boxplot of children number on age condition. No difference.
ggplot(insurance, aes(x=as.factor(children), y=age)) +
geom_boxplot(col="red")+
stat_boxplot(col="red", geom="errorbar", width=0.5)+
xlab("Children Number")+
geom_jitter(position=position_jitter(0.02))+
ggtitle("Boxplot of Children Numbers in Age Condition")
The BMI in each region has a distribution similar to a normal distribution. The BMI values for the northeast, northwest and southwest regions have a similar average around 30 (kg/m^2). The BMI distribution of the southeast region is shifted right compared to the other three regions. It indicates that the overall BMI in southeast region is higher than other regions.
# Histogram of BMI distribution in different region. Average BMI in southeast region is a bit higher than the other three regions.
ggplot(data=insurance, aes(bmi),colour = cut) +
geom_histogram(binwidth =1) +
facet_grid(rows=vars(region))+
ggtitle("Histogram of BMI Distribution in Different Regions")
There are spikes at ages 18 and 19 in all four regions. Besides this age range, the distribution of age in the four regions are relatively even and similar. There appears to be no significant difference of age between regions.
# Histogram of Age distribution in different region.There is no significant trend difference of age distribubtion in four regions.
ggplot(data=insurance, aes(age), color= region) +
geom_histogram() +
facet_grid(rows=vars(region))+
ggtitle("Histogram of Age Distribution in Different Regions")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The log2 transformed charges in the four regions have a similar average. There are lower charges in the southern regions compared to the northern regions indicated by the longer left tails. More people pay very high insurance premium charges in the southern regions compared to northern regions shown by the higher density in the right tails.
# Histogram of insurance charge in different region.
# after taking log2 on charges, the distribution of charges are closer to normal distribution.
ggplot(data=insurance, aes(log2(charges))) +
geom_histogram() +
facet_grid(rows=vars(region))+
ggtitle("Histogram of Insurance Charges in Different Regions")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The BMI spread in each age is relatively similar. Obese people are classified by having a BMI above 50. There are 3 obese cases in this dataset, which occur between 18 and 23 years old. People older than 60 show a more constant BMI between 20 and 40. However, the main spreads of BMI among different ages are generally very similar, concentrating around values between 19 and 43. There is a trend that as age increases, the BMI increases slightly as well. From a Pearson’s correlation test, correlation is 0.1092719 and the p-value is significant (<0.001). This might not be such a compelling finding; however, background research into the CDC research on BMI in the US showed trends of increased BMI in certain age ranges when compared to other ages.
# Plot the bmi vs. age.
ggplot(insurance, aes(x=age, y=bmi)) +
geom_point()+
geom_smooth(method = "lm", se = FALSE, color = "red", size=1.4)+
ggtitle("Plot of BMI vs. Age")
## `geom_smooth()` using formula 'y ~ x'
The plot clearly shows the minimum insurance premium charges are associated with age. As age increases the minimum charges increase gradually as well. However, the ranges of the high premium charges are not significantly different among all ages. There are three levels of insurance charges in all ages. This level difference probably occurs because the charges are determined by two other main criteria besides age.
# Plot the charges vs. age. There is an obvious trend that as age getting older, the charges are higher.
ggplot(insurance, aes(x=age, y=log2(charges))) +
geom_point()+
geom_smooth(method = "lm", se = FALSE, color = "red", size=1.4)+
ggtitle("Plot of Insurance Charges vs. Age")
## `geom_smooth()` using formula 'y ~ x'
For each BMI number, the log2 charges are spread from 10-15. Interestingly, the log2 charges drop after BMI is larger than 41. It clearly appears that smokers pay higher charges compared to non-smokers, which is shown by the yellow dots. These dots cluster above log2charge=14. There is a positive trend indicated from the linear regression line that as the BMI increases, the charges increase as well. Although, this trend does not appear to be big.
# Plot the charges vs. bmi. When bmi is larger than 30, the charges is increased. It's significant that smokers pay higher charges.
ggplot(insurance, aes(x=bmi, y=log2(charges),col=smoker)) +
geom_point()+
scale_color_manual(values=c("#999999", "#E69F00")) +
geom_smooth(method = "lm", se = FALSE, color = "yellow", size=1.4)+
ggtitle("Plot of Insurance Charges vs. BMI")
## `geom_smooth()` using formula 'y ~ x'
The lower whiskers of log2 charges increase as the children number increases until plateauing at children number 4. It may indicate that the minimum insurance premium charges are associated with children number. However the upper whiskers among children number 0, 1, and 4 are similar and higher than other three children number groups. This difference may occur because the fewest people have 5 children. The log2 charges range for 5 children is the narrowest.
# Plot the charges vs. number of children.
ggplot(insurance, aes(x=children, y=log2(charges))) +
geom_point()+
geom_smooth(method = "lm", se = FALSE, color = "orange", size=1.4)+
ggtitle("Plot of Insurance Charges vs. Number of Children")
## `geom_smooth()` using formula 'y ~ x'
Correlation plots help to see the correlations between variables and help decide which variables will be used in a multiple linear regression that models charges. Smoker and charges show the highest correlation (0.7~0.8). Second is age, followed by BMI, then children number. There is no visible correlation of sex and region with charges. So in the multiple linear regression analysis, we consider smoker, age, BMI and children as the four independent variables to model charges.
#change character variables to numeric.
insuranceNum=insurance
insuranceNum$sex <- ifelse(insuranceNum$sex=="male", 1, 0)
insuranceNum$smoker <- ifelse(insuranceNum$smoker=="no", 0, 1)
regionType <- c(northeast = 1, northwest = 2, southeast = 3, southwest=4)
insuranceNum$region <- regionType[insuranceNum$region]
#Plot the correlation matrix.
corrplot(cor(insuranceNum), method="shade")
Age may be a confounding variable. This variable could be confounding because it directly affects the insurance charges and could also affect the children variable. This relationship between age and children occurs because the youngest and oldest population included have a lower prevalence of having dependents on their insurance plan while the middle aged population is more likely to have children on their insurance plan.
BMI may also be a confounding variable because according the CDC the prevalence of obesity in american adults is about 40% in adults between the ages of 20 and 39, 44.8% in adults between 40-59 years, and 42.8% among adult aged 60 and older. Both age and BMI drive up medical costs and they appear to have a relationship with one another with higher prevalence of higher BMI (risk of obesity) as age increases. Therefore, we cannot say with 100% confidence if it is the age, BMI, or interplay of the two that is driving up the medical costs for this sample population.
Null hypothesis 1: The average premium insurance charge is the same for those who smoke when compared to those who do not smoke. \[H_1: \beta_1 = 0\]
Alternative hypothesis 1: The average premium insurance charge is not the same for those who smoke when compared to those who do not smoke. \[H_1: \beta_1 \neq 0\]
Null hypothesis 2: The average premium insurance charge is the same for individuals of all BMI measurements. \[H_2: \beta_2 = 0\]
Alternative hypothesis 2: The average premium insurance charge is not the same for individuals of all BMI measurements. \[H_2: \beta_2 \neq 0\]
Null hypothesis 3: The average premium insurance charge is the same for all ages. \[H_3: \beta_3 = 0\]
Alternative hypothesis 3: The average premium insurance charge is not the same for all ages. \[H_3: \beta_3 \neq 0\]
Null hypothesis 4: The average premium insurance charge is the same for individuals with any number of children. \[H_4: \beta_4 = 0\]
Alternative hypothesis 4: The average premium insurance charge is not the same for individuals with any number of children. \[H_4: \beta_4 \neq 0\]
We will consider the age, children, BMI, and smoker variables to model the log 2 transformation of the charges variable. We will model the log 2 transformation of the charges variable because the original data is not normally distributed. Once this variable is transformed a distribution closer to normal occurs.
There are four factors used to determine the premium amount for an individual. These four are location they live, smoking status, age, and the amount of people it covers. All these variables were included in the model except for location. We decided to remove region because when it was modeled as a simple linear regression with charges, the resulting p-value was not statistically significant. Additionally, the correlation between region and charges was very close to 0 (~-0.006), demonstrating a small or no relationship between the two variables. Similar to this reason, we chose to include BMI as the simple linear regression model produced a statistically significant p-value as well as a larger pearson’s correlation coeffcient (~0.20). Finally, the sex of an individual can not be used to determine the premium and did not have a strong correlation with charges (~0.06). Therefore, sex was not included in our model.
We will use a multiple linear regression model to predict the charges. We will use this model because we have multiple variables and want to test if a linear relationship exists between the explanatory variables and the dependent variable.
The multiple linear regression model will be based on Equation I shown below.
Equation I: \[log_2(charges) = \beta_0 + \beta_1 * (smoker) + \beta_2 * (bmi) + \beta_3 * (age) + \beta_4 * (children)\]
model <- lm(log2(charges) ~ smoker + bmi + age + children, data = insurance)
model
##
## Call:
## lm(formula = log2(charges) ~ smoker + bmi + age + children, data = insurance)
##
## Coefficients:
## (Intercept) smokeryes bmi age children
## 10.07402 2.22643 0.01531 0.05018 0.14600
The model calculated coefficients for each variable and an intercept as shown in Equation II.
Equation II: \[log_2(charges) = 10.07 + 2.23 * (smoker) + 0.02 * (bmi) + 0.05 * (age) + (0.15) * children\]
sum_model <- summary(model)
p_vals <- sum_model$coefficients[2:5,4]
sum_model
##
## Call:
## lm(formula = log2(charges) ~ smoker + bmi + age + children, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.60629 -0.28684 -0.06763 0.10384 2.99476
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.074017 0.100589 100.151 < 2e-16 ***
## smokeryes 2.226430 0.043912 50.703 < 2e-16 ***
## bmi 0.015306 0.002923 5.236 1.91e-07 ***
## age 0.050181 0.001270 39.502 < 2e-16 ***
## children 0.145997 0.014714 9.922 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6479 on 1333 degrees of freedom
## Multiple R-squared: 0.7622, Adjusted R-squared: 0.7614
## F-statistic: 1068 on 4 and 1333 DF, p-value: < 2.2e-16
## 2.5 % 97.5 %
## (Intercept) 9.876688044 10.27134617
## smokeryes 2.140286907 2.31257350
## bmi 0.009571357 0.02104162
## age 0.047688543 0.05267271
## children 0.117132350 0.17486221
Table I: This table shows the p values and the 5% confidence intervals for each parameter in the model.
According to the results shown in the Table I, all the variables showed that they have a statistically significant association to charges. This interpretation is made because the significance level was set at 0.05. Each p-value was below this threshold; therefore, we can reject each null hypothesis. We estimate there is an association between each explanatory variable and charges when adjusting for the other variables.
Each explanatory variable has a differing effect on the log2 charges outcomes shown by the equation calculated by the multiple linear regression model. We estimate that the average log2 charges goes up by approximately 2.27 (log2 dollars) when an individual is a smoker compared to a non smoker while keeping the other variables constant. We estimate that the average log2 charges goes up by approximately 0.02 (log2 dollars) when an individual’s BMI increases by one unit (kg/m^2) while keeping the other variables constant. We estimate that the average log2 charges goes up by approximately 0.05 (log2 dollars) when an individual’s age increases by one year while keeping the other variables constant. We estimate that the average log2 charges goes up by approximately 0.15 (log2 dollars) when an individual’s amount of dependents increases by one while keeping the other variables constant. These results mean for each variable there is low compatibility with the null hypothesis. Therefore, the situation that each variable has no affect on the charges variable, its coefficient is equal to 0, is very unlikely and we can reject each null hypothesis. Finally, the intercept in the equation, approximately 10.07, is equivalent to what the log2(charges) will be for an individual if all variables are equal to zero. Therefore, an individual who is 0, has a BMI of 0, does not smoke, and has 0 children will be billed approximately $1077.91 for health insurance.
Additionally, the 95% confidence intervals are shown with the p-values in Table I. The confidence intervals shows that if the model is fit an infinite amount of times the true coefficient values will be included in the resulting confidence intervals 95% of the time.
We decided not to include region in our multiple linear regression model. This decision was made due to two factors. First, there was minimal correlation (~-0.006) between charges and region. Second, when a simple linear regression model was run with region as the explanatory variable, the result was statistically insignificant (p-value = 0.119).
As previously stated, the BMI metric can inaccurately measure the health of an individual. If BMI is used as a metric for an individual’s health, we would expect larger BMI values to increase insurance charges. Due to the BMI’s unreliability, this assumption will not be true in all cases. Second, the region variable only produces four values: northeast, northwest, southeast, and southwest. Splitting regions into four quadrants does not allow us to draw specific conclusions on where individuals live. We know that location is used as a factor for determining insurance cost. Therefore, we would expect location to be effective in predicting the charge amounts. Though, this expectation may not occur for this dataset due to the manner in which it was recorded. Third, the age data has a larger frequency of 18 and 19 year-old individuals by a large margin compared to other age bins. There is no reason for this provided in the dataset; therefore we can not correctly explain its occurrence. Additionally, this skew in the data could impact the results. Finally, the children variable is a good estimate of the amount of dependents or people on an individual’s plan. However, this variable does not include instances where a spouse is a dependent on a plan. Therefore, some data may be misrepresenting how many individuals are covered by a certain plan.
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] corrplot_0.90 gmodels_2.18.1 dplyr_1.0.7 crosstable_0.2.1
## [5] ggplot2_3.3.5 webshot_0.5.2 magick_2.7.3 kableExtra_1.3.4
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.2 tidyr_1.1.3 jsonlite_1.7.2 viridisLite_0.4.0
## [5] splines_4.1.1 gtools_3.9.2 assertthat_0.2.1 highr_0.9
## [9] yaml_2.2.1 gdtools_0.2.3 pillar_1.6.2 backports_1.2.1
## [13] lattice_0.20-44 glue_1.4.2 uuid_0.1-4 digest_0.6.27
## [17] checkmate_2.0.0 rvest_1.0.1 colorspace_2.0-2 htmltools_0.5.2
## [21] Matrix_1.3-4 pkgconfig_2.0.3 purrr_0.3.4 scales_1.1.1
## [25] processx_3.5.2 gdata_2.18.0 svglite_2.0.0 officer_0.4.0
## [29] tibble_3.1.4 mgcv_1.8-36 generics_0.1.0 farver_2.1.0
## [33] ellipsis_0.3.2 withr_2.4.2 survival_3.2-11 magrittr_2.0.1
## [37] crayon_1.4.1 evaluate_0.14 ps_1.6.0 fansi_0.5.0
## [41] nlme_3.1-152 MASS_7.3-54 forcats_0.5.1 xml2_1.3.2
## [45] tools_4.1.1 data.table_1.14.0 lifecycle_1.0.0 stringr_1.4.0
## [49] flextable_0.6.9 munsell_0.5.0 zip_2.2.0 callr_3.7.0
## [53] compiler_4.1.1 jquerylib_0.1.4 systemfonts_1.0.2 rlang_0.4.11
## [57] grid_4.1.1 rstudioapi_0.13 base64enc_0.1-3 labeling_0.4.2
## [61] rmarkdown_2.11 gtable_0.3.0 DBI_1.1.1 R6_2.5.1
## [65] knitr_1.34 fastmap_1.1.0 utf8_1.2.2 nortest_1.0-4
## [69] stringi_1.7.4 Rcpp_1.0.7 vctrs_0.3.8 tidyselect_1.1.1
## [73] xfun_0.26