First Project-US Insurance Charges Analysis

Introduction - The interplay of attributes of the insured population and the effect they may have on insurance premium and medical risk underwriting when it is allowed

What is a Health Insurance Premium?

In the United States the health insurance premium is the amount a person pays for their health insurance every month. Insurance in the US is a way people in a community cover health costs collectively. Therefore, during months when a person is healthy and does not use health care services, the premium dollars paid monthly are used to pay health care costs of other people covered under the same plan.

In 2020 the average national cost for health insurance was about $456 a month for an individual and $1,152 a month for a family. However, these monthly costs do vary greatly based upon the particular health plan used. Even the same insurer may have different plan tiers with different costs. In addition to these premiums, consumers may also have to pay out-of-pocket costs, like deductibles, co-pays, and coinsurance when they seek medical care¹.

What is Underwriting?

Underwriting in the United States Health Insurance arena was a process that insurance companies routinely used prior to 2014 to determine if an applicant was an acceptable risk. If they were, underwriting determined how much their premiums should cost based on their medical history. This practice is no longer completely allowed under new rules laid out by the Affordable Care Act (ACA) in 2014².

The ACA states that all new individual major medical plans are guaranteed. However, premiums still vary in most states based on age and tobacco use. For individual market coverage and small group coverage plans, insurers cannot consider the individual or group’s overall medical history when setting premiums or determining eligibility for coverage (not that tobacco use can be used as a factor for determining insurance premiums). Large group coverage, and many medium sized groups, are different from their individual and small group counterparts, in that, when they buy coverage from an insurance company, premiums can be based on the group’s overall claims history. Meaning, less “healthy” groups can be charged higher total premiums than healthier groups. The ACA kicks in so the individual employees within the large or medium group are guaranteed coverage and are not charged different rates based on their individual medical history.

Medical underwriting, although somewhat phased out, still is used on some policies, including fixed-indemnity plans, short-term plans, limited benefit plans, and other non-ACA regulated plans. Individual life insurance and disability insurance policies also still use medical underwriting to determine premiums.

Excepted benefits include short-term health insurance and supplemental insurance products, like dental/vision plans, accident supplements, fixed indemnity plans, and critical illness plans. Most expected benefits are designed to supplement major medical coverage, rather than replace it. Short term plans are typically used as stand-alone coverage for a limited period of time (~364 days) with most states having more restrictive rules than the federal rules. These short term plans however may include blanket exclusions for any pre-existing conditions. Pre-existing conditions are defined as how far back the plan will look at the person’s medical history. These short-term health insurance plans often use post-claims underwriting, meaning the plan has the ability to comb through the consumer’s medical records after the person is enrolled and has submitted a claim, as opposed to happening before the policy was issued. Therefore, if the post-claims underwriting process determines that the claim being submitted is rooted in a pre-existing condition, the insurer can then deny covering the claim regardless of the individual being enrolled in their plan³.

Introduction - Data Source

US Health Insurance Data Set
Insurance Premium Charges in US with important details for risk underwriting.

https://www.kaggle.com/teertha/ushealthinsurancedataset?select=insurance.csv

This data set contains 1338 rows of insured data, where the Insurance charges are given against the following attributes of the insured: age, sex, BMI, number of children, smoker and region.The dataset used is relatively simple and was proposed to be an excellent starting point for the U.S. Economic Development Administration (EDA) to begin general statistical analysis, hypothesis testing, and training linear regression models for predicting insurance premium charges. The dataset was also proposed to be a helpful and simple set that could illuminate the understanding of risk underwriting in health insurance and the interplay of different lifestyle and medical attributes of the insured population to see how they affect the insurance premium.

insurance=read.csv("data/insurance.csv", header=TRUE)
head(insurance)

##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

Data Background

There are four factors used to determine the premium amount for an individual. These four are location they live, smoking status, age, and the amount of people it covers. All these variables are included in the aforementioned dataset. An insurance company can determine an individual’s premium based on their location of residence. The insurance company will use factors such as differences in competition, state vs local rules, and cost of living. The dataset accounts for this by including a region variable for every individual. Insurance companies can charge tobacco users up to 50% more than non tobacco users⁴. This information is included under the smoker variable. Age affects insurance premiums based on a rate where the baseline is set at 21 years old. Generally, as an individual gets older, their premium costs will increase. The rate continues to increase until the age of 65⁵. The rate differs by state. Insurance companies can also change premium costs depending on the amount of people a policy will cover. If there are more people on the plan, the premium cost will increase⁴. This information is covered under the children variable. This variable represents the amount of dependents an individual’s policy covers.

The dataset also includes sex and BMI variables. Neither are factors that can be used to determine an individual’s premium amount. Body Mass Index (BMI) is measured but the individual’s weight (kg) divided by height (m) squared. It is standard in epidemiological studies of body weight, to consider an adult between 25 and 30 to be “overweight” and an adult between 30 as “obese”. BMI should not be regarded as a medical diagnosis because many people who would be counted as obese on the basis of body mass index are physically fit when measured in out ways like percentage of body fat⁶. Although the BMI measurement is not a full proof medical measurement it is easily included in surveys and calculable so it is included as a generalized measurement for body size relating to health in this dataset.

Exploration of the data

Check missing values and dimension.

There are no missing or undefined values in the data set. The variables age, bmi, charges, and children are numeric types. The variables sex, region, and smoker are character types.

#Use There is no missing value in this data set
sapply(insurance, function(x) sum(is.na(x)))

##      age      sex      bmi children   smoker   region  charges 
##        0        0        0        0        0        0        0

#check dimension of the data set
dim(insurance)

## [1] 1338    7

#check each variable's data type and brief statistic information
summary(insurance)

##       age            sex                 bmi           children    
##  Min.   :18.00   Length:1338        Min.   :15.96   Min.   :0.000  
##  1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
##  Median :39.00   Mode  :character   Median :30.40   Median :1.000  
##  Mean   :39.21                      Mean   :30.66   Mean   :1.095  
##  3rd Qu.:51.00                      3rd Qu.:34.69   3rd Qu.:2.000  
##  Max.   :64.00                      Max.   :53.13   Max.   :5.000  
##     smoker             region             charges     
##  Length:1338        Length:1338        Min.   : 1122  
##  Class :character   Class :character   1st Qu.: 4740  
##  Mode  :character   Mode  :character   Median : 9382  
##                                        Mean   :13270  
##                                        3rd Qu.:16640  
##                                        Max.   :63770

Exploration of each variable - Visualization of quantitative variables through plots and categorical variables through ratio of sample population in crosstable.

Age

The ages of those included in the dataset have a relatively uniform distribution with the exception of those around 18-19 years of age. Those individuals who are 18-19 years of age appear to have triple the representation of any other age in the data set. Using a log2 transformation on age would not help the distribution become closer to normal.

#Check numeric variables age distributions by plotting histogram.
hist(insurance$age, xlab="Age", breaks = 20, main="Age Distribution")

BMI

Body Mass Index (BMI) is an individual’s weight (kg) divided by height (m) squared. The CDC defines adults with BMI between 25 and 30 as overweight and those over 30 as obese. The overall BMI distribution in this dataset is close to a normal distribution with a small right skew and a center around 30.

hist(insurance$bmi, xlab="BMI", breaks = 20, main="BMI Distribution")

Number of Children

Children variable is a discrete variable which only has six numbers. The distribution of number of children included in this dataset is right skewed with the majority of individuals having zero children.

hist(insurance$children, xlab="number of children",breaks = 6, main="Number of Children Distribution")

Charges

The distribution of charges is extremely right skewed because due to range of the insurance premium charges, which is $1122 to $63770. A log2 transformation was performed to normalize the distribution of this variable. This transofrmation will allow us to perform more accurate statistical analysis. We use the transformed charges variables in all of the following analyses.

hist(log2(insurance$charges), xlab="log2charges", breaks = 20, main="Charges Distribution")

Sex

The percentage of males to females included in the data set is 49.48% to 50.52%. This is the inverse of what we see in the United States 2020 population, which is split 50.75% and 49.25% for female and male respectively. However, since the split is very close to 50/50 both in the dataset and the current US population, we chose not to investigate any sampling bias.

Smoker

The percentage of non-smoker to smoker included in the dataset is 79.52% to 20.48%. This percentage represents the approximate prevalence in the United States at the time of the data collection.

Region

The data was entered in a way that separated the regions of the United States by quadrant (northeast/northwest/southeast/southwest). The data set includes relatively even samples from each region with slightly more individuals belonging to the southeast group.

#Check ordinal variables distributions
crosstable(insurance,sex,smoker,region) %>% as_flextable(keep_id=FALSE)

label	variable	value
sex	female	662 (49.48%)
male	676 (50.52%)
smoker	no	1064 (79.52%)
yes	274 (20.48%)
region	northeast	324 (24.22%)
northwest	325 (24.29%)
southeast	364 (27.20%)
southwest	325 (24.29%)

Explore the data by ploting graghs to compare between variables.

Plotting graphs between variables to see the correlations.

The medians and 50% IQR of BMI between male and female are very similar. Although, the spread of BMI in male is larger than in female.

BMI in male and female

The medians and IQRs of BMI between male and female are very similar. The BMI median is around 30 for both male and female. The upper outliers and maximum BMI in the male variable is larger compared to those in the female variable.

# Boxplot of BMI on sex. There is no significant trend difference of BMI between male and female.
ggplot(insurance, aes(x=sex, y=bmi)) +
  geom_boxplot(col="red")+
  stat_boxplot(col="red", geom="errorbar", width=0.5)+
  geom_jitter(position=position_jitter(0.02))+
  ggtitle("Boxplot of BMI in Male and Female")

Age in Children Number

In the group of people who have zero children, the median age is around 36 years old. The groups for 1 and 2 children have the same median at 40 years old. The 3 children group children has the highest median age at 41. The groups with 4 and 5 children have age medians around 38-39. The 0 to 3 children groups all have a very similar age range from 18 to 64. The ages for those with 0 children are more accumulated in the 25% QR and 75% QR, unlike like those with 1-3 children, which have a narrower IQR. The 4 and 5 children groups have much less people, hence these have a slightly smaller age range. In each of the five distinct categories of children number, there appears to be no significant trend that older individuals have more dependents.

# Boxplot of children number on age condition. No difference.
ggplot(insurance, aes(x=as.factor(children), y=age)) +
  geom_boxplot(col="red")+
  stat_boxplot(col="red", geom="errorbar", width=0.5)+
  xlab("Children Number")+
  geom_jitter(position=position_jitter(0.02))+
  ggtitle("Boxplot of Children Numbers in Age Condition")

BMI Distribution in Different Regions

The BMI in each region has a distribution similar to a normal distribution. The BMI values for the northeast, northwest and southwest regions have a similar average around 30 (kg/m^2). The BMI distribution of the southeast region is shifted right compared to the other three regions. It indicates that the overall BMI in southeast region is higher than other regions.

# Histogram of BMI distribution in different region. Average BMI in southeast region is a bit higher than the other three regions.
ggplot(data=insurance, aes(bmi),colour = cut) + 
  geom_histogram(binwidth =1) +
  facet_grid(rows=vars(region))+
  ggtitle("Histogram of BMI Distribution in Different Regions")

Age Distribution in Different Regions

There are spikes at ages 18 and 19 in all four regions. Besides this age range, the distribution of age in the four regions are relatively even and similar. There appears to be no significant difference of age between regions.

# Histogram of Age distribution in different region.There is no significant trend difference of age distribubtion in four regions.
ggplot(data=insurance, aes(age), color= region) + 
  geom_histogram() +
  facet_grid(rows=vars(region))+
  ggtitle("Histogram of Age Distribution in Different Regions")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Charges Distribution in Different Regions

The log2 transformed charges in the four regions have a similar average. There are lower charges in the southern regions compared to the northern regions indicated by the longer left tails. More people pay very high insurance premium charges in the southern regions compared to northern regions shown by the higher density in the right tails.

# Histogram of insurance charge in different region.
# after taking log2 on charges, the distribution of charges are closer to normal distribution.
ggplot(data=insurance, aes(log2(charges))) + 
  geom_histogram() +
  facet_grid(rows=vars(region))+
  ggtitle("Histogram of Insurance Charges in Different Regions")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

BMI vs Age

The BMI spread in each age is relatively similar. Obese people are classified by having a BMI above 50. There are 3 obese cases in this dataset, which occur between 18 and 23 years old. People older than 60 show a more constant BMI between 20 and 40. However, the main spreads of BMI among different ages are generally very similar, concentrating around values between 19 and 43. There is a trend that as age increases, the BMI increases slightly as well. From a Pearson’s correlation test, correlation is 0.1092719 and the p-value is significant (<0.001). This might not be such a compelling finding; however, background research into the CDC research on BMI in the US showed trends of increased BMI in certain age ranges when compared to other ages.

# Plot the bmi vs. age.  
ggplot(insurance, aes(x=age, y=bmi)) +
  geom_point()+
  geom_smooth(method = "lm", se = FALSE, color = "red", size=1.4)+
  ggtitle("Plot of BMI vs. Age")

## `geom_smooth()` using formula 'y ~ x'

Charges vs Age

The plot clearly shows the minimum insurance premium charges are associated with age. As age increases the minimum charges increase gradually as well. However, the ranges of the high premium charges are not significantly different among all ages. There are three levels of insurance charges in all ages. This level difference probably occurs because the charges are determined by two other main criteria besides age.

# Plot the charges vs. age. There is an obvious trend that as age getting older, the charges are higher.
ggplot(insurance, aes(x=age, y=log2(charges))) +
  geom_point()+
  geom_smooth(method = "lm", se = FALSE, color = "red", size=1.4)+
  ggtitle("Plot of Insurance Charges vs. Age")

## `geom_smooth()` using formula 'y ~ x'

Charges vs BMI by Smoking Condition

For each BMI number, the log2 charges are spread from 10-15. Interestingly, the log2 charges drop after BMI is larger than 41. It clearly appears that smokers pay higher charges compared to non-smokers, which is shown by the yellow dots. These dots cluster above log2charge=14. There is a positive trend indicated from the linear regression line that as the BMI increases, the charges increase as well. Although, this trend does not appear to be big.

# Plot the charges vs. bmi. When bmi is larger than 30, the charges is increased. It's significant that smokers pay higher charges.
ggplot(insurance, aes(x=bmi, y=log2(charges),col=smoker)) +
  geom_point()+
  scale_color_manual(values=c("#999999", "#E69F00")) +
  geom_smooth(method = "lm", se = FALSE, color = "yellow", size=1.4)+
  ggtitle("Plot of Insurance Charges vs. BMI")

## `geom_smooth()` using formula 'y ~ x'

Charges vs Children Number

The lower whiskers of log2 charges increase as the children number increases until plateauing at children number 4. It may indicate that the minimum insurance premium charges are associated with children number. However the upper whiskers among children number 0, 1, and 4 are similar and higher than other three children number groups. This difference may occur because the fewest people have 5 children. The log2 charges range for 5 children is the narrowest.

# Plot the charges vs. number of children. 
ggplot(insurance, aes(x=children, y=log2(charges))) +
  geom_point()+
  geom_smooth(method = "lm", se = FALSE, color = "orange", size=1.4)+
  ggtitle("Plot of Insurance Charges vs. Number of Children")

## `geom_smooth()` using formula 'y ~ x'

Check the correlation between variables.

Correlation plots help to see the correlations between variables and help decide which variables will be used in a multiple linear regression that models charges. Smoker and charges show the highest correlation (0.7~0.8). Second is age, followed by BMI, then children number. There is no visible correlation of sex and region with charges. So in the multiple linear regression analysis, we consider smoker, age, BMI and children as the four independent variables to model charges.

#change character variables to numeric. 
insuranceNum=insurance
insuranceNum$sex <- ifelse(insuranceNum$sex=="male", 1, 0)
insuranceNum$smoker <- ifelse(insuranceNum$smoker=="no", 0, 1)
regionType <- c(northeast = 1, northwest = 2, southeast = 3, southwest=4)
insuranceNum$region <- regionType[insuranceNum$region]

#Plot the correlation matrix. 
corrplot(cor(insuranceNum), method="shade")

Potential Cofounders and Covariates

Age may be a confounding variable. This variable could be confounding because it directly affects the insurance charges and could also affect the children variable. This relationship between age and children occurs because the youngest and oldest population included have a lower prevalence of having dependents on their insurance plan while the middle aged population is more likely to have children on their insurance plan.

BMI may also be a confounding variable because according the CDC the prevalence of obesity in american adults is about 40% in adults between the ages of 20 and 39, 44.8% in adults between 40-59 years, and 42.8% among adult aged 60 and older. Both age and BMI drive up medical costs and they appear to have a relationship with one another with higher prevalence of higher BMI (risk of obesity) as age increases. Therefore, we cannot say with 100% confidence if it is the age, BMI, or interplay of the two that is driving up the medical costs for this sample population.

Hypothesis Testing and Interpretation

Hypotheses

Null hypothesis 1: The average premium insurance charge is the same for those who smoke when compared to those who do not smoke. \[H_1: \beta_1 = 0\]

Alternative hypothesis 1: The average premium insurance charge is not the same for those who smoke when compared to those who do not smoke. \[H_1: \beta_1 \neq 0\]

Null hypothesis 2: The average premium insurance charge is the same for individuals of all BMI measurements. \[H_2: \beta_2 = 0\]

Alternative hypothesis 2: The average premium insurance charge is not the same for individuals of all BMI measurements. \[H_2: \beta_2 \neq 0\]

Null hypothesis 3: The average premium insurance charge is the same for all ages. \[H_3: \beta_3 = 0\]

Alternative hypothesis 3: The average premium insurance charge is not the same for all ages. \[H_3: \beta_3 \neq 0\]

Null hypothesis 4: The average premium insurance charge is the same for individuals with any number of children. \[H_4: \beta_4 = 0\]

Alternative hypothesis 4: The average premium insurance charge is not the same for individuals with any number of children. \[H_4: \beta_4 \neq 0\]

Multiple Linear Regression Model

We will consider the age, children, BMI, and smoker variables to model the log 2 transformation of the charges variable. We will model the log 2 transformation of the charges variable because the original data is not normally distributed. Once this variable is transformed a distribution closer to normal occurs.

There are four factors used to determine the premium amount for an individual. These four are location they live, smoking status, age, and the amount of people it covers. All these variables were included in the model except for location. We decided to remove region because when it was modeled as a simple linear regression with charges, the resulting p-value was not statistically significant. Additionally, the correlation between region and charges was very close to 0 (~-0.006), demonstrating a small or no relationship between the two variables. Similar to this reason, we chose to include BMI as the simple linear regression model produced a statistically significant p-value as well as a larger pearson’s correlation coeffcient (~0.20). Finally, the sex of an individual can not be used to determine the premium and did not have a strong correlation with charges (~0.06). Therefore, sex was not included in our model.

We will use a multiple linear regression model to predict the charges. We will use this model because we have multiple variables and want to test if a linear relationship exists between the explanatory variables and the dependent variable.

Equation without Coefficients

The multiple linear regression model will be based on Equation I shown below.

Equation I: \[log_2(charges) = \beta_0 + \beta_1 * (smoker) + \beta_2 * (bmi) + \beta_3 * (age) + \beta_4 * (children)\]

model <- lm(log2(charges) ~ smoker + bmi + age + children, data = insurance)
model

## 
## Call:
## lm(formula = log2(charges) ~ smoker + bmi + age + children, data = insurance)
## 
## Coefficients:
## (Intercept)    smokeryes          bmi          age     children  
##    10.07402      2.22643      0.01531      0.05018      0.14600

Equation with Coefficients

The model calculated coefficients for each variable and an intercept as shown in Equation II.

Equation II: \[log_2(charges) = 10.07 + 2.23 * (smoker) + 0.02 * (bmi) + 0.05 * (age) + (0.15) * children\]

Multiple linear regression analysis and results

sum_model <- summary(model)
p_vals <- sum_model$coefficients[2:5,4]
sum_model

## 
## Call:
## lm(formula = log2(charges) ~ smoker + bmi + age + children, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.60629 -0.28684 -0.06763  0.10384  2.99476 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.074017   0.100589 100.151  < 2e-16 ***
## smokeryes    2.226430   0.043912  50.703  < 2e-16 ***
## bmi          0.015306   0.002923   5.236 1.91e-07 ***
## age          0.050181   0.001270  39.502  < 2e-16 ***
## children     0.145997   0.014714   9.922  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6479 on 1333 degrees of freedom
## Multiple R-squared:  0.7622, Adjusted R-squared:  0.7614 
## F-statistic:  1068 on 4 and 1333 DF,  p-value: < 2.2e-16

##                   2.5 %      97.5 %
## (Intercept) 9.876688044 10.27134617
## smokeryes   2.140286907  2.31257350
## bmi         0.009571357  0.02104162
## age         0.047688543  0.05267271
## children    0.117132350  0.17486221

Table I: This table shows the p values and the 5% confidence intervals for each parameter in the model.

Results

Interpretation

According to the results shown in the Table I, all the variables showed that they have a statistically significant association to charges. This interpretation is made because the significance level was set at 0.05. Each p-value was below this threshold; therefore, we can reject each null hypothesis. We estimate there is an association between each explanatory variable and charges when adjusting for the other variables.

Each explanatory variable has a differing effect on the log2 charges outcomes shown by the equation calculated by the multiple linear regression model. We estimate that the average log2 charges goes up by approximately 2.27 (log2 dollars) when an individual is a smoker compared to a non smoker while keeping the other variables constant. We estimate that the average log2 charges goes up by approximately 0.02 (log2 dollars) when an individual’s BMI increases by one unit (kg/m^2) while keeping the other variables constant. We estimate that the average log2 charges goes up by approximately 0.05 (log2 dollars) when an individual’s age increases by one year while keeping the other variables constant. We estimate that the average log2 charges goes up by approximately 0.15 (log2 dollars) when an individual’s amount of dependents increases by one while keeping the other variables constant. These results mean for each variable there is low compatibility with the null hypothesis. Therefore, the situation that each variable has no affect on the charges variable, its coefficient is equal to 0, is very unlikely and we can reject each null hypothesis. Finally, the intercept in the equation, approximately 10.07, is equivalent to what the log2(charges) will be for an individual if all variables are equal to zero. Therefore, an individual who is 0, has a BMI of 0, does not smoke, and has 0 children will be billed approximately $1077.91 for health insurance.

Additionally, the 95% confidence intervals are shown with the p-values in Table I. The confidence intervals shows that if the model is fit an infinite amount of times the true coefficient values will be included in the resulting confidence intervals 95% of the time.

We decided not to include region in our multiple linear regression model. This decision was made due to two factors. First, there was minimal correlation (~-0.006) between charges and region. Second, when a simple linear regression model was run with region as the explanatory variable, the result was statistically insignificant (p-value = 0.119).

Limitations

As previously stated, the BMI metric can inaccurately measure the health of an individual. If BMI is used as a metric for an individual’s health, we would expect larger BMI values to increase insurance charges. Due to the BMI’s unreliability, this assumption will not be true in all cases. Second, the region variable only produces four values: northeast, northwest, southeast, and southwest. Splitting regions into four quadrants does not allow us to draw specific conclusions on where individuals live. We know that location is used as a factor for determining insurance cost. Therefore, we would expect location to be effective in predicting the charge amounts. Though, this expectation may not occur for this dataset due to the manner in which it was recorded. Third, the age data has a larger frequency of 18 and 19 year-old individuals by a large margin compared to other age bins. There is no reason for this provided in the dataset; therefore we can not correctly explain its occurrence. Additionally, this skew in the data could impact the results. Finally, the children variable is a good estimate of the amount of dependents or people on an individual’s plan. However, this variable does not include instances where a spouse is a dependent on a plan. Therefore, some data may be misrepresenting how many individuals are covered by a certain plan.

References

Kurt, D. (2021, September 8). Health Insurance Premium. Investopedia. Retrieved October 7, 2021, from https://www.investopedia.com/health-insurance-premium-4773146.
What is underwriting? (2021, March 13). Healthinsurance.Org. https://www.healthinsurance.org/glossary/underwriting/
When Is Medical Underwriting Still Used? (2021, April 20). Verywell Health. https://www.verywellhealth.com/what-is-medical-underwriting-4178117
How Health Insurance Marketplace® Plans Set Your Premiums. HealthCare.gov https://www.healthcare.gov/how-plans-set-your-premiums/.
How Age Affects Health Insurance Costs. ValuePenguin https://www.valuepenguin.com/how-age-affects-health-insurance-costs.
“Adult Obesity Facts.” Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, 7 June 2021, https://www.cdc.gov/obesity/data/adult.html.

sessionInfo()

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] corrplot_0.90    gmodels_2.18.1   dplyr_1.0.7      crosstable_0.2.1
## [5] ggplot2_3.3.5    webshot_0.5.2    magick_2.7.3     kableExtra_1.3.4
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.2        tidyr_1.1.3       jsonlite_1.7.2    viridisLite_0.4.0
##  [5] splines_4.1.1     gtools_3.9.2      assertthat_0.2.1  highr_0.9        
##  [9] yaml_2.2.1        gdtools_0.2.3     pillar_1.6.2      backports_1.2.1  
## [13] lattice_0.20-44   glue_1.4.2        uuid_0.1-4        digest_0.6.27    
## [17] checkmate_2.0.0   rvest_1.0.1       colorspace_2.0-2  htmltools_0.5.2  
## [21] Matrix_1.3-4      pkgconfig_2.0.3   purrr_0.3.4       scales_1.1.1     
## [25] processx_3.5.2    gdata_2.18.0      svglite_2.0.0     officer_0.4.0    
## [29] tibble_3.1.4      mgcv_1.8-36       generics_0.1.0    farver_2.1.0     
## [33] ellipsis_0.3.2    withr_2.4.2       survival_3.2-11   magrittr_2.0.1   
## [37] crayon_1.4.1      evaluate_0.14     ps_1.6.0          fansi_0.5.0      
## [41] nlme_3.1-152      MASS_7.3-54       forcats_0.5.1     xml2_1.3.2       
## [45] tools_4.1.1       data.table_1.14.0 lifecycle_1.0.0   stringr_1.4.0    
## [49] flextable_0.6.9   munsell_0.5.0     zip_2.2.0         callr_3.7.0      
## [53] compiler_4.1.1    jquerylib_0.1.4   systemfonts_1.0.2 rlang_0.4.11     
## [57] grid_4.1.1        rstudioapi_0.13   base64enc_0.1-3   labeling_0.4.2   
## [61] rmarkdown_2.11    gtable_0.3.0      DBI_1.1.1         R6_2.5.1         
## [65] knitr_1.34        fastmap_1.1.0     utf8_1.2.2        nortest_1.0-4    
## [69] stringi_1.7.4     Rcpp_1.0.7        vctrs_0.3.8       tidyselect_1.1.1 
## [73] xfun_0.26