Arthritis in the USA

Peter Kamerman

9 March 2019

Overview

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephonic survey that takes place across all 50 states of the United States (US), the District of Columbia, Puerto Rico, Guam, and the US Virgin Islands. The goal of the survey is the collection of state and territory data on preventative health practices and risk behaviours that are linked to chronic diseases, injuries, and preventable infectious diseases. The survey is limited to non-institutionalized US/US territory residents who are 18 years old.¹

The survey uses probability sampling methods to select the sample, and the sampling is stratified by state/territory. Within each state, sampling is based on random digit dialing, using records of household land-line numbers and cellphone numbers within each state (there is almost universal access to a land-line in US households). The sampling of cellphone numbers is by simple random sampling, while land-line number sampling uses disproportionate stratified sampling (DSS), which stratifies the sampling according to presumed housing density. Sampling is also geographically stratified within each state, and the timing of the calls is spread across the time of day, and across weekdays/weekends/holidays to accommodate the work schedules of potential participants.

Non-response and targeted under/over sampling of particular groups can introduce distortions in the data relative to the underlying population structure, hence weighting factors are calculated for the data. BRFSS weighting corrections are applied using a method called raking, and the stratification (_psu: population stratification unit) and weighting factors (_llcpwt: final weight) are provided in the brfss2013 dataset². Both the stratification and weighting of the data can be factored into an analysis using the ‘survey’ package.

Research questions

I chose to focus on the data from the core questionnaire (mandatory across all states/territories) that deals with the presence of arthritic pain. Arthritic pain is the most common cause of chronic pain in the general population³, and it contributes to a lower quality of life. With the BRFSS dataset we have an opportunity to undertake an exploratory analysis of the burden of arthritis in a large, representative sample of the US adult population.

I limited the data analysed to the 50 US states. In all cases I analysed the data by state.

Case ascertainment for arthritis was based on question ‘havarth3’ in the Chronic Health Conditions section of the BRFSS Main Survey. The question asks the following:

Has a doctor, nurse, or other health professional ever told you that you had any of the following? For each, tell me ‘Yes’, ‘No’, or you’re ‘Not sure’:

(Ever told) you have some form of arthritis, rheumatoid arthritis, gout, lupus, or fibromyalgia? (Arthritis diagnoses include: rheumatism, polymyalgia rheumatica; osteoarthritis (not osteporosis); tendonitis, bursitis, bunion, tennis elbow; carpal tunnel syndrome, tarsal tunnel syndrome; joint infection, etc.)

Only participants who answered ‘Yes’ or ‘No’ where included in the analyses (n = 488,794).

Because the analysis is based on 2013 data only, and I am not comparing across years, I have not adjusted (transformed) the data to a reference population (e.g., the US 2000 Standard Population). As such, prevalence estimates are crude (unadjusted) estimates.

Question 1:

What is the crude prevalence of arthritis in each state?

Question 2:

What is the prevalence of arthritis in each state by age group?

Question 3:

What is the prevalence of arthritis in each state by sex?

NOTE: In line with the exploratory nature of the exercise, I only report the outcome of exploratory numerical and graphical analyses, and have not extended the analyses to include hypothesis tests or modelling.

Data extraction and cleaning

In preparation for the analysis I did the following:

Imported the data into R;
Extract the required columns of data;
Generated a survey design object (includes sample stratification and weighting information).

For part 2, I extracted the following columns to assess their suitability for analysis:

State (_state):

state [categorical: names of states and territories]

Age (_age_g):

Age (in years) in six mutually exclusive groups? [categorical: Age 18 to 24, Age 25 to 34, Age 35 to 44, Age 45 to 54, Age 55 to 64, Age 65 or older]

Sex (sex):

Indicate sex of respondent [categorical: Male, Female]

Diagnosed with arthritis (havarth3):

Ever told you have arthritis? [categorical: Yes, No, NA = other/missing]

In addition, I also extracted variables required for building the design object, namely, the primary sampling unit (_psu), sample design stratification variable (_ststr), and final weight for landline And cellphone data (_llcpwt).

My initial cleaning steps included: i) selecting the columns I wanted; ii) Guam, Puerto Rico, and Washington DC; iii) cleaned the column names; iv) fixed some ordinal variables; and v) removed missing data from the design variables.

#-- Import data --#
load(url('https://www.dropbox.com/s/al9wx65xywredzq/brfss2013.RData?dl=1'))

#-- Select columns --#
brfss2013 %<>%
    select(X_psu, X_llcpwt, X_ststr, # Stratification and weighting factors
           X_state, # State and residency information
           X_age_g, sex, # Demographics
           havarth3) # Arthritis

#-- Reduce to 50 states --#
brfss2013 %<>%
    # Filter out District of Columbia, Guam, Puerto Rico, and odd values (80, 0)
    filter(!X_state %in% c('District of Columbia',
                           'Guam', 'Puerto Rico',
                           '80', '0')) %>%
    # Remove resulting empty state factors
    mutate(X_state = fct_drop(X_state))

#-- Rename columns --#
brfss2013 %<>%
    rename(psu = X_psu,
           llcpwt = X_llcpwt,
           strata = X_ststr,
           state = X_state,
           age = X_age_g,
           dx_arthritis = havarth3)

#-- Order age factor levels --#
brfss2013 %<>%
    mutate(age = factor(age, 
                        ordered = TRUE))

#-- Remove missing survey design variables --#
brfss2013 %<>%
    filter(!is.na(psu)) %>%
    filter(!is.na(strata)) %>% 
    filter(!is.na(llcpwt))

After this preliminary clean of the data, I inspected the variables of interest and the survey design variables.

Tabular summary of variables of interest

brfss2013 %>% 
    select(-psu, -strata, -llcpwt) %>% 
    skim() %>% 
    kable()

skim_type	skim_variable	n_missing	complete_rate	factor.ordered	factor.n_unique	factor.top_counts
factor	state	0	1.0000000	FALSE	50	Flo: 33668, Kan: 23282, Neb: 17139, Mas: 15071
factor	age	5	0.9999896	TRUE	6	Age: 156868, Age: 107011, Age: 81549, Age: 58564
factor	sex	2	0.9999958	FALSE	2	Fem: 282562, Mal: 196380
factor	dx_arthritis	2866	0.9940160	FALSE	2	No: 314571, Yes: 161507

There are some missing data for arthritis diagnosis, but they only make-up a small percentage of the total survey population.

Survey design variables

The sum of survey weight should approximate the population of the US. In this case the sum of the weights is 242782057, and so I am satisfied with the weights.

The number of unique primary sampling units (PSU) is 33909, and the number of unique strata is 1294

Code to generate a design object

Generating a design object is a crucial part of analysing complex survey data because the design object factors in the stratification and weightings that are then incorporated into all subsequent analyses.

#-- Set options for control standard error estimation --#
options(survey.lonely.psu = 'certainty')

#-- Create survey design object --#
brfss2013_design <- svydesign(ids = ~psu,
                              weight = ~llcpwt,
                              strata = ~strata,
                              nest = TRUE,
                              data = brfss2013)

#-- Summary of the design --#
brfss2013_design

## Stratified 1 - level Cluster Sampling design (with replacement)
## With (467659) clusters.
## svydesign(ids = ~psu, weight = ~llcpwt, strata = ~strata, nest = TRUE, 
##     data = brfss2013)

Research questions

Question 1

What is the prevalence of arthritis in each state?

Code to generate the arthritis by state data

#-- Calculate proportion with arthritis by state --#
arthritis_state <- svyby(formula = ~dx_arthritis, 
                         by = ~state, 
                         design = brfss2013_design, 
                         FUN = svymean, 
                         na.rm = TRUE)

Code to clean and plot the data

#-- Clean-up and convert to a percentage --#
prop_state <- arthritis_state %>%
    select(1:2) %>% 
    set_names(c('state', 'proportion')) %>% 
    mutate(state = as.character(state)) %>% 
    mutate_if(is.numeric, list(~round(. * 100)))

#-- Get conficence intervals --#
ci_state <- confint(arthritis_state) %>% 
    as.data.frame(.) %>% 
    rownames_to_column() %>% 
    mutate_if(is.numeric, list(~round(. * 100))) %>% 
    set_names(c('state', 'proportion_low', 'proportion_upp')) %>% 
    filter(str_detect(state, 'Yes')) %>% 
    mutate(state = str_remove(state, ':.+')) 

#-- Merge proportions and CI --#
prop_state %<>% 
    left_join(ci_state)
    
#-- Add interactive tooltip info --#
prop_state %<>%
    mutate(tooltip = paste0('<b>', state, '</b><br>',
                            '<em>Prevalence of arthritis</em><br>',
                            proportion, '% (95% CI: ',
                            proportion_low, ' to ', 
                            proportion_upp, ')')) 

#-- Create a simple features (sf) map of the 50 US states --#
map_us <- fifty_states %>% 
    # Create map
    st_as_sf(., 
             coords = c("long", "lat"),
             crs = "+proj=longlat +datum=WGS84 +no_defs") %>% 
    group_by(id, piece) %>% 
    summarise(do_union = FALSE) %>% 
    st_cast("POLYGON") %>% 
    st_cast("MULTIPOLYGON") %>% 
    ungroup() %>% 
    # Add a state column
    mutate(state = str_to_title(id)) %>% 
    # Remove old id column
    select(-id)

#-- Join map_us with arthritis_state --#
map_state <- map_us %>%
    left_join(prop_state) 

# Plot
gg_state <- ggplot(map_state) +
    geom_sf_interactive(aes(tooltip = tooltip,
                            fill = proportion), 
             color = '#FFFFFF',
             size = 0.25) +
    scale_fill_viridis_c(name = 'Prevalence (%)',
                         direction = -1) +
    labs(title = 'Prevalence of diagnosed arthritis by US state*',
         subtitle = '(adult non-institutionalized population)', 
         caption = '* Move cursor over each state to see state-specific estimate and 95% confidence interval.') +
    theme_void() +
    theme(panel.grid = element_line(colour = '#FFFFFF'),
          plot.caption = element_text(face = 'italic',
                                      hjust = 0))

ggiraph(ggobj = gg_state,
        width = 1,
        hover_css = 'cursor:pointer;
        fill:#FFFFFF;
        stroke:#000000;',
        tooltip_extra_css = 'background-color:#000000;
font-family:sans-serif,arial;
font-size:80%;
color:#FFFFFF;
padding:10px;
border-radius:10px 10px 10px 10px;')

Question 2

What is the prevalence of arthritis in each state by age group?

Code to generate the arthritis by state for each age group data

#-- Calculate proportion (95%CI) with arthritis by state --#
arthritis_age <- svyby(formula = ~dx_arthritis, 
                       by = ~state + age, 
                       design = brfss2013_design, 
                       FUN = svymean, 
                       na.rm = TRUE)

Code to clean and plot the data

#-- Clean-up and convert to a percentage --#
prop_age <- arthritis_age %>%
    select(1:3) %>% 
    set_names(c('state', 'age', 'proportion')) %>% 
    mutate(state = as.character(state)) %>% 
    mutate_if(is.numeric, list(~round(. * 100)))

#-- Get the confidence intervals --#
ci_age <- confint(arthritis_age) %>% 
    as.data.frame(.) %>% 
    rownames_to_column() %>% 
    mutate_if(is.numeric, list(~round(. * 100))) %>% 
    set_names(c('state', 'proportion_low', 'proportion_upp')) %>% 
    filter(str_detect(state, 'Yes')) %>% 
    mutate(state = str_remove(state, ':.+')) %>% 
    separate(col = state, 
             into = c('state', 'age'),
             sep = '\\.') %>% 
    mutate(age = factor(age, ordered = TRUE))

#-- Merge proportion and CI --#
prop_age %<>% 
    left_join(ci_age)


#-- Add interactive tooltip info --#
prop_age %<>%
    mutate(tooltip = paste0('<b>', state, ': ', age, '</b><br>',
                            '<em>Prevalence of arthritis</em><br>',
                            proportion, '% (95% CI: ',
                            proportion_low, ' to ', 
                            proportion_upp, ')')) 

#-- Join map_us with arthritis_state --#
map_age <- map_us %>%
    left_join(prop_age) %>% 
    filter(!is.na(age))

# Plot
gg_age <- ggplot(map_age) +
    geom_sf_interactive(aes(tooltip = tooltip,
                            fill = proportion), 
             color = '#FFFFFF',
             size = 0.25) +
    scale_fill_viridis_c(name = 'Prevalence (%)',
                         direction = -1) +
    labs(title = 'Prevalence of diagnosed arthritis by US state and age*',
         subtitle = '(adult non-institutionalized population)', 
         caption = '* Move cursor over each state to see state-specific estimate and 95% confidence interval.') +
    facet_wrap(~age, ncol = 3) +
    theme_void() +
    theme(panel.grid = element_line(colour = '#FFFFFF'),
          strip.text = element_text(size = 11,
                                    margin = margin(t = 1, r = 0, b = 0.1, l = 0, unit = "line")),
          plot.caption = element_text(face = 'italic',
                                      hjust = 0))

ggiraph(ggobj = gg_age,
        width = 1,
        hover_css = 'cursor:pointer;
        fill:#FFFFFF;
        stroke:#000000;',
        tooltip_extra_css = 'background-color:#000000;
font-family:sans-serif,arial;
font-size:80%;
color:#FFFFFF;
padding:10px;
border-radius:10px 10px 10px 10px;')

Question 3

What is the prevalence of arthritis in each state by sex?

Code to generate the arthritis by state for each sex data

#-- Calculate proportion (95%CI) with arthritis by state --#
arthritis_sex <- svyby(formula = ~dx_arthritis, 
                       by = ~state + sex, 
                       design = brfss2013_design, 
                       FUN = svymean, 
                       na.rm = TRUE)

Code to clean and plot the data

#-- Clean-up and convert to a percentage --#
prop_sex <- arthritis_sex %>%
    select(1:3) %>% 
    set_names(c('state', 'sex', 'proportion')) %>% 
    mutate(state = as.character(state)) %>% 
    mutate_if(is.numeric, list(~round(. * 100)))

#-- Get the confidence intervals --#
ci_sex <- confint(arthritis_sex) %>% 
    as.data.frame(.) %>% 
    rownames_to_column() %>% 
    mutate_if(is.numeric, list(~round(. * 100))) %>% 
    set_names(c('state', 'proportion_low', 'proportion_upp')) %>% 
    filter(str_detect(state, 'Yes')) %>% 
    mutate(state = str_remove(state, ':.+')) %>% 
    separate(col = state, 
             into = c('state', 'age'),
             sep = '\\.') %>% 
    mutate(age = factor(age, ordered = TRUE))

#-- Merge proportion and CI --#
prop_sex %<>% 
    left_join(ci_sex)

#-- Add interactive tooltip info --#
prop_sex %<>%
    mutate(tooltip = paste0('<b>', state, ': ', sex, '</b><br>',
                            '<em>Prevalence of arthritis</em><br>',
                            proportion, '% (95% CI: ',
                            proportion_low, ' to ', 
                            proportion_upp, ')')) 

#-- Join map_us with arthritis_state --#
map_sex <- map_us %>%
    left_join(prop_sex) %>% 
    filter(!is.na(sex))

# Plot
gg_sex <- ggplot(map_sex) +
    geom_sf_interactive(aes(tooltip = tooltip,
                            fill = proportion), 
             color = '#FFFFFF',
             size = 0.25) +
    scale_fill_viridis_c(name = 'Prevalence (%)',
                         direction = -1) +
    labs(title = 'Prevalence of diagnosed arthritis by US state and sex*',
         subtitle = '(adult non-institutionalized population)', 
         caption = '* Move cursor over each state to see state-specific estimate and 95% confidence interval.') +
    facet_wrap(~sex, ncol = 2) +
    theme_void() +
    theme(panel.grid = element_line(colour = '#FFFFFF'),
          strip.text = element_text(size = 11,
                                    margin = margin(t = 1, r = 0, b = 0.1, l = 0, unit = "line")),
          plot.caption = element_text(face = 'italic',
                                      hjust = 0))

ggiraph(ggobj = gg_sex,
        width = 1,
        hover_css = 'cursor:pointer;
        fill:#FFFFFF;
        stroke:#000000;',
        tooltip_extra_css = 'background-color:#000000;
font-family:sans-serif,arial;
font-size:80%;
color:#FFFFFF;
padding:10px;
border-radius:10px 10px 10px 10px;')

Summary

These data are by no means novel, but they nicely illustrate some key epidemiological patterns.

Firstly, the prevalence of diagnosed arthritis differs substantially by state, with Eastern states tending to have a greater burden of disease. I did not directly address the question, but I suspect that differences in the age profile of states may play a major part in inter-state differences in disease burden.

I did however look at differences in arthritis across age groups. There is a marked increase in the prevalence of diagnosed cases of arthritis with increasing age. for example, the prevalence of arthritis in adults around the age of 20 years is about 2 to 5%, whereas the prevalence in disease is closer to 50% in people aged 65 years and older. There does not seem to be a sharp increase in the prevalence of arthritis at any particular age. Rather, the increased prevalence a appears to show a progress increase as age increases.

The last risk factor I looked at was the effect of sex on the prevalence of diagnosed cases of arthritis. Here, women showed a fairly marked increase in disease prevalence compared to men. The reasons for this difference is not understood.

Acknowledgements

This post was inspired by an assignment in the Coursera Introduction to Probability and Data topic in the Statistics with R Specialization offered through Duke University.

Session information

## R version 4.0.4 (2021-02-15)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## Random number generation:
##  RNG:     Mersenne-Twister 
##  Normal:  Inversion 
##  Sample:  Rounding 
##  
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] geofacet_0.2.0      survey_4.0          survival_3.2-9     
##  [4] Matrix_1.3-2        sf_0.9-7            fiftystater_1.0.1  
##  [7] skimr_2.1.3         magrittr_2.0.1      geojsonio_0.9.4    
## [10] lubridate_1.7.10    xml2_1.3.2          sp_1.4-5           
## [13] leaflet_2.0.4.1     highcharter_0.8.2   flexdashboard_0.5.2
## [16] gganimate_1.0.7     gdtools_0.2.3       ggiraph_0.7.8      
## [19] svglite_2.0.0       pander_0.6.3        knitr_1.31         
## [22] forcats_0.5.1       stringr_1.4.0       dplyr_1.0.5        
## [25] purrr_0.3.4         readr_1.4.0         tidyr_1.1.3        
## [28] tibble_3.1.0        ggplot2_3.3.3       tidyverse_1.3.0    
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_2.0-0    ellipsis_0.3.1      class_7.3-18       
##  [4] rgdal_1.5-23        base64enc_0.1-3     fs_1.5.0           
##  [7] rstudioapi_0.13     httpcode_0.3.0      farver_2.1.0       
## [10] ggrepel_0.9.1       fansi_0.4.2         splines_4.0.4      
## [13] rlist_0.4.6.1       jsonlite_1.7.2      broom_0.7.5        
## [16] dbplyr_2.1.0        png_0.1-7           rgeos_0.5-5        
## [19] compiler_4.0.4      httr_1.4.2          backports_1.2.1    
## [22] assertthat_0.2.1    lazyeval_0.2.2      cli_2.3.1          
## [25] tweenr_1.0.1        htmltools_0.5.1.1   prettyunits_1.1.1  
## [28] tools_4.0.4         igraph_1.2.6        gtable_0.3.0       
## [31] glue_1.4.2          geojson_0.3.4       V8_3.4.0           
## [34] Rcpp_1.0.6          imguR_1.0.3         cellranger_1.1.0   
## [37] jquerylib_0.1.3     vctrs_0.3.6         crul_1.1.0         
## [40] crosstalk_1.1.1     xfun_0.22           rvest_1.0.0        
## [43] lifecycle_1.0.0     jqr_1.2.0           zoo_1.8-9          
## [46] scales_1.1.1        hms_1.0.0           yaml_2.2.1         
## [49] quantmod_0.4.18     curl_4.3            gridExtra_2.3      
## [52] sass_0.3.1          stringi_1.5.3       highr_0.8          
## [55] maptools_1.0-2      e1071_1.7-4         TTR_0.24.2         
## [58] boot_1.3-27         repr_1.1.3          rlang_0.4.10       
## [61] pkgconfig_2.0.3     systemfonts_1.0.1   geogrid_0.1.1      
## [64] evaluate_0.14       lattice_0.20-41     htmlwidgets_1.5.3  
## [67] labeling_0.4.2      tidyselect_1.1.0    geojsonsf_2.0.1    
## [70] R6_2.5.0            magick_2.7.0        generics_0.1.0     
## [73] DBI_1.1.1           pillar_1.5.1        haven_2.3.1        
## [76] foreign_0.8-81      withr_2.4.1         units_0.7-0        
## [79] xts_0.12.1          modelr_0.1.8        crayon_1.4.1       
## [82] uuid_0.1-4          KernSmooth_2.23-18  utf8_1.2.1         
## [85] rmarkdown_2.7       jpeg_0.1-8.1        progress_1.2.2     
## [88] rnaturalearth_0.1.0 readxl_1.3.1        data.table_1.14.0  
## [91] reprex_1.0.0        digest_0.6.27       classInt_0.4-3     
## [94] webshot_0.5.2       munsell_0.5.0       viridisLite_0.3.0  
## [97] kableExtra_1.3.4    bslib_0.2.4         mitools_2.4

Source: BRFSS website ↩︎
Preparing 2013 BRFSS module data for analysis: Link ↩︎
Source: Breivik et al., 2006. DOI: 10.1016/j.ejpain.2005.06.009 ↩︎

Arthritis in the USA

Peter Kamerman

Overview

Research questions

Data extraction and cleaning

Tabular summary of variables of interest

Survey design variables

Code to generate a design object

Research questions

Question 1

What is the prevalence of arthritis in each state?

Code to generate the arthritis by state data

Code to clean and plot the data

Question 2

What is the prevalence of arthritis in each state by age group?

Code to generate the arthritis by state for each age group data

Code to clean and plot the data

Question 3

What is the prevalence of arthritis in each state by sex?

Code to generate the arthritis by state for each sex data

Code to clean and plot the data

Summary

Acknowledgements

Session information

Share