Arthritis in the USA
Peter Kamerman
9 March 2019Overview
The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephonic survey that takes place across all 50 states of the United States (US), the District of Columbia, Puerto Rico, Guam, and the US Virgin Islands. The goal of the survey is the collection of state and territory data on preventative health practices and risk behaviours that are linked to chronic diseases, injuries, and preventable infectious diseases. The survey is limited to non-institutionalized US/US territory residents who are \(\geq\) 18 years old.1
The survey uses probability sampling methods to select the sample, and the sampling is stratified by state/territory. Within each state, sampling is based on random digit dialing, using records of household land-line numbers and cellphone numbers within each state (there is almost universal access to a land-line in US households). The sampling of cellphone numbers is by simple random sampling, while land-line number sampling uses disproportionate stratified sampling (DSS), which stratifies the sampling according to presumed housing density. Sampling is also geographically stratified within each state, and the timing of the calls is spread across the time of day, and across weekdays/weekends/holidays to accommodate the work schedules of potential participants.
Non-response and targeted under/over sampling of particular groups can introduce distortions in the data relative to the underlying population structure, hence weighting factors are calculated for the data. BRFSS weighting corrections are applied using a method called raking, and the stratification (_psu: population stratification unit) and weighting factors (_llcpwt: final weight) are provided in the brfss2013 dataset2. Both the stratification and weighting of the data can be factored into an analysis using the ‘survey’ package.
Research questions
I chose to focus on the data from the core questionnaire (mandatory across all states/territories) that deals with the presence of arthritic pain. Arthritic pain is the most common cause of chronic pain in the general population3, and it contributes to a lower quality of life. With the BRFSS dataset we have an opportunity to undertake an exploratory analysis of the burden of arthritis in a large, representative sample of the US adult population.
I limited the data analysed to the 50 US states. In all cases I analysed the data by state.
Case ascertainment for arthritis was based on question ‘havarth3’ in the Chronic Health Conditions section of the BRFSS Main Survey. The question asks the following:
Has a doctor, nurse, or other health professional ever told you that you had any of the following? For each, tell me ‘Yes’, ‘No’, or you’re ‘Not sure’:
(Ever told) you have some form of arthritis, rheumatoid arthritis, gout, lupus, or fibromyalgia? (Arthritis diagnoses include: rheumatism, polymyalgia rheumatica; osteoarthritis (not osteporosis); tendonitis, bursitis, bunion, tennis elbow; carpal tunnel syndrome, tarsal tunnel syndrome; joint infection, etc.)
Only participants who answered ‘Yes’ or ‘No’ where included in the analyses (n = 488,794).
Because the analysis is based on 2013 data only, and I am not comparing across years, I have not adjusted (transformed) the data to a reference population (e.g., the US 2000 Standard Population). As such, prevalence estimates are crude (unadjusted) estimates.
Question 1:
What is the crude prevalence of arthritis in each state?
Question 2:
What is the prevalence of arthritis in each state by age group?
Question 3:
What is the prevalence of arthritis in each state by sex?
NOTE: In line with the exploratory nature of the exercise, I only report the outcome of exploratory numerical and graphical analyses, and have not extended the analyses to include hypothesis tests or modelling.
Data extraction and cleaning
In preparation for the analysis I did the following:
- Imported the data into R;
- Extract the required columns of data;
- Generated a survey design object (includes sample stratification and weighting information).
For part 2, I extracted the following columns to assess their suitability for analysis:
State (_state):
- state [categorical: names of states and territories]
Age (_age_g):
- Age (in years) in six mutually exclusive groups? [categorical: Age 18 to 24, Age 25 to 34, Age 35 to 44, Age 45 to 54, Age 55 to 64, Age 65 or older]
Sex (sex):
- Indicate sex of respondent [categorical: Male, Female]
Diagnosed with arthritis (havarth3):
- Ever told you have arthritis? [categorical: Yes, No, NA = other/missing]
In addition, I also extracted variables required for building the design object, namely, the primary sampling unit (_psu), sample design stratification variable (_ststr), and final weight for landline And cellphone data (_llcpwt).
My initial cleaning steps included: i) selecting the columns I wanted; ii) Guam, Puerto Rico, and Washington DC; iii) cleaned the column names; iv) fixed some ordinal variables; and v) removed missing data from the design variables.
#-- Import data --#
load(url('https://www.dropbox.com/s/al9wx65xywredzq/brfss2013.RData?dl=1'))
#-- Select columns --#
%<>%
brfss2013 select(X_psu, X_llcpwt, X_ststr, # Stratification and weighting factors
# State and residency information
X_state, # Demographics
X_age_g, sex, # Arthritis
havarth3)
#-- Reduce to 50 states --#
%<>%
brfss2013 # Filter out District of Columbia, Guam, Puerto Rico, and odd values (80, 0)
filter(!X_state %in% c('District of Columbia',
'Guam', 'Puerto Rico',
'80', '0')) %>%
# Remove resulting empty state factors
mutate(X_state = fct_drop(X_state))
#-- Rename columns --#
%<>%
brfss2013 rename(psu = X_psu,
llcpwt = X_llcpwt,
strata = X_ststr,
state = X_state,
age = X_age_g,
dx_arthritis = havarth3)
#-- Order age factor levels --#
%<>%
brfss2013 mutate(age = factor(age,
ordered = TRUE))
#-- Remove missing survey design variables --#
%<>%
brfss2013 filter(!is.na(psu)) %>%
filter(!is.na(strata)) %>%
filter(!is.na(llcpwt))
After this preliminary clean of the data, I inspected the variables of interest and the survey design variables.
Tabular summary of variables of interest
%>%
brfss2013 select(-psu, -strata, -llcpwt) %>%
skim() %>%
kable()
skim_type | skim_variable | n_missing | complete_rate | factor.ordered | factor.n_unique | factor.top_counts |
---|---|---|---|---|---|---|
factor | state | 0 | 1.0000000 | FALSE | 50 | Flo: 33668, Kan: 23282, Neb: 17139, Mas: 15071 |
factor | age | 5 | 0.9999896 | TRUE | 6 | Age: 156868, Age: 107011, Age: 81549, Age: 58564 |
factor | sex | 2 | 0.9999958 | FALSE | 2 | Fem: 282562, Mal: 196380 |
factor | dx_arthritis | 2866 | 0.9940160 | FALSE | 2 | No: 314571, Yes: 161507 |
There are some missing data for arthritis diagnosis, but they only make-up a small percentage of the total survey population.
Survey design variables
The sum of survey weight should approximate the population of the US. In this case the sum of the weights is 242782057, and so I am satisfied with the weights.
The number of unique primary sampling units (PSU) is 33909, and the number of unique strata is 1294
Code to generate a design object
Generating a design object is a crucial part of analysing complex survey data because the design object factors in the stratification and weightings that are then incorporated into all subsequent analyses.
#-- Set options for control standard error estimation --#
options(survey.lonely.psu = 'certainty')
#-- Create survey design object --#
<- svydesign(ids = ~psu,
brfss2013_design weight = ~llcpwt,
strata = ~strata,
nest = TRUE,
data = brfss2013)
#-- Summary of the design --#
brfss2013_design
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (467659) clusters.
## svydesign(ids = ~psu, weight = ~llcpwt, strata = ~strata, nest = TRUE,
## data = brfss2013)
Research questions
Question 1
What is the prevalence of arthritis in each state?
Code to generate the arthritis by state data
#-- Calculate proportion with arthritis by state --#
<- svyby(formula = ~dx_arthritis,
arthritis_state by = ~state,
design = brfss2013_design,
FUN = svymean,
na.rm = TRUE)
Code to clean and plot the data
#-- Clean-up and convert to a percentage --#
<- arthritis_state %>%
prop_state select(1:2) %>%
set_names(c('state', 'proportion')) %>%
mutate(state = as.character(state)) %>%
mutate_if(is.numeric, list(~round(. * 100)))
#-- Get conficence intervals --#
<- confint(arthritis_state) %>%
ci_state as.data.frame(.) %>%
rownames_to_column() %>%
mutate_if(is.numeric, list(~round(. * 100))) %>%
set_names(c('state', 'proportion_low', 'proportion_upp')) %>%
filter(str_detect(state, 'Yes')) %>%
mutate(state = str_remove(state, ':.+'))
#-- Merge proportions and CI --#
%<>%
prop_state left_join(ci_state)
#-- Add interactive tooltip info --#
%<>%
prop_state mutate(tooltip = paste0('<b>', state, '</b><br>',
'<em>Prevalence of arthritis</em><br>',
'% (95% CI: ',
proportion, ' to ',
proportion_low, ')'))
proportion_upp,
#-- Create a simple features (sf) map of the 50 US states --#
<- fifty_states %>%
map_us # Create map
st_as_sf(.,
coords = c("long", "lat"),
crs = "+proj=longlat +datum=WGS84 +no_defs") %>%
group_by(id, piece) %>%
summarise(do_union = FALSE) %>%
st_cast("POLYGON") %>%
st_cast("MULTIPOLYGON") %>%
ungroup() %>%
# Add a state column
mutate(state = str_to_title(id)) %>%
# Remove old id column
select(-id)
#-- Join map_us with arthritis_state --#
<- map_us %>%
map_state left_join(prop_state)
# Plot
<- ggplot(map_state) +
gg_state geom_sf_interactive(aes(tooltip = tooltip,
fill = proportion),
color = '#FFFFFF',
size = 0.25) +
scale_fill_viridis_c(name = 'Prevalence (%)',
direction = -1) +
labs(title = 'Prevalence of diagnosed arthritis by US state*',
subtitle = '(adult non-institutionalized population)',
caption = '* Move cursor over each state to see state-specific estimate and 95% confidence interval.') +
theme_void() +
theme(panel.grid = element_line(colour = '#FFFFFF'),
plot.caption = element_text(face = 'italic',
hjust = 0))
ggiraph(ggobj = gg_state,
width = 1,
hover_css = 'cursor:pointer;
fill:#FFFFFF;
stroke:#000000;',
tooltip_extra_css = 'background-color:#000000;
font-family:sans-serif,arial;
font-size:80%;
color:#FFFFFF;
padding:10px;
border-radius:10px 10px 10px 10px;')
Question 2
What is the prevalence of arthritis in each state by age group?
Code to generate the arthritis by state for each age group data
#-- Calculate proportion (95%CI) with arthritis by state --#
<- svyby(formula = ~dx_arthritis,
arthritis_age by = ~state + age,
design = brfss2013_design,
FUN = svymean,
na.rm = TRUE)
Code to clean and plot the data
#-- Clean-up and convert to a percentage --#
<- arthritis_age %>%
prop_age select(1:3) %>%
set_names(c('state', 'age', 'proportion')) %>%
mutate(state = as.character(state)) %>%
mutate_if(is.numeric, list(~round(. * 100)))
#-- Get the confidence intervals --#
<- confint(arthritis_age) %>%
ci_age as.data.frame(.) %>%
rownames_to_column() %>%
mutate_if(is.numeric, list(~round(. * 100))) %>%
set_names(c('state', 'proportion_low', 'proportion_upp')) %>%
filter(str_detect(state, 'Yes')) %>%
mutate(state = str_remove(state, ':.+')) %>%
separate(col = state,
into = c('state', 'age'),
sep = '\\.') %>%
mutate(age = factor(age, ordered = TRUE))
#-- Merge proportion and CI --#
%<>%
prop_age left_join(ci_age)
#-- Add interactive tooltip info --#
%<>%
prop_age mutate(tooltip = paste0('<b>', state, ': ', age, '</b><br>',
'<em>Prevalence of arthritis</em><br>',
'% (95% CI: ',
proportion, ' to ',
proportion_low, ')'))
proportion_upp,
#-- Join map_us with arthritis_state --#
<- map_us %>%
map_age left_join(prop_age) %>%
filter(!is.na(age))
# Plot
<- ggplot(map_age) +
gg_age geom_sf_interactive(aes(tooltip = tooltip,
fill = proportion),
color = '#FFFFFF',
size = 0.25) +
scale_fill_viridis_c(name = 'Prevalence (%)',
direction = -1) +
labs(title = 'Prevalence of diagnosed arthritis by US state and age*',
subtitle = '(adult non-institutionalized population)',
caption = '* Move cursor over each state to see state-specific estimate and 95% confidence interval.') +
facet_wrap(~age, ncol = 3) +
theme_void() +
theme(panel.grid = element_line(colour = '#FFFFFF'),
strip.text = element_text(size = 11,
margin = margin(t = 1, r = 0, b = 0.1, l = 0, unit = "line")),
plot.caption = element_text(face = 'italic',
hjust = 0))
ggiraph(ggobj = gg_age,
width = 1,
hover_css = 'cursor:pointer;
fill:#FFFFFF;
stroke:#000000;',
tooltip_extra_css = 'background-color:#000000;
font-family:sans-serif,arial;
font-size:80%;
color:#FFFFFF;
padding:10px;
border-radius:10px 10px 10px 10px;')
Question 3
What is the prevalence of arthritis in each state by sex?
Code to generate the arthritis by state for each sex data
#-- Calculate proportion (95%CI) with arthritis by state --#
<- svyby(formula = ~dx_arthritis,
arthritis_sex by = ~state + sex,
design = brfss2013_design,
FUN = svymean,
na.rm = TRUE)
Code to clean and plot the data
#-- Clean-up and convert to a percentage --#
<- arthritis_sex %>%
prop_sex select(1:3) %>%
set_names(c('state', 'sex', 'proportion')) %>%
mutate(state = as.character(state)) %>%
mutate_if(is.numeric, list(~round(. * 100)))
#-- Get the confidence intervals --#
<- confint(arthritis_sex) %>%
ci_sex as.data.frame(.) %>%
rownames_to_column() %>%
mutate_if(is.numeric, list(~round(. * 100))) %>%
set_names(c('state', 'proportion_low', 'proportion_upp')) %>%
filter(str_detect(state, 'Yes')) %>%
mutate(state = str_remove(state, ':.+')) %>%
separate(col = state,
into = c('state', 'age'),
sep = '\\.') %>%
mutate(age = factor(age, ordered = TRUE))
#-- Merge proportion and CI --#
%<>%
prop_sex left_join(ci_sex)
#-- Add interactive tooltip info --#
%<>%
prop_sex mutate(tooltip = paste0('<b>', state, ': ', sex, '</b><br>',
'<em>Prevalence of arthritis</em><br>',
'% (95% CI: ',
proportion, ' to ',
proportion_low, ')'))
proportion_upp,
#-- Join map_us with arthritis_state --#
<- map_us %>%
map_sex left_join(prop_sex) %>%
filter(!is.na(sex))
# Plot
<- ggplot(map_sex) +
gg_sex geom_sf_interactive(aes(tooltip = tooltip,
fill = proportion),
color = '#FFFFFF',
size = 0.25) +
scale_fill_viridis_c(name = 'Prevalence (%)',
direction = -1) +
labs(title = 'Prevalence of diagnosed arthritis by US state and sex*',
subtitle = '(adult non-institutionalized population)',
caption = '* Move cursor over each state to see state-specific estimate and 95% confidence interval.') +
facet_wrap(~sex, ncol = 2) +
theme_void() +
theme(panel.grid = element_line(colour = '#FFFFFF'),
strip.text = element_text(size = 11,
margin = margin(t = 1, r = 0, b = 0.1, l = 0, unit = "line")),
plot.caption = element_text(face = 'italic',
hjust = 0))
ggiraph(ggobj = gg_sex,
width = 1,
hover_css = 'cursor:pointer;
fill:#FFFFFF;
stroke:#000000;',
tooltip_extra_css = 'background-color:#000000;
font-family:sans-serif,arial;
font-size:80%;
color:#FFFFFF;
padding:10px;
border-radius:10px 10px 10px 10px;')
Summary
These data are by no means novel, but they nicely illustrate some key epidemiological patterns.
Firstly, the prevalence of diagnosed arthritis differs substantially by state, with Eastern states tending to have a greater burden of disease. I did not directly address the question, but I suspect that differences in the age profile of states may play a major part in inter-state differences in disease burden.
I did however look at differences in arthritis across age groups. There is a marked increase in the prevalence of diagnosed cases of arthritis with increasing age. for example, the prevalence of arthritis in adults around the age of 20 years is about 2 to 5%, whereas the prevalence in disease is closer to 50% in people aged 65 years and older. There does not seem to be a sharp increase in the prevalence of arthritis at any particular age. Rather, the increased prevalence a appears to show a progress increase as age increases.
The last risk factor I looked at was the effect of sex on the prevalence of diagnosed cases of arthritis. Here, women showed a fairly marked increase in disease prevalence compared to men. The reasons for this difference is not understood.
Acknowledgements
This post was inspired by an assignment in the Coursera Introduction to Probability and Data topic in the Statistics with R Specialization offered through Duke University.
Session information
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## Random number generation:
## RNG: Mersenne-Twister
## Normal: Inversion
## Sample: Rounding
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] geofacet_0.2.0 survey_4.0 survival_3.2-9
## [4] Matrix_1.3-2 sf_0.9-7 fiftystater_1.0.1
## [7] skimr_2.1.3 magrittr_2.0.1 geojsonio_0.9.4
## [10] lubridate_1.7.10 xml2_1.3.2 sp_1.4-5
## [13] leaflet_2.0.4.1 highcharter_0.8.2 flexdashboard_0.5.2
## [16] gganimate_1.0.7 gdtools_0.2.3 ggiraph_0.7.8
## [19] svglite_2.0.0 pander_0.6.3 knitr_1.31
## [22] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.5
## [25] purrr_0.3.4 readr_1.4.0 tidyr_1.1.3
## [28] tibble_3.1.0 ggplot2_3.3.3 tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.0-0 ellipsis_0.3.1 class_7.3-18
## [4] rgdal_1.5-23 base64enc_0.1-3 fs_1.5.0
## [7] rstudioapi_0.13 httpcode_0.3.0 farver_2.1.0
## [10] ggrepel_0.9.1 fansi_0.4.2 splines_4.0.4
## [13] rlist_0.4.6.1 jsonlite_1.7.2 broom_0.7.5
## [16] dbplyr_2.1.0 png_0.1-7 rgeos_0.5-5
## [19] compiler_4.0.4 httr_1.4.2 backports_1.2.1
## [22] assertthat_0.2.1 lazyeval_0.2.2 cli_2.3.1
## [25] tweenr_1.0.1 htmltools_0.5.1.1 prettyunits_1.1.1
## [28] tools_4.0.4 igraph_1.2.6 gtable_0.3.0
## [31] glue_1.4.2 geojson_0.3.4 V8_3.4.0
## [34] Rcpp_1.0.6 imguR_1.0.3 cellranger_1.1.0
## [37] jquerylib_0.1.3 vctrs_0.3.6 crul_1.1.0
## [40] crosstalk_1.1.1 xfun_0.22 rvest_1.0.0
## [43] lifecycle_1.0.0 jqr_1.2.0 zoo_1.8-9
## [46] scales_1.1.1 hms_1.0.0 yaml_2.2.1
## [49] quantmod_0.4.18 curl_4.3 gridExtra_2.3
## [52] sass_0.3.1 stringi_1.5.3 highr_0.8
## [55] maptools_1.0-2 e1071_1.7-4 TTR_0.24.2
## [58] boot_1.3-27 repr_1.1.3 rlang_0.4.10
## [61] pkgconfig_2.0.3 systemfonts_1.0.1 geogrid_0.1.1
## [64] evaluate_0.14 lattice_0.20-41 htmlwidgets_1.5.3
## [67] labeling_0.4.2 tidyselect_1.1.0 geojsonsf_2.0.1
## [70] R6_2.5.0 magick_2.7.0 generics_0.1.0
## [73] DBI_1.1.1 pillar_1.5.1 haven_2.3.1
## [76] foreign_0.8-81 withr_2.4.1 units_0.7-0
## [79] xts_0.12.1 modelr_0.1.8 crayon_1.4.1
## [82] uuid_0.1-4 KernSmooth_2.23-18 utf8_1.2.1
## [85] rmarkdown_2.7 jpeg_0.1-8.1 progress_1.2.2
## [88] rnaturalearth_0.1.0 readxl_1.3.1 data.table_1.14.0
## [91] reprex_1.0.0 digest_0.6.27 classInt_0.4-3
## [94] webshot_0.5.2 munsell_0.5.0 viridisLite_0.3.0
## [97] kableExtra_1.3.4 bslib_0.2.4 mitools_2.4
Source: BRFSS website↩︎
Source: Breivik et al., 2006. DOI: 10.1016/j.ejpain.2005.06.009↩︎