Using BRFSS multiple years of data and running svydesign takes a million years

07:40 22 Apr 2026

I want to troubleshoot and seek your advice. I have 5 years of BRFSS data, filtered variables, subset after first doing svydesign on the whole dataset, and trying to produce survey_mean. I tried different ways of trying to reduce the amount of time running the codes. First I tried svy_mean, tbl_svysummary, and etc. Nothing reduced the time running and it's still running on the server. I'm wondering if there's any way around this problem. Here's the code:

set.seed(1)

# ---- Simulate BRFSS-like data ----
n <- 1e6                  # total rows; lower this if it eats your RAM
n_strata <- 200           # strata (e.g., state-region cells)
psus_per_stratum <- 30    # PSUs within each stratum

df <- tibble(
  `_STSTR` = sample(seq_len(n_strata), n, replace = TRUE),
  `_PSU`   = sample(seq_len(psus_per_stratum), n, replace = TRUE),
  finalwt  = runif(n, 50, 5000),
  race = factor(sample(
    c("White only", "Black or African American only",
      "American Indian Native Alaskan", "Asian only",
      "Native Hawaiian or other Pacific Islander only", "Other"),
    n, replace = TRUE,
    prob = c(0.70, 0.12, 0.02, 0.06, 0.01, 0.09))),
  sex = factor(sample(c("Male", "Female"), n, replace = TRUE)),
  age = factor(sample(
    c("18-24","25-29","30-34","35-39","40-44","45-49","50-54",
      "55-59","60-64","65-69","70-74","75-79","80+"),
    n, replace = TRUE)),
  smokerstat = factor(sample(
    c("Current Smoker", "Former Smoker", "Never Smoked"),
    n, replace = TRUE, prob = c(0.15, 0.25, 0.60))),
  education  = factor(sample(
    c("Did not graduate High School", "Graduated High School",
      "Attended College or Technical School",
      "Graduated College or Technical School"),
    n, replace = TRUE, prob = c(0.10, 0.30, 0.30, 0.30)))
)

# 1,719,106 obs
brfss_design <- brfss_combined %>%
  select("_PSU", "_STSTR", "finalwt", "age", "race", "sex", "smokerstat", "education")

options(survey.lonely.psu = "adjust")

brfss_design <- svydesign(
  id = ~`_PSU`,          
  strata = ~`_STSTR`,    
  weights = ~`finalwt`,  
  data = brfss_design,
  nest = TRUE
)

subpop <- subset(brfss_design,
                 race == "American Indian Native Alaskan" &
                   age %in% c("18-24", "25-29"))

# Drop levels that don't appear in the subset
subpop$variables <- subpop$variables |>
  mutate(across(where(is.factor), droplevels))
 
# reset svy using svyyr package
library(srvyr)
subpop <- as_survey(subpop)

# one variable at a time
tab_smoker <- subpop |>
  filter(!is.na(smokerstat)) |>
  group_by(sex, smokerstat) |>
  summarise(p = survey_mean(vartype = "ci", na.rm = TRUE)) |>
  mutate(cell = sprintf("%.1f%% (%.1f–%.1f)", p*100, p_low*100, p_upp*100)) |>
  select(sex, smokerstat, cell) |>
  pivot_wider(names_from = sex, values_from = cell)

# og code trying to run the code below
tbl_svysummary(subpop, by = sex,
               include = c(smokerstat, age, education),
               statistic = all_categorical() ~ "{p}% ({p.std.error})",
               digits = all_categorical() ~ 1) |>
  add_ci(include = everything())

Any advice would be appreciated. Thank you.

r survey

Your Answer

Privacy & Cookie Consent