I want to troubleshoot and seek your advice. I have 5 years of BRFSS data, filtered variables, subset after first doing svydesign on the whole dataset, and trying to produce survey_mean. I tried different ways of trying to reduce the amount of time running the codes. First I tried svy_mean, tbl_svysummary, and etc. Nothing reduced the time running and it's still running on the server. I'm wondering if there's any way around this problem. Here's the code:
set.seed(1)
# ---- Simulate BRFSS-like data ----
n <- 1e6 # total rows; lower this if it eats your RAM
n_strata <- 200 # strata (e.g., state-region cells)
psus_per_stratum <- 30 # PSUs within each stratum
df <- tibble(
`_STSTR` = sample(seq_len(n_strata), n, replace = TRUE),
`_PSU` = sample(seq_len(psus_per_stratum), n, replace = TRUE),
finalwt = runif(n, 50, 5000),
race = factor(sample(
c("White only", "Black or African American only",
"American Indian Native Alaskan", "Asian only",
"Native Hawaiian or other Pacific Islander only", "Other"),
n, replace = TRUE,
prob = c(0.70, 0.12, 0.02, 0.06, 0.01, 0.09))),
sex = factor(sample(c("Male", "Female"), n, replace = TRUE)),
age = factor(sample(
c("18-24","25-29","30-34","35-39","40-44","45-49","50-54",
"55-59","60-64","65-69","70-74","75-79","80+"),
n, replace = TRUE)),
smokerstat = factor(sample(
c("Current Smoker", "Former Smoker", "Never Smoked"),
n, replace = TRUE, prob = c(0.15, 0.25, 0.60))),
education = factor(sample(
c("Did not graduate High School", "Graduated High School",
"Attended College or Technical School",
"Graduated College or Technical School"),
n, replace = TRUE, prob = c(0.10, 0.30, 0.30, 0.30)))
)
# 1,719,106 obs
brfss_design <- brfss_combined %>%
select("_PSU", "_STSTR", "finalwt", "age", "race", "sex", "smokerstat", "education")
options(survey.lonely.psu = "adjust")
brfss_design <- svydesign(
id = ~`_PSU`,
strata = ~`_STSTR`,
weights = ~`finalwt`,
data = brfss_design,
nest = TRUE
)
subpop <- subset(brfss_design,
race == "American Indian Native Alaskan" &
age %in% c("18-24", "25-29"))
# Drop levels that don't appear in the subset
subpop$variables <- subpop$variables |>
mutate(across(where(is.factor), droplevels))
# reset svy using svyyr package
library(srvyr)
subpop <- as_survey(subpop)
# one variable at a time
tab_smoker <- subpop |>
filter(!is.na(smokerstat)) |>
group_by(sex, smokerstat) |>
summarise(p = survey_mean(vartype = "ci", na.rm = TRUE)) |>
mutate(cell = sprintf("%.1f%% (%.1f–%.1f)", p*100, p_low*100, p_upp*100)) |>
select(sex, smokerstat, cell) |>
pivot_wider(names_from = sex, values_from = cell)
# og code trying to run the code below
tbl_svysummary(subpop, by = sex,
include = c(smokerstat, age, education),
statistic = all_categorical() ~ "{p}% ({p.std.error})",
digits = all_categorical() ~ 1) |>
add_ci(include = everything())
Any advice would be appreciated. Thank you.