Use full dataset or analysis dataset for multiple imputation of missing race & ethnicity
15:31 08 Jun 2026

My research team is planning to use multiple imputation to impute missing race and ethnicity in our dataset. Our analysis will be restricted to infants; however, we have data on all children up to age 18. My question is: is it better to impute missing race and ethnicity using all available data (children up to 18), or to restrict to our population of interest (infants) first?

We have debated pros and cons of each approach. Including additional ages means we would have more information that could improve our imputation. However, including additional ages adds to the time and complexity of the imputation, and age-related differences in which races and ethnicities are represented in our dataset could impact our results. We have a large dataset, so we are not worried about having too little data to accurately impute if we do end up restricting to infants.

mi