I built a regression pipeline for predicting a continuous target (accident_risk) using XGBoost with a focus on avoiding data leakage and following best practices.
What I’ve already done
Used
Pipeline+ColumnTransformerfor XGBoost to ensure leakage-free preprocessing during CVUsed
OneHotEncoder(handle_unknown='ignore')instead ofget_dummiesEnsured imputers are fit only on training data
Used
RandomizedSearchCVinstead of GridSearchUsed early stopping for both models
Built a simple weighted ensemble of XGBoost + CatBoost
Performed final retraining on full dataset before submission
My questions
Are there any improvements I can make to:
model performance?
code structure / design?
Is my approach to:
handling categorical variables
cross-validation
final retraining
correct and optimal?
Are there better alternatives to my current ensemble strategy (weighted average)?
Code
Preprocessor
def build_preprocessor(num_cols, cat_cols):
num_transformer = Pipeline([
("imputer", SimpleImputer(strategy="median")),
])
cat_transformer = Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")),
("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
])
return ColumnTransformer(
transformers=[
("num", num_transformer, num_cols),
("cat", cat_transformer, cat_cols),
]
)
XGBoost Pipeline
xgb_pipeline = Pipeline([
("preprocessor", build_preprocessor(num_cols, cat_cols)),
("xgb", XGBRegressor(tree_method="hist", eval_metric="rmse"))
])
param_dist = {
"xgb__n_estimators": [200, 400, 600],
"xgb__max_depth": [3, 5, 7],
"xgb__learning_rate": [0.01, 0.05, 0.1],
}
search = RandomizedSearchCV(
xgb_pipeline,
param_distributions=param_dist,
n_iter=10,
cv=5,
scoring="r2"
)
search.fit(X_train, y_train)
Additional context
Dataset contains both numerical and categorical features
Evaluation metric: R²
Using 5-fold CV
I would appreciate any suggestions on improving performance, readability, or best practices.