Best practices to optimize this multi-step data processing code
00:03 27 Mar 2026

I built a regression pipeline for predicting a continuous target (accident_risk) using XGBoost with a focus on avoiding data leakage and following best practices.

What I’ve already done

  • Used Pipeline + ColumnTransformer for XGBoost to ensure leakage-free preprocessing during CV

  • Used OneHotEncoder(handle_unknown='ignore') instead of get_dummies

  • Ensured imputers are fit only on training data

  • Used RandomizedSearchCV instead of GridSearch

  • Used early stopping for both models

  • Built a simple weighted ensemble of XGBoost + CatBoost

  • Performed final retraining on full dataset before submission

My questions

  1. Are there any improvements I can make to:

    • model performance?

    • code structure / design?

  2. Is my approach to:

    • handling categorical variables

    • cross-validation

    • final retraining
      correct and optimal?

  3. Are there better alternatives to my current ensemble strategy (weighted average)?

Code

Preprocessor

def build_preprocessor(num_cols, cat_cols):
    num_transformer = Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
    ])

    cat_transformer = Pipeline([
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
    ])

    return ColumnTransformer(
        transformers=[
            ("num", num_transformer, num_cols),
            ("cat", cat_transformer, cat_cols),
        ]
    )

XGBoost Pipeline

xgb_pipeline = Pipeline([
    ("preprocessor", build_preprocessor(num_cols, cat_cols)),
    ("xgb", XGBRegressor(tree_method="hist", eval_metric="rmse"))
])

param_dist = {
    "xgb__n_estimators": [200, 400, 600],
    "xgb__max_depth": [3, 5, 7],
    "xgb__learning_rate": [0.01, 0.05, 0.1],
}

search = RandomizedSearchCV(
    xgb_pipeline,
    param_distributions=param_dist,
    n_iter=10,
    cv=5,
    scoring="r2"
)

search.fit(X_train, y_train)

Additional context

  • Dataset contains both numerical and categorical features

  • Evaluation metric: R²

  • Using 5-fold CV

I would appreciate any suggestions on improving performance, readability, or best practices.

python git pipe computer-science slickquiz