Best practices to optimize this multi-step data processing code

00:03 27 Mar 2026

I built a regression pipeline for predicting a continuous target (accident_risk) using XGBoost with a focus on avoiding data leakage and following best practices.

What I’ve already done

Used Pipeline + ColumnTransformer for XGBoost to ensure leakage-free preprocessing during CV
Used OneHotEncoder(handle_unknown='ignore') instead of get_dummies
Ensured imputers are fit only on training data
Used RandomizedSearchCV instead of GridSearch
Used early stopping for both models
Built a simple weighted ensemble of XGBoost + CatBoost
Performed final retraining on full dataset before submission

My questions

Are there any improvements I can make to:
- model performance?
- code structure / design?
Is my approach to:
- handling categorical variables
- cross-validation
- final retraining
  correct and optimal?
Are there better alternatives to my current ensemble strategy (weighted average)?

Code

Preprocessor

def build_preprocessor(num_cols, cat_cols):
    num_transformer = Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
    ])

    cat_transformer = Pipeline([
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
    ])

    return ColumnTransformer(
        transformers=[
            ("num", num_transformer, num_cols),
            ("cat", cat_transformer, cat_cols),
        ]
    )

XGBoost Pipeline

xgb_pipeline = Pipeline([
    ("preprocessor", build_preprocessor(num_cols, cat_cols)),
    ("xgb", XGBRegressor(tree_method="hist", eval_metric="rmse"))
])

param_dist = {
    "xgb__n_estimators": [200, 400, 600],
    "xgb__max_depth": [3, 5, 7],
    "xgb__learning_rate": [0.01, 0.05, 0.1],
}

search = RandomizedSearchCV(
    xgb_pipeline,
    param_distributions=param_dist,
    n_iter=10,
    cv=5,
    scoring="r2"
)

search.fit(X_train, y_train)

Additional context

Dataset contains both numerical and categorical features
Evaluation metric: R²
Using 5-fold CV

I would appreciate any suggestions on improving performance, readability, or best practices.

python git pipe computer-science slickquiz

My questions

Code

Additional context

Your Answer

Privacy & Cookie Consent