CatBoost evaluation/test dataset with weights for observations
14:50 09 Jan 2019

I'm working on a dataset containing a list of people (indexed by their fiscal code). The target variable is binary (1: buy a book, 0: otherwise).

All the predictors are categorical (ex: nationality, city, road, bin of the income and so on).

A fiscal code could be repeated twice, and each instance/observation has a weight (1 if not repeated, a value between 0 and 1 if repeated).

For example, the dataset looks like:

fiscal_code weight target categorical info
AAAAA1 0.98 0 ...
AAAAA1 0.02 1 ...

I have two datasets:

  • train:
    • X_train: matrix of categorical variables
    • y_train: target variable
    • train_weight: weight for every observation in the train set
  • test: X_test, y_test, and test_weight variables, which correspond to those in the train dataset

I tried a CatBoost CatBoostClassifier model:

# Inizialize booster and hyperparameters
categorical_features_indices = np.where(X.dtypes == np.category)[0]
    
model = CatBoostClassifier(iterations=5000, learning_rate=0.1, depth=7, loss_function='Logloss',eval_metric='AUC')

# Fit model
model.fit(X_train,
          y_train,
          eval_set=(X_test,y_test),
          cat_features=categorical_features_indices,
          use_best_model=True,
          verbose=True,
          sample_weight=train_weight)

How can I take into account that the observations in the test dataset have weights too (e.g. test_weight)?

I read the CatBoost documentation, but I did not find anything useful, instead of lightgbm documentation (if considering another boosting model).

python catboost