I'm working on a dataset containing a list of people (indexed by their fiscal code). The target variable is binary (1: buy a book, 0: otherwise).
All the predictors are categorical (ex: nationality, city, road, bin of the income and so on).
A fiscal code could be repeated twice, and each instance/observation has a weight (1 if not repeated, a value between 0 and 1 if repeated).
For example, the dataset looks like:
| fiscal_code | weight | target | categorical info |
|---|---|---|---|
| AAAAA1 | 0.98 | 0 | ... |
| AAAAA1 | 0.02 | 1 | ... |
I have two datasets:
- train:
X_train: matrix of categorical variablesy_train: target variabletrain_weight: weight for every observation in the train set
- test:
X_test,y_test, andtest_weightvariables, which correspond to those in the train dataset
I tried a CatBoost CatBoostClassifier model:
# Inizialize booster and hyperparameters
categorical_features_indices = np.where(X.dtypes == np.category)[0]
model = CatBoostClassifier(iterations=5000, learning_rate=0.1, depth=7, loss_function='Logloss',eval_metric='AUC')
# Fit model
model.fit(X_train,
y_train,
eval_set=(X_test,y_test),
cat_features=categorical_features_indices,
use_best_model=True,
verbose=True,
sample_weight=train_weight)
How can I take into account that the observations in the test dataset have weights too (e.g. test_weight)?
I read the CatBoost documentation, but I did not find anything useful, instead of lightgbm documentation (if considering another boosting model).