Using XGBoost in R for regression based model
06:41 01 Mar 2017

I'm trying to use XGBoost as a replacement for gbm.

The scores I'm getting are rather odd, so I'm thinking maybe I'm doing something wrong in my code.

My data contains several factor variables, all other numeric.

Response variable is a continuous variable indicating a House-Price.

I Understand that in order to use XGBoost, I need to use One Hot Enconding for those. I'm doing so by using the following code:

Xtest <- test.data
Xtrain <- train.data
XSalePrice <- Xtrain$SalePrice
Xtrain$SalePrice <- NULL

# Combine data  
Xall <- data.frame(rbind(Xtrain, Xtest))

# Get categorical features names
ohe_vars <- names(Xall)[which(sapply(Xall, is.factor))]

# Convert them
dummies <- dummyVars(~., data = Xall)
Xall_ohe <- as.data.frame(predict(dummies, newdata = Xall))

# Replace factor variables in data with OHE
Xall <- cbind(Xall[, -c(which(colnames(Xall) %in% ohe_vars))], Xall_ohe)

After that, I'm splitting the data back to the test & train set:

Xtrain <- Xall[1:nrow(train.data), ]
Xtest <- Xall[-(1:nrow(train.data)), ]

And then building a model, and printing the RMSE & Rsquared:

# Model
xgb.fit <- xgboost(data = data.matrix(Xtrain), label = XSalePrice,
  booster = "gbtree", objective = "reg:linear",
  colsample_bytree = 0.2, gamma = 0.0,
  learning_rate = 0.05, max_depth = 6,
  min_child_weight = 1.5, n_estimators = 7300,
  reg_alpha = 0.9, reg_lambda = 0.5,
  subsample = 0.2, seed = 42,
  silent = 1, nrounds = 25)

xgb.pred <- predict(xgb.fit, data.matrix(Xtrain))
postResample(xgb.pred, XSalePrice)

Problem is I'm getting very off RMSE & Rsxquare:

        RMSE     Rsquared 
1.877639e+05 5.308910e-01 

That are VERY far from the results I get when using GBM.

I'm thinking i'm doing something wrong, my best guess it probably with the One Hot Encoding phase which I'm unfamiliar, So used a googled code with adjustments to my data.

Can someone indicate what am I doing wrong and how to 'fix' it?

UPDATE:

After reviewing @Codutie answer, my code has some errors:

Xtrain <- sparse.model.matrix(SalePrice ~. , data = train.data)
XDtrain <- xgb.DMatrix(data = Xtrain, label = "SalePrice")

xgb.DMatrix produces:

Error in setinfo.xgb.DMatrix(dmat, names(p), p[[1]]) : 
  The length of labels must equal to the number of rows in the input data

train.data is data frame, and it has 1453 rows. Label SalePrice also contains 1453 values (No missing values)

Thanks

r regression decision-tree xgboost one-hot-encoding