Difference between predict() and predict_proba() functions in scikit learn
18:50 09 Jul 2023

After training a tree:

clf = DecisionTreeClassifier()
clf.fit(Xtrain, ytrain)

I have decided to test its performance on the training data itself (just to compare, later, with the test data, in terms of Sensitivity Specificity and ROC-AUC). But instead to only applying the predict(), I also applied the predict_proba() on the X_train data.

As you can see from dataframe below, the observation 4 has 50% of being 0 and 50% of being 1 (according to predict_proba() function) however the predict() function classified it as 0.

Image with the dataframe where the first column is the result from predict_proba() function and the second column is the result from predict() column

Did the predict() function classify as 0 by "chance" or did it classify as 0 because it comes first (as in order matters)?

I could not solve my doubts when analyzing the documentation of the functions (source).

python scikit-learn statistics data-science