Why is there such a big difference in accuracy when applying the Gini test and entropy?

Hello everyone. I continue to slowly study ML and got to the well-known data set 'Wine'. And I hit the next point: if I use entropy as a criterion instead of the Gini criterion, the accuracy drops by 4-10%. Using the code example

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn import metrics
from sklearn.metrics import classification_report,confusion_matrix
dataset = load_wine()
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.322, random_state = 42)
dt_wine = tree.DecisionTreeClassifier()
dt_wine = dt_wine.fit(X_train,y_train)
y_pred = dt_wine.predict(X_test)
test_accuracy= metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", '%.2f'% (test_accuracy*100),"%")

Accuracy: 94.83 % but if I run

dt_wine = tree.DecisionTreeClassifier(criterion='entropy')

Then Accuracy 84.48

I tried to play with the data, but still entropy shows results worse than Ginny. This entropy does not fit the given date Seth, or is she really worse than Ginny's criteria in most cases? Or did I do something wrong (for example, I calculated the accuracy)? I read the theory and did not find any prerequisites for such big differences.

Author: MaxU, 2020-06-07

1 answers

If you want reproducible model training results, always use the random_state parameter, otherwise you may get different results on exactly the same datasets.

Example - we run the same code twice (the same datasets):

In [29]: %paste
dt_wine = tree.DecisionTreeClassifier()
dt_wine = dt_wine.fit(X_train,y_train)
y_pred = dt_wine.predict(X_test)
test_accuracy= metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", '%.2f'% (test_accuracy*100),"%")

## -- End pasted text --
Accuracy:  96.55 %

In [30]: %paste
dt_wine = tree.DecisionTreeClassifier()
dt_wine = dt_wine.fit(X_train,y_train)
y_pred = dt_wine.predict(X_test)
test_accuracy= metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", '%.2f'% (test_accuracy*100),"%")

## -- End pasted text --
Accuracy:  94.83 %

The accuracy of the predictions decreased in the second case. To avoid this, use the random_state parameter:


Regarding the choice of criteria - if one criterion would be always better otherwise, it probably would not make sense to leave the criterion that always gives the worst result. ;)

The criterion is one of the hyperparameters of the model that can and should be configured (GridSearchCV, RandomizedSearchCV, hyperopt, etc.)

If you use algorithms based on decision trees, then I would recommend using Decision Tree Ensembles algorithms - they are much more resistant to overfitting and almost always give better results, compared to using single trees.

Example for your data:

In [52]: from sklearn.ensemble import RandomForestClassifier

In [53]: rf = RandomForestClassifier(random_state=123)

In [54]: rf.fit(X_train, y_train)
Out[54]:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=123,
                       verbose=0, warm_start=False)

In [55]: y_pred2 = rf.predict(X_test)

In [56]: metrics.accuracy_score(y_test, y_pred2)
Out[56]: 1.0
 1
Author: MaxU, 2020-06-07 17:31:12