Different output values with the same parameters when classifying data

I select the parameters for the best training of the classification model.

I do it like this:

print('Исходная обученность:         ', lgb_m_REZ) 
g = 775
max_score = 0
g_best = 0
i_best = 0
while g < 779:
    i = 25
    X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.33, random_state=g)
    while i < 28:
        param_grid = {
        'max_features' : ['auto','sqrt', 'log2'],
        'learning_rate' : [ 0.05 , 0.1 , 0.2 , 0.3 ],
        'random_state' : [i],
        }
        svc = GradientBoostingClassifier()
        clf = GridSearchCV(svc, param_grid)
        clf.fit(X_train2, y_train2)
        print('random_state sample:       ', g)
        print('random_state model:        ', i)
        print('При подборе параметров:       ', clf.best_score_)
        print('При подборе параметров:       ', clf.best_params_)
        if clf.best_score_>lgb_m_REZ and clf.best_score_> max_score:
            max_score = clf.best_score_
            g_best = g
            i_best = i
        print('Лучшее значение при подборе параметров: ', max_score, 'i ', i_best,'g ', g_best)
        i+=1
    g+=1 

For example, I get values that give a better option than the first run of the model:

{'learning_rate': 0.3, 'max_features': 'sqrt', 'random_state': 25}
g = 775
i = 25

I substitute them like this:

X_train3, X_test3, y_train3, y_test3 = train_test_split(X, y, test_size=0.33, random_state=775)
lgb_m1 = GradientBoostingClassifier(max_features='sqrt', learning_rate= 0.3, random_state=25)
lgb_m1.fit( X_train3, y_train3) 
print(lgb_m_REZ)
res3 = lgb_m1.predict(X_test3)
print('Доля правильно угаданных значений: ', accuracy_score(res3, y_test3))

As a result, the results are different.

Where did I go wrong? What am I doing wrong?

Author: 0xdb, 2020-06-11

1 answers

It looks like you don't quite understand why and how the random_state parameter is used in Scikit-Learn.

This parameter is used exclusively to get the same results for multiple runs of a command with the same input parameters. That is, using this parameter allows you to get deterministic results for functions and methods of classes that use a random number generator.

Select random_state to get a slightly better one the result when splitting the sample or when training the model - does not make sense. After all, the model is trained to predict values for unknown samples.

Selected for training and test samples random_state does not guarantee the best result for an unknown data sample.


So there is no point in wasting resources and picking up random_state - it is only used to reproduce the results.

It is better to choose real ones hyperparameters of the model, for example n_estimators, criterion, min_samples_split, min_samples_leaf, max_depth, etc.

Sometimes, to reproduce the result, you have to explicitly set the value np.random.seed().

I would rewrite your code as follows:

random_state = 123

np.random.seed(random_state)
X_train2, X_test2, y_train2, y_test2 = \
    train_test_split(X, y, test_size=0.33, random_state=random_state)

param_grid = {
    'max_features' : ['auto','sqrt', 'log2'],
    'learning_rate' : [ 0.05 , 0.1 , 0.2 , 0.3 ],
    'n_estimators': [50, 100, 250, 500],
    'criterion': ['friedman_mse','mse','mae']
}
svc = GradientBoostingClassifier(random_state=random_state)
np.random.seed(random_state)
clf = GridSearchCV(svc, param_grid, random_state=random_state)
clf.fit(X_train2, y_train2)

After that, use the best selected parameters, or you can use a model that has already been trained with the best parameters: clf.best_estimator_

clf.best_estimator_.score(X_test2, y_test2)
 2
Author: MaxU, 2020-06-11 11:17:38