Different output values with the same parameters when classifying data
I select the parameters for the best training of the classification model.
I do it like this:
print('Исходная обученность: ', lgb_m_REZ)
g = 775
max_score = 0
g_best = 0
i_best = 0
while g < 779:
i = 25
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.33, random_state=g)
while i < 28:
param_grid = {
'max_features' : ['auto','sqrt', 'log2'],
'learning_rate' : [ 0.05 , 0.1 , 0.2 , 0.3 ],
'random_state' : [i],
}
svc = GradientBoostingClassifier()
clf = GridSearchCV(svc, param_grid)
clf.fit(X_train2, y_train2)
print('random_state sample: ', g)
print('random_state model: ', i)
print('При подборе параметров: ', clf.best_score_)
print('При подборе параметров: ', clf.best_params_)
if clf.best_score_>lgb_m_REZ and clf.best_score_> max_score:
max_score = clf.best_score_
g_best = g
i_best = i
print('Лучшее значение при подборе параметров: ', max_score, 'i ', i_best,'g ', g_best)
i+=1
g+=1
For example, I get values that give a better option than the first run of the model:
{'learning_rate': 0.3, 'max_features': 'sqrt', 'random_state': 25}
g = 775
i = 25
I substitute them like this:
X_train3, X_test3, y_train3, y_test3 = train_test_split(X, y, test_size=0.33, random_state=775)
lgb_m1 = GradientBoostingClassifier(max_features='sqrt', learning_rate= 0.3, random_state=25)
lgb_m1.fit( X_train3, y_train3)
print(lgb_m_REZ)
res3 = lgb_m1.predict(X_test3)
print('Доля правильно угаданных значений: ', accuracy_score(res3, y_test3))
As a result, the results are different.
Where did I go wrong? What am I doing wrong?
1 answers
It looks like you don't quite understand why and how the random_state
parameter is used in Scikit-Learn
.
This parameter is used exclusively to get the same results for multiple runs of a command with the same input parameters. That is, using this parameter allows you to get deterministic results for functions and methods of classes that use a random number generator.
Select random_state
to get a slightly better one the result when splitting the sample or when training the model - does not make sense. After all, the model is trained to predict values for unknown samples.
Selected for training and test samples random_state
does not guarantee the best result for an unknown data sample.
So there is no point in wasting resources and picking up random_state
- it is only used to reproduce the results.
It is better to choose real ones hyperparameters of the model, for example n_estimators
, criterion
, min_samples_split
, min_samples_leaf
, max_depth
, etc.
Sometimes, to reproduce the result, you have to explicitly set the value np.random.seed()
.
I would rewrite your code as follows:
random_state = 123
np.random.seed(random_state)
X_train2, X_test2, y_train2, y_test2 = \
train_test_split(X, y, test_size=0.33, random_state=random_state)
param_grid = {
'max_features' : ['auto','sqrt', 'log2'],
'learning_rate' : [ 0.05 , 0.1 , 0.2 , 0.3 ],
'n_estimators': [50, 100, 250, 500],
'criterion': ['friedman_mse','mse','mae']
}
svc = GradientBoostingClassifier(random_state=random_state)
np.random.seed(random_state)
clf = GridSearchCV(svc, param_grid, random_state=random_state)
clf.fit(X_train2, y_train2)
After that, use the best selected parameters, or you can use a model that has already been trained with the best parameters: clf.best_estimator_
clf.best_estimator_.score(X_test2, y_test2)