cross validation

Question

cross validation

Program code, cross-validation is considered a bit wrong, help fix

import numpy as np
from pandas import DataFrame
import pandas as pd
import warnings 
from sklearn import cross_validation
warnings.simplefilter('ignore') # отключим предупреждения Anaconda
data = pd.read_csv('C:\\Users\\Vika\\Downloads\\ENB2012.csv', ';')
data.head()
from sklearn.cross_validation import train_test_split, cross_val_score
kfold = 5 #количество подвыборок для валидации
itog_val = {} #список для записи результатов кросс валидации разных алгоритмов 
X = data.values[::, 0:8]
y = data.values[::, 0:1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print ('обучающая выборка:\n', X_train[:9])
print ('\n')
print ('тестовая выборка:\n', X_test[:7])
from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier(n_estimators=70) 
scores = cross_validation.cross_val_score(clf, X_train, (y_train.ravel()*1000).astype(int), cv=kfold)
itog_val['AdaBoostClassifier'] = scores.mean()
print ('итог', itog_val)
clf.fit(X_train, y_train) 
clf.score(X_test, y_test) 
clf.predict(X_test) 
print ('AdaBoostClassifier:\n', X_test[:9])

The answer is always only this total {'AdaBoostClassifier': 1.0}

Original selection https://ru.files.fm/u/aempdy95

2

python машинное-обучение scikit-learn

Author: user280357, 2018-01-13

Source

1 answers

score 2 · Accepted Answer

You have the" target " vector that you are trying to predict (column Y1) also present in the dataset that you are making the prediction from. The classifier apparently noticed this one-to-one correspondence and always makes correct predictions for a known column (the first column in X is Y1).

It's as if we want to predict a person's gender(М, Ж) by height, weight, and pre-determined gender:

X1   X2  X3 -> Y1
190  98  M  ->  M
160  50  Ж  ->  Ж

The classifier may notice that for predictions it is enough to use only the last column (X3) and, voila, you always have predictions with 100% accuracy...

Try this way:

import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import AdaBoostClassifier

#import warnings
#from sklearn import cross_validation
#warnings.simplefilter('ignore') # отключим предупреждения Anaconda

data = pd.read_csv(r'D:\download\ENB2012.csv', ';')
data.head()
kfold = 5 #количество подвыборок для валидации
itog_val = {} #список для записи результатов кросс валидации разных алгоритмов
X = data.drop('Y1', axis=1).values[:, :7] # [y] столбец не должен попадать в [X]
y = data['Y1'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print ('обучающая выборка:\n', X_train[:9])
print ('\n')
print ('тестовая выборка:\n', X_test[:7])
clf = AdaBoostClassifier(n_estimators=70)    
scores = cross_val_score(clf, X_train, y_train, cv=kfold)
itog_val['AdaBoostClassifier'] = scores.mean()
print ('итог', itog_val)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
y_test_predicted = clf.predict(X_test)
print ('AdaBoostClassifier first 9 predicted values:\n', y_test_predicted[:9])

Result:

обучающая выборка:
 [[   2   60    3    3 7418    5    3]
 [   2   18    4    2 7374    1    1]
 [   1   12    4    0 3499    1    3]
 [   2   48    3    6 6224    1    5]
 [   4   21    3    2 1591    2    4]
 [   2   18    2    2 1924    5    2]
 [   4   24    2    9 1258    1    4]
 [   1    6    2    6  448    1    2]
 [   4   18    4    1 3850    1    4]]


тестовая выборка:
 [[   1   18    0    9 3104    1    4]
 [   2   36    2    1 9398    1    2]
 [   2    9    4    6 1136    4    5]
 [   2   24    2    0 1201    1    2]
 [   4   12    4    3 1240    5    5]
 [   4    6    1    0  783    5    3]
 [   1   16    4    0 2625    1    5]]
итог {'AdaBoostClassifier': 0.67554019014693167}
AdaBoostClassifier first 9 predicted values:
 [0 0 1 0 1 0 0 0 0]

UPDATE:

To clearly show which "are working"

You can create a new DataFrame-column YP - predicted values:

In [123]: res = pd.DataFrame(np.column_stack((y_test, y_test_predicted, X_test)),
     ...:                    columns=['Y1','YP'] + data.columns[1:8].tolist())
     ...:

In [124]: res
Out[124]:
     Y1  YP  X1  X2  X3  X4    X5  X6  X7
0     0   1   1  18   2   1  7511   5   5
1     1   0   1  42   4   5  3394   1   1
2     1   1   4  42   4   2  4041   3   3
3     0   1   1  12   2   3   709   1   5
4     0   1   4  24   2   9  4591   4   3
5     0   0   1  20   4   0  2235   1   3
6     1   1   1  24   4   3  1231   4   5
7     0   1   1  12   2   1  3386   1   5
8     1   1   4  24   2   3  1278   1   5
9     0   0   1  18   2   3  1882   1   3
..   ..  ..  ..  ..  ..  ..   ...  ..  ..
102   1   1   1  21   2   2  3599   1   4
103   1   1   4  18   4   3  2238   1   3
104   0   1   1  60   2   9  7297   1   5
105   0   0   2  12   2   2   951   2   2
106   1   0   4  10   2   6   727   3   5
107   0   0   2  12   2   3  1534   1   2
108   1   0   2  26   2   1  7966   1   2
109   1   1   4  18   2   3  1126   5   2
110   1   0   2  24   2   0  1201   1   2
111   1   1   4   6   2   3   518   1   3

[112 rows x 9 columns]

UPDATE2:

To see all the rows, you can save res to Excel file:

res.to_excel(r'/path/to/test_data_set_and_prediction.xlsx', index=False)