cross validation
Program code, cross-validation is considered a bit wrong, help fix
import numpy as np
from pandas import DataFrame
import pandas as pd
import warnings
from sklearn import cross_validation
warnings.simplefilter('ignore') # отключим предупреждения Anaconda
data = pd.read_csv('C:\\Users\\Vika\\Downloads\\ENB2012.csv', ';')
data.head()
from sklearn.cross_validation import train_test_split, cross_val_score
kfold = 5 #количество подвыборок для валидации
itog_val = {} #список для записи результатов кросс валидации разных алгоритмов
X = data.values[::, 0:8]
y = data.values[::, 0:1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print ('обучающая выборка:\n', X_train[:9])
print ('\n')
print ('тестовая выборка:\n', X_test[:7])
from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier(n_estimators=70)
scores = cross_validation.cross_val_score(clf, X_train, (y_train.ravel()*1000).astype(int), cv=kfold)
itog_val['AdaBoostClassifier'] = scores.mean()
print ('итог', itog_val)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
clf.predict(X_test)
print ('AdaBoostClassifier:\n', X_test[:9])
The answer is always only this
total {'AdaBoostClassifier': 1.0}
Original selection https://ru.files.fm/u/aempdy95
1 answers
You have the" target " vector that you are trying to predict (column Y1
) also present in the dataset that you are making the prediction from. The classifier apparently noticed this one-to-one correspondence and always makes correct predictions for a known column (the first column in X
is Y1
).
It's as if we want to predict a person's gender(М
, Ж
) by height, weight, and pre-determined gender:
X1 X2 X3 -> Y1
190 98 M -> M
160 50 Ж -> Ж
The classifier may notice that for predictions it is enough to use only the last column (X3
) and, voila, you always have predictions with 100% accuracy...
Try this way:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import AdaBoostClassifier
#import warnings
#from sklearn import cross_validation
#warnings.simplefilter('ignore') # отключим предупреждения Anaconda
data = pd.read_csv(r'D:\download\ENB2012.csv', ';')
data.head()
kfold = 5 #количество подвыборок для валидации
itog_val = {} #список для записи результатов кросс валидации разных алгоритмов
X = data.drop('Y1', axis=1).values[:, :7] # [y] столбец не должен попадать в [X]
y = data['Y1'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print ('обучающая выборка:\n', X_train[:9])
print ('\n')
print ('тестовая выборка:\n', X_test[:7])
clf = AdaBoostClassifier(n_estimators=70)
scores = cross_val_score(clf, X_train, y_train, cv=kfold)
itog_val['AdaBoostClassifier'] = scores.mean()
print ('итог', itog_val)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
y_test_predicted = clf.predict(X_test)
print ('AdaBoostClassifier first 9 predicted values:\n', y_test_predicted[:9])
Result:
обучающая выборка:
[[ 2 60 3 3 7418 5 3]
[ 2 18 4 2 7374 1 1]
[ 1 12 4 0 3499 1 3]
[ 2 48 3 6 6224 1 5]
[ 4 21 3 2 1591 2 4]
[ 2 18 2 2 1924 5 2]
[ 4 24 2 9 1258 1 4]
[ 1 6 2 6 448 1 2]
[ 4 18 4 1 3850 1 4]]
тестовая выборка:
[[ 1 18 0 9 3104 1 4]
[ 2 36 2 1 9398 1 2]
[ 2 9 4 6 1136 4 5]
[ 2 24 2 0 1201 1 2]
[ 4 12 4 3 1240 5 5]
[ 4 6 1 0 783 5 3]
[ 1 16 4 0 2625 1 5]]
итог {'AdaBoostClassifier': 0.67554019014693167}
AdaBoostClassifier first 9 predicted values:
[0 0 1 0 1 0 0 0 0]
UPDATE:
To clearly show which "are working"
You can create a new DataFrame-column YP
- predicted values:
In [123]: res = pd.DataFrame(np.column_stack((y_test, y_test_predicted, X_test)),
...: columns=['Y1','YP'] + data.columns[1:8].tolist())
...:
In [124]: res
Out[124]:
Y1 YP X1 X2 X3 X4 X5 X6 X7
0 0 1 1 18 2 1 7511 5 5
1 1 0 1 42 4 5 3394 1 1
2 1 1 4 42 4 2 4041 3 3
3 0 1 1 12 2 3 709 1 5
4 0 1 4 24 2 9 4591 4 3
5 0 0 1 20 4 0 2235 1 3
6 1 1 1 24 4 3 1231 4 5
7 0 1 1 12 2 1 3386 1 5
8 1 1 4 24 2 3 1278 1 5
9 0 0 1 18 2 3 1882 1 3
.. .. .. .. .. .. .. ... .. ..
102 1 1 1 21 2 2 3599 1 4
103 1 1 4 18 4 3 2238 1 3
104 0 1 1 60 2 9 7297 1 5
105 0 0 2 12 2 2 951 2 2
106 1 0 4 10 2 6 727 3 5
107 0 0 2 12 2 3 1534 1 2
108 1 0 2 26 2 1 7966 1 2
109 1 1 4 18 2 3 1126 5 2
110 1 0 2 24 2 0 1201 1 2
111 1 1 4 6 2 3 518 1 3
[112 rows x 9 columns]
UPDATE2:
To see all the rows, you can save res
to Excel file:
res.to_excel(r'/path/to/test_data_set_and_prediction.xlsx', index=False)