Error Found input variables with inconsistent numbers of samples

Teaching RandomRorest-

Here is the code:

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                  test_size=0.8, 
                                  random_state=241)

RFC = RandomForestClassifier(n_estimators=37, random_state=241)
RFC.fit(X_train, y_train)

scor_test = []
for predict in RFC.predict_proba(X_test):
    x_scor = log_loss(y_test, predict)
    scor_test.apend(x_scor)

After executing the last block, an error occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-152-01347a72f1da> in <module>
      1 scor_test = []
      2 for predict in RFC.predict_proba(X_test):
----> 3     x_scor = log_loss(y_test, predict)
      4     scor_test.apend(x_scor)

~\Anaconda3\lib\site-packages\sklearn\metrics\classification.py in log_loss(y_true, y_pred, eps, normalize, sample_weight, labels)
   1762     """
   1763     y_pred = check_array(y_pred, ensure_2d=False)
-> 1764     check_consistent_length(y_pred, y_true, sample_weight)
   1765 
   1766     lb = LabelBinarizer()

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays)
    233     if len(uniques) > 1:
    234         raise ValueError("Found input variables with inconsistent numbers of"
--> 235                          " samples: %r" % [int(l) for l in lengths])
    236 
    237 

ValueError: Found input variables with inconsistent numbers of samples: [2, 3001]    
Found input variables with inconsistent numbers of samples

Where did I go wrong?

Additional information:

y_test.shape - (3001,)
RFC.predict_proba(X_test).shape - (3001, 2)

Maybe the problem is in the dimension of the matrices?

Author: 0xdb, 2019-07-02

1 answers

Try it like this:

In [6]: X_train.shape
Out[6]: (750, 1776)

In [7]: RFC = RandomForestClassifier(n_estimators=37, random_state=241)
   ...: RFC.fit(X_train, y_train)
   ...:
Out[7]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=37, n_jobs=None,
            oob_score=False, random_state=241, verbose=0, warm_start=False)

In [8]: predicted = RFC.predict(X_test)

In [9]: loss = log_loss(y_test, predicted)

In [10]: loss
Out[10]: 9.27641427545646

PS this answer shows how to get rid of the error specified in the question. But it is not clear from the question what the author originally wanted to do. Why count "logistic loss" and even in a loop...


Let's check the accuracy of the model on the test sample:

In [11]: RFC.score(X_test, y_test)
Out[11]: 0.7314228590469843
 1
Author: MaxU, 2019-07-02 20:13:55