Python, ML: data preprocessing for predict

The data for the train to pass pre-processing (normalization/standardization, etc.) However, the predict module -- for the same ml-model-is separate, and there is no way to pre-training data to predict on the basis of the values of the train-of dataset (std and variance for standard scaler, the max value for maxabs scaler, etc.) At the moment I'm doing so-trained model on standardized data, but they do predict on non-standardized data. Question: what is the best way to proceed from the point of view of best practice? Yes, and in general, is this -- training the model on standardized data, but predicting on NON-standardized data -- a significant error?

Author: imitusov, 2021-02-04

4 answers

Yes, and in general, is this -- training the model on standardized data, but predicting on NON-standardized data -- a significant error? - yes, in general, this is an error that will not allow you to get clear results.

And yes, if you train on standardized data and submit standardized data, then most often the result will be better than if you train and use non-standardized data. But this has nothing to do with the scheme "we teach in standardized languages"., we serve non-standardized", which is always flawed.

If you (for example) use a third-party module that executes predict, then obviously either it itself should have the means to standardize your data, or give you a method that would accept already standardized data from you. Otherwise, everything is very strange and fitful.

 1
Author: passant, 2021-02-04 11:08:10

There are too many unknowns in your problem statement:

  1. The data by its nature may already be close to standardized, then it will not make much difference whether they are standardized before being sent to the model or not
  2. The models used also have an impact. Trees can be linked to specific intervals of input data values, as a result, when you feed processed and raw data to the wooden model, you can get radically different results. And if you have simple linear regression - well, OK, the result will be shifted, the angle of the regression line will be different, but, in principle, this can really be corrected by post-processing, if you adjust the parameters.

Naturally, the only advice you can give here if you can't control the data processing for predict is to train the model on raw data. Then the resulting result will be most relevant. But in any case, in a good way, you need to study the data, see how they are affected by processing, how the model learns from them, etc.And so long as this is spherical data in a vacuum, you can only guess about how your model behaves in such conditions.

 1
Author: CrazyElf, 2021-02-04 13:21:40

I will answer only the first question: as an option, use ml algorithms that are invariant to preprocessing, for example, algorithms on trees: DecisionTree, RF, boosters

 0
Author: imitusov, 2021-02-04 09:12:06

Thank you more for your answers. So: 1.) making a prediction on NON-standardized data, if the model was trained on standardized data, is definitely not a good thing 2.) the fact that in this case I got a better score than using a NON-standardized dataset for both train and predict, apparently was caused by a relatively small spread in the values of the source data. At the moment, I am considering two equivalent solutions to the problem: either abandon standardization, in principle, or to transfer scale coefficients from one localized module to another through the database (there are no other ways in my case).

 0
Author: derdentyler, 2021-02-05 08:12:35