How can I encode categorical features containing Nan without adding a new category?

For example, a feature that takes the values {'Male', 'Female', NaN}, when using OneHotEncoder (or some other means), translated this feature into two numerical ones and encoded:

         f1      f2  
Male      1       0
Female    0       1
Nan       0       0

Another example:
Source dataset:

           Gender        City
Person1      Male      Moscow
Person2       Nan       Kazan
Person3    Female     Saratov

The resulting dataset:

            f1(Male)   f2(Female)   f3(Moscow)   f4(Saratov)   f5(Kazan)
Person1            1            0            1             0           0
Person2            0            0            0             1           0
Person3            0            1            0             0           1

At the same time, such a dataset should be obtained regardless of whether the set on which the encoder was trained had Nan in some categories.

Author: 0xdb, 2019-11-18

2 answers

I need Nan not to be considered a separate attribute value, but to be considered the absence of any value

The easiest way is to simply delete the columns related to the NaN values.

Example:

Let's say we have the following frame:

In [39]: df
Out[39]:
         Gender     City
Person1    Male   Moscow
Person2     NaN    Kazan
Person3  Female  Saratov

Decision:

from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown='ignore')

df_encoded = pd.DataFrame.sparse.from_spmatrix(
    enc.fit_transform(df.fillna('N/A')),
    columns=enc.get_feature_names()
)

The result is the following DataFrame:

In [41]: df_encoded
Out[41]:
   x0_Female  x0_Male  x0_N/A  x1_Kazan  x1_Moscow  x1_Saratov
0        0.0      1.0     0.0       0.0        1.0         0.0
1        0.0      0.0     1.0       1.0        0.0         0.0
2        1.0      0.0     0.0       0.0        0.0         1.0

Now we need to get rid of all the columns that end in "_N/A":

In [48]: mask = df_encoded.columns.str.contains(r"_N/A$")

In [49]: df_encoded = df_encoded.loc[:, ~mask]

Result:

In [50]: df_encoded
Out[50]:
   x0_Female  x0_Male  x1_Kazan  x1_Moscow  x1_Saratov
0        0.0      1.0       0.0        1.0         0.0
1        0.0      0.0       1.0        0.0         0.0
2        1.0      0.0       0.0        0.0         1.0

You can also use the categories=[<list_of_categories>] and handle_unknown='ignore' parameters in sklearn. preprocessing.OneHotEncoder, but this is a more time-consuming way, since you will have to encode each column separately, setting the list of unique values as categories.

 2
Author: MaxU, 2019-11-18 20:00:29

If you need to encode the attribute values in such a way that no new columns appear when new values appear, then you can use Hashing Encoding, also known as Hashing Trick - in this case, you specify the maximum number of columns in the output (after encoding) dataset in advance.

PS Brief explanation of the "hashing trick" algorithm

 1
Author: MaxU, 2019-11-18 19:33:22