How can I encode categorical features containing Nan without adding a new category?

For example, a feature that takes the values {'Male', 'Female', NaN}, when using OneHotEncoder (or some other means), translated this feature into two numerical ones and encoded:

         f1      f2  
Male      1       0
Female    0       1
Nan       0       0

Another example:
Source dataset:

           Gender        City
Person1      Male      Moscow
Person2       Nan       Kazan
Person3    Female     Saratov

The resulting dataset:

            f1(Male)   f2(Female)   f3(Moscow)   f4(Saratov)   f5(Kazan)
Person1            1            0            1             0           0
Person2            0            0            0             1           0
Person3            0            1            0             0           1

At the same time, such a dataset should be obtained regardless of whether the set on which the encoder was trained had Nan in some categories.

I need Nan not to be considered a separate attribute value, but to be considered the absence of any value

The easiest way is to simply delete the columns related to the NaN values.


Let's say we have the following frame:

In [39]: df
         Gender     City
Person1    Male   Moscow
Person2     NaN    Kazan
Person3  Female  Saratov


from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown='ignore')

df_encoded = pd.DataFrame.sparse.from_spmatrix(

The result is the following DataFrame:

In [41]: df_encoded
   x0_Female  x0_Male  x0_N/A  x1_Kazan  x1_Moscow  x1_Saratov
0        0.0      1.0     0.0       0.0        1.0         0.0
1        0.0      0.0     1.0       1.0        0.0         0.0
2        1.0      0.0     0.0       0.0        0.0         1.0

Now we need to get rid of all the columns that end in "_N/A":

In [48]: mask = df_encoded.columns.str.contains(r"_N/A$")

In [49]: df_encoded = df_encoded.loc[:, ~mask]


In [50]: df_encoded
   x0_Female  x0_Male  x1_Kazan  x1_Moscow  x1_Saratov
0        0.0      1.0       0.0        1.0         0.0
1        0.0      0.0       1.0        0.0         0.0
2        1.0      0.0       0.0        0.0         1.0

You can also use the categories=[<list_of_categories>] and handle_unknown='ignore' parameters in sklearn. preprocessing.OneHotEncoder, but this is a more time-consuming way, since you will have to encode each column separately, setting the list of unique values as categories.

If you need to encode the attribute values in such a way that no new columns appear when new values appear, then you can use Hashing Encoding, also known as Hashing Trick - in this case, you specify the maximum number of columns in the output (after encoding) dataset in advance.

PS Brief explanation of the "hashing trick" algorithm

