How can I encode categorical features containing Nan without adding a new category?
For example, a feature that takes the values {'Male', 'Female', NaN}
, when using OneHotEncoder
(or some other means), translated this feature into two numerical ones and encoded:
f1 f2
Male 1 0
Female 0 1
Nan 0 0
Another example:
Source dataset:
Gender City
Person1 Male Moscow
Person2 Nan Kazan
Person3 Female Saratov
The resulting dataset:
f1(Male) f2(Female) f3(Moscow) f4(Saratov) f5(Kazan)
Person1 1 0 1 0 0
Person2 0 0 0 1 0
Person3 0 1 0 0 1
At the same time, such a dataset should be obtained regardless of whether the set on which the encoder was trained had Nan
in some categories.
2 answers
I need Nan not to be considered a separate attribute value, but to be considered the absence of any value
The easiest way is to simply delete the columns related to the NaN
values.
Example:
Let's say we have the following frame:
In [39]: df
Out[39]:
Gender City
Person1 Male Moscow
Person2 NaN Kazan
Person3 Female Saratov
Decision:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
df_encoded = pd.DataFrame.sparse.from_spmatrix(
enc.fit_transform(df.fillna('N/A')),
columns=enc.get_feature_names()
)
The result is the following DataFrame:
In [41]: df_encoded
Out[41]:
x0_Female x0_Male x0_N/A x1_Kazan x1_Moscow x1_Saratov
0 0.0 1.0 0.0 0.0 1.0 0.0
1 0.0 0.0 1.0 1.0 0.0 0.0
2 1.0 0.0 0.0 0.0 0.0 1.0
Now we need to get rid of all the columns that end in "_N/A"
:
In [48]: mask = df_encoded.columns.str.contains(r"_N/A$")
In [49]: df_encoded = df_encoded.loc[:, ~mask]
Result:
In [50]: df_encoded
Out[50]:
x0_Female x0_Male x1_Kazan x1_Moscow x1_Saratov
0 0.0 1.0 0.0 1.0 0.0
1 0.0 0.0 1.0 0.0 0.0
2 1.0 0.0 0.0 0.0 1.0
You can also use the categories=[<list_of_categories>]
and handle_unknown='ignore'
parameters in sklearn. preprocessing.OneHotEncoder, but this is a more time-consuming way, since you will have to encode each column separately, setting the list of unique values as categories
.
If you need to encode the attribute values in such a way that no new columns appear when new values appear, then you can use Hashing Encoding, also known as Hashing Trick - in this case, you specify the maximum number of columns in the output (after encoding) dataset in advance.