What is the difference between Label Encoding and One-Hot Encoding (or Creating Dummies)? Which to apply when?
Let’s start with the explanation. We know that before the modeling process, there must be pre-processing to make the data to obtain the best possible performance. There is a situation we may face in the pre-processing phase. It is the conversion of Categorical data into Numeric. The point is that a mathematical model (simply the computer brain) can only understand numbers.
Aha! Now let’s look at the difference between Label Encoding and One-Hot Encoding (or Creating Dummies) and their application! As we remember from statistics, we can divide Categorical data into two groups, Nominal and Ordinal. Examples for Nominal “Gender”, “Hair color”, “Country”, and for Ordinal “Position (Junior, Middle, Senior)”, “Education (BS, MS, PhD)”, “Satisfaction level (Bad, Average, Good, Excellent) ” and others can be given. As you can see from the examples, Ordinal has a ranking of values, but not Nominal, and it is not possible to do it under normal conditions!
If our data is Ordinal and the count of Unique values is large, Label Encoding — (which will also help prevent the creation of a large number of features, save memory and not increase the complexity of the model), vice versa, if data is Nominal and the count of Unique values are low, One-Hot Encoding (or Creating Dummies) should be applied
We can also create dummy features using “pd.get_dummies()”, and the application is more convenient and provides access to important parameters. For example; The “drop_first” parameter can be used to reduce the total number of dummy features by one by deleting the first dummy feature. The reason for the reduction by one is due to Degrees of Freedom, resulting it prevents Multicollinearity. I will talk about this in detail in another article, stay tuned! :)
If you liked the article or have any ideas, we will be glad if you like the post and comment. Thanks!