⭐ Encoding Categorical Variables
🔍 Why Encoding Matters
Machine-learning models cannot understand text categories by default. Encoding transforms these categories into meaningful numerical values, ensuring the model correctly interprets patterns without bias or distortion.
1. Label Encoding
Assigns each category an integer. ✔ Best for ordinal features ❌ Risky for nominal data because numbers imply order.
Example:
Small → 1
Medium → 2
Large → 3
2. One-Hot Encoding
Creates binary columns for each category. ✔ Removes order bias ❌ Leads to curse of dimensionality with high-cardinality columns.
Example:
Color_Red: 1 Color_Blue: 0 Color_Green: 0
3. Ordinal Encoding
Used when categories have a real ranked order. Example:
Beginner → 0
Intermediate → 1
Advanced → 2
4. Target Encoding
Replaces categories with the mean of the target variable. ✔ Performs well in competitions ❌ Prone to leakage → must apply smoothing + cross-validation.
5. Frequency Encoding
Encodes each category by how often it occurs. ✔ Helpful for high-cardinality features ✔ Works well with tree models
6. Binary Encoding
Hybrid between one-hot and hashing. ✔ Reduces dimensionality ✔ Efficient for large datasets.
Handling Unknown Categories
When deploying models, new categories may appear. Use:
handle_unknown="ignore" (OneHotEncoder)
Fallback bucket: "Other"
Keep consistent category maps from training.
Which Encoding for Which Model?
1. Label / Ordinal Encoding
Best for:
Tree-based models (Random Forest, XGBoost, LightGBM, Decision Trees) Why:
Tree models split values based on thresholds, not distances—so ordinal numbers don’t distort results.
2. One-Hot Encoding
Best for:
Linear models (Logistic Regression, Linear Regression)
Neural networks
KNN, SVM Why:
Avoids implying numerical order; keeps categories independent.
3. Target Encoding
Best for:
High-cardinality categorical features
Models sensitive to dimensionality (GBMs, linear models) Why:
Collapses many categories into one numerical signal without creating hundreds of dummy variables.
4. Frequency Encoding
Best for:
Large datasets
Mixed models Why:
Converts categories into counts; useful when category frequency carries predictive power.
5. Binary Encoding
Best for:
Very high-cardinality data
When One-Hot encoding explodes dimensionality Why:
Reduces feature space by encoding categories into binary digits.

















