How to use CatBoost and Optuna to solve a Price Prediction Problem?
Date:
Understanding CatBoost and EDA
1. What is CatBoost?
CatBoost (Categorical Boosting) is a gradient boosting library developed by Yandex. It is optimized for handling categorical data and is widely used for structured data in machine learning projects.
Key Features of CatBoost:
- Handles categorical features automatically without requiring encoding (like one-hot encoding or label encoding).
- Uses Ordered Boosting, which helps prevent overfitting.
- Works well with imbalanced datasets and missing values.
- Efficient with default hyperparameters, requiring minimal tuning.
- Supports GPU acceleration, making training faster.
Why Use CatBoost for Price Prediction?
- Many price prediction datasets have categorical features such as
item_type,brand,location, etc. - CatBoost naturally handles these features and finds optimal splits, reducing preprocessing effort.
- It is one of the top-performing models for structured/tabular data.
2. What is EDA (Exploratory Data Analysis)?
Exploratory Data Analysis (EDA) is the process of understanding a dataset through visualization and statistical summaries before applying machine learning models.
3. CatBoost Modeling explanation
catboost_params explanation:
1. loss_function
- Loss function used to optimize the model.
2. eval_metric
- Metric used to evaluate the model during training (on validation data).
3. learning_rate
- Controls how much the model updates at each iteration.
- Lower values (e.g., 0.01 - 0.1):
- Slower learning but may increase accuracy.
- Helps prevent overfitting.
- Higher values (e.g., >0.1):
- Model learns faster but can overshoot and miss the optimal solution.
4. iterations
- Number of boosting rounds (trees) the model trains.
- More iterations → More complex model (higher risk of overfitting).
- Best practice:
- Use early stopping (
early_stopping_rounds=100) to prevent unnecessary training. - Usually, 500-2000 iterations is optimal, but it depends on dataset size.
- Use early stopping (
5. depth
- Maximum depth of each tree (controls model complexity).
- Deeper trees (e.g.,
depth=8) → More complex model, higher risk of overfitting. - Shallower trees (e.g.,
depth=4) → More generalizable, faster training.
6. random_strength
- Controls the amount of randomness added to splits.
- Higher values help regularization by introducing randomness into tree building.
- When to use?
- If
random_strength=0, the splits are deterministic (fixed). - If
random_strength > 0, the model introduces randomness to prevent overfitting.
- If
7. l2_leaf_reg
- L2 regularization term for leaves (controls model complexity).
- Higher values (e.g., 10+) → More regularization (simpler model).
- Lower values (e.g., <1) → Less regularization (risk of overfitting).
8. random_seed
- Sets the random seed for reproducibility.
- Ensures that running the same code multiple times gives the same result.
9. verbose
- Controls how much output CatBoost prints during training.
verbose=False→ No logging (silent training).verbose=100→ Prints metrics every 100 iterations.
10. 'task_type': 'GPU'
- Trains the model on GPU instead of CPU.
- Benefits of using GPU:
- Faster training (5-10x speedup for large datasets).
- Better for high iteration counts.
