How to use CatBoost and Optuna to solve a Price Prediction Problem?

Date: February 02, 2025

Understanding CatBoost and EDA

1. What is CatBoost?

CatBoost (Categorical Boosting) is a gradient boosting library developed by Yandex. It is optimized for handling categorical data and is widely used for structured data in machine learning projects.

Key Features of CatBoost:

Handles categorical features automatically without requiring encoding (like one-hot encoding or label encoding).
Uses Ordered Boosting, which helps prevent overfitting.
Works well with imbalanced datasets and missing values.
Efficient with default hyperparameters, requiring minimal tuning.
Supports GPU acceleration, making training faster.

Why Use CatBoost for Price Prediction?

Many price prediction datasets have categorical features such as item_type, brand, location, etc.
CatBoost naturally handles these features and finds optimal splits, reducing preprocessing effort.
It is one of the top-performing models for structured/tabular data.

2. What is EDA (Exploratory Data Analysis)?

Exploratory Data Analysis (EDA) is the process of understanding a dataset through visualization and statistical summaries before applying machine learning models.

3. CatBoost Modeling explanation

catboost_params explanation:

1. `loss_function`

Loss function used to optimize the model.

2. `eval_metric`

Metric used to evaluate the model during training (on validation data).

3. `learning_rate`

Controls how much the model updates at each iteration.
Lower values (e.g., 0.01 - 0.1):
- Slower learning but may increase accuracy.
- Helps prevent overfitting.
Higher values (e.g., >0.1):
- Model learns faster but can overshoot and miss the optimal solution.

4. `iterations`

Number of boosting rounds (trees) the model trains.
More iterations → More complex model (higher risk of overfitting).
Best practice:
- Use early stopping (early_stopping_rounds=100) to prevent unnecessary training.
- Usually, 500-2000 iterations is optimal, but it depends on dataset size.

5. `depth`

Maximum depth of each tree (controls model complexity).
Deeper trees (e.g., depth=8) → More complex model, higher risk of overfitting.
Shallower trees (e.g., depth=4) → More generalizable, faster training.

6. `random_strength`

Controls the amount of randomness added to splits.
Higher values help regularization by introducing randomness into tree building.
When to use?
- If random_strength=0, the splits are deterministic (fixed).
- If random_strength > 0, the model introduces randomness to prevent overfitting.

7. `l2_leaf_reg`

L2 regularization term for leaves (controls model complexity).
Higher values (e.g., 10+) → More regularization (simpler model).
Lower values (e.g., <1) → Less regularization (risk of overfitting).

8. `random_seed`

Sets the random seed for reproducibility.
Ensures that running the same code multiple times gives the same result.

9. `verbose`

Controls how much output CatBoost prints during training.
verbose=False → No logging (silent training).
verbose=100 → Prints metrics every 100 iterations.

10. `'task_type': 'GPU'`

Trains the model on GPU instead of CPU.
Benefits of using GPU:
- Faster training (5-10x speedup for large datasets).
- Better for high iteration counts.

Share on

Twitter Facebook LinkedIn

Zongao Bian

Understanding CatBoost and EDA

1. What is CatBoost?

Key Features of CatBoost:

Why Use CatBoost for Price Prediction?

2. What is EDA (Exploratory Data Analysis)?

3. CatBoost Modeling explanation

1. loss_function

2. eval_metric

3. learning_rate

4. iterations

5. depth

6. random_strength

7. l2_leaf_reg

8. random_seed

9. verbose

10. 'task_type': 'GPU'