How to use CatBoost and Optuna to solve a Price Prediction Problem?

Date:

Kaggle Notebook Link

Understanding CatBoost and EDA

1. What is CatBoost?

CatBoost (Categorical Boosting) is a gradient boosting library developed by Yandex. It is optimized for handling categorical data and is widely used for structured data in machine learning projects.

Key Features of CatBoost:
  • Handles categorical features automatically without requiring encoding (like one-hot encoding or label encoding).
  • Uses Ordered Boosting, which helps prevent overfitting.
  • Works well with imbalanced datasets and missing values.
  • Efficient with default hyperparameters, requiring minimal tuning.
  • Supports GPU acceleration, making training faster.
Why Use CatBoost for Price Prediction?
  • Many price prediction datasets have categorical features such as item_type, brand, location, etc.
  • CatBoost naturally handles these features and finds optimal splits, reducing preprocessing effort.
  • It is one of the top-performing models for structured/tabular data.

2. What is EDA (Exploratory Data Analysis)?

Exploratory Data Analysis (EDA) is the process of understanding a dataset through visualization and statistical summaries before applying machine learning models.

3. CatBoost Modeling explanation

catboost_params explanation:


1. loss_function

  • Loss function used to optimize the model.

2. eval_metric

  • Metric used to evaluate the model during training (on validation data).

3. learning_rate

  • Controls how much the model updates at each iteration.
  • Lower values (e.g., 0.01 - 0.1):
    • Slower learning but may increase accuracy.
    • Helps prevent overfitting.
  • Higher values (e.g., >0.1):
    • Model learns faster but can overshoot and miss the optimal solution.

4. iterations

  • Number of boosting rounds (trees) the model trains.
  • More iterations → More complex model (higher risk of overfitting).
  • Best practice:
    • Use early stopping (early_stopping_rounds=100) to prevent unnecessary training.
    • Usually, 500-2000 iterations is optimal, but it depends on dataset size.

5. depth

  • Maximum depth of each tree (controls model complexity).
  • Deeper trees (e.g., depth=8) → More complex model, higher risk of overfitting.
  • Shallower trees (e.g., depth=4) → More generalizable, faster training.

6. random_strength

  • Controls the amount of randomness added to splits.
  • Higher values help regularization by introducing randomness into tree building.
  • When to use?
    • If random_strength=0, the splits are deterministic (fixed).
    • If random_strength > 0, the model introduces randomness to prevent overfitting.

7. l2_leaf_reg

  • L2 regularization term for leaves (controls model complexity).
  • Higher values (e.g., 10+) → More regularization (simpler model).
  • Lower values (e.g., <1) → Less regularization (risk of overfitting).

8. random_seed

  • Sets the random seed for reproducibility.
  • Ensures that running the same code multiple times gives the same result.

9. verbose

  • Controls how much output CatBoost prints during training.
  • verbose=FalseNo logging (silent training).
  • verbose=100 → Prints metrics every 100 iterations.

10. 'task_type': 'GPU'

  • Trains the model on GPU instead of CPU.
  • Benefits of using GPU:
    • Faster training (5-10x speedup for large datasets).
    • Better for high iteration counts.