Kaggle Playground Series - Case Study

Competition Timeline

November 2025 - S5E11: Loan Default Prediction

Competition Overview

Binary classification challenge predicting loan default probability. The dataset contained 593,994 training samples with 11 features and significant class imbalance (80/20 split). Evaluated using AUC-ROC metric.

Final Results

Rank: #950 / 3,850 participants (Top 25%)
Final AUC: 0.92450
Approach: Single LightGBM with Optuna-optimized hyperparameters and multi-seed averaging

Experimental Phases

Phase 1: Cross-Validation Foundation SUCCESS

5-fold stratified cross-validation with LightGBM
Optuna hyperparameter optimization (30 trials)
CV: 0.923351 ± 0.000664, Private LB: 0.92450
Excellent generalization (delta = 0.00115)

Phase 2: Feature Engineering FAILED

Created 10 financial features (net income, payment-to-income ratio, risk score)
Performance dropped to 0.922625 (-0.000726)
Root cause: Features highly correlated with existing features (r > 0.95)
Example: net_income correlated with annual_income at r = 0.9873

Phase 3: Model Ensembling FAILED

Attempted combining LightGBM, XGBoost, CatBoost
Models achieved correlation rho ~ 0.99, yielding only 0.62% variance reduction
Models too similar on simple datasets - different hyperparameters couldn't create diversity

Phase 4: Multi-Seed Averaging & External Data SUCCESS

Trained LightGBM with 5 different random seeds
Incorporated 18 statistical features from external dataset (20,000 real samples)
Final improvement: +0.00115 AUC
External features showed correlation r < 0.8 versus existing features

Code: Optuna Hyperparameter Optimization

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'max_depth': trial.suggest_int('max_depth', 3, 12),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'num_leaves': trial.suggest_int('num_leaves', 20, 150),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
    }

    model = lgb.LGBMClassifier(**params, random_state=42, verbose=-1)
    scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')
    return scores.mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=30)

Key Insights

Prioritized deep understanding over leaderboard optimization. Validated information-theoretic principles: external data provided genuinely new information while engineered features were redundant transformations the model already learned through splits.

LightGBM 4.0+ Optuna Scikit-learn StratifiedKFold Python 3.10+

Top 25% Final Ranking

0.92450 AUC-ROC

593,994 Training Samples

Binary Classification Task Type

Overall Approach

For each monthly competition, I follow a systematic workflow:

Data Exploration: Understanding distributions, missing values, and feature relationships
Feature Engineering: Creating meaningful features based on domain knowledge and EDA insights
Model Selection: Experimenting with appropriate algorithms for the problem type
Validation: Using cross-validation to ensure robust performance estimates
Iteration: Learning from each submission to improve subsequent attempts

Skills Developed

Regular participation in Kaggle competitions has helped develop practical skills in:

Working with diverse dataset types and problem domains
Efficient data preprocessing and feature engineering
Model selection and hyperparameter tuning with Optuna
Understanding evaluation metrics and their implications
Learning from the Kaggle community and top solutions