Kaggle Playground Series

Ongoing participation in monthly Kaggle Playground Series competitions, practicing machine learning techniques on diverse datasets. Click on each month to see competition details.

Competition Timeline

November 2025 - S5E11: Loan Default Prediction

Competition Overview

Binary classification challenge predicting loan default probability. The dataset contained 593,994 training samples with 11 features and significant class imbalance (80/20 split). Evaluated using AUC-ROC metric.

Final Results

Rank: #950 / 3,850 participants (Top 25%)
Final AUC: 0.92450
Approach: Single LightGBM with Optuna-optimized hyperparameters and multi-seed averaging

Experimental Phases

Phase 1: Cross-Validation Foundation SUCCESS

  • 5-fold stratified cross-validation with LightGBM
  • Optuna hyperparameter optimization (30 trials)
  • CV: 0.923351 ± 0.000664, Private LB: 0.92450
  • Excellent generalization (delta = 0.00115)

Phase 2: Feature Engineering FAILED

  • Created 10 financial features (net income, payment-to-income ratio, risk score)
  • Performance dropped to 0.922625 (-0.000726)
  • Root cause: Features highly correlated with existing features (r > 0.95)
  • Example: net_income correlated with annual_income at r = 0.9873

Phase 3: Model Ensembling FAILED

  • Attempted combining LightGBM, XGBoost, CatBoost
  • Models achieved correlation rho ~ 0.99, yielding only 0.62% variance reduction
  • Models too similar on simple datasets - different hyperparameters couldn't create diversity

Phase 4: Multi-Seed Averaging & External Data SUCCESS

  • Trained LightGBM with 5 different random seeds
  • Incorporated 18 statistical features from external dataset (20,000 real samples)
  • Final improvement: +0.00115 AUC
  • External features showed correlation r < 0.8 versus existing features

Code: Optuna Hyperparameter Optimization

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'max_depth': trial.suggest_int('max_depth', 3, 12),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'num_leaves': trial.suggest_int('num_leaves', 20, 150),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
    }

    model = lgb.LGBMClassifier(**params, random_state=42, verbose=-1)
    scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')
    return scores.mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=30)

Key Insights

Prioritized deep understanding over leaderboard optimization. Validated information-theoretic principles: external data provided genuinely new information while engineered features were redundant transformations the model already learned through splits.

LightGBM 4.0+ Optuna Scikit-learn StratifiedKFold Python 3.10+
Top 25% Final Ranking
0.92450 AUC-ROC
593,994 Training Samples
Binary Classification Task Type
December 2025 - Coming Soon

December 2025 competition details will be added once the competition is complete.

Overall Approach

For each monthly competition, I follow a systematic workflow:

  • Data Exploration: Understanding distributions, missing values, and feature relationships
  • Feature Engineering: Creating meaningful features based on domain knowledge and EDA insights
  • Model Selection: Experimenting with appropriate algorithms for the problem type
  • Validation: Using cross-validation to ensure robust performance estimates
  • Iteration: Learning from each submission to improve subsequent attempts

Skills Developed

Regular participation in Kaggle competitions has helped develop practical skills in:

  • Working with diverse dataset types and problem domains
  • Efficient data preprocessing and feature engineering
  • Model selection and hyperparameter tuning with Optuna
  • Understanding evaluation metrics and their implications
  • Learning from the Kaggle community and top solutions