QMUL ML Society Valentine's Hackathon - Case Study

Problem Statement

The challenge was to predict whether a person has a Valentine's date, based on demographic and social attributes — a binary classification problem evaluated on ROC-AUC. The signal-to-noise ratio was intentionally low, making feature engineering and model calibration the key differentiators.

700K

Training Rows

0.59235

Final AUC

3

Models Ensembled

150

Optuna Trials

Feature Engineering

With a noisy target and large dataset, careful preprocessing and feature construction were essential to extract signal before model training.

Date Features

Parsed Survey_Date into month, hour, and day-of-week components
Captures any systematic temporal bias in survey responses

Interaction Terms

Created cross-feature interactions between key demographic variables
Allows models to capture non-linear relationships without deep trees

Transformations & Flags

Log-transformed income to reduce skew and improve gradient flow
Missingness flags added for columns with null values — treating missingness itself as a signal

Target Encoding

Applied target encoding to high-cardinality categorical columns
Used KFold cross-validation during encoding to prevent target leakage

Modelling Approach

Three gradient boosting frameworks were tuned independently, then combined into a rank-based ensemble.

Hyperparameter Tuning with Optuna

Each model was tuned with 50 Optuna trials, optimising ROC-AUC via cross-validation:

LightGBM — fast training on large datasets, tuned learning rate, num_leaves, min_child_samples
XGBoost — tuned max_depth, subsample, colsample_bytree, eta
CatBoost — native categorical handling, tuned depth, learning_rate, l2_leaf_reg

Code: Rank-Based Ensemble

from scipy.stats import rankdata

# Get OOF probability predictions from each tuned model
lgb_preds = lgb_model.predict_proba(X_test)[:, 1]
xgb_preds = xgb_model.predict_proba(X_test)[:, 1]
cat_preds = cat_model.predict_proba(X_test)[:, 1]

# Rank-based ensemble: convert to ranks before averaging
# Reduces sensitivity to outlier probability estimates
ensemble = (
    rankdata(lgb_preds) +
    rankdata(xgb_preds) +
    rankdata(cat_preds)
) / 3

Rank-based ensembling normalises each model's output before averaging, making it more robust to differences in probability calibration across models.

Key Success Factors

Target encoding with KFold allowed safe encoding of categorical features on the full training set without leakage
Missingness as signal: Adding null-indicator flags gave models information about why data was missing, not just that it was
Diverse model types: LightGBM, XGBoost, and CatBoost each handle data differently — ensembling reduces variance across their errors
Rank-based ensembling sidesteps probability calibration differences between models, producing a more reliable combined score
Thorough tuning: 50 Optuna trials per model on a noisy task made a meaningful difference at the margin

Tools & Technologies

Python: Primary programming language
Jupyter Notebook: Interactive development and rapid iteration
Pandas & NumPy: Data manipulation and feature engineering
LightGBM, XGBoost, CatBoost: Gradient boosting frameworks
Optuna: Hyperparameter optimisation framework
Scikit-learn: KFold cross-validation, ROC-AUC scoring
SciPy: Rank-based ensemble aggregation

Key Learnings

On noisy tasks, ensembling is worth the effort: The margin between 1st and 2nd place (0.00040 AUC) shows how small gains compound — each modelling decision matters
Treat missingness as a feature: Null-indicator flags can carry predictive signal that imputation alone would destroy
KFold target encoding is non-negotiable: Naive target encoding leaks label information and produces overfit models that collapse on test data
Rank-based ensembling is robust: When models have different probability scales, ranking before averaging produces more stable combined predictions than simple averaging