QMUL ML Society Valentine's Hackathon

2nd Place Solution - Predicting Valentine's date likelihood from demographic and social attributes using a tuned gradient boosting ensemble

2nd Place
QMUL Machine Learning Society Valentine's Day Hackathon 2026 — 0.59235 AUC (1st: 0.59275)

Problem Statement

The challenge was to predict whether a person has a Valentine's date, based on demographic and social attributes — a binary classification problem evaluated on ROC-AUC. The signal-to-noise ratio was intentionally low, making feature engineering and model calibration the key differentiators.

700K
Training Rows
0.59235
Final AUC
3
Models Ensembled
150
Optuna Trials

Feature Engineering

With a noisy target and large dataset, careful preprocessing and feature construction were essential to extract signal before model training.

Date Features

  • Parsed Survey_Date into month, hour, and day-of-week components
  • Captures any systematic temporal bias in survey responses

Interaction Terms

  • Created cross-feature interactions between key demographic variables
  • Allows models to capture non-linear relationships without deep trees

Transformations & Flags

  • Log-transformed income to reduce skew and improve gradient flow
  • Missingness flags added for columns with null values — treating missingness itself as a signal

Target Encoding

  • Applied target encoding to high-cardinality categorical columns
  • Used KFold cross-validation during encoding to prevent target leakage

Modelling Approach

Three gradient boosting frameworks were tuned independently, then combined into a rank-based ensemble.

Hyperparameter Tuning with Optuna

Each model was tuned with 50 Optuna trials, optimising ROC-AUC via cross-validation:

  • LightGBM — fast training on large datasets, tuned learning rate, num_leaves, min_child_samples
  • XGBoost — tuned max_depth, subsample, colsample_bytree, eta
  • CatBoost — native categorical handling, tuned depth, learning_rate, l2_leaf_reg

Code: Rank-Based Ensemble

from scipy.stats import rankdata

# Get OOF probability predictions from each tuned model
lgb_preds = lgb_model.predict_proba(X_test)[:, 1]
xgb_preds = xgb_model.predict_proba(X_test)[:, 1]
cat_preds = cat_model.predict_proba(X_test)[:, 1]

# Rank-based ensemble: convert to ranks before averaging
# Reduces sensitivity to outlier probability estimates
ensemble = (
    rankdata(lgb_preds) +
    rankdata(xgb_preds) +
    rankdata(cat_preds)
) / 3

Rank-based ensembling normalises each model's output before averaging, making it more robust to differences in probability calibration across models.

Key Success Factors

  • Target encoding with KFold allowed safe encoding of categorical features on the full training set without leakage
  • Missingness as signal: Adding null-indicator flags gave models information about why data was missing, not just that it was
  • Diverse model types: LightGBM, XGBoost, and CatBoost each handle data differently — ensembling reduces variance across their errors
  • Rank-based ensembling sidesteps probability calibration differences between models, producing a more reliable combined score
  • Thorough tuning: 50 Optuna trials per model on a noisy task made a meaningful difference at the margin

Tools & Technologies

  • Python: Primary programming language
  • Jupyter Notebook: Interactive development and rapid iteration
  • Pandas & NumPy: Data manipulation and feature engineering
  • LightGBM, XGBoost, CatBoost: Gradient boosting frameworks
  • Optuna: Hyperparameter optimisation framework
  • Scikit-learn: KFold cross-validation, ROC-AUC scoring
  • SciPy: Rank-based ensemble aggregation

Key Learnings

  • On noisy tasks, ensembling is worth the effort: The margin between 1st and 2nd place (0.00040 AUC) shows how small gains compound — each modelling decision matters
  • Treat missingness as a feature: Null-indicator flags can carry predictive signal that imputation alone would destroy
  • KFold target encoding is non-negotiable: Naive target encoding leaks label information and produces overfit models that collapse on test data
  • Rank-based ensembling is robust: When models have different probability scales, ranking before averaging produces more stable combined predictions than simple averaging