Spaceship Titanic – Iterative Machine Learning Approach

Project Overview

This project tackles the Kaggle Spaceship Titanic classification problem using a structured, iterative machine learning workflow.
Each iteration builds on insights from the previous one, focusing on better preprocessing, feature engineering, model selection, and validation strategy.

The goal was not just leaderboard performance, but correct ML methodology, interpretability, and learning progression.

Iteration 1: Baseline Models & Classical ML

Approach

Basic preprocessing (imputation, encoding)
Train–validation split (no cross-validation)
Classical models:
- Naive Bayes
- Logistic Regression (with regularization)
- Random Forest

Key Observations

Logistic Regression improved after tuning C
Random Forest outperformed linear models
Performance sensitive to random state
No meaningful feature engineering

Results

Validation Accuracy: ~0.77–0.79
Kaggle Score: 0.798
Leaderboard Rank: 1522 / 2692

Takeaway

A solid baseline, but:

No CV → unstable estimates
Feature interactions poorly captured

Iteration 2: Feature Engineering + Cross-Validation + Feature Selection

Approach

Proper ColumnTransformer pipelines
Feature groups:
- Numerical
- Boolean
- Categorical
- Engineered / count-based
New engineered features:
- Total_spend
- Has_spent
- Spend-based ratios
Feature selection using:
- L1 (Logistic Regression)
5-Fold Cross-Validation
Models evaluated:
- RF + LR-based feature selection (RF_lr)
- RF + RF-based feature selection
- LR baselines

Key Observations

RF with LR feature selection was the most stable
CV reduced randomness
Feature selection slightly improved generalization
Accuracy gains were incremental

Results

Best CV Accuracy (RF_lr): ~0.800
Kaggle Score: 0.799
Leaderboard Rank: 1453 / 2692

Takeaway

Feature engineering helps only when it adds new signal
Stability and correctness > raw accuracy
RF_lr became the strong classical ML baseline

Iteration 3: Gradient Boosting with XGBoost

Approach

Introduced XGBoost (XGBClassifier)
Same clean preprocessing pipeline
Cross-validation retained
Minimal tuning:
- Controlled tree depth
- Moderate learning rate
- No aggressive regularization

Key Observations

XGBoost outperformed RF-based pipelines
Higher mean accuracy with slightly higher variance
Indicates lower bias but increased sensitivity
No evidence of data leakage or overfitting

Results

CV Accuracy: ~0.804–0.805
Kaggle Score: ~0.805
Leaderboard Rank: 741 / 2719

Takeaway

Boosting extracted additional non-linear structure
Performance approached the practical ceiling for this dataset
Further gains likely require ensembling or stronger features

Iteration 4: Gradient Boosting with LightGBM

Approach

Introduced LightGBM (LGBMClassifier)
Retained the same preprocessing and engineered features
Stratified 5-Fold Cross-Validation
Leveraged leaf-wise tree growth (vs level-wise in XGBoost)
Moderate hyperparameters:
- Controlled max depth
- Stable learning rate
- No heavy regularization

Key Observations

LightGBM consistently outperformed XGBoost
Higher mean CV accuracy with lower variance
Better bias–variance tradeoff
Leaf-wise growth enabled faster and more effective loss reduction
No signs of overfitting or leakage

Results

CV Accuracy: ~0.807–0.808
CV Std Dev: ~0.009
Kaggle Score: ~0.807
Leaderboard Rank: 293 / 2292

Takeaway

LightGBM extracted additional structured and non-linear signal
Achieved the best single-model performance in this workflow
Model performance is now very close to the dataset’s empirical ceiling
Meaningful gains beyond this point require:
- Ensembling
- Stacking
- Domain-heavy feature engineering

Final Comparison Summary

Iteration	Model	CV / Val Accuracy	Kaggle Score	Rank
Iter 1	RF / LR	~0.77–0.79	0.798	1522/2692
Iter 2	RF + LR FS	~0.80	0.799	1453/2692
Iter 3	XGBoost	~0.805	0.805	741/2719
Iter 4	LightGBM	~0.808	0.807	293/2292
Iter 5	Catboost	~0.816	0.8024	No improvements

Key Learnings

Cross-validation is mandatory for reliable evaluation
Feature engineering only helps when it adds new information
Tree-based models outperform linear methods on this dataset
Boosting reduces bias but must be variance-controlled
Small leaderboard gains can represent significant methodological progress

Conclusion

This project demonstrates a disciplined and correct ML workflow, progressing from baseline models to advanced gradient boosting while maintaining reproducibility and interpretability.

Achieving ~80.7% accuracy with clean pipelines and CV reflects strong modeling rather than leaderboard tricks.

Future Work

Model ensembling (RF + XGB + LGBM)
Feature interaction discovery
Explore CatBoost for native categorical handling (Explored)
Automating the feature creation by wrapping it into a function

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
readme.md		readme.md
spaceship_titanic_dataset_iter(1).py		spaceship_titanic_dataset_iter(1).py
spaceship_titanic_dataset_iter(2).py		spaceship_titanic_dataset_iter(2).py
spaceship_titanic_dataset_iter(3).py		spaceship_titanic_dataset_iter(3).py
spaceship_titanic_dataset_iter(4).py		spaceship_titanic_dataset_iter(4).py
spaceship_titanic_dataset_iter(5).py		spaceship_titanic_dataset_iter(5).py
submission_iter(1).csv		submission_iter(1).csv
submission_iter(2).csv		submission_iter(2).csv
submission_iter(3).csv		submission_iter(3).csv
submission_iter(4).csv		submission_iter(4).csv
submission_iter(5).csv		submission_iter(5).csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spaceship Titanic – Iterative Machine Learning Approach

Project Overview

Iteration 1: Baseline Models & Classical ML

Approach

Key Observations

Results

Takeaway

Iteration 2: Feature Engineering + Cross-Validation + Feature Selection

Approach

Key Observations

Results

Takeaway

Iteration 3: Gradient Boosting with XGBoost

Approach

Key Observations

Results

Takeaway

Iteration 4: Gradient Boosting with LightGBM

Approach

Key Observations

Results

Takeaway

Final Comparison Summary

Key Learnings

Conclusion

Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Spaceship Titanic – Iterative Machine Learning Approach

Project Overview

Iteration 1: Baseline Models & Classical ML

Approach

Key Observations

Results

Takeaway

Iteration 2: Feature Engineering + Cross-Validation + Feature Selection

Approach

Key Observations

Results

Takeaway

Iteration 3: Gradient Boosting with XGBoost

Approach

Key Observations

Results

Takeaway

Iteration 4: Gradient Boosting with LightGBM

Approach

Key Observations

Results

Takeaway

Final Comparison Summary

Key Learnings

Conclusion

Future Work

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages