This project demonstrates a complete machine learning workflow on the Titanic dataset from Kaggle. The goal is to predict passenger survival using advanced techniques such as handling missing values, encoding categorical variables, building pipelines, cross-validation, and using XGBoost classifier.
The project emphasizes:
- Data cleaning and missing value imputation
- Feature engineering and categorical encoding
- Creating reusable and efficient ML pipelines
- Model evaluation with cross-validation
- Avoiding data leakage to ensure model integrity
The dataset is sourced from the Titanic - Machine Learning from Disaster competition on Kaggle.
- Handling of missing values (e.g., Age, Embarked)
- Encoding of categorical variables (e.g., Sex, Embarked)
- Use of Scikit-learn Pipelines for clean workflow
- Cross-validation for robust evaluation
- XGBoost classifier for improved predictive power
- Clear separation of training and validation data to prevent data leakage
- Clone this repository:
git clone https://github.com/yourusername/titanic-ml-pipeline-xgboost.git
cd titanic-ml-pipeline-xgboost