This project predicts whether a passenger survived the Titanic disaster using demographic and travel-related information. Multiple machine learning classification models are trained, cross-validated, and compared to select the best-performing model.
The project follows an end-to-end machine learning workflow including data preprocessing, model evaluation, visualization, and final prediction on unseen test data.
- Predict passenger survival (
Survived) - Perform exploratory data analysis (EDA)
- Train and compare multiple classification models
- Select the best model using F1-score
- Generate predictions for unseen test data
Source: Kaggle Titanic Dataset
train.csv– Training & validation datatest.csv– Unseen data for final prediction
Pclass– Passenger classSex– GenderAge– Age of passengerSibSp– Number of siblings/spouses aboardParch– Number of parents/children aboardFare– Ticket fareEmbarked– Port of embarkation
Survived0= Did not survive1= Survived
- Logistic Regression (Baseline)
- Random Forest Classifier
- Support Vector Machine (SVM)
- K-Nearest Neighbors (KNN)
- XGBoost Classifier
- Data loading and inspection
- Missing value handling and data cleaning
- Encoding categorical variables
- Feature scaling
- Train–validation split
- 5-fold cross-validation
- Model comparison using F1-score
- Best model selection
- Prediction on test dataset
- Accuracy
- Precision
- Recall
- F1-score (primary metric)
- Confusion Matrix
F1-score was prioritized due to class imbalance in the dataset.
Complete analysis and implementation is available in the notebook:
pip install -r requirements.txt