Tanzanian Waterwells Project

Overview

The Tanzanian Waterwells Project is an initiative aimed at predicting water pump functionality to ensure clean water access for the people of Tanzania. This project utilizes machine learning (ML) techniques to classify water wells as functional, non-functional, or in need of repair. By leveraging data analysis and model optimization, our goal is to provide the people of Tanzania a ML model that can make accurate predictions on water pumps that are functional and produce clean water.

Team

Shefat Moral
Gabriel Santorelli
David Jimenez

Project Goals

Our goal for this project was to build an effective ML model that can predict water pump functionality for the purpose of bringing clean water to the people of Tanzania 🇹🇿.

Data Exploration (EDA)

Key Findings:

The dataset contained three well statuses:
- Functional
- Non-Functional
- Functional Needs Repair
Functionality by Region: Analyzed geographic patterns of well status.
Water Quality Analysis: Functioning wells did not always provide clean water.
- Categories like Unknown, Fluoride Abandoned, and Salty were among the lowest quality.

Model Development

Baseline Model: Decision Tree

Initial accuracy: 75%
Strengths: Identified Class 0 (Functional) and Class 2 (Non-Functional) well.
Weakness: Struggled with Class 1 (Needs Repair).
Class Balancing: Addressed class imbalance with class_weight and SMOTE (Synthetic Minority Over-sampling Technique).

Improved Model: Random Forest

Initial accuracy: 79%
Improved precision for Class 1 (60%) and Class 2 (84%).
Struggled to distinguish Class 1 from Class 2.
Further EDA to determine the next steps in optimization.

Feature Engineering & Model Optimization

Correlating Features: Engineered features based on relationships (e.g., gps_height, function_years).
Hyperparameter Tuning: Adjusted parameters for better classification performance.
Error Analysis: looked into the features that were potentially causing false positives.

Final Model Performance

Optimized Random Forest Model
- Error Analysis: Pumps that needed repair weren't producing good water. Grouped Needs Repair with Not functional.
- Accuracy: up to 81%, that's a 6% increase from our baseline model(75%).
- Precision: up to .82, that's a 36% increase from our baseline model(.46).
- Recall: up to .74, that's a 42% increase from our baseline model(.32).
- F1-Score: up to .78, that's a 40% increase from our baseline model(.38).
- Reduced false positives by 200 cases.
- Mean CV Accuracy: .793, indicating there is no overfitting or data leakage.
- Standard deviation: of 0.0025, indicating model stability.

Future Improvements

Continue Error Analysis: Keep exploring features that are causing false positives.
Explore additional models: Like SVM, Gradient Boosting, etc., to enhance the accuracy of the model further.

Acknowledgments

We appreciate the Tanzanian government and local organizations for bringing clean water to communities. Special thanks to our fellow peers and instructor for their guidance throughout this project.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.gitignore		.gitignore
David_notebook.ipynb		David_notebook.ipynb
EDA File.ipynb		EDA File.ipynb
Final Notebook.ipynb		Final Notebook.ipynb
Final Submission.html		Final Submission.html
PIP project 3 Gabriel 3.ipynb		PIP project 3 Gabriel 3.ipynb
PIP project 3 Gabriel.ipynb		PIP project 3 Gabriel.ipynb
ProjectDescription.ipynb		ProjectDescription.ipynb
README.md		README.md
Tanzanian Waterwells Presentation (1).pdf		Tanzanian Waterwells Presentation (1).pdf
test_set_values.csv		test_set_values.csv
training_set_labels.csv		training_set_labels.csv
training_set_values.csv		training_set_values.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tanzanian Waterwells Project

Overview

Team

Project Goals

Data Exploration (EDA)

Key Findings:

Model Development

Baseline Model: Decision Tree

Improved Model: Random Forest

Feature Engineering & Model Optimization

Final Model Performance

Future Improvements

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tanzanian Waterwells Project

Overview

Team

Project Goals

Data Exploration (EDA)

Key Findings:

Model Development

Baseline Model: Decision Tree

Improved Model: Random Forest

Feature Engineering & Model Optimization

Final Model Performance

Future Improvements

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages