This project focuses on predicting the forest cover type from cartographic and environmental variables using machine learning.
It trained and compared three classification models — Decision Tree, Random Forest, and XGBoost — and further improved performance with hyperparameter tuning.
The project demonstrates the full machine learning pipeline: data cleaning, preprocessing, training, evaluation, visualization, and tuning.
- Source: UCI Machine Learning Repository — Covertype Dataset
- Format:
.data(converted to.csvfor processing) - Number of Instances: 581,012
- Number of Features: 54 cartographic variables (e.g., elevation, slope, soil type, distance to hydrology, etc.)
- Target Variable:
Cover_Type(multi-class, 7 forest cover categories)
The target variable Cover_Type has 7 categories representing different types of forest cover:
- Spruce/Fir
- Lodgepole Pine
- Ponderosa Pine
- Cottonwood/Willow
- Aspen
- Douglas-fir
- Krummholz
- A tree-like model where decisions are made by splitting features into branches.
- Easy to interpret, but can overfit if the tree is too deep.
- Good baseline model for classification.
- An ensemble of multiple decision trees (a "forest").
- Each tree is trained on a random subset of data and features.
- More accurate and robust than a single Decision Tree because it reduces overfitting.
- A boosting algorithm that builds trees sequentially.
- Each new tree corrects the errors of the previous ones.
- Very powerful for structured/tabular data and often achieves state-of-the-art results.
- Requires careful hyperparameter tuning for best performance.
-
Data Cleaning & Preprocessing
- Added column names from the dataset description.
- Checked for missing values (none were found).
- Converted categorical features into usable formats.
- (Optional) Outlier detection using Z-score.
-
Train-Test Split
- 80% training, 20% testing.
-
Model Training
- Trained Decision Tree, Random Forest, and XGBoost classifiers.
-
Model Evaluation
- Accuracy Score
- Precision, Recall, F1-Score
- Confusion Matrix (heatmap visualization)
-
Feature Importance
- Visualized the most important features for tree-based models.
-
Hyperparameter Tuning
- Used GridSearchCV and RandomizedSearchCV to optimize hyperparameters.
- Reduced overfitting and improved accuracy.
| Model | Accuracy Score |
|---|---|
| Decision Tree (Default) | 0.9059 |
| Random Forest (Default) | 0.9308 |
| XGBoost (Default) | 0.8711 |
| Decision Tree (Tuned) | 0.9124 |
| Random Forest (Tuned) | 0.9524 |
| XGBoost (Tuned) | 0.9581 |
✅ Best Model: XGBoost (Tuned) with 95.8% accuracy.
- Confusion Matrix Heatmap → to check per-class predictions.
- Feature Importance Bar Chart → to identify top predictive features (e.g., Elevation, Horizontal Distance to Roadways).
- Install dependencies:
pip install -r requirements.txt- clone this repository:
git clone https://github.com/Adeeba-Shahzadi/ForestCoverClassification-MultiClassificationModel.git- Navigate to the project folder:
cd ForestCoverClassification-MultiClassificationModel- Run the notebook or script:
jupyter ForestCoverTypeClassification.ipynbOR
python forestcovertypeclassification.py