Skip to content

Commit ee05d08

Browse files
committed
analysis 4
1 parent 3e6f4f2 commit ee05d08

File tree

12 files changed

+212
-0
lines changed

12 files changed

+212
-0
lines changed

docs/.DS_Store

0 Bytes
Binary file not shown.

docs/_analysis/analysis-4.md

Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
---
2+
layout: single
3+
title: "Breast Cancer Classification with Logistic Regression, CART, and Random Forest (R) "
4+
date: 2025-3-03
5+
category: analysis
6+
author_profile: true
7+
toc: true
8+
toc_label: "Table of Contents"
9+
toc_icon: "file"
10+
toc_sticky: true
11+
order: 4
12+
#classes: wide
13+
---
14+
15+
Date Posted: 2025-03-03
16+
17+
Category: [Data Projects](https://meng-kiat.github.io/analysis/){: .btn .btn--info .btn--small}
18+
19+
In this analysis, I used the [Wisconsin Breast Cancer Dataset](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic) to build and evaluate models for classifying tumors as benign or malignant. This project demonstrates a complete machine learning pipeline in R, including:
20+
21+
- Data cleaning and feature selection
22+
- Correlation analysis
23+
- Model training and tuning
24+
- Handling class imbalance via sampling
25+
- Comparison of model performance (Logistic Regression, CART, Random Forest)
26+
27+
# Project Objectives
28+
29+
The primary objective of this project was to investigate the efficacy of using ML models to assist in breast cancer diagnosis. Besides that, I also look into:
30+
31+
1. Feature Selection through variable importance
32+
2. Methods of handling class sampling
33+
3. Hyper-parameter tuning in RandomForest
34+
35+
The full code can be found below:
36+
37+
[View Notebook](){: .btn .btn--info .btn--small}
38+
39+
# Dataset
40+
41+
The dataset comes from the UCI Machine Learning Repository and contains numerical features computed from digitized images of breast masses. Key features include radius, texture, perimeter, area, and more — measured via mean, standard error, and worst-case metrics.
42+
43+
## Preprocessing the Dataset
44+
### Removing Multicollinear Variables
45+
46+
To avoid multicollinearity, I used findCorrelation() from the caret package to identify and remove highly correlated variables (threshold > 0.8).
47+
48+
{% highlight ruby %}
49+
corr1 <- cor(data.dt)
50+
corrplot(corr1,type = 'lower', method = 'color', ...)
51+
52+
to_drop <- c("concavity_mean", "compactness_mean", ...)
53+
initial_data <- select(data.dt, -to_drop)
54+
55+
initial_data$diagnosis <- factor(initial_data$diagnosis)
56+
{% endhighlight %}
57+
58+
![corrplot](/assets/images/wisconsin/corrplot.png)
59+
60+
The target variable was also factorised to prepare for classification tasks.
61+
### Feature Importance
62+
63+
To identify the most influential features, a baseline logistic regression and a random forest model was trained initially.
64+
65+
{% highlight ruby %}
66+
logmodel1 <- glm(diagnosis ~ ., data = initial_data, family = binomial())
67+
vim1 <- varImp(logmodel1)
68+
69+
rf_model1 <- randomForest(diagnosis ~ ., data = initial_data)
70+
vim2 <- varImp(rf_model1)
71+
{% endhighlight %}
72+
73+
The feature importance values were exported and reviewed in Excel to guide further feature reduction.
74+
75+
![Feat_impt](/assets/images/wisconsin/Feature_Importance.png)
76+
77+
# Model Building
78+
## Logistic Regression
79+
80+
{% highlight ruby %}
81+
logmodel1 <- glm(diagnosis~.,data=trainset,family=binomial())
82+
summary(logmodel1)
83+
84+
#Remove insignificant variables
85+
logmodel2 <- glm(diagnosis ~ perimeter_mean + concave.points_mean+ texture_worst + symmetry_worst,data=trainset,family=binomial())
86+
summary(logmodel2)
87+
88+
#Evaluate model with confusion matrix
89+
logmodel2.test<-predict(logmodel2,newdata=testset,type='response')
90+
91+
#threshold = 0.9
92+
logmodel2.predict.test<-ifelse(logmodel2.test>0.9,"1","0")
93+
{% endhighlight %}
94+
95+
## Decision Tree (CART)
96+
We use CART decision tree (pruned using cross-validated CP)
97+
98+
{% highlight ruby %}
99+
#Cart
100+
set.seed(100)
101+
Cart1 <- rpart(diagnosis ~.,data=trainset, method='class', control = rpart.control(minsplit = 2, cp=0.0))
102+
103+
print(Cart1)
104+
#Viewing prune sequences, prune triggers, and 10-fold CV errors
105+
printcp(Cart1)
106+
107+
#Identifying optimal CP with plotted CP
108+
plotcp(Cart1)
109+
cp1 = sqrt(0.0135135*0.0090090)
110+
{% endhighlight %}
111+
112+
![cpplot](/assets/images/wisconsin/cpplot.png)
113+
114+
{% highlight ruby %}
115+
#Pruning tree at cp = 0.01103373
116+
cp1 = sqrt(0.006*0.008)
117+
cp1
118+
Cart2<-prune(Cart1,cp=cp1)
119+
printcp(Cart2)
120+
121+
rpart.plot(Cart2, nn= T, main = "Pruned Tree with cp = 0.006928203")
122+
Cart2$variable.importance
123+
{% endhighlight %}
124+
125+
![decisiontree](/assets/images/wisconsin/decisiontree.png)
126+
127+
## RandomForest Model
128+
### Random Forest Tuning
129+
130+
I implemented a custom loop to test various mtry and ntree combinations and recorded their test set accuracy.
131+
132+
{% highlight ruby %}
133+
rf_parameter_test <- function(mtry, ntree) {
134+
randomForest(diagnosis ~ ., data = trainset, mtry = mtry, ntree = ntree)
135+
}
136+
137+
for (mtry in 1:ncol(trainset) - 1) {
138+
for (ntree in c(25, 100, 500)) {
139+
model <- rf_parameter_test(mtry, ntree)
140+
accuracy <- mean(predict(model, testset) == testset$diagnosis)
141+
}
142+
}
143+
{% endhighlight %}
144+
145+
You can find the parameters and their results below:
146+
147+
![rf_parameters](/assets/images/wisconsin/rf_parameters.png)
148+
149+
Proceeded with RSF = 6 and B = 25, as they performed well across both seeds.
150+
We use the tuned RandomForest Model, with RSF = 6 and B = 25.
151+
152+
{% highlight ruby %}
153+
set.seed(100)
154+
rfmodel1 <-randomForest(trainset$diagnosis~ ., data = trainset, importance = T, ntree = 25,mtry =6)
155+
rfmodel1
156+
var.impt <- importance(rfmodel1)
157+
158+
varImpPlot(rfmodel1, type = 1)
159+
160+
plot(rfmodel1)
161+
{% endhighlight %}
162+
163+
![rfvarimp](/assets/images/wisconsin/rfmodelvarimp.png)
164+
![rfmodel1](/assets/images/wisconsin/rfmodel1.png)
165+
166+
## Balancing Data
167+
168+
Data is imbalanced, with relatively more benign diagnosis. Data imbalances can lead to issues such as overfitting, or the end-model having inaccuracy in identifying cases (benign, in this context) with less observations.
169+
170+
![diagnosis_imbalance](/assets/images/wisconsin/diagnosis_imbalance.png)
171+
172+
We will investigate the effects of upsampling/downsampling with the original dataset as a control.
173+
174+
The downsampling/upsampling methods can be found below.
175+
176+
{% highlight ruby %}
177+
#Balanced dataset, downsampled
178+
trainset <- downSample(trainset,trainset$diagnosis)
179+
View(trainset)
180+
table(trainset$diagnosis)
181+
{% endhighlight %}
182+
183+
{% highlight ruby %}
184+
#Balanced dataset, upsampled
185+
trainset <- upSample(trainset_original,trainset_original$diagnosis)
186+
View(trainset)
187+
table(trainset$diagnosis)
188+
{% endhighlight %}
189+
190+
After generating the various balanced datasets, we repeated the above process of building the 3 models and compared the results:
191+
192+
# Overall Evaluation
193+
![accuracytable](/assets/images/wisconsin/accuracy.png)
194+
195+
We can see that models that used data that was upsampled generally did better than downsampled data. This is likely due to the dataset being very small. Even a few sets of data being downsampled was relatively more significant information loss for the models.
196+
197+
In the context of breast cancer diagnosis, false negatives are the most dangerous outcomes as they mean a person with cancer has gone undetected; logmodel generally had the best performance when it came to
198+
199+
RandomForest generally had the best performance in terms of accuracy and the various metrics.
200+
201+
# Conclusion
202+
203+
This project demonstrated the importance of:
204+
205+
- Careful feature selection to reduce multicollinearity
206+
- Tuning hyperparameters for tree-based models
207+
- Handling class imbalance in medical data
208+
- Comparing multiple models for both interpretability and accuracy
209+
210+
In real-world applications like cancer diagnosis, model choice and threshold tuning have life-and-death implications. Balancing recall and precision is vital.
211+
212+
This project shows that there is some efficacy in using ML models to diagnose breast cancer. It is also worth noting that certain features carried much of the predictive power - for example, **concave.points_mean**. However, the dataset is small, and it is recommended that future projects use a larger dataset for a more accurate evaluation of ML models for breast cancer diagnosis.

docs/_analysis/decisiontree.png

142 KB
Loading
92.5 KB
Loading
121 KB
Loading
387 KB
Loading
90.8 KB
Loading
142 KB
Loading
56 KB
Loading
155 KB
Loading

0 commit comments

Comments
 (0)