Kepler-Object-Of-Interest-Classification/model.rmd at main · davide-tonetto-884585/Kepler-Object-Of-Interest-Classification · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "KOI analysis - Models"
author: "Davide Tonetto"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
  html_document:
    theme: cosmo
    highlight: tango
    toc: true
    toc_float: true
    toc_depth: 3
    number_sections: true
    df_print: paged
    code_folding: show
    fig_width: 10
    fig_height: 6
---

```{r setup, include=FALSE}
install.packages(c("parallelly", "future"), type = "binary", repos = "https://cloud.r-project.org")

# Function to install and load required packages
packages <- c("tidyverse", "caret", "pROC", "car", "ggplot2", "mgcv", "glmnet")

# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(!installed_packages)) {
  install.packages(packages[!installed_packages], repos = "https://cloud.r-project.org")
}

# Load all required packages
invisible(lapply(packages, library, character.only = TRUE))

# Set global chunk options
knitr::opts_chunk$set(
  warning = FALSE, # Don't show warnings in the output
  message = FALSE, # Don't show package loading messages
  echo = TRUE, # Show R code chunks in the output
  fig.width = 10, # Set default figure width
  fig.height = 6 # Set default figure height
)

# Helper function for mode calculation
calculate_mode <- function(x, na.rm = FALSE) {
  if (na.rm) {
    x <- x[!is.na(x)]
  }
  ux <- unique(x)
  if (length(ux) == 0) {
    return(NA_character_)
  } # Handle empty or all-NA input
  ux[which.max(tabulate(match(x, ux)))]
}
```

# Load data
Load the cleaned data from **`data_preparation.rmd`**. Assumes `koi_data.Rda` contains the necessary columns.
```{r load-data}
data_file_path <- "data/Rdas/koi_data.Rda"
koi_data_raw <- readRDS(data_file_path)
```

# Logistic regression models
## Select columns
Define target and predictor columns. **Review this list carefully.**
```{r select-columns}
target_variable_name <- "koi_pdisposition"
predictor_cols <- c(
  "koi_period", "koi_duration", "koi_depth", "koi_prad", "koi_teq",
  "koi_insol", "koi_model_snr", "koi_steff", "koi_slogg", "koi_srad",
  "koi_smass", "koi_impact", "koi_ror", "koi_srho", "koi_sma", "koi_incl",
  "koi_dor", "koi_ldm_coeff1", "koi_ldm_coeff2", "koi_smet"
)
selected_cols <- c(target_variable_name, predictor_cols)

# Check which selected columns actually exist in the loaded data
selected_cols_exist <- selected_cols[selected_cols %in% names(koi_data_raw)]
missing_cols <- setdiff(selected_cols, selected_cols_exist)
if (length(missing_cols) > 0) {
  print(paste("Warning: The following selected columns were not found:", paste(missing_cols, collapse = ", ")))
}
if (!target_variable_name %in% selected_cols_exist) {
  stop(paste("Target variable", target_variable_name, "not found in the data!"))
}

# Subset the data
koi_data <- koi_data_raw[, selected_cols_exist]
print(paste("Selected", ncol(koi_data), "columns for analysis."))
```

## Split data into training and test sets
Stratified split based on the target variable.
```{r split-data}
set.seed(42) # for reproducibility
train_indices <- createDataPartition(koi_data[[target_variable_name]], p = 0.8, list = FALSE)
train_data <- koi_data[train_indices, ]
test_data <- koi_data[-train_indices, ]
print(paste("Training set size:", nrow(train_data)))
print(paste("Test set size:", nrow(test_data)))
```

## Handle Missing Values (NA) - AFTER Splitting
Apply simple median/mode imputation. **Consider more advanced methods if NAs are numerous or potentially non-random.**
```{r handle-nas}
# --- Impute Training Data ---
numeric_predictors_train <- names(train_data)[sapply(train_data, is.numeric)]
factor_predictors_train <- names(train_data)[sapply(train_data, function(x) is.character(x) || is.factor(x))]
factor_predictors_train <- setdiff(factor_predictors_train, target_variable_name)

# Impute numeric with median (calculated from training data)
train_medians <- list()
for (col in numeric_predictors_train) {
  if (any(is.na(train_data[[col]]))) {
    median_val <- median(train_data[[col]], na.rm = TRUE)
    if (is.na(median_val)) {
      median_val <- 0
      print(paste("Warning: Train median NA for", col))
    }
    train_data[[col]][is.na(train_data[[col]])] <- median_val
    train_medians[[col]] <- median_val # Store median for applying to test set
  }
}

# Impute character/factor with mode (calculated from training data)
train_modes <- list()
for (col in factor_predictors_train) {
  if (any(is.na(train_data[[col]]))) {
    mode_val <- calculate_mode(train_data[[col]], na.rm = TRUE)
    if (is.na(mode_val)) {
      mode_val <- "Missing"
      print(paste("Warning: Train mode NA for", col))
    }
    train_data[[col]][is.na(train_data[[col]])] <- mode_val
    train_modes[[col]] <- mode_val # Store mode for applying to test set
  }
}

# Remove rows with NA in target (TRAIN)
rows_before_na_target_train <- nrow(train_data)
train_data <- train_data %>% filter(!is.na(.data[[target_variable_name]]))
if (nrow(train_data) < rows_before_na_target_train) print("Removed rows with NA target in TRAIN")

# --- Impute Test Data (using values from training data) ---
for (col in names(train_medians)) {
  if (col %in% names(test_data) && any(is.na(test_data[[col]]))) {
    test_data[[col]][is.na(test_data[[col]])] <- train_medians[[col]]
  }
}

# Impute character/factor with *training* mode
for (col in names(train_modes)) {
  if (col %in% names(test_data) && any(is.na(test_data[[col]]))) {
    test_data[[col]][is.na(test_data[[col]])] <- train_modes[[col]]
  }
}
# Handle any remaining NAs in test set predictors if medians/modes couldn't be calculated or applied
na_check_test <- colSums(is.na(test_data %>% select(-all_of(target_variable_name))))
if (any(na_check_test > 0)) {
  print("Warning: NAs still present in test set predictors after imputation:")
  print(na_check_test[na_check_test > 0])
  test_data <- test_data[complete.cases(test_data %>% select(-all_of(target_variable_name))), ]
  print("Removed test rows with remaining NAs in predictors.")
}

# Remove rows with NA in target (TEST)
rows_before_na_target_test <- nrow(test_data)
test_data <- test_data %>% filter(!is.na(.data[[target_variable_name]]))
if (nrow(test_data) < rows_before_na_target_test) print("Removed rows with NA target in TEST")
```

## Convert Predictors and Target to Factors
Ensure categorical predictors and the target variable are factors with consistent levels.
```{r convert-factors}
# Convert target variable to factor with desired levels and labels
train_data[[target_variable_name]] <- factor(train_data[[target_variable_name]],
  levels = c("CANDIDATE", "FALSE POSITIVE"),
  labels = c("candidate", "false_positive")
)
test_data[[target_variable_name]] <- factor(test_data[[target_variable_name]],
  levels = c("CANDIDATE", "FALSE POSITIVE"),
  labels = c("candidate", "false_positive")
)
print("Converted target variable to factor with levels:")
print(levels(train_data[[target_variable_name]]))

# Define positive class based on the new factor levels
positive_class <- levels(train_data[[target_variable_name]])[2]
if (is.na(positive_class)) stop("Could not determine positive class level after factoring.")
print(paste("Positive class for evaluation:", positive_class))
```

## Scale Numerical Predictors
Scale numeric predictors AFTER splitting and handling NAs/factors. Fit scaler only on training data.
```{r scale-numerics}
numeric_predictors_final <- names(train_data)[sapply(train_data, is.numeric)]

# Check for zero variance columns before scaling
nzv <- nearZeroVar(train_data[, numeric_predictors_final], saveMetrics = TRUE)
cols_to_scale <- rownames(nzv)[!nzv$zeroVar]
if (length(cols_to_scale) < length(numeric_predictors_final)) {
  print("Warning: Found zero-variance numeric columns, excluding from scaling:")
  print(rownames(nzv)[nzv$zeroVar])
}

if (length(cols_to_scale) > 0) {
  scaler <- preProcess(train_data[, cols_to_scale], method = c("center", "scale"))
  train_data_scaled <- predict(scaler, train_data)
  test_data_scaled <- predict(scaler, test_data)
  print(paste("Scaled numeric predictors:", paste(cols_to_scale, collapse = ", ")))
} else {
  print("No non-zero variance numeric predictors found to scale.")
  train_data_scaled <- train_data
  test_data_scaled <- test_data
}
```

## Model Fitting
Define the formula and fit the GLM.
```{r model-fitting}
glm_original_formula <- as.formula(paste(target_variable_name, "~ ."))
print(paste("Using formula:", deparse(glm_original_formula)))

glm_original_model <- glm(glm_original_formula,
  data = train_data_scaled,
  family = binomial(link = "logit")
)

summary(glm_original_model)
```

## Model interpretation
Variables Associated with an **INCREASED** Likelihood of being a **"FALSE POSITIVE"**:


- **koi_duration (1.15093 \*\*\*)**: Longer transit durations are associated with a higher chance of being a "false_positive".
  This might seem counterintuitive, as real planets have specific durations. However, some types of false positives (e.g., grazing eclipsing binaries, certain stellar variability) might also produce long-duration events that get flagged.

- **koi_depth (2.19920 \*\*\*)**: Deeper transits are associated with a higher chance of being a "false_positive".
  This is also potentially counterintuitive, as deep transits are strong signals. However, very deep events might be more likely to be stellar eclipsing binaries (another star, not a planet) rather than a planet, especially if `koi_ror` is also large.

- **koi_teq (1.82665 \*\*\*)**: Higher equilibrium temperatures are associated with a higher chance of being a "false_positive".
  This could mean that objects with extremely high calculated T_eq are more likely to be misclassified astrophysical phenomena or certain types of eclipsing binaries rather than stable planetary candidates.

- **koi_slogg (0.67690 \*\*\*)**: Higher stellar surface gravity (log₁₀(cm s⁻²)) is associated with a higher chance of being a "false_positive".
  Stars with very high surface gravity are typically compact (e.g., some types of dwarfs or even white dwarfs if they were in the sample). Transits around these might have different characteristics or be mimicked by phenomena that are ultimately classified as false positives. Alternatively, if true candidates are mostly around typical main-sequence stars, deviations from that `slogg` might increase FP likelihood.

- **koi_smass (0.44922 \*\*\*)**: Higher stellar mass is associated with a higher chance of being a "false_positive".
  Similar to `koi_slogg`, perhaps transits around more massive stars in dataset are more prone to being astrophysical false positives or are harder to distinguish.

- **koi_impact (0.53968 \*\*\*)**: A higher impact parameter is associated with a higher chance of being a "false_positive".
  This makes perfect sense, a higher impact parameter means that the transit is more grazing. Grazing transits often have a V-shape (rather than a U-shape) and are a common characteristic of eclipsing binaries or other false positive scenarios.

- **koi_ror (2.55320 \*\*\*)**: A larger planet-to-star radius ratio is associated with a higher chance of being a "false_positive".
  This is a very strong indicator. While a larger planet is easier to detect, a very large `koi_ror` often points towards an eclipsing binary system (two stars orbiting each other) rather than a planet transiting a star, as planets are generally much smaller than their host stars. This is a classic false positive signature.

- **koi_dor (0.58369 \*\*\*)**: A larger planet-star distance over star radius (distance at mid-transit normalized by stellar radius) is associated with a higher chance of being a "false_positive".
  `koi_dor` is related to the impact parameter and how far from the star's center the transit occurs, scaled by the star's radius. A larger value could indicate a more grazing transit geometry, consistent with `koi_impact` also pointing towards "false_positive".


Variables Associated with a **DECREASED** Likelihood of being a **"FALSE POSITIVE"** (i.e., **INCREASED** Likelihood of being a **"CANDIDATE"**):

- **koi_insol (-0.44363 \*\*\*)**: Higher insolation flux is associated with a lower chance of being a "false_positive" (i.e., more likely a "candidate").
  This suggests that objects receiving more stellar flux (often closer planets) are more likely to be genuine candidates in the dataset once other factors are taken into account.

- **koi_steff (-0.48874 \*\*)**: Higher stellar effective temperature is associated with a lower chance of being a "false_positive" (i.e., more likely a "candidate").
  Transits around hotter stars might be cleaner or less prone to certain types of astrophysical mimics that get flagged as false positives when occurring around cooler stars (which can have more stellar activity).

- **koi_incl (-0.41634 \*\*\*)**: Higher inclination is associated with a lower chance of being a "false_positive" (i.e., more likely a "candidate").
  This is right. An inclination closer to 90 degrees (edge-on) is required for a transit. If "higher" `koi_incl` means closer to 90 degrees, then a more edge-on system is less likely to be a false positive and more likely a true candidate. The model is correctly identifying that good, central (or near-central) transits are more likely to be candidates.

- **koi_ldm_coeff1 (-1.05599 \*)**, **koi_ldm_coeff2 (-0.98381 \*\*)**: Specific trends in these limb darkening coefficients are associated with a lower chance of being a "false_positive".
  This implies that transit light curve shapes that are well-fit with these particular limb darkening characteristics are more typical of genuine planetary candidates than false positives.

- **koi_smet (-0.56487 \*\*\*)**: Higher stellar metallicity is associated with a lower chance of being a "false_positive" (i.e., more likely a "candidate").
  This aligns very well with astrophysical understanding since stars with higher metallicity are known to be more likely to host planets (especially gas giants). So, the model finds that higher metallicity makes an object more likely to be a genuine candidate.

## Model Performance
Predict on the test set and evaluate.
```{r model-performance}
original_probs <- predict(glm_original_model, newdata = test_data_scaled, type = "response")
# Ensure probabilities are valid
original_probs <- pmax(pmin(original_probs, 1 - 1e-15), 1e-15)

# Convert probabilities to class predictions using a 0.5 threshold
threshold <- 0.5
# Use the levels from the *test* data target factor for creating prediction factor
test_target_levels <- levels(test_data_scaled[[target_variable_name]])
negative_class <- test_target_levels[1]

original_preds <- factor(ifelse(original_probs > threshold, positive_class, negative_class),
  levels = test_target_levels
)

test_target_factor <- test_data_scaled[[target_variable_name]]

cm <- confusionMatrix(original_preds,
  test_target_factor,
  positive = positive_class
) # Ensure positive class is correctly specified
cm
```

Plot confusion matrix
```{r model-performance-heatmap}
cm_data <- as.data.frame(cm$table)
# Create heatmap
ggplot(cm_data, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile(colour = "white", linewidth = 0.5) + # Add lines between tiles
  geom_text(aes(label = Freq), color = "white", size = 6) + # Adjust text size
  # Use a color scheme less prone to issues for colorblindness if possible
  scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Frequency") +
  theme_minimal(base_size = 12) + # Set base font size
  labs(
    title = "Confusion Matrix Heatmap", # Corrected title
    x = "Actual Class",
    y = "Predicted Class"
  ) +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5), # Center title
    axis.text.x = element_text(angle = 0, vjust = 1, hjust = 0.5), # Adjust text angle if needed
    axis.title = element_text(size = 14),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10)
  )
```

ROC curve
```{r model-performance-plot}
roc_original <- roc(
  response = test_target_factor,
  predictor = original_probs,
  levels = test_target_levels
) # Specify levels explicitly
print(paste("GLM Original Features - AUC:", round(auc(roc_original), 4)))
```

Plot ROC curve
```{r model-performance-roc}
plot(roc_original,
  main = "ROC Curve - GLM Original Features",
  col = "blue",
  lwd = 2,
  print.auc = TRUE,
  print.auc.y = 0.75,
  print.auc.x = 0.75
)
```

## Outlier detection
Check for influential points using cook's distance.
```{r}
influenceIndexPlot(glm_original_model, vars = "C")
```

Find and remove all influential points with Cook's distance > 0.5
```{r}
cooks_d <- cooks.distance(glm_original_model)
outlier_indices <- which(cooks_d > 0.5)
print(paste("Number of influential points to remove:", length(outlier_indices)))

# Remove all influential points
if (length(outlier_indices) > 0) {
  train_data_scaled <- train_data_scaled[-outlier_indices, ]
  print(paste("Removed", length(outlier_indices), "influential points"))
}
```

Refit the GLM with the updated training data.
```{r}
glm_original_model <- glm(glm_original_formula,
  data = train_data_scaled,
  family = binomial(link = "logit")
)
summary(glm_original_model)
```

Check outliers again.
```{r}
influenceIndexPlot(glm_original_model, vars = "C")
```

Find and remove all influential points with Cook's distance > 0.5
```{r}
cooks_d <- cooks.distance(glm_original_model)
outlier_indices <- which(cooks_d > 0.5)
print(paste("Number of influential points to remove:", length(outlier_indices)))

# Remove all influential points
if (length(outlier_indices) > 0) {
  train_data_scaled <- train_data_scaled[-outlier_indices, ]
  print(paste("Removed", length(outlier_indices), "influential points"))
}
```
Refit the GLM with the updated training data.
```{r}
glm_original_model <- glm(glm_original_formula,
  data = train_data_scaled,
  family = binomial(link = "logit")
)
summary(glm_original_model)
```

Check outliers again.
```{r}
influenceIndexPlot(glm_original_model, vars = "C")
```

Now the model should be good.

## Model Evaluation without outliers
Predict on the test set and evaluate.
```{r}
original_probs <- predict(glm_original_model, newdata = test_data_scaled, type = "response")
# Ensure probabilities are valid
original_probs <- pmax(pmin(original_probs, 1 - 1e-15), 1e-15)

# Convert probabilities to class predictions using a 0.5 threshold
threshold <- 0.5
# Use the levels from the *test* data target factor for creating prediction factor
test_target_levels <- levels(test_data_scaled[[target_variable_name]])
negative_class <- test_target_levels[1] # Should be 'candidate'
# Note: positive_class was defined earlier based on train_data levels

original_preds <- factor(ifelse(original_probs > threshold, positive_class, negative_class),
  levels = test_target_levels
)

test_target_factor <- test_data_scaled[[target_variable_name]]

cm <- confusionMatrix(original_preds,
  test_target_factor,
  positive = positive_class
) # Ensure positive class is correctly specified
cm
```

Plot confusion matrix
```{r}
cm_data <- as.data.frame(cm$table)
# Create heatmap
ggplot(cm_data, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile(colour = "white", linewidth = 0.5) + # Add lines between tiles
  geom_text(aes(label = Freq), color = "white", size = 6) + # Adjust text size
  # Use a color scheme less prone to issues for colorblindness if possible
  scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Frequency") +
  theme_minimal(base_size = 12) + # Set base font size
  labs(
    title = "Confusion Matrix Heatmap", # Corrected title
    x = "Actual Class",
    y = "Predicted Class"
  ) +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5), # Center title
    axis.text.x = element_text(angle = 0, vjust = 1, hjust = 0.5), # Adjust text angle if needed
    axis.title = element_text(size = 14),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10)
  )
```

ROC curve
```{r}
roc_original <- roc(
  response = test_target_factor,
  predictor = original_probs,
  levels = test_target_levels
) # Specify levels explicitly
print(paste("GLM Original Features - AUC:", round(auc(roc_original), 4)))
```

Plot ROC curve
```{r}
plot(roc_original,
  main = "ROC Curve - GLM Original Features",
  col = "blue",
  lwd = 2,
  print.auc = TRUE,
  print.auc.y = 0.75,
  print.auc.x = 0.75
)
```

## Residual analysis
- **Plot 1: Residuals vs Fitted (Linear Predictor)**
  Look for a lack of strong patterns, random scatter around the smoothed line (though some banding is expected for binary data). Curvature might indicate link function issues or missing non-linear terms.
- **Plot 3: Residuals vs Leverage**
  Look for points with high leverage (far right) AND large standardized residuals (far top/bottom). Points outside Cook's distance contours (dashed lines) are influential.
```{r}
plot(glm_original_model, which = c(1, 5))
```

Calculates Variance Inflation Factors. Important when using original predictors.
High VIF (> 5 or > 10) suggests multicollinearity might be inflating standard errors of coefficients, making them unstable/unreliable.
```{r}
vif_values <- vif(glm_original_model)
print("VIF Values:")
print(vif_values)
# Check for high values
max_vif <- max(vif_values, na.rm = TRUE)
print(paste("Max VIF:", round(max_vif, 2)))
```

From the VIF values, we notice the following:
1. Physical Relationships: koi_period and koi_sma are expected to be highly correlated due to Kepler's Third Law.
2. Parameter Dependencies: Limb darkening coefficients (koi_ldm_coeff1, koi_ldm_coeff2) are often highly correlated with each other and can also be correlated with stellar temperature (koi_steff). Stellar parameters (koi_steff, koi_teq, koi_smass) are often correlated among themselves.

We proceed by removing highly correlated predictors from the model.
```{r}
# Remove highly correlated predictors
predictors_to_remove <- c("koi_sma", "koi_ldm_coeff2")
train_data_scaled <- train_data_scaled[, !names(train_data_scaled) %in% predictors_to_remove]
test_data_scaled <- test_data_scaled[, !names(test_data_scaled) %in% predictors_to_remove]

# Update predictor_cols
predictor_cols <- predictor_cols[!predictor_cols %in% predictors_to_remove]

# Refit the model with updated data
glm_original_model <- glm(glm_original_formula,
  data = train_data_scaled,
  family = binomial(link = "logit")
)
summary(glm_original_model)
```

## Updated interpretation
Overall Model Changes & Fit:

- **AIC**: The AIC has decreased from **5241.5** in the previous model to **5092.1** in this model. A lower AIC generally suggests a better trade-off between model fit and complexity. This is an improvement.
- **Deviance**: The residual deviance has also decreased (**5054.1** vs. **5199.5**), indicating a better fit to the data with this new model specification.
- **Predictors**: `koi_ldm_coeff2` is no longer present. This change in model structure is likely contributing to the improved AIC.


Coefficients Interpretation (Positive Class: _"false_positive"_):

- **(Intercept) 1.12710 (\*\*\*)**: When all scaled predictor variables are at their mean value (0), the log-odds of the KOI being a "false_positive" is 1.12710. This is highly significant.

Variables Associated with an **INCREASED** Likelihood of being a **"FALSE POSITIVE"**:

- **koi_duration (1.26967 \*\*\*)**: Longer transit durations are strongly associated with an increased likelihood of being a "false_positive". Consistent with the previous model, slightly stronger effect.

- **koi_depth (0.78093 \*)**: Deeper transits are associated with an increased likelihood of being a "false_positive". Still positive, but the effect size and significance are notably reduced compared to the previous 2.19920 \*\*\*.

- **koi_teq (2.12831 \*\*\*)**: Higher equilibrium temperatures are very strongly associated with an increased likelihood of being a "false_positive". Consistent, effect size increased from 1.82665 \*\*\*.

- **koi_slogg (0.79903 \*\*\*)**: Higher stellar surface gravity is very strongly associated with an increased likelihood of being a "false_positive". Consistent, slightly stronger effect than 0.67690 \*\*\*.

- **koi_smass (0.46510 \*\*\*)**: Higher stellar mass is strongly associated with an increased likelihood of being a "false_positive". Consistent, similar magnitude to 0.44922 \*\*\*.

- **koi_impact (0.41540 \*\*\*)**: A higher impact parameter (more grazing transit) is strongly associated with an increased likelihood of being a "false_positive". Consistent, slightly smaller magnitude than 0.53968 \*\*\*.

- **koi_ror (5.69350 \*\*\*)**: A larger planet-to-star radius ratio is very strongly associated with an increased likelihood of being a "false_positive". This effect has become much stronger—previously 2.55320 \*\*\*—highlighting `koi_ror` as an even more critical indicator of false positives.

- **koi_dor (0.62047 \*\*\*)**: Larger planet-star distance over star radius (at mid-transit) is very strongly associated with an increased likelihood of being a "false_positive". Consistent, similar magnitude to 0.58369 \*\*\*.

- **koi_ldm_coeff1 (0.18397 \*\*)**: This coefficient has flipped sign and changed significance. Previously -1.05599 \*, now positive. Indicates that higher values of this coefficient are now associated with false positives, possibly due to the removal of `koi_ldm_coeff2`.


Variables Associated with a **DECREASED** Likelihood of being a **"FALSE POSITIVE"**:

- **koi_insol (-1.20622 \*\*\*)**: Higher insolation flux is very strongly associated with a decreased likelihood of being a "false_positive". Consistent, and effect size more than doubled from -0.44363 \*\*\*.

- **koi_steff (-0.17450 \*)**: Higher stellar effective temperature is associated with a decreased likelihood of being a "false_positive". Consistent, but effect size and significance reduced from -0.48874 \*\*.

- **koi_incl (-0.24948 \*)**: Higher inclination (closer to 90°, more edge-on) is associated with a decreased likelihood of being a "false_positive". Consistent, but effect size and significance reduced from -0.41634 \*\*\*.

- **koi_smet (-0.66341 \*\*\*)**: Higher stellar metallicity is very strongly associated with a decreased likelihood of being a "false_positive". Consistent, slightly stronger effect than -0.56487 \*\*\*.

## Model Evaluation without correlated predictors
Predict on the test set and evaluate.
```{r}
original_probs <- predict(glm_original_model, newdata = test_data_scaled, type = "response")
# Ensure probabilities are valid
original_probs <- pmax(pmin(original_probs, 1 - 1e-15), 1e-15)

# Convert probabilities to class predictions using a 0.5 threshold
threshold <- 0.5
# Use the levels from the *test* data target factor for creating prediction factor
test_target_levels <- levels(test_data_scaled[[target_variable_name]])
negative_class <- test_target_levels[1] # Should be 'candidate'
# Note: positive_class was defined earlier based on train_data levels

original_preds <- factor(ifelse(original_probs > threshold, positive_class, negative_class),
  levels = test_target_levels
)

test_target_factor <- test_data_scaled[[target_variable_name]]

cm <- confusionMatrix(original_preds,
  test_target_factor,
  positive = positive_class
) # Ensure positive class is correctly specified
cm
```

Plot confusion matrix
```{r}
cm_data <- as.data.frame(cm$table)
# Create heatmap
ggplot(cm_data, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile(colour = "white", linewidth = 0.5) + # Add lines between tiles
  geom_text(aes(label = Freq), color = "white", size = 6) + # Adjust text size
  # Use a color scheme less prone to issues for colorblindness if possible
  scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Frequency") +
  theme_minimal(base_size = 12) + # Set base font size
  labs(
    title = "Confusion Matrix Heatmap", # Corrected title
    x = "Actual Class",
    y = "Predicted Class"
  ) +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5), # Center title
    axis.text.x = element_text(angle = 0, vjust = 1, hjust = 0.5), # Adjust text angle if needed
    axis.title = element_text(size = 14),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10)
  )
```

ROC curve
```{r}
roc_original <- roc(
  response = test_target_factor,
  predictor = original_probs,
  levels = test_target_levels
) # Specify levels explicitly
print(paste("GLM Original Features - AUC:", round(auc(roc_original), 4)))
```

Plot ROC curve
```{r}
plot(roc_original,
  main = "ROC Curve - GLM Original Features",
  col = "blue",
  lwd = 2,
  print.auc = TRUE,
  print.auc.y = 0.75,
  print.auc.x = 0.75
)
```

```{r}
vif_values <- vif(glm_original_model)
print("VIF Values:")
print(vif_values)
# Check for high values
max_vif <- max(vif_values, na.rm = TRUE)
print(paste("Max VIF:", round(max_vif, 2)))
```

Now the model should be good.

## Summary of GLM Results
The logistic regression model was fitted using scaled original numeric features (excluding the binary flag variables) to predict the koi_pdisposition.

1. **Model Fit & Significant Predictors**:
  The overall model shows a highly significant improvement over the null model (intercept only), indicated by the large drop in deviance (Null: 8866.2, Residual: 5045.5).
  Several predictors showed a statistically significant relationship (p < 0.05) with the target variable in this model: koi_duration, koi_depth, koi_teq, koi_insol, koi_steff, koi_slogg, koi_smass, koi_impact, koi_ror, koi_incl, koi_dor, koi_ldm_coeff1, koi_ldm_coeff2, and koi_smet.
  Notable predictors that were not statistically significant in this specific model include koi_period, koi_prad, koi_model_snr, koi_srad, koi_srho, and koi_sma. This could be due to collinearity with other included variables.

2. **Residual Analysis (Non-linearity Check)**:
  Tests for non-linearity (curvature) were performed on the model's residuals.
  Significant evidence (p < 0.05) of non-linear patterns remaining in the residuals was found for the following predictors: koi_depth, koi_teq, koi_insol, koi_slogg, koi_srad, koi_smass, koi_impact, koi_sma, koi_incl, koi_dor, and koi_smet. koi_period was borderline significant.
  This indicates that the assumption of a purely linear relationship between these predictors and the log-odds of the outcome is likely violated.

3. **Conclusion**:
  The GLM identifies many significant predictors for exoplanet disposition. However, the residual analysis strongly suggests that the model's linear structure is insufficient to capture the underlying relationships for numerous variables. This indicates that model performance could potentially be improved by incorporating non-linear effects or using models better suited to capturing such complexities.

# Model improvements
## GAM model
Let's start by fitting a GAM model in order to use smooth functions s() for predictors showing non-linearity in residual analysis.
```{r}
nonlinear_predictors_gam <- c(
  "koi_depth", "koi_teq", "koi_insol", "koi_slogg",
  "koi_srad", "koi_smass", "koi_impact", "koi_sma",
  "koi_incl", "koi_dor", "koi_smet", "koi_period"
)

nonlinear_predictors_gam <- nonlinear_predictors_gam[nonlinear_predictors_gam %in% names(train_data_scaled)]

linear_predictors_gam <- setdiff(predictor_cols, nonlinear_predictors_gam)
```

Prepare the formula for the GAM model.
```{r}
smooth_terms <- if (length(nonlinear_predictors_gam) > 0) paste(paste0("s(", nonlinear_predictors_gam, ")"), collapse = " + ") else NULL
linear_terms <- if (length(linear_predictors_gam) > 0) paste(linear_predictors_gam, collapse = " + ") else NULL

gam_formula_str <- paste(target_variable_name, "~", paste(c(smooth_terms, linear_terms), collapse = " + "))
# Remove potential trailing/leading "+" if one group was empty
gam_formula_str <- gsub("\\+ $", "", gsub("^\\+ ", "", gsub(" \\+ \\+ ", " + ", gam_formula_str)))

gam_formula <- as.formula(gam_formula_str)
print(paste("Using GAM formula:", gam_formula_str))
```

Fit the GAM model.
```{r}
gam_model <- gam(gam_formula,
  data = train_data_scaled,
  family = binomial(link = "logit"),
  method = "REML"
)
summary(gam_model)
```

Plot of the splines effect:
```{r}
par(mfrow = c(2, 3))
for (i in 1:11) {
  plot(gam_model, select = i, shade = TRUE, shade.col = "lightblue")
  abline(h = 0, lty = "dashed")
}
```

As we can see from the plots, not all the splines are significant. For example `koi_period`, `koi_insol`, `koi_smass`, `koi_period`, `koi_incl`, `koi_depth` and `koi_impact` are not significant. We will try to remove them from the model.
```{r}
gam_model <- update(gam_model, . ~ . - s(koi_period) - s(koi_insol) - s(koi_smass) - s(koi_period) - s(koi_incl) - s(koi_depth) - s(koi_impact))
summary(gam_model)
```

Parametric Coefficients (Linear Effects)

These coefficients are interpreted like those in a standard GLM: they represent the change in the log-odds of being a `"false_positive"` for a one-unit increase in the (scaled) predictor, holding other variables constant.

- **(Intercept) 1.00430 (\*\*\*)**: The baseline log-odds of a KOI being a `"false_positive"` is 1.00430 when all linear predictors are zero and the smooth functions are at their average effect. This is highly significant.
- **koi_duration 1.41230 (\*\*\*)**: Longer transit durations significantly increase the log-odds of being a `"false_positive"`. (Odds Ratio ≈ e^1.41230 ≈ 4.10). This effect is consistent with and slightly stronger than in the previous GLM.
- **koi_prad -0.19688 (p=0.45961)**: Planetary radius is not a significant linear predictor of `"false_positive"` status in this model.
- **koi_model_snr -0.04401 (p=0.53163)**: The transit signal-to-noise ratio is not a significant linear predictor.
- **koi_steff 0.23853 (\*)**: Higher stellar effective temperature (koi_steff) significantly increases the log-odds of being a `"false_positive"`. (Odds Ratio ≈ e^0.23853 ≈ 1.27).

  **Important Change**: This is a flip in direction compared to the previous GLM where higher `koi_steff` decreased the odds of being a false positive. Now, in the context of this GAM (with other variables modeled non-linearly), higher stellar temperatures are associated with a higher likelihood of being a false positive. This suggests that the relationship is complex and was being influenced by the enforced linearity of other terms in the GLM.

- **koi_ror 7.67484 (\*\*\*)**: The planet-to-star radius ratio (`koi_ror`) has a very large and highly significant positive effect on the log-odds of being a `"false_positive"`. (Odds Ratio ≈ e^7.67484 ≈ 2153). This remains an extremely strong indicator, likely identifying eclipsing binaries. The effect size is even larger than in the GLM.
- **koi_srho 0.02748 (p=0.35658)**: Fitted stellar density is not a significant linear predictor.
- **koi_ldm_coeff1 0.19630 (\*\*)**: The first limb darkening coefficient has a significant positive linear effect on the log-odds of being a `"false_positive"`. (Odds Ratio ≈ e^0.19630 ≈ 1.22). This is consistent with the GLM.

Approximate Significance of Smooth Terms (Non-Linear Effects)

For these variables, the relationship with the log-odds of being a `"false_positive"` is not assumed to be linear. The `edf` (estimated degrees of freedom) indicates the complexity of the curve; an `edf` close to 1 suggests a near-linear relationship, while higher values indicate more "wiggles" or non-linearity.

- **s(koi_teq) edf=5.699 (\*\*\*)**: Equilibrium temperature (`koi_teq`) has a highly significant and non-linear relationship with the likelihood of being a `"false_positive"`. The `edf` of ~5.7 suggests a fairly complex curve. The linear GLM showed a simple positive association; this GAM indicates the true relationship is more nuanced across the range of `koi_teq`.
- **s(koi_slogg) edf=4.830 (\*\*)**: Stellar surface gravity (`koi_slogg`) also has a highly significant and non-linear effect. The `edf` of ~4.8 indicates a complex relationship, not just a simple linear trend as was suggested by the GLM.
- **s(koi_srad) edf=6.249 (p=0.49)**: The smooth term for stellar radius (`koi_srad`) is not significant. Even when allowing for a non-linear relationship, `koi_srad` does not significantly help predict `"false_positive"` status in this model.
- **s(koi_dor) edf=8.097 (\*\*\*)**: The planet-star distance over star radius (`koi_dor`) has a highly significant and very complex non-linear relationship (high `edf` of ~8.1). This indicates that specific ranges or patterns in `koi_dor` are associated with `"false_positive"` likelihood in a way that a straight line cannot capture.
- **s(koi_smet) edf=5.997 (\*\*\*)**: Stellar metallicity (`koi_smet`) has a highly significant and non-linear effect. The `edf` of ~6 suggests a complex curve. The GLM showed a linear negative effect (higher metallicity → less likely FP). The GAM suggests this relationship may not be consistently linear across all metallicity values; for instance, the benefit might plateau or change at very low/high metallicities.

Summary and Key Insights from the GAM

- **Non-Linearity Matters**: Several key predictors (`koi_teq`, `koi_slogg`, `koi_dor`, `koi_smet`) have significant non-linear relationships with the probability of a KOI being a `"false_positive"`. This means their influence is more complex than a simple positive or negative trend. Visualizing these smooth terms is crucial.
- **koi_ror is King (for FPs)**: The planet-to-star radius ratio (`koi_ror`) remains an overwhelmingly strong linear predictor, with a massive coefficient, strongly indicating that large `koi_ror` values are characteristic of `"false_positives"` (likely eclipsing binaries).
- **Reversal of koi_steff Effect**: The most notable change for linear terms is that `koi_steff` (stellar effective temperature) now increases the likelihood of a `"false_positive"`. This highlights how accounting for non-linearities in other variables can change the apparent effect of those kept linear. It suggests the relationship between `koi_steff` and disposition is context-dependent.
- **Consistent Linear Effects**: `koi_duration` and `koi_ldm_coeff1` continue to show a positive linear association with `"false_positive"` likelihood.
- **Persistent Non-Significance**: `koi_prad`, `koi_model_snr`, `koi_srho` (linear terms), and `koi_srad` (even as a smooth term) do not significantly contribute to the model.
- **Improved Model Fit**: The GAM explains more deviance (45.7%) and has a higher pseudo R-squared (0.539) compared to the previous GLMs, suggesting it captures the underlying data patterns more effectively by allowing for these non-linearities.

Now we can see that all the splines are significant.
Plot of the splines effect:
```{r}
par(mfrow = c(2, 3))
for (i in 1:6) {
  plot(gam_model, select = i, shade = TRUE, shade.col = "lightblue")
  abline(h = 0, lty = "dashed")
}
```

Predict on the test set and evaluate.
```{r}
gam_probs <- predict(gam_model, newdata = test_data_scaled, type = "response")
gam_probs <- pmax(pmin(gam_probs, 1 - 1e-15), 1e-15)
gam_preds <- factor(ifelse(gam_probs > 0.5, positive_class, levels(test_target_factor)[1]), levels = levels(test_target_factor))
```

Confusion matrix
```{r}
cm_gam <- confusionMatrix(gam_preds, test_target_factor, positive = positive_class)
cm_gam
```

Plot confusion matrix
```{r}
cm_data <- as.data.frame(cm_gam$table)
# Create heatmap
ggplot(cm_data, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile(colour = "white", linewidth = 0.5) + # Add lines between tiles
  geom_text(aes(label = Freq), color = "white", size = 6) + # Adjust text size
  # Use a color scheme less prone to issues for colorblindness if possible
  scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Frequency") +
  theme_minimal(base_size = 12) + # Set base font size
  labs(
    title = "Confusion Matrix Heatmap", # Corrected title
    x = "Actual Class",
    y = "Predicted Class"
  ) +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5), # Center title
    axis.text.x = element_text(angle = 0, vjust = 1, hjust = 0.5), # Adjust text angle if needed
    axis.title = element_text(size = 14),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10)
  )
```

ROC curve
```{r}
roc_gam <- roc(response = test_target_factor, predictor = gam_probs, levels = levels(test_target_factor))
print(paste("GAM - AUC:", round(auc(roc_gam), 4)))
```

Plot ROC curve
```{r}
plot(roc_gam,
  main = "ROC Curve - GAM",
  col = "blue",
  lwd = 2,
  print.auc = TRUE,
  print.auc.y = 0.75,
  print.auc.x = 0.75
)
```

## Interaction terms
Let's try adding interaction terms to the original GLM model.
```{r}
interaction_subset <- c("koi_teq", "koi_slogg", "koi_ror", "koi_smet", "koi_impact")
# Ensure these exist in the data
interaction_subset <- interaction_subset[interaction_subset %in% names(train_data_scaled)]

# Get all other predictors to include as main effects
all_predictors <- setdiff(names(train_data_scaled), target_variable_name)
other_predictors <- setdiff(all_predictors, interaction_subset)
```

Prepare the formula for the GLM model with interaction terms.
```{r}
interaction_formula_str <- paste(
  target_variable_name, "~",
  paste0("(", paste(interaction_subset, collapse = " + "), ")^2"), # Pairwise interactions + main effects
  "+",
  paste(other_predictors, collapse = " + ")
) # Add main effects of others
interaction_formula <- as.formula(interaction_formula_str)
print(paste("Using interaction formula:", interaction_formula_str))
```

Fit the GLM model.
```{r}
glm_interaction_model <- glm(interaction_formula,
  data = train_data_scaled,
  family = binomial(link = "logit")
)
summary(glm_interaction_model)
```

From the summary, we notice that this specific model output is not interpretable due to severe numerical problems, likely stemming from (quasi-)complete separation caused by the complexity of the interaction terms.

Predict on the test set and evaluate.
```{r}
interaction_probs <- predict(glm_interaction_model, newdata = test_data_scaled, type = "response")
interaction_probs <- pmax(pmin(interaction_probs, 1 - 1e-15), 1e-15)
interaction_preds <- factor(ifelse(interaction_probs > 0.5, positive_class, levels(test_target_factor)[1]), levels = levels(test_target_factor))
```

Confusion matrix
```{r}
cm_interaction <- confusionMatrix(interaction_preds, test_target_factor, positive = positive_class)
cm_interaction
```

Plot confusion matrix
```{r}
cm_data <- as.data.frame(cm_interaction$table)
# Create heatmap
ggplot(cm_data, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile(colour = "white", linewidth = 0.5) + # Add lines between tiles
  geom_text(aes(label = Freq), color = "white", size = 6) + # Adjust text size
  # Use a color scheme less prone to issues for colorblindness if possible
  scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Frequency") +
  theme_minimal(base_size = 12) + # Set base font size
  labs(
    title = "Confusion Matrix Heatmap", # Corrected title
    x = "Actual Class",
    y = "Predicted Class"
  ) +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5), # Center title
    axis.text.x = element_text(angle = 0, vjust = 1, hjust = 0.5), # Adjust text angle if needed
    axis.title = element_text(size = 14),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10)
  )
```

ROC curve
```{r}
roc_glm_interaction <- roc(response = test_target_factor, predictor = interaction_probs, levels = levels(test_target_factor))
print(paste("GLM with interaction - AUC:", round(auc(roc_glm_interaction), 4)))
```

Plot ROC curve
```{r}
plot(roc_glm_interaction,
  main = "ROC Curve - GLM with interaction",
  col = "blue",
  lwd = 2,
  print.auc = TRUE,
  print.auc.y = 0.75,
  print.auc.x = 0.75
)
```

# Lasso Regression
Preapare dataset and formula
```{r}
train_formula_lasso <- as.formula(paste("~ .")) # Include all predictors
x_train <- model.matrix(train_formula_lasso, data = train_data_scaled[, all_predictors])[, -1] # Predictor matrix, remove intercept
y_train <- train_data_scaled[[target_variable_name]] # Target factor
```

## Fit the model
```{r}
set.seed(42) # for reproducibility of CV folds
lasso <- glmnet(x_train, y_train, family = "binomial", alpha = 1)
cv_lasso_fit <- cv.glmnet(x_train, y_train,
  family = "binomial",
  alpha = 1,
  type.measure = "auc", # Use AUC for CV evaluation
  nfolds = 10
)
```

## Plot the results
Value of lambda selected via cross validation:
```{r}
plot(cv_lasso_fit)
```

```{r}
cv_lasso_fit$lambda.min
```

Selected model in lasso path plot:
```{r}
plot(lasso, xvar = "lambda")
abline(v = log(cv_lasso_fit$lambda.min), col = "red", lty = "dashed")
```

the Lasso estimated coefficients corresponding to the best lambda are:
```{r}
lasso_coef <- coef(lasso,  s=cv_lasso_fit$lambda.min)
lasso_coef[lasso_coef[,1]!=0,]
```

## LASSO Model Interpretation

The LASSO (Least Absolute Shrinkage and Selection Operator) model performs feature selection by shrinking some coefficients to zero. The non-zero coefficients below are those selected by LASSO as important predictors, using the `lambda.min` value from cross-validation.

**Variables Associated with an INCREASED Likelihood of being a "FALSE POSITIVE" (Positive Coefficients):**

-   **`(Intercept) 0.3658801`**: This is the baseline log-odds of a KOI being a "FALSE POSITIVE" when all other predictor variables in the model are zero.
-   **`koi_ror (0.9173861)`**: A larger planet-to-star radius ratio. This is a very strong indicator. A very large `koi_ror` often points towards an eclipsing binary system rather than a planet.
-   **`koi_teq (0.9106747)`**: Higher equilibrium temperature. Objects with extremely high calculated T_eq might be misclassified astrophysical phenomena.
-   **`koi_depth (0.8213715)`**: Deeper transits. Very deep events might be more likely to be stellar eclipsing binaries.
-   **`koi_duration (0.3712789)`**: Longer transit durations. Some types of false positives (e.g., grazing eclipsing binaries) can produce long-duration events.
-   **`koi_impact (0.3371364)`**: Higher impact parameter. This means the transit is more grazing, which is common in false positive scenarios.
-   **`koi_period (0.1994419)`**: Longer orbital periods. This suggests a slight increase in the likelihood of being a false positive.
-   **`koi_model_snr (0.1879024)`**: Higher model signal-to-noise ratio. While counterintuitive, extremely high SNR events might sometimes be non-planetary or indicative of issues that lead to a false positive classification.
-   **`koi_dor (0.1320083)`**: Larger planet-star distance over star radius. This can indicate a more grazing transit geometry.

**Variables Associated with a DECREASED Likelihood of being a "FALSE POSITIVE" (i.e., INCREASED Likelihood of being a "CANDIDATE") (Negative Coefficients):**

-   **`koi_incl (-0.6193954)`**: Higher orbital inclination. An inclination closer to 90 degrees (edge-on transit) is required for a true planetary transit, making it less likely to be a false positive.
-   **`koi_smet (-0.4631438)`**: Higher stellar metallicity. Stars with higher metallicity are known to be more likely to host planets, making an object more likely to be a genuine candidate.

## Evaluate the model
```{r}
best_lambda <- cv_lasso_fit$lambda.min
x_test <- model.matrix(train_formula_lasso, data = test_data_scaled[, all_predictors])[, -1]

# Evaluate using the best lambda found by CV
lasso_probs <- predict(cv_lasso_fit, newx = x_test, s = best_lambda, type = "response")
lasso_probs <- as.vector(pmax(pmin(lasso_probs, 1 - 1e-15), 1e-15)) # Ensure stability
lasso_preds <- factor(ifelse(lasso_probs > 0.5, positive_class, levels(test_target_factor)[1]), levels = levels(test_target_factor))
```

Confusion matrix
```{r}
lasso_cm <- confusionMatrix(lasso_preds, test_target_factor, positive = positive_class)
lasso_cm
```

Plot confusion matrix
```{r}
cm_data <- as.data.frame(lasso_cm$table)
# Create heatmap
ggplot(cm_data, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile(colour = "white", linewidth = 0.5) + # Add lines between tiles
  geom_text(aes(label = Freq), color = "white", size = 6) + # Adjust text size
  # Use a color scheme less prone to issues for colorblindness if possible
  scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Frequency") +