ehsanx
diff --git a/‎.Rhistory‎
Lines changed: 510 additions & 510 deletions b/‎.Rhistory‎
Lines changed: 510 additions & 510 deletions
diff --git a/‎_freeze/surveydata5/execute-results/html.json‎
Lines changed: 2 additions & 2 deletions b/‎_freeze/surveydata5/execute-results/html.json‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎_freeze/surveydata9/execute-results/html.json‎
Lines changed: 2 additions & 2 deletions b/‎_freeze/surveydata9/execute-results/html.json‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎_freeze/surveydata9c/execute-results/html.json‎
Lines changed: 19 additions & 0 deletions b/‎_freeze/surveydata9c/execute-results/html.json‎
Lines changed: 19 additions & 0 deletions
diff --git a/‎_freeze/surveydata9c/figure-html/unnamed-chunk-4-1.png‎
16.7 KB b/‎_freeze/surveydata9c/figure-html/unnamed-chunk-4-1.png‎
16.7 KB
diff --git a/‎_freeze/surveydata9c/figure-html/unnamed-chunk-5-1.png‎
24.9 KB b/‎_freeze/surveydata9c/figure-html/unnamed-chunk-5-1.png‎
24.9 KB
diff --git a/‎_freeze/surveydata9c/libs/bootstrap/bootstrap-bb462d781dde1847d9e3ccf7736099dd.min.css‎
Lines changed: 12 additions & 0 deletions b/‎_freeze/surveydata9c/libs/bootstrap/bootstrap-bb462d781dde1847d9e3ccf7736099dd.min.css‎
Lines changed: 12 additions & 0 deletions
@@ -0,0 +1,19 @@
+{
+  "hash": "501e2d3f0dee34a35f32d4a1035e93ec",
+  "result": {
+    "engine": "knitr",
+    "markdown": "## NHANES: Performance {.unnumbered}\n\nThis tutorial outlines how to evaluate the performance of a logistic regression model fitted to complex survey data using R, focusing on the NHANES dataset.\n\nWe will assess the model's predictive accuracy using a design-correct **Area Under the Curve (AUC)** and evaluate its fit with the **Archer-Lemeshow** test.\n\n------------------------------------------------------------------------\n\n## 1. Data Preparation ⚙️\n\nFirst, we need to load the necessary R packages and prepare the NHANES data. The code below loads data from the `Flegal2016.RData` file, recodes variables like age, race, smoking, and education, and creates the final survey design object (`analytic_design`) for our analysis.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# --- Load Libraries ---\nlibrary(survey)\nlibrary(dplyr)\n# install.packages(\"devtools\")\n# devtools::install_github(\"ehsanx/svyTable1\", \n#                          build_vignettes = TRUE, \n#                          dependencies = TRUE)\nlibrary(svyTable1)\nlibrary(ggplot2)\n\n# --- Load Data ---\nload(\"Data/surveydata/Flegal2016.RData\") \n\n# --- Create Analytic Sample (N=5,455) ---\ndat.analytic <- subset(dat.full, RIDAGEYR >= 20)\ndat.analytic <- subset(dat.analytic, !is.na(BMXBMI))\ndat.analytic <- subset(dat.analytic, is.na(RIDEXPRG) | RIDEXPRG != \"Yes, positive lab pregnancy test\")\n\n# --- Recode Analysis Variables ---\ndat.full <- dat.full %>%\n  mutate(\n    Age = cut(RIDAGEYR,\n              breaks = c(20, 40, 60, Inf),\n              right = FALSE,\n              labels = c(\"20-39\", \"40-59\", \">=60\")),\n\n    race = case_when(\n      RIDRETH3 == \"Non-Hispanic White\" ~ \"Non-Hispanic white\",\n      RIDRETH3 == \"Non-Hispanic Black\" ~ \"Non-Hispanic black\",\n      RIDRETH3 == \"Non-Hispanic Asian\" ~ \"Non-Hispanic Asian\",\n      RIDRETH3 %in% c(\"Mexican American\", \"Other Hispanic\") ~ \"Hispanic\",\n      TRUE ~ \"Other\" # Consolidating \"Other Race\" for simplicity\n    ),\n    race = factor(race, levels = c(\"Non-Hispanic white\", \"Non-Hispanic black\", \"Non-Hispanic Asian\", \"Hispanic\", \"Other\")),\n\n    obese = factor(ifelse(BMXBMI >= 30, \"Yes\", \"No\"), levels = c(\"No\", \"Yes\")),\n\n    smoking = case_when(\n      SMQ020 == \"No\" ~ \"Never smoker\",\n      SMQ020 == \"Yes\" & SMQ040 == \"Not at all\" ~ \"Former smoker\",\n      SMQ020 == \"Yes\" ~ \"Current smoker\",\n      TRUE ~ NA_character_\n    ),\n    smoking = factor(smoking, levels = c(\"Never smoker\", \"Former smoker\", \"Current smoker\")),\n\n    education = case_when(\n        DMDEDUC2 %in% c(\"Less than 9th grade\", \"9-11th grade (Includes 12th grad)\") ~ \"<High school\",\n        DMDEDUC2 == \"High school graduate/GED or equi\" ~ \"High school\",\n        DMDEDUC2 %in% c(\"Some college or AA degree\", \"College graduate or above\") ~ \">High school\",\n        TRUE ~ NA_character_\n    ),\n    education = factor(education, levels = c(\"High school\", \"<High school\", \">High school\"))\n  )\n\n# --- Create Survey Design Object ---\noptions(survey.lonely.psu = \"adjust\")\nsvy.design.full <- svydesign(id = ~SDMVPSU,\n                             strata = ~SDMVSTRA,\n                             weights = ~WTINT2YR,\n                             nest = TRUE,\n                             data = dat.full)\nanalytic_design <- subset(svy.design.full, SEQN %in% dat.analytic$SEQN)\n```\n:::\n\n\n------------------------------------------------------------------------\n\n## 2. Fitting a Logistic Regression Model 📊\n\nNext, we fit a survey-weighted logistic regression model using `svyglm()`. Our goal is to predict obesity status based on **Age**, **race**, **smoking status**, and **education level**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Fit the survey-weighted logistic regression model\nfit_obesity <- svyglm(\n  I(obese == \"Yes\") ~ Age + race + smoking + education,\n  design = analytic_design,\n  family = quasibinomial()\n)\n```\n:::\n\n\n------------------------------------------------------------------------\n\n## 3. Model Performance: ROC Curve and AUC 📈\n\nWe calculate the **Area Under the ROC Curve (AUC)** to measure predictive performance. The `svyAUC()` function from `svyTable1` computes a design-corrected AUC and confidence interval, which requires a replicate-weights survey design.\n\n### Create Replicate Design & Refit Model\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# svyAUC() requires a replicate-weights design object\nrep_design <- as.svrepdesign(analytic_design)\n# Restrict the replicate design to complete cases\nvars_for_model <- c(\"obese\", \"Age\", \"race\", \"smoking\", \"education\")\nrep_design_complete <- subset(rep_design, \n                              complete.cases(rep_design$variables[, vars_for_model]))\n\n# Refit model on the complete-case replicate design\nfit_obesity_rep <- svyglm(\n  I(obese == \"Yes\") ~ Age + race + smoking + education,\n  design = rep_design_complete,\n  family = quasibinomial()\n)\n```\n:::\n\n\n### Calculate AUC\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Calculate the design-correct AUC\nauc_results_list <- svyAUC(fit_obesity_rep, rep_design_complete, plot = TRUE)\n```\n\n::: {.cell-output-display}\n![](surveydata9c_files/figure-html/unnamed-chunk-4-1.png){width=672}\n:::\n\n```{.r .cell-code}\n\n# Display the AUC summary\nknitr::kable(auc_results_list$summary)\n```\n\n::: {.cell-output-display}\n\n\n|       AUC|        SE|  CI_Lower| CI_Upper|\n|---------:|---------:|---------:|--------:|\n| 0.5967987| 0.0108604| 0.5755123| 0.618085|\n\n\n:::\n:::\n\n\n**Interpretation:** An AUC of **0.5967987** indicates poor discrimination. While better than simpler models, it suggests that the predictors do not have strong predictive power for obesity in this dataset.\n\n### Visualize the ROC Curve\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(auc_results_list$roc_data, aes(x = FPR, y = TPR)) +\n  geom_line(color = \"blue\", linewidth = 1) +\n  geom_abline(linetype = \"dashed\") +\n  labs(\n    title = \"Survey-Weighted ROC Curve\",\n    subtitle = paste0(\"AUC = \", round(auc_results_list$summary$AUC, 3)),\n    x = \"1 - Specificity (FPR)\",\n    y = \"Sensitivity (TPR)\"\n  ) +\n  theme_minimal()\n```\n\n::: {.cell-output-display}\n![](surveydata9c_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n:::\n\n\n------------------------------------------------------------------------\n\n## 4. Goodness-of-Fit: Archer and Lemeshow Test ✅\n\nWe assess model fit using the **Archer-Lemeshow test**. A non-significant p-value (p \\> 0.05) suggests a good model fit.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Run the goodness-of-fit test\ngof_results <- svygof(fit_obesity, analytic_design)\n\n# Display the results\nknitr::kable(gof_results)\n```\n\n::: {.cell-output-display}\n\n\n| F_statistic| df1| df2|   p_value|\n|-----------:|---:|---:|---------:|\n|    2.168827|   9|   6| 0.1791469|\n\n\n:::\n:::\n\n\n**Interpretation:** The p-value (**0.1791469**) is greater than 0.05, indicating no evidence of a poor fit. The model’s predictions align well with observed outcomes, suggesting good calibration.\n\n------------------------------------------------------------------------\n\n## Summary\n\n-   The AUC shows poor discriminatory power, meaning the model does not strongly distinguish between obese and non-obese individuals.\n-   The Archer-Lemeshow test indicates good model fit (p \\> 0.05).\n\n**Conclusion:** While well-calibrated, the model’s low predictive power implies other factors (e.g., cholesterol, blood pressure, diet) may better explain obesity risk.\n",
+    "supporting": [],
+    "filters": [
+      "rmarkdown/pagebreak.lua"
+    ],
+    "includes": {
+      "include-in-header": [
+        "<link href=\"site_libs/pagedtable-1.1/css/pagedtable.css\" rel=\"stylesheet\" />\n<script src=\"site_libs/pagedtable-1.1/js/pagedtable.js\"></script>\n"
+      ]
+    },
+    "engineDependencies": {},
+    "preserve": {},
+    "postProcess": true
+  }
+}