Proof reading: logistic regression

liamjberrisford · liamjberrisford · commit 03208586f5b5 · 2024-09-10T10:28:19.000+01:00
diff --git a/individual_modules/regression_analysis_with_R/logistic_regression.ipynb b/individual_modules/regression_analysis_with_R/logistic_regression.ipynb
@@ -93,7 +93,7 @@
     "\n",
     "$$\\sigma(t) = \\frac{e^t}{e^t+1}$$\n",
     "\n",
-    "We can visualise the transform between the right hand side and the left hand side of the equation in the graph below"
+    "We can visualise the transform between the right-hand side and the left-hand side of the equation in the graph below"
    ]
   },
   {
@@ -135,9 +135,9 @@
     "\n",
     "$$ln(odds) = ln(\\frac{p}{(1-p)}) = \\beta_0 + \\beta_1*x$$\n",
     "\n",
-    "We are no longer in the class of linear regression, we are in a more general class of generalised linear models. These permit a more varied number of regression models with different types of outcomes.They use a link function to transform from the unbounded prediction on the right hand side to the properties of the outcome variable on the left hand side. For logistic regression the link function is the logistic function. This means we also need a new R function, `glm()` to fit them. \n",
+    "We are no longer in the class of linear regression, we are in a more general class of generalised linear models. These permit a more varied number of regression models with different types of outcomes.They use a link function to transform from the unbounded prediction on the right-hand side to the properties of the outcome variable on the left-hand side. For logistic regression, the link function is the logistic function. This means we also need a new R function, `glm()` to fit them. \n",
     "\n",
-    "Let's look at an example we are going to predict Type II diabetes status from bmi."
+    "Let's look at an example. We are going to predict Type II diabetes status from BMI."
    ]
   },
   {
@@ -225,7 +225,7 @@
    "id": "537e3273-ce41-415d-96cc-641afa3adedf",
    "metadata": {},
    "source": [
-    "We can see that the odds ratios is `r signif(exp(coef(model.log))[\"bmi\"], 3)` which can be reported as for a 1 unit increase of BMI an individual is `r signif(exp(coef(model.log))[\"bmi\"], 3)` times more likely to develop Type 2 Diabetes. This effect is not very big!\n",
+    "We can see that the odds ratios is `r signif(exp(coef(model.log))[\"bmi\"], 3)` which can be reported as for a 1 unit increase of BMI an individual is `r signif(exp(coef(model.log))[\"bmi\"], 3)` times more likely to develop Type II Diabetes. This effect is not very big!\n",
     "\n",
     "Significance testing is conceptually the same as for linear regression, whereby each regression coefficient (i.e. log odds ratio) is tested to see if it is non-zero. It differs though how it is calculated. As we are no longer able to derive an exact solution, we have to use an iterative method to find the best estimates. This means you sometimes might get warnings that your model failed to converge. This means that the algorithm was not able to settle on an appropriate solution for the best regression coefficients and the result should be treated with caution. Typically, this is due to not enough data or trying to fit too many predictor variables simultaneously or a poor choice of model between X and Y. \n",
     "\n",
@@ -303,15 +303,15 @@
    "id": "f3267dd2-dd18-4e27-bef5-928fc8ee4a8c",
    "metadata": {},
    "source": [
-    "We can use the confidence interval to determine whether our estimated regression coefficient has a non-zero effect, by whether it contains the null value. For example at a significance level of $\\alpha = 0.05$, if the estimated coefficient is significantly non zero (i.e. $p-value < 0.05$) then the 100(1-$\\alpha$) = 95% confidence interval will not contain 0. The null value for the log(OR) is 0, and the null value for the OR is 1. Therefore, if we don't convert our confidence interval to the correct units we may draw the wrong conclusion. \n",
+    "We can use the confidence interval to determine whether our estimated regression coefficient has a non-zero effect, by whether it contains the null value. For example, at a significance level of $\\alpha = 0.05$, if the estimated coefficient is significantly non-zero (i.e. $p-value < 0.05$) then the 100(1-$\\alpha$) = 95% confidence interval will not contain 0. The null value for the log(OR) is 0, and the null value for the OR is 1. Therefore, if we don't convert our confidence interval to the correct units, we may draw the wrong conclusion. \n",
     "\n",
     "Logistic regression is all about appropriately handling the non-continuous outcome variable. The predictor variables can be as complex as your dataset can handle and include categorical variables etc. in the same way as we described for linear regression. \n",
     "\n",
     "Let's practise some examples:\n",
     "\n",
     "## Logistic Regression with Multiple Predictor Variables Exercise\n",
     "\n",
-    "*Fit a logistic regression model to test for an association between age and type 2 diabetes status*\n"
+    "*Fit a logistic regression model to test for an association between age and type II diabetes status*\n"
    ]
   },
   {
@@ -1365,7 +1365,7 @@
    "id": "d896fb6c-f2ad-4152-9244-a3f51c941649",
    "metadata": {},
    "source": [
-    "*Fit a logistic regression model to test for an association between age, alcohol units and exercise time and type 2 diabetes status*"
+    "*Fit a logistic regression model to test for an association between age, alcohol units and exercise time and type II diabetes status*"
    ]
   },
   {
@@ -2440,7 +2440,7 @@
    "id": "58f0efbf-af31-4ffe-8945-92f41ced9afb",
    "metadata": {},
    "source": [
-    "*Fit a logistic regression model to test for an association between socioeconomic status and type 2 diabetes status, controlling for age and bmi.*"
+    "*Fit a logistic regression model to test for an association between socioeconomic status and type II diabetes status, controlling for age and BMI.*"
    ]
   },
   {
@@ -3512,13 +3512,13 @@
    "source": [
     "\n",
     "\n",
-    "You may have noticed in the last example above that while the ANOVA for the ethnicity variable was significant neither of the two dummy variables were significantly associated at P < 0.05. The `ethnicityEuropean` showed a trend for significance with P~0.05. This happens sometimes becuase when you use an ANOVA to test for the joint effect of both dummy variables, you are using a 2 degree of freedom test (see third column in the ANOVA output), while in the tests for the individual coefficients you are using a 1 degree of freedom test. Mathematically the threshold for a two degree of freedom test is slightly lower to be significant. You could think of this as rather than needing a really strong effect in one variable, a small effect, but in both variables would be meaningful. In reality these results are not contradicting each other, its just a chance thing related to the fact that we have used a hard threshold to determine significance. Where you only just have enough statitstica power to detect an effect, it is chance whether it falls just above the trheshold or just below. \n",
+    "You may have noticed in the last example above that while the ANOVA for the ethnicity variable was significant, neither of the two dummy variables were significantly associated at P < 0.05. The `ethnicityEuropean` showed a trend for significance with P~0.05. This happens sometimes because when you use an ANOVA to test for the joint effect of both dummy variables, you are using a 2 degree of freedom test (see third column in the ANOVA output), while in the tests for the individual coefficients you are using a 1 degree of freedom test. Mathematically the threshold for a two degree of freedom test is slightly lower to be significant. You could think of this as rather than needing a really strong effect in one variable, a small effect but in both variables would be meaningful. In reality, these results are not contradicting each other, it's just a chance thing related to the fact that we have used a hard threshold to determine significance. Where you only just have enough statitstical power to detect an effect, it is chance whether it falls just above the threshold or just below. \n",
     "\n",
     "## Predictions with the Logistic Regression Model\n",
     "\n",
-    "We are going to make some predictions from a logiistic regression model to show how the model goes from a weighted sum of prediction variables to the binary outcome variable. \n",
+    "We are going to make some predictions from a logistic regression model to show how the model goes from a weighted sum of prediction variables to the binary outcome variable. \n",
     "\n",
-    "Let's revist the example of prediction type 2 diabetes as a function of alcohol units and exercise hours. First we need to fit the model.\n"
+    "Let's revisit the example of predicting type II diabetes as a function of alcohol units and exercise hours. First, we need to fit the model.\n"
    ]
   },
   {
@@ -3550,7 +3550,7 @@
    "id": "dc5114f8-dd4a-47c8-be6a-1324e72d79ad",
    "metadata": {},
    "source": [
-    "Using our estimated regression coefficients we can write our fitted regression model as"
+    "Using our estimated regression coefficients we can write our fitted regression model as `r logEq`."
    ]
   },
   {
@@ -3570,9 +3570,9 @@
    "id": "b91a9b68-eeca-4a3b-850d-7f428418b019",
    "metadata": {},
    "source": [
-    "`r logEq`.\n",
     "\n",
-    "Let's say we have a new observation we want to make a prediction for, we know that they exercise for on average 4 hours a week and consume 10 units of alcohol per week. We can input these values into our equation to estimate the log odds of the this individual having type 2 diabetes. \n"
+    "\n",
+    "Let's say we have a new observation we want to make a prediction for, we know that they exercise for on average 4 hours a week and consume 10 units of alcohol per week. We can input these values into our equation to estimate the log odds of this individual having type II diabetes. \n"
    ]
   },
   {
diff --git a/individual_modules/regression_analysis_with_R/questions/logistic_regression_3.json b/individual_modules/regression_analysis_with_R/questions/logistic_regression_3.json
@@ -16,7 +16,7 @@
         ]
     },
     {
-        "question": "Considering the estimated regression coefficients for the two binary variabels for ethnicty, which of the following statements are correct? Ignore the P-values we are just interested in interpeting the regression coefficients. Select all that apply.",
+        "question": "Considering the estimated regression coefficients for the two binary variables for ethnicity, which of the following statements are correct? Ignore the P-values we are just interested in interpreting the regression coefficients. Select all that apply.",
         "type": "many_choice",
         "answers": [
             {
diff --git a/individual_modules/regression_analysis_with_R/questions/summary_logistic_regression.json b/individual_modules/regression_analysis_with_R/questions/summary_logistic_regression.json
@@ -4,7 +4,7 @@
         "type": "many_choice",
         "answers": [
             {
-                "answer": "TO predict a continuous outcome",
+                "answer": "To predict a continuous outcome",
                 "correct": false,
                 "feedback": "Incorrect"
             },

Original file line number	Diff line number	Diff line change
`@@ -16,7 +16,7 @@`
`16`	`16`	`]`
`17`	`17`	`},`
`18`	`18`	`{`
`19`		`- "question": "Considering the estimated regression coefficients for the two binary variabels for ethnicty, which of the following statements are correct? Ignore the P-values we are just interested in interpeting the regression coefficients. Select all that apply.",`
	`19`	`+ "question": "Considering the estimated regression coefficients for the two binary variables for ethnicity, which of the following statements are correct? Ignore the P-values we are just interested in interpreting the regression coefficients. Select all that apply.",`
`20`	`20`	`"type": "many_choice",`
`21`	`21`	`"answers": [`
`22`	`22`	`{`
Original file line number	Diff line number	Diff line change
`@@ -4,7 +4,7 @@`
`4`	`4`	`"type": "many_choice",`
`5`	`5`	`"answers": [`
`6`	`6`	`{`
`7`		`- "answer": "TO predict a continuous outcome",`
	`7`	`+ "answer": "To predict a continuous outcome",`
`8`	`8`	`"correct": false,`
`9`	`9`	`"feedback": "Incorrect"`
`10`	`10`	`},`