CLS-Data
diff --git a/‎.gitignore‎
Lines changed: 3 additions & 2 deletions b/‎.gitignore‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎docs/intro.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/intro.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/mcs-creating_identifiers.md‎
Lines changed: 0 additions & 8 deletions b/‎docs/mcs-creating_identifiers.md‎
Lines changed: 0 additions & 8 deletions
diff --git a/‎docs/mcs-data_structures.md‎
Lines changed: 73 additions & 63 deletions b/‎docs/mcs-data_structures.md‎
Lines changed: 73 additions & 63 deletions
diff --git a/‎docs/mcs-household_grid.md‎
Lines changed: 103 additions & 81 deletions b/‎docs/mcs-household_grid.md‎
Lines changed: 103 additions & 81 deletions
@@ -20,5 +20,6 @@ vendor/
 # Ignore R environment
 .Renviron
 
-# Ignore Notes
-Notes/
+# Ignore Notes and Old Files
+Notes/
+old/
@@ -28,9 +28,9 @@ The code presented on this website will presume you have downloaded the data fro
 
 To make using the datasets easier, we provide code reorganise the `.dta` files into a simple directory structure with a folder for each sweep. This code is described under each study section (e.g., `MCS -> Creating a Simple Folder Structure`). We will assume you have organised the files in this way in other code we present.
 
-We use the `tidyverse` (an `R` package) extensively in the code presented on this website. Many of the functions we use are repeated in multiple places, so we have provided [a short primer](https://cls-data.github.io/docs/r_primer.html) on the main functions we will use. If you are new to the language, this primer also contains links to more detailed resources for learning `R` and `tidyverse`. Even if you are experienced with `R`, as there may still be some material that is new.
+We use the `tidyverse` (an `R` package) extensively in the code presented on this website. If you are new to the `tidyverse`, we recommend Hadley Wickham and colleagues' book, R for Data Science, which is [available for free online](https://r4ds.had.co.nz/).
 
 # Code Sharing
-This website can obviously not provide all the code you may need to carry out the analyses you may want to with CLS data. We have therefore set up the [`#britishcohorts` hashtag on GitHub Gist](https://gist.github.com/search?q=%23britishcohorts) for people to share code snippets that are useful for CLS analyses. Please consider sharing your own code snipetts (for instance, code to derive a useful variable) on GitHub Gist adding the `#britishcohorts` hashtag and a study specific hashtag (`#mcs`, `#bcs70`, `#nextsteps`, `#ncds`) to the Gist description to make it findable. 
+This website can obviously not provide all the code you may need to carry out the analyses you may want to with CLS data. We have therefore set up the [`#britishcohorts` hashtag on GitHub Gist](https://gist.github.com/search?q=%23britishcohorts) for people to share code snippets that are useful for CLS analyses. Please consider sharing your own code snippets (for instance, code to derive a useful variable) on GitHub Gist adding the `#britishcohorts` hashtag and a study specific hashtag (`#mcs`, `#bcs70`, `#nextsteps`, `#ncds`) to the Gist description to make it findable. 
 
 Please also consider sharing the full code for papers you publish on the [Open Science Framework website (OSF)](https://osf.io) - it helps others reproduce your work and lowers the cost for others in learning CLS's data. You can add the tag 'british-cohorts' (plus a study-specific tag) to make your project findable.
@@ -1,7 +1,7 @@
 ---
 layout: default
 title: Working with the Household Grid
-nav_order: 4
+nav_order: 3
 parent: MCS
 format: docusaurus-md
 ---
@@ -11,12 +11,17 @@ format: docusaurus-md
 
 # Introduction
 
-In this tutorial, we will learn the basics of using the household grid.
-Specifically, we will see how to identify particular family members, how
-to use the household grid to create family-member specific variables,
-and how to determine the relationships between family members. We will
-use the example of finding natural mothers smoking status at the first
-sweep.
+In this section, we describe the basics of using the household grid.
+Specifically, we show how to use the household grid to:
+
+1.  Identify particular family members
+
+2.  Create family-member specific variables
+
+3.  Determine the relationships between non-cohort members within a
+    family.
+
+We use the following packages:
 
 ```r
 # Load Packages
@@ -26,26 +31,28 @@ library(haven) # For importing .dta files
 
 # Finding Mother of Cohort Members
 
-We will load just four variables from the household grid: `MCSID` and
-`APNUM00`, which uniquely identify an individual, and `AHPSEX00` and
-`AHCREL00`, which contain information on the individual’s sex and their
-relationship to the household’s cohort member(s). `AHCREL00 == 7`
-identifies natural parents and `AHPSEX00 == 2` identifies females.
-Combining the two identifies natural mothers. Below, we use `count()` to
-show the different (observed) values for the sex and relationship
-variables. We also use the `filter()` function (which retains
-observations where the conditions are `TRUE`) to create a dataset
-containing the identifiers (`MCSID` and `APNUM00` of natural mothers
-only; we will merge this will smoking information shortly.
-`add_count(MCSID) %>% filter(n == 1)` is included as an interim step to
-ensure there is just one natural mother per family.
+To show how to perform 1 & 2, we use the example of finding natural
+mothers’ smoking status at the first sweep. We load just four variables
+from the Sweep 1 household grid: `MCSID` and `APNUM00`, which together
+uniquely identify an individual, and `AHPSEX00` and `AHCREL00`, which
+contain information on the individual’s sex and their relationship to
+the household’s cohort member(s). `AHCREL00 == 7` identifies natural
+parents and `AHPSEX00 == 2` identifies females. Combining the two
+identifies natural mothers. Below, we use `count()` to show the
+different (observed) values for the sex and relationship variables. We
+also use the `filter()` function (which retains observations where the
+conditions are `TRUE`) to create a dataset containing the identifiers
+(`MCSID` and `APNUM00`) of natural mothers only; we will merge this with
+the smoking information shortly. `add_count(MCSID) %>% filter(n == 1)`
+is included as an interim step to ensure there is just one natural
+mother per family.[^1]
 
 ```r
 df_0y_hhgrid <- read_dta("0y/mcs1_hhgrid.dta") %>%
-  select(MCSID, APNUM00, AHPSEX00, AHCREL00)
+  select(MCSID, APNUM00, AHPSEX00, AHCREL00) # Retains the listed variables
 
 df_0y_hhgrid %>%
-  count(AHPSEX00)
+  count(AHPSEX00) # Tabulates each sex; AHPSEX00 does not record the sex of cohort members
 ```
 
 ``` text
@@ -60,7 +67,7 @@ df_0y_hhgrid %>%
 
 ```r
 df_0y_hhgrid %>%
-  count(AHCREL00)
+  count(AHCREL00) # Tabulates each relationship to a cohort member
 ```
 
 ``` text
@@ -87,32 +94,35 @@ df_0y_hhgrid %>%
 
 ```r
 df_0y_mothers <- df_0y_hhgrid %>%
-  filter(AHCREL00 == 7,
-         AHPSEX00 == 2) %>%
-  add_count(MCSID) %>%
-  filter(n == 1) %>%
-  select(MCSID, APNUM00)
+  filter(
+    AHCREL00 == 7, # Keep natural parents...
+    AHPSEX00 == 2 # ...who are female.
+  ) %>%
+  add_count(MCSID) %>% # Creates new variable (n) containing # of records with given MCSID
+  filter(n == 1) %>% # Keep where only one recorded natural mother per family
+  select(MCSID, APNUM00) # Keep identifier variables
 ```
 
 Note, where a cohort member is part of a family (`MCSID`) with two or
 more cohort members, the cohort member will have been a multiple birth
 (i.e., twin or triplet), so familial relationships should apply to all
 cohort members in the family, which is why there is just one
 relationship (`[A-G]HCREL00`) variable per household grid file. This
-will change as the cohort members age, move into separate residences and
-start their own families.
+will change as the cohort members age, moving into separate residences
+and starting their own families.
 
 # Creating a Mother’s Smoking Variable
 
 Now we have a dataset containing the IDs of natural mothers, we can load
-the smoking information from the Sweep 1 parent interview file. The
-smoking variable used is called `APSMUS0A` which contains information on
-the tobacco products a parent uses. We classify a parent as a smoker if
-they use any tobacco product (`mutate(smoker = case_when(...))`).
+the smoking information from the Sweep 1 parent interview file
+(`mcs1_parent_interview.dta`). The smoking variable we use is called
+`APSMUS0A` and contains information on the tobacco product (if any) a
+parent consumes. We classify a parent as a smoker if they use any
+tobacco product (`mutate(parent_smoker = case_when(...))`).
 
 ```r
 df_0y_parent <- read_dta("0y/mcs1_parent_interview.dta") %>%
-  select(MCSID, APNUM00, APSMUS0A)
+  select(MCSID, APNUM00, APSMUS0A) # Retains only the variables we need
 
 df_0y_parent %>%
   count(APSMUS0A)
@@ -135,18 +145,18 @@ df_0y_parent %>%
 
 ```r
 df_0y_smoking <- df_0y_parent %>%
-  mutate(smoker = case_when(APSMUS0A %in% 2:95 ~ 1,
-                            APSMUS0A == 1 ~ 0)) %>%
-  select(MCSID, APNUM00, smoker)
+  mutate(parent_smoker = case_when(APSMUS0A %in% 2:95 ~ 1, # If APSMUS0A is integer between 2 and 95, then 1
+                            APSMUS0A == 1 ~ 0)) %>% # If APSMUS0A is 1, then 0
+  select(MCSID, APNUM00, parent_smoker)
 ```
 
 Now we can merge the two datasets together to ensure we only keep rows
 in `df_0y_smoking` that appear in `df_0y_mothers`. We use `left_join()`
 to do this, with `df_0y_mothers` as the dataset determining the
-outputted rows, so that we have one row per identified mother. The
+outputted rows, so that we have one row per identified mother.[^2] The
 result is a dataset with one row per family with an identified mother.
-We rename the `smoker` variable to `mother_smoker` to clarify that it
-refers to the mother’s smoking status.
+We rename the `parent_smoker` variable to `mother_smoker` to clarify
+that it refers to the mother’s smoking status.
 
 Below we also pipe this dataset into the `tabyl()` function (from
 `janitor`) to tabulate the number and proportions of mothers who smoke
@@ -155,23 +165,9 @@ and those who do not.
 ```r
 # install.packages("janitor") # Uncomment if you need to install
 library(janitor)
-```
-
-``` text
-
-Attaching package: 'janitor'
-```
-
-``` text
-The following objects are masked from 'package:stats':
-
-    chisq.test, fisher.test
-```
-
-```r
 df_0y_mothers %>%
   left_join(df_0y_smoking, by = c("MCSID", "APNUM00")) %>%
-  select(MCSID, mother_smoker = smoker) %>%
+  select(MCSID, mother_smoker = parent_smoker) %>%
   tabyl(mother_smoker)
 ```
 
@@ -185,22 +181,25 @@ df_0y_mothers %>%
 # Determining Relationships between Non-Cohort Members
 
 The household grids include another set of relationship variables
-(`[A-G]HPREL[A-Z]0`). These can be used to identify the relationships
-between family members. These variables record the person in the row’s
-(ego) relationship to the person denoted by the column (alt); the
-penultimate letter `[A-Z]` in `[A-G]HPREL[A-Z]0` corresponds to the
-person’s `PNUM00`. For instance, the variable `AHPRELB0` would denote
-the relationship of the person in the row to the person with
-`APNUM00 == 2`. We will extract a small set of data from the Sweep 1
-household grid to show this in action.
+besides `[A-G]HCREL00`. These vary in name slightly between sweeps:
+`[A-D]HPREL[A-Z]0` in `mcs[1-4]_hhgrid.dta`, `EPREL0[A-Z]00` in
+`mcs5_hhgrid.dta`, and `[F-G]HPREL0[A-Z]` in `mcs[6-7]_hhgrid.dta`.
+These variables can be used to identify the relationships between
+non-cohort member family members. Specifically, they record the person
+in the row’s (ego) relationship to the person denoted by the column
+(alt); the letter `[A-Z]` in the variable name corresponds to the alt’s
+`[A-D]PNUM00`. For instance, the variable `AHPRELB0` denotes the
+relationship of the person in the row to the person in the same family
+with `APNUM00 == 2`. Below, we extract a small set of data from the
+Sweep 1 household grid to show this in action.
 
 ```r
 df_0y_hhgrid_prel <- read_dta("0y/mcs1_hhgrid.dta") %>%
   select(MCSID, APNUM00, matches("AHPREL[A-Z]0"))
 
 df_0y_hhgrid_prel %>%
   filter(MCSID == "M10001N") %>% # To look at just one family
-  select(APNUM00, AHPRELA0, AHPRELB0, AHPRELC0)
+  select(APNUM00, AHPRELA0, AHPRELB0, AHPRELC0) # To look at first few relationship variables
 ```
 
 ``` text
@@ -220,24 +219,26 @@ There are seven members in this family, one of whom is a cohort member
 (`APNUM00 == 100`). `APNUM00`’s 1 and 2 are the (natural) parents, and
 `APNUM00`’s 3-6 and 100 are the (natural) children. The relationship
 variables show that `APNUM00`’s 1 and 2 are married, and `APNUM00`’s 3-7
-are siblings. Note, the symmetry in the relationships. Where,
-`APNUM00 == 1`, `AHPRELC0 == 7 [Natural Parent]` and where
-`APNUM00 == 3`, `AHPRELA0 == 3 [Natural Child]`.
-
-If we want to find the particular person occupying a particular
-relationship for an individual (e.g., we want to know the `PNUM00` of
-the person’s partner), we need to reshape the data into long-format with
-one row per ego-alt relationship within a family. For instance, if we
-want to find each person’s spouse (conditional on one being present), we
-can do the following:
+are siblings (`AHPRELC0 == 11 [Natural brother/sister]`) and biological
+offspring of `APNUM00`’s 1 and 2
+(`AHPREL[A-B]0 == 3 [Natural son/daughter]`). Note the symmetry in the
+relationships. Where, `APNUM00 == 1`, `AHPRELC0 == 7 [Natural Parent]`
+and where `APNUM00 == 3`, `AHPRELA0 == 3 [Natural son/daughter]`.
+
+If we want to find the particular person occupying a specific
+relationship for an individual (e.g., we want to know the `[A-G]PNUM00`
+of the person’s partner), we need to reshape the data into long-format
+with one row per ego-alt relationship within a family. For instance, if
+we want to find each person’s spouse (conditional on one being present),
+we can do the following:[^3]
 
 ```r
 df_0y_hhgrid_prel %>%
   pivot_longer(cols = matches("AHPREL[A-Z]0"),
                names_to = "alt",
                values_to = "relationship") %>%
-  mutate(APNUM00_alt = match(str_sub(alt, 7, 7), LETTERS)) %>%
-  filter(relationship == 1) %>%
+  mutate(APNUM00_alt = match(str_sub(alt, -2, -2), LETTERS)) %>% # Creates alt's PNUM00 by matching penultimate letter to position in alphabet
+  filter(relationship == 1) %>% # Keep where husband or wife
   select(MCSID, APNUM00, partner_pnum = APNUM00_alt)
 ```
 
@@ -261,7 +262,28 @@ df_0y_hhgrid_prel %>%
 # Coda
 
 This only scratches the surface of what can be achieved with the
-household grid. The `mcs[1-7]_hhgrid.dta` also contain information on
-cohort-member and family-member’s dates of birth, which can be used to,
-for example, identify the number of resident younger siblings, determine
-maternal and paternal age at birth, and so on.
+household grid. The `mcs[1-7]_hhgrid.dta` files also contain information
+on cohort-member and family-member’s dates of birth, which can be used
+to, for example, identify the number of resident younger siblings,
+determine maternal and paternal age at birth, and so on.
+
+[^1]: Loading the `.dta` files into `R` with `haven::read_dta()` retains
+    the dataset metadata, including variable names and labels, mainly by
+    storing variables as `labelled` class objects. See the [`labelled`
+    package help
+    files](https://cran.r-project.org/web/packages/labelled/vignettes/intro_labelled.html)
+    for more information on working with this metadata - for instance,
+    converting `labelled` variables to standard `R` factor variables or
+    replacing negative values (generally reserved in MCS data to
+    indicate missingness) with `R`’s native `NA` value.
+
+[^2]: `left_join()` takes as arguments two data frames and retains only
+    the rows in the first data frame, regardless of whether there is a
+    match with the second. See [*Combining Data Across
+    Sweeps*](https://cls-data.github.io/docs/mcs-merging_across_sweeps.html)
+    for more discussion of the `*_join()` functions.
+
+[^3]: For more on reshaping data, see [*Reshaping Data from Long to Wide
+    (or Wide to
+    Long)*](https://cls-data.github.io/docs/mcs-reshape_long_wide.html)
+    for more discussion of the `*_join()` functions.