Skip to content

Commit 0004e52

Browse files
committed
edited mcs household grid
1 parent 9b6b3db commit 0004e52

12 files changed

+489
-779
lines changed

.gitignore

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,5 +20,6 @@ vendor/
2020
# Ignore R environment
2121
.Renviron
2222

23-
# Ignore Notes
24-
Notes/
23+
# Ignore Notes and Old Files
24+
Notes/
25+
old/

docs/intro.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,9 @@ The code presented on this website will presume you have downloaded the data fro
2828

2929
To make using the datasets easier, we provide code reorganise the `.dta` files into a simple directory structure with a folder for each sweep. This code is described under each study section (e.g., `MCS -> Creating a Simple Folder Structure`). We will assume you have organised the files in this way in other code we present.
3030

31-
We use the `tidyverse` (an `R` package) extensively in the code presented on this website. Many of the functions we use are repeated in multiple places, so we have provided [a short primer](https://cls-data.github.io/docs/r_primer.html) on the main functions we will use. If you are new to the language, this primer also contains links to more detailed resources for learning `R` and `tidyverse`. Even if you are experienced with `R`, as there may still be some material that is new.
31+
We use the `tidyverse` (an `R` package) extensively in the code presented on this website. If you are new to the `tidyverse`, we recommend Hadley Wickham and colleagues' book, R for Data Science, which is [available for free online](https://r4ds.had.co.nz/).
3232

3333
# Code Sharing
34-
This website can obviously not provide all the code you may need to carry out the analyses you may want to with CLS data. We have therefore set up the [`#britishcohorts` hashtag on GitHub Gist](https://gist.github.com/search?q=%23britishcohorts) for people to share code snippets that are useful for CLS analyses. Please consider sharing your own code snipetts (for instance, code to derive a useful variable) on GitHub Gist adding the `#britishcohorts` hashtag and a study specific hashtag (`#mcs`, `#bcs70`, `#nextsteps`, `#ncds`) to the Gist description to make it findable.
34+
This website can obviously not provide all the code you may need to carry out the analyses you may want to with CLS data. We have therefore set up the [`#britishcohorts` hashtag on GitHub Gist](https://gist.github.com/search?q=%23britishcohorts) for people to share code snippets that are useful for CLS analyses. Please consider sharing your own code snippets (for instance, code to derive a useful variable) on GitHub Gist adding the `#britishcohorts` hashtag and a study specific hashtag (`#mcs`, `#bcs70`, `#nextsteps`, `#ncds`) to the Gist description to make it findable.
3535

3636
Please also consider sharing the full code for papers you publish on the [Open Science Framework website (OSF)](https://osf.io) - it helps others reproduce your work and lowers the cost for others in learning CLS's data. You can add the tag 'british-cohorts' (plus a study-specific tag) to make your project findable.

docs/mcs-creating_identifiers.md

Lines changed: 0 additions & 8 deletions
This file was deleted.

docs/mcs-data_structures.md

Lines changed: 73 additions & 63 deletions
Large diffs are not rendered by default.

docs/mcs-household_grid.md

Lines changed: 103 additions & 81 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
layout: default
33
title: Working with the Household Grid
4-
nav_order: 4
4+
nav_order: 3
55
parent: MCS
66
format: docusaurus-md
77
---
@@ -11,12 +11,17 @@ format: docusaurus-md
1111

1212
# Introduction
1313

14-
In this tutorial, we will learn the basics of using the household grid.
15-
Specifically, we will see how to identify particular family members, how
16-
to use the household grid to create family-member specific variables,
17-
and how to determine the relationships between family members. We will
18-
use the example of finding natural mothers smoking status at the first
19-
sweep.
14+
In this section, we describe the basics of using the household grid.
15+
Specifically, we show how to use the household grid to:
16+
17+
1. Identify particular family members
18+
19+
2. Create family-member specific variables
20+
21+
3. Determine the relationships between non-cohort members within a
22+
family.
23+
24+
We use the following packages:
2025

2126
```r
2227
# Load Packages
@@ -26,26 +31,28 @@ library(haven) # For importing .dta files
2631

2732
# Finding Mother of Cohort Members
2833

29-
We will load just four variables from the household grid: `MCSID` and
30-
`APNUM00`, which uniquely identify an individual, and `AHPSEX00` and
31-
`AHCREL00`, which contain information on the individual’s sex and their
32-
relationship to the household’s cohort member(s). `AHCREL00 == 7`
33-
identifies natural parents and `AHPSEX00 == 2` identifies females.
34-
Combining the two identifies natural mothers. Below, we use `count()` to
35-
show the different (observed) values for the sex and relationship
36-
variables. We also use the `filter()` function (which retains
37-
observations where the conditions are `TRUE`) to create a dataset
38-
containing the identifiers (`MCSID` and `APNUM00` of natural mothers
39-
only; we will merge this will smoking information shortly.
40-
`add_count(MCSID) %>% filter(n == 1)` is included as an interim step to
41-
ensure there is just one natural mother per family.
34+
To show how to perform 1 & 2, we use the example of finding natural
35+
mothers’ smoking status at the first sweep. We load just four variables
36+
from the Sweep 1 household grid: `MCSID` and `APNUM00`, which together
37+
uniquely identify an individual, and `AHPSEX00` and `AHCREL00`, which
38+
contain information on the individual’s sex and their relationship to
39+
the household’s cohort member(s). `AHCREL00 == 7` identifies natural
40+
parents and `AHPSEX00 == 2` identifies females. Combining the two
41+
identifies natural mothers. Below, we use `count()` to show the
42+
different (observed) values for the sex and relationship variables. We
43+
also use the `filter()` function (which retains observations where the
44+
conditions are `TRUE`) to create a dataset containing the identifiers
45+
(`MCSID` and `APNUM00`) of natural mothers only; we will merge this with
46+
the smoking information shortly. `add_count(MCSID) %>% filter(n == 1)`
47+
is included as an interim step to ensure there is just one natural
48+
mother per family.[^1]
4249

4350
```r
4451
df_0y_hhgrid <- read_dta("0y/mcs1_hhgrid.dta") %>%
45-
select(MCSID, APNUM00, AHPSEX00, AHCREL00)
52+
select(MCSID, APNUM00, AHPSEX00, AHCREL00) # Retains the listed variables
4653

4754
df_0y_hhgrid %>%
48-
count(AHPSEX00)
55+
count(AHPSEX00) # Tabulates each sex; AHPSEX00 does not record the sex of cohort members
4956
```
5057

5158
``` text
@@ -60,7 +67,7 @@ df_0y_hhgrid %>%
6067

6168
```r
6269
df_0y_hhgrid %>%
63-
count(AHCREL00)
70+
count(AHCREL00) # Tabulates each relationship to a cohort member
6471
```
6572

6673
``` text
@@ -87,32 +94,35 @@ df_0y_hhgrid %>%
8794

8895
```r
8996
df_0y_mothers <- df_0y_hhgrid %>%
90-
filter(AHCREL00 == 7,
91-
AHPSEX00 == 2) %>%
92-
add_count(MCSID) %>%
93-
filter(n == 1) %>%
94-
select(MCSID, APNUM00)
97+
filter(
98+
AHCREL00 == 7, # Keep natural parents...
99+
AHPSEX00 == 2 # ...who are female.
100+
) %>%
101+
add_count(MCSID) %>% # Creates new variable (n) containing # of records with given MCSID
102+
filter(n == 1) %>% # Keep where only one recorded natural mother per family
103+
select(MCSID, APNUM00) # Keep identifier variables
95104
```
96105

97106
Note, where a cohort member is part of a family (`MCSID`) with two or
98107
more cohort members, the cohort member will have been a multiple birth
99108
(i.e., twin or triplet), so familial relationships should apply to all
100109
cohort members in the family, which is why there is just one
101110
relationship (`[A-G]HCREL00`) variable per household grid file. This
102-
will change as the cohort members age, move into separate residences and
103-
start their own families.
111+
will change as the cohort members age, moving into separate residences
112+
and starting their own families.
104113

105114
# Creating a Mother’s Smoking Variable
106115

107116
Now we have a dataset containing the IDs of natural mothers, we can load
108-
the smoking information from the Sweep 1 parent interview file. The
109-
smoking variable used is called `APSMUS0A` which contains information on
110-
the tobacco products a parent uses. We classify a parent as a smoker if
111-
they use any tobacco product (`mutate(smoker = case_when(...))`).
117+
the smoking information from the Sweep 1 parent interview file
118+
(`mcs1_parent_interview.dta`). The smoking variable we use is called
119+
`APSMUS0A` and contains information on the tobacco product (if any) a
120+
parent consumes. We classify a parent as a smoker if they use any
121+
tobacco product (`mutate(parent_smoker = case_when(...))`).
112122

113123
```r
114124
df_0y_parent <- read_dta("0y/mcs1_parent_interview.dta") %>%
115-
select(MCSID, APNUM00, APSMUS0A)
125+
select(MCSID, APNUM00, APSMUS0A) # Retains only the variables we need
116126

117127
df_0y_parent %>%
118128
count(APSMUS0A)
@@ -135,18 +145,18 @@ df_0y_parent %>%
135145

136146
```r
137147
df_0y_smoking <- df_0y_parent %>%
138-
mutate(smoker = case_when(APSMUS0A %in% 2:95 ~ 1,
139-
APSMUS0A == 1 ~ 0)) %>%
140-
select(MCSID, APNUM00, smoker)
148+
mutate(parent_smoker = case_when(APSMUS0A %in% 2:95 ~ 1, # If APSMUS0A is integer between 2 and 95, then 1
149+
APSMUS0A == 1 ~ 0)) %>% # If APSMUS0A is 1, then 0
150+
select(MCSID, APNUM00, parent_smoker)
141151
```
142152

143153
Now we can merge the two datasets together to ensure we only keep rows
144154
in `df_0y_smoking` that appear in `df_0y_mothers`. We use `left_join()`
145155
to do this, with `df_0y_mothers` as the dataset determining the
146-
outputted rows, so that we have one row per identified mother. The
156+
outputted rows, so that we have one row per identified mother.[^2] The
147157
result is a dataset with one row per family with an identified mother.
148-
We rename the `smoker` variable to `mother_smoker` to clarify that it
149-
refers to the mother’s smoking status.
158+
We rename the `parent_smoker` variable to `mother_smoker` to clarify
159+
that it refers to the mother’s smoking status.
150160

151161
Below we also pipe this dataset into the `tabyl()` function (from
152162
`janitor`) to tabulate the number and proportions of mothers who smoke
@@ -155,23 +165,9 @@ and those who do not.
155165
```r
156166
# install.packages("janitor") # Uncomment if you need to install
157167
library(janitor)
158-
```
159-
160-
``` text
161-
162-
Attaching package: 'janitor'
163-
```
164-
165-
``` text
166-
The following objects are masked from 'package:stats':
167-
168-
chisq.test, fisher.test
169-
```
170-
171-
```r
172168
df_0y_mothers %>%
173169
left_join(df_0y_smoking, by = c("MCSID", "APNUM00")) %>%
174-
select(MCSID, mother_smoker = smoker) %>%
170+
select(MCSID, mother_smoker = parent_smoker) %>%
175171
tabyl(mother_smoker)
176172
```
177173

@@ -185,22 +181,25 @@ df_0y_mothers %>%
185181
# Determining Relationships between Non-Cohort Members
186182

187183
The household grids include another set of relationship variables
188-
(`[A-G]HPREL[A-Z]0`). These can be used to identify the relationships
189-
between family members. These variables record the person in the row’s
190-
(ego) relationship to the person denoted by the column (alt); the
191-
penultimate letter `[A-Z]` in `[A-G]HPREL[A-Z]0` corresponds to the
192-
person’s `PNUM00`. For instance, the variable `AHPRELB0` would denote
193-
the relationship of the person in the row to the person with
194-
`APNUM00 == 2`. We will extract a small set of data from the Sweep 1
195-
household grid to show this in action.
184+
besides `[A-G]HCREL00`. These vary in name slightly between sweeps:
185+
`[A-D]HPREL[A-Z]0` in `mcs[1-4]_hhgrid.dta`, `EPREL0[A-Z]00` in
186+
`mcs5_hhgrid.dta`, and `[F-G]HPREL0[A-Z]` in `mcs[6-7]_hhgrid.dta`.
187+
These variables can be used to identify the relationships between
188+
non-cohort member family members. Specifically, they record the person
189+
in the row’s (ego) relationship to the person denoted by the column
190+
(alt); the letter `[A-Z]` in the variable name corresponds to the alt’s
191+
`[A-D]PNUM00`. For instance, the variable `AHPRELB0` denotes the
192+
relationship of the person in the row to the person in the same family
193+
with `APNUM00 == 2`. Below, we extract a small set of data from the
194+
Sweep 1 household grid to show this in action.
196195

197196
```r
198197
df_0y_hhgrid_prel <- read_dta("0y/mcs1_hhgrid.dta") %>%
199198
select(MCSID, APNUM00, matches("AHPREL[A-Z]0"))
200199

201200
df_0y_hhgrid_prel %>%
202201
filter(MCSID == "M10001N") %>% # To look at just one family
203-
select(APNUM00, AHPRELA0, AHPRELB0, AHPRELC0)
202+
select(APNUM00, AHPRELA0, AHPRELB0, AHPRELC0) # To look at first few relationship variables
204203
```
205204

206205
``` text
@@ -220,24 +219,26 @@ There are seven members in this family, one of whom is a cohort member
220219
(`APNUM00 == 100`). `APNUM00`’s 1 and 2 are the (natural) parents, and
221220
`APNUM00`’s 3-6 and 100 are the (natural) children. The relationship
222221
variables show that `APNUM00`’s 1 and 2 are married, and `APNUM00`’s 3-7
223-
are siblings. Note, the symmetry in the relationships. Where,
224-
`APNUM00 == 1`, `AHPRELC0 == 7 [Natural Parent]` and where
225-
`APNUM00 == 3`, `AHPRELA0 == 3 [Natural Child]`.
226-
227-
If we want to find the particular person occupying a particular
228-
relationship for an individual (e.g., we want to know the `PNUM00` of
229-
the person’s partner), we need to reshape the data into long-format with
230-
one row per ego-alt relationship within a family. For instance, if we
231-
want to find each person’s spouse (conditional on one being present), we
232-
can do the following:
222+
are siblings (`AHPRELC0 == 11 [Natural brother/sister]`) and biological
223+
offspring of `APNUM00`’s 1 and 2
224+
(`AHPREL[A-B]0 == 3 [Natural son/daughter]`). Note the symmetry in the
225+
relationships. Where, `APNUM00 == 1`, `AHPRELC0 == 7 [Natural Parent]`
226+
and where `APNUM00 == 3`, `AHPRELA0 == 3 [Natural son/daughter]`.
227+
228+
If we want to find the particular person occupying a specific
229+
relationship for an individual (e.g., we want to know the `[A-G]PNUM00`
230+
of the person’s partner), we need to reshape the data into long-format
231+
with one row per ego-alt relationship within a family. For instance, if
232+
we want to find each person’s spouse (conditional on one being present),
233+
we can do the following:[^3]
233234

234235
```r
235236
df_0y_hhgrid_prel %>%
236237
pivot_longer(cols = matches("AHPREL[A-Z]0"),
237238
names_to = "alt",
238239
values_to = "relationship") %>%
239-
mutate(APNUM00_alt = match(str_sub(alt, 7, 7), LETTERS)) %>%
240-
filter(relationship == 1) %>%
240+
mutate(APNUM00_alt = match(str_sub(alt, -2, -2), LETTERS)) %>% # Creates alt's PNUM00 by matching penultimate letter to position in alphabet
241+
filter(relationship == 1) %>% # Keep where husband or wife
241242
select(MCSID, APNUM00, partner_pnum = APNUM00_alt)
242243
```
243244

@@ -261,7 +262,28 @@ df_0y_hhgrid_prel %>%
261262
# Coda
262263

263264
This only scratches the surface of what can be achieved with the
264-
household grid. The `mcs[1-7]_hhgrid.dta` also contain information on
265-
cohort-member and family-member’s dates of birth, which can be used to,
266-
for example, identify the number of resident younger siblings, determine
267-
maternal and paternal age at birth, and so on.
265+
household grid. The `mcs[1-7]_hhgrid.dta` files also contain information
266+
on cohort-member and family-member’s dates of birth, which can be used
267+
to, for example, identify the number of resident younger siblings,
268+
determine maternal and paternal age at birth, and so on.
269+
270+
[^1]: Loading the `.dta` files into `R` with `haven::read_dta()` retains
271+
the dataset metadata, including variable names and labels, mainly by
272+
storing variables as `labelled` class objects. See the [`labelled`
273+
package help
274+
files](https://cran.r-project.org/web/packages/labelled/vignettes/intro_labelled.html)
275+
for more information on working with this metadata - for instance,
276+
converting `labelled` variables to standard `R` factor variables or
277+
replacing negative values (generally reserved in MCS data to
278+
indicate missingness) with `R`’s native `NA` value.
279+
280+
[^2]: `left_join()` takes as arguments two data frames and retains only
281+
the rows in the first data frame, regardless of whether there is a
282+
match with the second. See [*Combining Data Across
283+
Sweeps*](https://cls-data.github.io/docs/mcs-merging_across_sweeps.html)
284+
for more discussion of the `*_join()` functions.
285+
286+
[^3]: For more on reshaping data, see [*Reshaping Data from Long to Wide
287+
(or Wide to
288+
Long)*](https://cls-data.github.io/docs/mcs-reshape_long_wide.html)
289+
for more discussion of the `*_join()` functions.

0 commit comments

Comments
 (0)