Skip to content

Commit 4649fbe

Browse files
committed
added mcs-household_grid v1
1 parent efd05e9 commit 4649fbe

File tree

9 files changed

+516
-54
lines changed

9 files changed

+516
-54
lines changed

.gitignore

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,5 +18,4 @@ vendor/
1818
/.quarto/
1919

2020
# Ignore R environment
21-
.Renviron
22-
scripts
21+
.Renviron

docs/mcs-household_grid.md

Lines changed: 267 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
---
2+
layout: default
3+
title: Working with the Household Grid
4+
nav_order: 4
5+
parent: MCS
6+
format: docusaurus-md
7+
---
8+
9+
10+
11+
12+
# Introduction
13+
14+
In this tutorial, we will learn the basics of using the household grid.
15+
Specifically, we will see how to identify particular family members, how
16+
to use the household grid to create family-member specific variables,
17+
and how to determine the relationships between family members. We will
18+
use the example of finding natural mothers smoking status at the first
19+
sweep.
20+
21+
```r
22+
# Load Packages
23+
library(tidyverse) # For data manipulation
24+
library(haven) # For importing .dta files
25+
```
26+
27+
# Finding Mother of Cohort Members
28+
29+
We will load just four variables from the household grid: `MCSID` and
30+
`APNUM00`, which uniquely identify an individual, and `AHPSEX00` and
31+
`AHCREL00`, which contain information on the individual’s sex and their
32+
relationship to the household’s cohort member(s). `AHCREL00 == 7`
33+
identifies natural parents and `AHPSEX00 == 2` identifies females.
34+
Combining the two identifies natural mothers. Below, we use `count()` to
35+
show the different (observed) values for the sex and relationship
36+
variables. We also use the `filter()` function (which retains
37+
observations where the conditions are `TRUE`) to create a dataset
38+
containing the identifiers (`MCSID` and `APNUM00` of natural mothers
39+
only; we will merge this will smoking information shortly.
40+
`add_count(MCSID) %>% filter(n == 1)` is included as an interim step to
41+
ensure there is just one natural mother per family.
42+
43+
```r
44+
df_0y_hhgrid <- read_dta("0y/mcs1_hhgrid.dta") %>%
45+
select(MCSID, APNUM00, AHPSEX00, AHCREL00)
46+
47+
df_0y_hhgrid %>%
48+
count(AHPSEX00)
49+
```
50+
51+
``` text
52+
# A tibble: 4 × 2
53+
AHPSEX00 n
54+
<dbl+lbl> <int>
55+
1 -2 [Unknown] 55
56+
2 -1 [Not applicable] 18734
57+
3 1 [Male] 26438
58+
4 2 [Female] 29567
59+
```
60+
61+
```r
62+
df_0y_hhgrid %>%
63+
count(AHCREL00)
64+
```
65+
66+
``` text
67+
# A tibble: 16 × 2
68+
AHCREL00 n
69+
<dbl+lbl> <int>
70+
1 -9 [Refusal] 5
71+
2 -8 [Dont Know] 1
72+
3 7 [Natural parent] 33812
73+
4 8 [Adoptive parent] 2
74+
5 9 [Foster parent] 3
75+
6 10 [Step-parent/partner of parent] 50
76+
7 11 [Natural brother/Natural sister] 13873
77+
8 12 [Half-brother/Half-sister] 3486
78+
9 13 [Step-brother/Step-sister] 16
79+
10 14 [Adopted brother/Adopted sister] 8
80+
11 15 [Foster brother/Foster sister] 9
81+
12 17 [Grandparent] 2164
82+
13 18 [Nanny/au pair] 20
83+
14 19 [Other relative] 2326
84+
15 20 [Other non-relative] 233
85+
16 96 [Self] 18786
86+
```
87+
88+
```r
89+
df_0y_mothers <- df_0y_hhgrid %>%
90+
filter(AHCREL00 == 7,
91+
AHPSEX00 == 2) %>%
92+
add_count(MCSID) %>%
93+
filter(n == 1) %>%
94+
select(MCSID, APNUM00)
95+
```
96+
97+
Note, where a cohort member is part of a family (`MCSID`) with two or
98+
more cohort members, the cohort member will have been a multiple birth
99+
(i.e., twin or triplet), so familial relationships should apply to all
100+
cohort members in the family, which is why there is just one
101+
relationship (`[A-G]HCREL00`) variable per household grid file. This
102+
will change as the cohort members age, move into separate residences and
103+
start their own families.
104+
105+
# Creating a Mother’s Smoking Variable
106+
107+
Now we have a dataset containing the IDs of natural mothers, we can load
108+
the smoking information from the Sweep 1 parent interview file. The
109+
smoking variable used is called `APSMUS0A` which contains information on
110+
the tobacco products a parent uses. We classify a parent as a smoker if
111+
they use any tobacco product (`mutate(smoker = case_when(...))`).
112+
113+
```r
114+
df_0y_parent <- read_dta("0y/mcs1_parent_interview.dta") %>%
115+
select(MCSID, APNUM00, APSMUS0A)
116+
117+
df_0y_parent %>%
118+
count(APSMUS0A)
119+
```
120+
121+
``` text
122+
# A tibble: 9 × 2
123+
APSMUS0A n
124+
<dbl+lbl> <int>
125+
1 -9 [Refusal] 4
126+
2 -8 [Don't Know] 3
127+
3 -1 [Not applicable] 10
128+
4 1 [No, does not smoke] 21229
129+
5 2 [Yes, cigarettes] 9003
130+
6 3 [Yes, roll-ups] 1246
131+
7 4 [Yes, cigars] 217
132+
8 5 [Yes, a pipe] 6
133+
9 95 [Yes, other tobacco product] 16
134+
```
135+
136+
```r
137+
df_0y_smoking <- df_0y_parent %>%
138+
mutate(smoker = case_when(APSMUS0A %in% 2:95 ~ 1,
139+
APSMUS0A == 1 ~ 0)) %>%
140+
select(MCSID, APNUM00, smoker)
141+
```
142+
143+
Now we can merge the two datasets together to ensure we only keep rows
144+
in `df_0y_smoking` that appear in `df_0y_mothers`. We use `left_join()`
145+
to do this, with `df_0y_mothers` as the dataset determining the
146+
outputted rows, so that we have one row per identified mother. The
147+
result is a dataset with one row per family with an identified mother.
148+
We rename the `smoker` variable to `mother_smoker` to clarify that it
149+
refers to the mother’s smoking status.
150+
151+
Below we also pipe this dataset into the `tabyl()` function (from
152+
`janitor`) to tabulate the number and proportions of mothers who smoke
153+
and those who do not.
154+
155+
```r
156+
# install.packages("janitor") # Uncomment if you need to install
157+
library(janitor)
158+
```
159+
160+
``` text
161+
162+
Attaching package: 'janitor'
163+
```
164+
165+
``` text
166+
The following objects are masked from 'package:stats':
167+
168+
chisq.test, fisher.test
169+
```
170+
171+
```r
172+
df_0y_mothers %>%
173+
left_join(df_0y_smoking, by = c("MCSID", "APNUM00")) %>%
174+
select(MCSID, mother_smoker = smoker) %>%
175+
tabyl(mother_smoker)
176+
```
177+
178+
``` text
179+
mother_smoker n percent valid_percent
180+
0 12883 0.695814205 0.6968304
181+
1 5605 0.302727518 0.3031696
182+
NA 27 0.001458277 NA
183+
```
184+
185+
# Determining Relationships between Non-Cohort Members
186+
187+
The household grids include another set of relationship variables
188+
(`[A-G]HPREL[A-Z]0`). These can be used to identify the relationships
189+
between family members. These variables record the person in the row’s
190+
(ego) relationship to the person denoted by the column (alt); the
191+
penultimate letter `[A-Z]` in `[A-G]HPREL[A-Z]0` corresponds to the
192+
person’s `PNUM00`. For instance, the variable `AHPRELB0` would denote
193+
the relationship of the person in the row to the person with
194+
`APNUM00 == 2`. We will extract a small set of data from the Sweep 1
195+
household grid to show this in action.
196+
197+
```r
198+
df_0y_hhgrid_prel <- read_dta("0y/mcs1_hhgrid.dta") %>%
199+
select(MCSID, APNUM00, matches("AHPREL[A-Z]0"))
200+
201+
df_0y_hhgrid_prel %>%
202+
select(MCSID, APNUM00, AHPRELA0, AHPRELB0, AHPRELC0, AHPRELD0) %>%
203+
filter(MCSID == "M10001N") # To look at just one family
204+
```
205+
206+
``` text
207+
# A tibble: 7 × 6
208+
MCSID APNUM00 AHPRELA0 AHPRELB0 AHPRELC0 AHPRELD0
209+
<chr> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lb> <dbl+lb>
210+
1 M10001N 1 96 [Self] 1 [Husband/Wife] 7 [Nat… 7 [Nat…
211+
2 M10001N 2 1 [Husband/Wife] 96 [Self] 7 [Nat… 7 [Nat…
212+
3 M10001N 3 3 [Natural son/daughter] 3 [Natural son/d… 96 [Sel… 11 [Nat…
213+
4 M10001N 4 3 [Natural son/daughter] 3 [Natural son/d… 11 [Nat… 96 [Sel…
214+
5 M10001N 5 3 [Natural son/daughter] 3 [Natural son/d… 11 [Nat… 11 [Nat…
215+
6 M10001N 6 3 [Natural son/daughter] 3 [Natural son/d… 11 [Nat… 11 [Nat…
216+
7 M10001N 100 3 [Natural son/daughter] 3 [Natural son/d… 11 [Nat… 11 [Nat…
217+
```
218+
219+
There are seven members in this family, one of whom is a cohort member
220+
(`APNUM00 == 100`). `APNUM00`’s 1 and 2 are the (natural) parents, and
221+
`APNUM00`’s 3-6 and 100 are the (natural) children. The relationship
222+
variables show that `APNUM00`’s 1 and 2 are married, and `APNUM00`’s 3-7
223+
are siblings. Note, the symmetry in the relationships. Where,
224+
`APNUM00 == 1`, `AHPRELC0 == 7 [Natural Parent]` and where
225+
`APNUM00 == 3`, `AHPRELA0 == 3 [Natural Child]`.
226+
227+
If we want to find the particular person occupying a particular
228+
relationship for an individual (e.g., we want to know the `PNUM00` of
229+
the person’s partner), we need to reshape the data into long-format with
230+
one row per ego-alt relationship within a family. For instance, if we
231+
want to find each person’s spouse (conditional on one being present), we
232+
can do the following:
233+
234+
```r
235+
df_0y_hhgrid_prel %>%
236+
pivot_longer(cols = matches("AHPREL[A-Z]0"),
237+
names_to = "alt",
238+
values_to = "relationship") %>%
239+
mutate(APNUM00_alt = match(str_sub(alt, 7, 7), LETTERS)) %>%
240+
filter(relationship == 1) %>%
241+
select(MCSID, APNUM00, parent_pnum = APNUM00_alt)
242+
```
243+
244+
``` text
245+
# A tibble: 23,616 × 3
246+
MCSID APNUM00 parent_pnum
247+
<chr> <dbl> <int>
248+
1 M10001N 1 2
249+
2 M10001N 2 1
250+
3 M10002P 1 2
251+
4 M10002P 2 1
252+
5 M10007U 1 2
253+
6 M10007U 2 1
254+
7 M10011Q 1 2
255+
8 M10011Q 2 1
256+
9 M10015U 1 2
257+
10 M10015U 2 1
258+
# ℹ 23,606 more rows
259+
```
260+
261+
# Coda
262+
263+
This only scratches the surface of what can be achieved with the
264+
household grid. The `mcs[1-7]_hhgrid.dta` also contain information on
265+
cohort-member and family-member’s dates of birth, which can be used to,
266+
for example, identify the number of resident younger siblings, determine
267+
maternal and paternal age at birth, and so on.

quarto/README.txt

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
1-
Use the following command to execute render in the correct folders:
1+
This folder contains the quarto (.qmd) files that are used to generate the markdown files for the webpages.
22

3-
quarto_render("quarto/next_steps-test.qmd",
4-
output_file = "next_steps-test.md",
5-
execute_dir = Sys.getenv("ns_fld"))
3+
Use the following command to render in the correct folders:
4+
5+
quarto::quarto_render("quarto/mcs-merging_across_sweeps.qmd",
6+
output_file = "mcs-merging_across_sweeps.md",
7+
execute_dir = Sys.getenv("mcs_fld"))

quarto/mcs-household_grid.qmd

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
---
2+
layout: default
3+
title: "Working with the Household Grid"
4+
nav_order: 4
5+
parent: MCS
6+
format: docusaurus-md
7+
---
8+
9+
# Introduction
10+
11+
In this tutorial, we will learn the basics of using the household grid. Specifically, we will see how to identify particular family members, how to use the household grid to create family-member specific variables, and how to determine the relationships between family members. We will use the example of finding natural mothers smoking status at the first sweep.
12+
13+
```{r}
14+
#| warning: false
15+
# Load Packages
16+
library(tidyverse) # For data manipulation
17+
library(haven) # For importing .dta files
18+
```
19+
20+
```{r}
21+
#| include: false
22+
# setwd(Sys.getenv("mcs_fld"))
23+
```
24+
25+
# Finding Mother of Cohort Members
26+
We will load just four variables from the household grid: `MCSID` and `APNUM00`, which uniquely identify an individual, and `AHPSEX00` and `AHCREL00`, which contain information on the individual's sex and their relationship to the household's cohort member(s).
27+
`AHCREL00 == 7` identifies natural parents and `AHPSEX00 == 2` identifies females. Combining the two identifies natural mothers. Below, we use `count()` to show the different (observed) values for the sex and relationship variables. We also use the `filter()` function (which retains observations where the conditions are `TRUE`) to create a dataset containing the identifiers (`MCSID` and `APNUM00` of natural mothers only; we will merge this will smoking information shortly. `add_count(MCSID) %>% filter(n == 1)` is included as an interim step to ensure there is just one natural mother per family.
28+
29+
```{r}
30+
df_0y_hhgrid <- read_dta("0y/mcs1_hhgrid.dta") %>%
31+
select(MCSID, APNUM00, AHPSEX00, AHCREL00)
32+
33+
df_0y_hhgrid %>%
34+
count(AHPSEX00)
35+
36+
df_0y_hhgrid %>%
37+
count(AHCREL00)
38+
39+
df_0y_mothers <- df_0y_hhgrid %>%
40+
filter(AHCREL00 == 7,
41+
AHPSEX00 == 2) %>%
42+
add_count(MCSID) %>%
43+
filter(n == 1) %>%
44+
select(MCSID, APNUM00)
45+
```
46+
47+
Note, where a cohort member is part of a family (`MCSID`) with two or more cohort members, the cohort member will have been a multiple birth (i.e., twin or triplet), so familial relationships should apply to all cohort members in the family, which is why there is just one relationship (`[A-G]HCREL00`) variable per household grid file. This will change as the cohort members age, move into separate residences and start their own families.
48+
49+
# Creating a Mother's Smoking Variable
50+
51+
Now we have a dataset containing the IDs of natural mothers, we can load the smoking information from the Sweep 1 parent interview file. The smoking variable used is called `APSMUS0A` which contains information on the tobacco products a parent uses. We classify a parent as a smoker if they use any tobacco product (`mutate(smoker = case_when(...))`).
52+
53+
```{r}
54+
df_0y_parent <- read_dta("0y/mcs1_parent_interview.dta") %>%
55+
select(MCSID, APNUM00, APSMUS0A)
56+
57+
df_0y_parent %>%
58+
count(APSMUS0A)
59+
60+
df_0y_smoking <- df_0y_parent %>%
61+
mutate(smoker = case_when(APSMUS0A %in% 2:95 ~ 1,
62+
APSMUS0A == 1 ~ 0)) %>%
63+
select(MCSID, APNUM00, smoker)
64+
```
65+
66+
Now we can merge the two datasets together to ensure we only keep rows in `df_0y_smoking` that appear in `df_0y_mothers`. We use `left_join()` to do this, with `df_0y_mothers` as the dataset determining the outputted rows, so that we have one row per identified mother. The result is a dataset with one row per family with an identified mother. We rename the `smoker` variable to `mother_smoker` to clarify that it refers to the mother's smoking status.
67+
68+
Below we also pipe this dataset into the `tabyl()` function (from `janitor`) to tabulate the number and proportions of mothers who smoke and those who do not.
69+
70+
```{r}
71+
# install.packages("janitor") # Uncomment if you need to install
72+
library(janitor)
73+
df_0y_mothers %>%
74+
left_join(df_0y_smoking, by = c("MCSID", "APNUM00")) %>%
75+
select(MCSID, mother_smoker = smoker) %>%
76+
tabyl(mother_smoker)
77+
```
78+
79+
# Determining Relationships between Non-Cohort Members
80+
The household grids include another set of relationship variables (`[A-G]HPREL[A-Z]0`). These can be used to identify the relationships between family members. These variables record the person in the row's (ego) relationship to the person denoted by the column (alt); the penultimate letter `[A-Z]` in `[A-G]HPREL[A-Z]0` corresponds to the person's `PNUM00`. For instance, the variable `AHPRELB0` would denote the relationship of the person in the row to the person with `APNUM00 == 2`. We will extract a small set of data from the Sweep 1 household grid to show this in action.
81+
82+
```{r}
83+
df_0y_hhgrid_prel <- read_dta("0y/mcs1_hhgrid.dta") %>%
84+
select(MCSID, APNUM00, matches("AHPREL[A-Z]0"))
85+
86+
df_0y_hhgrid_prel %>%
87+
select(MCSID, APNUM00, AHPRELA0, AHPRELB0, AHPRELC0, AHPRELD0) %>%
88+
filter(MCSID == "M10001N") # To look at just one family
89+
```
90+
91+
There are seven members in this family, one of whom is a cohort member (`APNUM00 == 100`). `APNUM00`'s 1 and 2 are the (natural) parents, and `APNUM00`'s 3-6 and 100 are the (natural) children. The relationship variables show that `APNUM00`'s 1 and 2 are married, and `APNUM00`'s 3-7 are siblings. Note, the symmetry in the relationships. Where, `APNUM00 == 1`, `AHPRELC0 == 7 [Natural Parent]` and where `APNUM00 == 3`, `AHPRELA0 == 3 [Natural Child]`.
92+
93+
If we want to find the particular person occupying a particular relationship for an individual (e.g., we want to know the `PNUM00` of the person's partner), we need to reshape the data into long-format with one row per ego-alt relationship within a family. For instance, if we want to find each person's spouse (conditional on one being present), we can do the following:
94+
95+
```{r}
96+
df_0y_hhgrid_prel %>%
97+
pivot_longer(cols = matches("AHPREL[A-Z]0"),
98+
names_to = "alt",
99+
values_to = "relationship") %>%
100+
mutate(APNUM00_alt = match(str_sub(alt, 7, 7), LETTERS)) %>%
101+
filter(relationship == 1) %>%
102+
select(MCSID, APNUM00, parent_pnum = APNUM00_alt)
103+
```
104+
105+
# Coda
106+
This only scratches the surface of what can be achieved with the household grid. The `mcs[1-7]_hhgrid.dta` also contain information on cohort-member and family-member's dates of birth, which can be used to, for example, identify the number of resident younger siblings, determine maternal and paternal age at birth, and so on.

0 commit comments

Comments
 (0)