11---
22layout : default
33title : Working with the Household Grid
4- nav_order : 4
4+ nav_order : 3
55parent : MCS
66format : docusaurus-md
77---
@@ -11,12 +11,17 @@ format: docusaurus-md
1111
1212# Introduction
1313
14- In this tutorial, we will learn the basics of using the household grid.
15- Specifically, we will see how to identify particular family members, how
16- to use the household grid to create family-member specific variables,
17- and how to determine the relationships between family members. We will
18- use the example of finding natural mothers smoking status at the first
19- sweep.
14+ In this section, we describe the basics of using the household grid.
15+ Specifically, we show how to use the household grid to:
16+
17+ 1 . Identify particular family members
18+
19+ 2 . Create family-member specific variables
20+
21+ 3 . Determine the relationships between non-cohort members within a
22+ family.
23+
24+ We use the following packages:
2025
2126``` r
2227# Load Packages
@@ -26,26 +31,28 @@ library(haven) # For importing .dta files
2631
2732# Finding Mother of Cohort Members
2833
29- We will load just four variables from the household grid: ` MCSID ` and
30- ` APNUM00 ` , which uniquely identify an individual, and ` AHPSEX00 ` and
31- ` AHCREL00 ` , which contain information on the individual’s sex and their
32- relationship to the household’s cohort member(s). ` AHCREL00 == 7 `
33- identifies natural parents and ` AHPSEX00 == 2 ` identifies females.
34- Combining the two identifies natural mothers. Below, we use ` count() ` to
35- show the different (observed) values for the sex and relationship
36- variables. We also use the ` filter() ` function (which retains
37- observations where the conditions are ` TRUE ` ) to create a dataset
38- containing the identifiers (` MCSID ` and ` APNUM00 ` of natural mothers
39- only; we will merge this will smoking information shortly.
40- ` add_count(MCSID) %>% filter(n == 1) ` is included as an interim step to
41- ensure there is just one natural mother per family.
34+ To show how to perform 1 & 2, we use the example of finding natural
35+ mothers’ smoking status at the first sweep. We load just four variables
36+ from the Sweep 1 household grid: ` MCSID ` and ` APNUM00 ` , which together
37+ uniquely identify an individual, and ` AHPSEX00 ` and ` AHCREL00 ` , which
38+ contain information on the individual’s sex and their relationship to
39+ the household’s cohort member(s). ` AHCREL00 == 7 ` identifies natural
40+ parents and ` AHPSEX00 == 2 ` identifies females. Combining the two
41+ identifies natural mothers. Below, we use ` count() ` to show the
42+ different (observed) values for the sex and relationship variables. We
43+ also use the ` filter() ` function (which retains observations where the
44+ conditions are ` TRUE ` ) to create a dataset containing the identifiers
45+ (` MCSID ` and ` APNUM00 ` ) of natural mothers only; we will merge this with
46+ the smoking information shortly. ` add_count(MCSID) %>% filter(n == 1) `
47+ is included as an interim step to ensure there is just one natural
48+ mother per family.[ ^ 1 ]
4249
4350``` r
4451df_0y_hhgrid <- read_dta(" 0y/mcs1_hhgrid.dta" ) %> %
45- select(MCSID , APNUM00 , AHPSEX00 , AHCREL00 )
52+ select(MCSID , APNUM00 , AHPSEX00 , AHCREL00 ) # Retains the listed variables
4653
4754df_0y_hhgrid %> %
48- count(AHPSEX00 )
55+ count(AHPSEX00 ) # Tabulates each sex; AHPSEX00 does not record the sex of cohort members
4956```
5057
5158``` text
@@ -60,7 +67,7 @@ df_0y_hhgrid %>%
6067
6168``` r
6269df_0y_hhgrid %> %
63- count(AHCREL00 )
70+ count(AHCREL00 ) # Tabulates each relationship to a cohort member
6471```
6572
6673``` text
@@ -87,32 +94,35 @@ df_0y_hhgrid %>%
8794
8895``` r
8996df_0y_mothers <- df_0y_hhgrid %> %
90- filter(AHCREL00 == 7 ,
91- AHPSEX00 == 2 ) %> %
92- add_count(MCSID ) %> %
93- filter(n == 1 ) %> %
94- select(MCSID , APNUM00 )
97+ filter(
98+ AHCREL00 == 7 , # Keep natural parents...
99+ AHPSEX00 == 2 # ...who are female.
100+ ) %> %
101+ add_count(MCSID ) %> % # Creates new variable (n) containing # of records with given MCSID
102+ filter(n == 1 ) %> % # Keep where only one recorded natural mother per family
103+ select(MCSID , APNUM00 ) # Keep identifier variables
95104```
96105
97106Note, where a cohort member is part of a family (` MCSID ` ) with two or
98107more cohort members, the cohort member will have been a multiple birth
99108(i.e., twin or triplet), so familial relationships should apply to all
100109cohort members in the family, which is why there is just one
101110relationship (` [A-G]HCREL00 ` ) variable per household grid file. This
102- will change as the cohort members age, move into separate residences and
103- start their own families.
111+ will change as the cohort members age, moving into separate residences
112+ and starting their own families.
104113
105114# Creating a Mother’s Smoking Variable
106115
107116Now we have a dataset containing the IDs of natural mothers, we can load
108- the smoking information from the Sweep 1 parent interview file. The
109- smoking variable used is called ` APSMUS0A ` which contains information on
110- the tobacco products a parent uses. We classify a parent as a smoker if
111- they use any tobacco product (` mutate(smoker = case_when(...)) ` ).
117+ the smoking information from the Sweep 1 parent interview file
118+ (` mcs1_parent_interview.dta ` ). The smoking variable we use is called
119+ ` APSMUS0A ` and contains information on the tobacco product (if any) a
120+ parent consumes. We classify a parent as a smoker if they use any
121+ tobacco product (` mutate(parent_smoker = case_when(...)) ` ).
112122
113123``` r
114124df_0y_parent <- read_dta(" 0y/mcs1_parent_interview.dta" ) %> %
115- select(MCSID , APNUM00 , APSMUS0A )
125+ select(MCSID , APNUM00 , APSMUS0A ) # Retains only the variables we need
116126
117127df_0y_parent %> %
118128 count(APSMUS0A )
@@ -135,18 +145,18 @@ df_0y_parent %>%
135145
136146``` r
137147df_0y_smoking <- df_0y_parent %> %
138- mutate(smoker = case_when(APSMUS0A %in% 2 : 95 ~ 1 ,
139- APSMUS0A == 1 ~ 0 )) %> %
140- select(MCSID , APNUM00 , smoker )
148+ mutate(parent_smoker = case_when(APSMUS0A %in% 2 : 95 ~ 1 , # If APSMUS0A is integer between 2 and 95, then 1
149+ APSMUS0A == 1 ~ 0 )) %> % # If APSMUS0A is 1, then 0
150+ select(MCSID , APNUM00 , parent_smoker )
141151```
142152
143153Now we can merge the two datasets together to ensure we only keep rows
144154in ` df_0y_smoking ` that appear in ` df_0y_mothers ` . We use ` left_join() `
145155to do this, with ` df_0y_mothers ` as the dataset determining the
146- outputted rows, so that we have one row per identified mother. The
156+ outputted rows, so that we have one row per identified mother.[ ^ 2 ] The
147157result is a dataset with one row per family with an identified mother.
148- We rename the ` smoker ` variable to ` mother_smoker ` to clarify that it
149- refers to the mother’s smoking status.
158+ We rename the ` parent_smoker ` variable to ` mother_smoker ` to clarify
159+ that it refers to the mother’s smoking status.
150160
151161Below we also pipe this dataset into the ` tabyl() ` function (from
152162` janitor ` ) to tabulate the number and proportions of mothers who smoke
@@ -155,23 +165,9 @@ and those who do not.
155165``` r
156166# install.packages("janitor") # Uncomment if you need to install
157167library(janitor )
158- ```
159-
160- ``` text
161-
162- Attaching package: 'janitor'
163- ```
164-
165- ``` text
166- The following objects are masked from 'package:stats':
167-
168- chisq.test, fisher.test
169- ```
170-
171- ``` r
172168df_0y_mothers %> %
173169 left_join(df_0y_smoking , by = c(" MCSID" , " APNUM00" )) %> %
174- select(MCSID , mother_smoker = smoker ) %> %
170+ select(MCSID , mother_smoker = parent_smoker ) %> %
175171 tabyl(mother_smoker )
176172```
177173
@@ -185,22 +181,25 @@ df_0y_mothers %>%
185181# Determining Relationships between Non-Cohort Members
186182
187183The household grids include another set of relationship variables
188- (` [A-G]HPREL[A-Z]0 ` ). These can be used to identify the relationships
189- between family members. These variables record the person in the row’s
190- (ego) relationship to the person denoted by the column (alt); the
191- penultimate letter ` [A-Z] ` in ` [A-G]HPREL[A-Z]0 ` corresponds to the
192- person’s ` PNUM00 ` . For instance, the variable ` AHPRELB0 ` would denote
193- the relationship of the person in the row to the person with
194- ` APNUM00 == 2 ` . We will extract a small set of data from the Sweep 1
195- household grid to show this in action.
184+ besides ` [A-G]HCREL00 ` . These vary in name slightly between sweeps:
185+ ` [A-D]HPREL[A-Z]0 ` in ` mcs[1-4]_hhgrid.dta ` , ` EPREL0[A-Z]00 ` in
186+ ` mcs5_hhgrid.dta ` , and ` [F-G]HPREL0[A-Z] ` in ` mcs[6-7]_hhgrid.dta ` .
187+ These variables can be used to identify the relationships between
188+ non-cohort member family members. Specifically, they record the person
189+ in the row’s (ego) relationship to the person denoted by the column
190+ (alt); the letter ` [A-Z] ` in the variable name corresponds to the alt’s
191+ ` [A-D]PNUM00 ` . For instance, the variable ` AHPRELB0 ` denotes the
192+ relationship of the person in the row to the person in the same family
193+ with ` APNUM00 == 2 ` . Below, we extract a small set of data from the
194+ Sweep 1 household grid to show this in action.
196195
197196``` r
198197df_0y_hhgrid_prel <- read_dta(" 0y/mcs1_hhgrid.dta" ) %> %
199198 select(MCSID , APNUM00 , matches(" AHPREL[A-Z]0" ))
200199
201200df_0y_hhgrid_prel %> %
202201 filter(MCSID == " M10001N" ) %> % # To look at just one family
203- select(APNUM00 , AHPRELA0 , AHPRELB0 , AHPRELC0 )
202+ select(APNUM00 , AHPRELA0 , AHPRELB0 , AHPRELC0 ) # To look at first few relationship variables
204203```
205204
206205``` text
@@ -220,24 +219,26 @@ There are seven members in this family, one of whom is a cohort member
220219(` APNUM00 == 100 ` ). ` APNUM00 ` ’s 1 and 2 are the (natural) parents, and
221220` APNUM00 ` ’s 3-6 and 100 are the (natural) children. The relationship
222221variables show that ` APNUM00 ` ’s 1 and 2 are married, and ` APNUM00 ` ’s 3-7
223- are siblings. Note, the symmetry in the relationships. Where,
224- ` APNUM00 == 1 ` , ` AHPRELC0 == 7 [Natural Parent] ` and where
225- ` APNUM00 == 3 ` , ` AHPRELA0 == 3 [Natural Child] ` .
226-
227- If we want to find the particular person occupying a particular
228- relationship for an individual (e.g., we want to know the ` PNUM00 ` of
229- the person’s partner), we need to reshape the data into long-format with
230- one row per ego-alt relationship within a family. For instance, if we
231- want to find each person’s spouse (conditional on one being present), we
232- can do the following:
222+ are siblings (` AHPRELC0 == 11 [Natural brother/sister] ` ) and biological
223+ offspring of ` APNUM00 ` ’s 1 and 2
224+ (` AHPREL[A-B]0 == 3 [Natural son/daughter] ` ). Note the symmetry in the
225+ relationships. Where, ` APNUM00 == 1 ` , ` AHPRELC0 == 7 [Natural Parent] `
226+ and where ` APNUM00 == 3 ` , ` AHPRELA0 == 3 [Natural son/daughter] ` .
227+
228+ If we want to find the particular person occupying a specific
229+ relationship for an individual (e.g., we want to know the ` [A-G]PNUM00 `
230+ of the person’s partner), we need to reshape the data into long-format
231+ with one row per ego-alt relationship within a family. For instance, if
232+ we want to find each person’s spouse (conditional on one being present),
233+ we can do the following:[ ^ 3 ]
233234
234235``` r
235236df_0y_hhgrid_prel %> %
236237 pivot_longer(cols = matches(" AHPREL[A-Z]0" ),
237238 names_to = " alt" ,
238239 values_to = " relationship" ) %> %
239- mutate(APNUM00_alt = match(str_sub(alt , 7 , 7 ), LETTERS )) %> %
240- filter(relationship == 1 ) %> %
240+ mutate(APNUM00_alt = match(str_sub(alt , - 2 , - 2 ), LETTERS )) %> % # Creates alt's PNUM00 by matching penultimate letter to position in alphabet
241+ filter(relationship == 1 ) %> % # Keep where husband or wife
241242 select(MCSID , APNUM00 , partner_pnum = APNUM00_alt )
242243```
243244
@@ -261,7 +262,28 @@ df_0y_hhgrid_prel %>%
261262# Coda
262263
263264This only scratches the surface of what can be achieved with the
264- household grid. The ` mcs[1-7]_hhgrid.dta ` also contain information on
265- cohort-member and family-member’s dates of birth, which can be used to,
266- for example, identify the number of resident younger siblings, determine
267- maternal and paternal age at birth, and so on.
265+ household grid. The ` mcs[1-7]_hhgrid.dta ` files also contain information
266+ on cohort-member and family-member’s dates of birth, which can be used
267+ to, for example, identify the number of resident younger siblings,
268+ determine maternal and paternal age at birth, and so on.
269+
270+ [ ^ 1 ] : Loading the ` .dta ` files into ` R ` with ` haven::read_dta() ` retains
271+ the dataset metadata, including variable names and labels, mainly by
272+ storing variables as ` labelled ` class objects. See the [ ` labelled `
273+ package help
274+ files] ( https://cran.r-project.org/web/packages/labelled/vignettes/intro_labelled.html )
275+ for more information on working with this metadata - for instance,
276+ converting ` labelled ` variables to standard ` R ` factor variables or
277+ replacing negative values (generally reserved in MCS data to
278+ indicate missingness) with ` R ` ’s native ` NA ` value.
279+
280+ [ ^ 2 ] : ` left_join() ` takes as arguments two data frames and retains only
281+ the rows in the first data frame, regardless of whether there is a
282+ match with the second. See [ * Combining Data Across
283+ Sweeps* ] ( https://cls-data.github.io/docs/mcs-merging_across_sweeps.html )
284+ for more discussion of the ` *_join() ` functions.
285+
286+ [ ^ 3 ] : For more on reshaping data, see [ * Reshaping Data from Long to Wide
287+ (or Wide to
288+ Long)* ] ( https://cls-data.github.io/docs/mcs-reshape_long_wide.html )
289+ for more discussion of the ` *_join() ` functions.
0 commit comments