Skip to content

Commit f79de02

Browse files
committed
add mcs_sweep_folders
1 parent 0165de9 commit f79de02

File tree

7 files changed

+148
-54
lines changed

7 files changed

+148
-54
lines changed

docs/intro.md

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,17 @@ The Centre for Longitudinal Studies (CLS) manages four cohort studies for which
1313
3. [Next Steps](https://doi.org/10.5334/ohd.16), a cohort of English schoolchildren followed from age 13/14 and born in 1989/90.
1414
4. [The Millennium Cohort Study (MCS)](https://doi.org/10.1093/ije/dyu001), a birth cohort of individuals born in Britain in 2000/02.
1515

16-
This website provides `R` and `Stata` code for common data management tasks in each of the studies. This include merging files across survey sweeps, reshaping data from wide to long format, and identifying correct identifier variables for observational units in each study (e.g., cohort members, families, parents, and so on);
16+
This website provides `R` and `Stata` code for common data management tasks in each of the studies. This include merging files across survey sweeps, reshaping data from wide to long format, and using the correct variables to identify observational units (e.g., cohort members, families, parents, and so on).
1717

18-
For background on these studies, please see cohort profile papers (linked above) and the [CLS website](https://cls.ucl.ac.uk/cls-studies/). Queries on the data can be sent to the [CLS Data team](mailto:[email protected]).
18+
For background on these studies, please see cohort profile papers (linked above) and the [CLS website](https://cls.ucl.ac.uk/cls-studies/). Queries about the data can be sent to the [CLS Data team](mailto:clsdata@ucl.ac.uk). Queries and comments about this website can be directed to [Liam Wright](mailto:liam.wright@ucl.ac.uk).
1919

2020
# Data Access
2121

22-
Most of the data is available to researchers via the UK Data Service (links: [NCDS](https://doi.org/10.5255/UKDA-Series-2000032), [BCS70](https://doi.org/10.5255/UKDA-Series-200001), [Next Steps](https://doi.org/10.5255/UKDA-Series-2000030), and [MCS](https://doi.org/10.5255/UKDA-Series-2000031)). This includes a series of harmonized measures created by [CLOSER](https://doi.org/10.5255/UKDA-Series-2000111). Most of the UKDS data is available via the least restrictive End User Licence, though more sensitive variables, such as low-level geographies, are available by Special Licence or Secure Access only.
22+
Most of the data is available to researchers via the UK Data Service (links: [NCDS](https://doi.org/10.5255/UKDA-Series-2000032), [BCS70](https://doi.org/10.5255/UKDA-Series-200001), [Next Steps](https://doi.org/10.5255/UKDA-Series-2000030), and [MCS](https://doi.org/10.5255/UKDA-Series-2000031)). This includes a series of harmonized measures created by [CLOSER](https://doi.org/10.5255/UKDA-Series-2000111). Most of the UKDS data is available via the minimally restrictive End User Licence. More sensitive variables, such as low-level geographies, are available by Special Licence or Secure Access only.
2323

24-
Further, some data, such as raw genetic data and biological samples, are only available by application to CLS directly. More information is available on the [CLS website](https://cls.ucl.ac.uk/data-access-training/data-access/).
24+
Some data, such as raw genetic data and biological samples, are only available by application to CLS directly. More information is available on the [CLS website](https://cls.ucl.ac.uk/data-access-training/data-access/).
25+
26+
# Preliminaries
27+
The code presented on this website will presume you have downloaded the data from the UKDS in `Stata` (`.dta`) format. For historical reasons, data on the UKDS for the NCDS, BCS70 and MCS are separated by survey sweep. To get all of the survey data for a study, you therefore need to download multiple individual datasets. This can make merging data across sweeps a little challenging as the data as downloaded are dispersed across multiple folders. The file and folder names are also often not comprehensible.
28+
29+
To make using the datasets easier, we provide code reorganise the `.dta` files into a simple directory structure with a folder for each sweep. This code is described under each study section (e.g., `MCS -> Creating a Simple Folder Structure`). We will assume you have organised the files in this way in other code we present.

docs/misc-sweep_folders.md

Lines changed: 0 additions & 48 deletions
This file was deleted.

docs/msc-sweep_folders.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
---
2+
layout: default
3+
title: "Creating a Simple Folder Structure"
4+
nav_order: 1
5+
parent: Miscellaneous
6+
format: docusaurus-md
7+
---
8+
9+
# Introduction {#introduction}
10+
11+
This page introduces code for taking [MCS UKDS End User Licence](https://doi.org/10.5255/UKDA-Series-2000031) zipped Stata (`.dta`) files, unzipping them and placing into per-sweep folders. The code is available on GitHub: https://github.com/CLS-Data/make-directories-mcs.
12+
13+
To use the code, first download or clone the GitHub directory. To download the directory, on the GitHub website, click `Code -> Download Zip` (see screenshot below) then unzip the downloaded file and place in a suitable location on your computer. To clone the directory, open your computer's command line or terminal, navigate to an appropriate location (`cd ...`) and type `git clone https://github.com/CLS-Data/make-directories-mcs`. You may want to rename the folder from `make-directories-mcs` to `MCS` or something similar.
14+
15+
![Downloading the GitHub directory](../images/mcs-sweep_folders_1.png)
16+
17+
When the folder is downloaded, open the `README.md` file and follow the instructions. You will need to download `R` and `RStudio`, as well as the appropriate MCS Stata files off the UK Data Service. The `README.md` file lists the asset numbers of the files the code will work for.
18+
19+
Once completed, the folder should look like the below. You will see the code also creates a data dictionary (in `.csv` and `R` [`.Rdata`] formats) which you can use to search for variables.
20+
21+
![Directory after code completed](../images/mcs-sweep_folders_2.png)

images/mcs-sweep_folders_1.png

1.6 MB
Loading

images/mcs-sweep_folders_2.png

125 KB
Loading

quarto/README.txt

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1-
Use the following command to save and render into the correct folder:
1+
Use the following command to execute render in the correct folders:
22

3-
quarto_render("quarto/next_steps-test.qmd", output_file = "next_steps-test.md")
3+
quarto_render("quarto/next_steps-test.qmd",
4+
output_file = "next_steps-test.md",
5+
execute_dir = Sys.getenv("ns_fld"))

quarto/mcs-merge_within_sweep.qmd

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
---
2+
layout: default
3+
title: "Merge within sweep"
4+
nav_order: 2
5+
parent: MCS
6+
format: docusaurus-md
7+
---
8+
9+
This page shows code for merging MCS files which use different data structures within a given sweep. In this demonstration, we will use data from sweep 2 (age 3y) of the survey.
10+
11+
```{r}
12+
#| warning: false
13+
library(tidyverse)
14+
library(haven)
15+
```
16+
17+
We will demonst
18+
19+
```{r}
20+
family <- read_dta("mcs2_family_derived.dta")
21+
cm <- read_dta("mcs2_cm_derived.dta")
22+
parent <- read_dta("mcs2_parent_derived.dta")
23+
parent_cm <- read_dta("mcs2_parent_cm_interview.dta")
24+
hhgrid <- read_dta("mcs2_hhgrid.dta")
25+
```
26+
27+
library(tidyverse)
28+
library(haven)
29+
30+
setwd("/Users/liamwright/Documents/Data/MCS/3y")
31+
32+
# 1. Load Data ----
33+
family <- read_dta("mcs2_family_derived.dta")
34+
cm <- read_dta("mcs2_cm_derived.dta")
35+
parent <- read_dta("mcs2_parent_derived.dta")
36+
parent_cm <- read_dta("mcs2_parent_cm_interview.dta")
37+
hhgrid <- read_dta("mcs2_hhgrid.dta")
38+
39+
family
40+
cm
41+
parent
42+
parent_cm
43+
hhgrid
44+
45+
# 2. Clean Data ----
46+
# family: BACTRY00 Country
47+
# cm: BDC08E00 Ethnicity
48+
# parent_cm: BPOFRE00 Any parent reads to child
49+
# parent_cm: BPPIAW00 Main / Secondary Career warm relationship with child
50+
# parent: BDD05S00 NS-SEC for the family
51+
# parent: BDDNVQ00 Parental Education (NVQ)
52+
# hhgrid: BHCREL00 Relationship to CM
53+
54+
df_ethnic_group <- cm %>%
55+
select(MCSID, BCNUM00, ethnic_group = BDC08E00)
56+
57+
df_country <- family %>%
58+
select(MCSID, country = BACTRY00)
59+
60+
df_reads <- parent_cm %>%
61+
select(MCSID, BPNUM00, BCNUM00, BPOFRE00) %>%
62+
mutate(parent_reads = case_when(between(BPOFRE00, 1, 3) ~ 1,
63+
between(BPOFRE00, 4, 6) ~ 0)) %>%
64+
drop_na() %>%
65+
group_by(MCSID, BCNUM00) %>%
66+
summarise(parent_reads = max(parent_reads),
67+
.groups = "drop")
68+
69+
df_warm <- parent_cm %>%
70+
select(MCSID, BCNUM00, BELIG00, BPPIAW00) %>%
71+
mutate(variable = ifelse(BELIG00 == 1, "main_warm", "secondary_warm"),
72+
value = case_when(BPPIAW00 == 5 ~ 1,
73+
between(BPPIAW00, 1, 6) ~ 0)) %>%
74+
select(MCSID, BCNUM00, variable, value) %>%
75+
pivot_wider(names_from = variable, values_from = value)
76+
77+
df_nssec <- parent %>%
78+
select(MCSID, BPNUM00, parent_nssec = BDD05S00) %>%
79+
mutate(parent_nssec = if_else(parent_nssec < 0, NA, parent_nssec)) %>%
80+
drop_na() %>%
81+
group_by(MCSID) %>%
82+
summarise(family_nssec = min(parent_nssec))
83+
84+
hhgrid %>% count(BHCREL00)
85+
hhgrid %>% count(BHPSEX00)
86+
87+
df_mother <- hhgrid %>%
88+
select(MCSID, BPNUM00, BHCREL00, BHPSEX00) %>%
89+
filter(between(BPNUM00, 1, 99),
90+
BHCREL00 == 7,
91+
BHPSEX00 == 2) %>%
92+
distinct(MCSID, BPNUM00) %>%
93+
add_count(MCSID) %>%
94+
filter(n == 1) %>%
95+
select(MCSID, BPNUM00)
96+
97+
df_mother_edu <- parent %>%
98+
select(MCSID, BPNUM00, mother_nvq = BDDNVQ00) %>%
99+
right_join(df_mother, by = c("MCSID", "BPNUM00")) %>%
100+
select(-BPNUM00)
101+
102+
df_ethnic_group
103+
df_country
104+
df_reads
105+
df_warm
106+
df_nssec
107+
df_mother_edu
108+
109+
df_ethnic_group %>%
110+
left_join(df_country, by = "MCSID") %>%
111+
left_join(df_reads, by = c("MCSID", "BCNUM00")) %>%
112+
left_join(df_warm, by = c("MCSID", "BCNUM00")) %>%
113+
left_join(df_nssec, by = "MCSID") %>%
114+
left_join(df_mother_edu, by = "MCSID")

0 commit comments

Comments
 (0)