CLS-Data
diff --git a/‎docs/intro.md‎
Lines changed: 9 additions & 4 deletions b/‎docs/intro.md‎
Lines changed: 9 additions & 4 deletions
diff --git a/‎docs/misc-sweep_folders.md‎
Lines changed: 0 additions & 48 deletions b/‎docs/misc-sweep_folders.md‎
Lines changed: 0 additions & 48 deletions
diff --git a/‎docs/msc-sweep_folders.md‎
Lines changed: 21 additions & 0 deletions b/‎docs/msc-sweep_folders.md‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎images/mcs-sweep_folders_1.png‎
1.6 MB b/‎images/mcs-sweep_folders_1.png‎
1.6 MB
diff --git a/‎images/mcs-sweep_folders_2.png‎
125 KB b/‎images/mcs-sweep_folders_2.png‎
125 KB
diff --git a/‎quarto/README.txt‎
Lines changed: 4 additions & 2 deletions b/‎quarto/README.txt‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎quarto/mcs-merge_within_sweep.qmd‎
Lines changed: 114 additions & 0 deletions b/‎quarto/mcs-merge_within_sweep.qmd‎
Lines changed: 114 additions & 0 deletions
@@ -13,12 +13,17 @@ The Centre for Longitudinal Studies (CLS) manages four cohort studies for which
 3. [Next Steps](https://doi.org/10.5334/ohd.16), a cohort of English schoolchildren followed from age 13/14 and born in 1989/90.
 4. [The Millennium Cohort Study (MCS)](https://doi.org/10.1093/ije/dyu001), a birth cohort of individuals born in Britain in 2000/02.
 
-This website provides `R` and `Stata` code for common data management tasks in each of the studies. This include merging files across survey sweeps, reshaping data from wide to long format, and identifying correct identifier variables for observational units in each study (e.g., cohort members, families, parents, and so on);
+This website provides `R` and `Stata` code for common data management tasks in each of the studies. This include merging files across survey sweeps, reshaping data from wide to long format, and using the correct variables to identify observational units (e.g., cohort members, families, parents, and so on).
 
-For background on these studies, please see cohort profile papers (linked above) and the [CLS website](https://cls.ucl.ac.uk/cls-studies/). Queries on the data can be sent to the [CLS Data team](mailto:[email protected]).
+For background on these studies, please see cohort profile papers (linked above) and the [CLS website](https://cls.ucl.ac.uk/cls-studies/). Queries about the data can be sent to the [CLS Data team](mailto:clsdata@ucl.ac.uk). Queries and comments about this website can be directed to [Liam Wright](mailto:liam.wright@ucl.ac.uk).
 
 # Data Access
 
-Most of the data is available to researchers via the UK Data Service (links: [NCDS](https://doi.org/10.5255/UKDA-Series-2000032), [BCS70](https://doi.org/10.5255/UKDA-Series-200001), [Next Steps](https://doi.org/10.5255/UKDA-Series-2000030), and [MCS](https://doi.org/10.5255/UKDA-Series-2000031)). This includes a series of harmonized measures created by [CLOSER](https://doi.org/10.5255/UKDA-Series-2000111). Most of the UKDS data is available via the least restrictive End User Licence, though more sensitive variables, such as low-level geographies, are available by Special Licence or Secure Access only. 
+Most of the data is available to researchers via the UK Data Service (links: [NCDS](https://doi.org/10.5255/UKDA-Series-2000032), [BCS70](https://doi.org/10.5255/UKDA-Series-200001), [Next Steps](https://doi.org/10.5255/UKDA-Series-2000030), and [MCS](https://doi.org/10.5255/UKDA-Series-2000031)). This includes a series of harmonized measures created by [CLOSER](https://doi.org/10.5255/UKDA-Series-2000111). Most of the UKDS data is available via the minimally restrictive End User Licence. More sensitive variables, such as low-level geographies, are available by Special Licence or Secure Access only. 
 
-Further, some data, such as raw genetic data and biological samples, are only available by application to CLS directly. More information is available on the [CLS website](https://cls.ucl.ac.uk/data-access-training/data-access/).
+Some data, such as raw genetic data and biological samples, are only available by application to CLS directly. More information is available on the [CLS website](https://cls.ucl.ac.uk/data-access-training/data-access/).
+
+# Preliminaries
+The code presented on this website will presume you have downloaded the data from the UKDS in `Stata` (`.dta`) format. For historical reasons, data on the UKDS for the NCDS, BCS70 and MCS are separated by survey sweep. To get all of the survey data for a study, you therefore need to download multiple individual datasets. This can make merging data across sweeps a little challenging as the data as downloaded are dispersed across multiple folders. The file and folder names are also often not comprehensible.
+
+To make using the datasets easier, we provide code reorganise the `.dta` files into a simple directory structure with a folder for each sweep. This code is described under each study section (e.g., `MCS -> Creating a Simple Folder Structure`). We will assume you have organised the files in this way in other code we present.
@@ -0,0 +1,21 @@
+---
+layout: default
+title: "Creating a Simple Folder Structure"
+nav_order: 1
+parent: Miscellaneous
+format: docusaurus-md
+---
+
+# Introduction {#introduction}
+
+This page introduces code for taking [MCS UKDS End User Licence](https://doi.org/10.5255/UKDA-Series-2000031) zipped Stata (`.dta`) files, unzipping them and placing into per-sweep folders. The code is available on GitHub: https://github.com/CLS-Data/make-directories-mcs.
+
+To use the code, first download or clone the GitHub directory. To download the directory, on the GitHub website, click `Code -> Download Zip` (see screenshot below) then unzip the downloaded file and place in a suitable location on your computer. To clone the directory, open your computer's command line or terminal, navigate to an appropriate location (`cd ...`) and type `git clone https://github.com/CLS-Data/make-directories-mcs`. You may want to rename the folder from `make-directories-mcs` to `MCS` or something similar.
+
+![Downloading the GitHub directory](../images/mcs-sweep_folders_1.png)
+
+When the folder is downloaded, open the `README.md` file and follow the instructions. You will need to download `R` and `RStudio`, as well as the appropriate MCS Stata files off the UK Data Service. The `README.md` file lists the asset numbers of the files the code will work for.
+
+Once completed, the folder should look like the below. You will see the code also creates a data dictionary (in `.csv` and `R` [`.Rdata`] formats) which you can use to search for variables. 
+
+![Directory after code completed](../images/mcs-sweep_folders_2.png)
@@ -1,3 +1,5 @@
-Use the following command to save and render into the correct folder:
+Use the following command to execute render in the correct folders:
 
-quarto_render("quarto/next_steps-test.qmd", output_file = "next_steps-test.md")
+quarto_render("quarto/next_steps-test.qmd", 
+              output_file = "next_steps-test.md",
+              execute_dir = Sys.getenv("ns_fld"))
@@ -0,0 +1,114 @@
+---
+layout: default
+title: "Merge within sweep"
+nav_order: 2
+parent: MCS
+format: docusaurus-md
+---
+
+This page shows code for merging MCS files which use different data structures within a given sweep. In this demonstration, we will use data from sweep 2 (age 3y) of the survey.
+
+```{r}
+#| warning: false
+library(tidyverse)
+library(haven)
+```
+
+We will demonst
+
+```{r}
+family <- read_dta("mcs2_family_derived.dta")
+cm <- read_dta("mcs2_cm_derived.dta")
+parent <- read_dta("mcs2_parent_derived.dta")
+parent_cm <- read_dta("mcs2_parent_cm_interview.dta")
+hhgrid <- read_dta("mcs2_hhgrid.dta")
+```
+
+library(tidyverse)
+library(haven)
+
+setwd("/Users/liamwright/Documents/Data/MCS/3y")
+
+# 1. Load Data ----
+family <- read_dta("mcs2_family_derived.dta")
+cm <- read_dta("mcs2_cm_derived.dta")
+parent <- read_dta("mcs2_parent_derived.dta")
+parent_cm <- read_dta("mcs2_parent_cm_interview.dta")
+hhgrid <- read_dta("mcs2_hhgrid.dta")
+
+family
+cm
+parent
+parent_cm
+hhgrid
+
+# 2. Clean Data ----
+# family: BACTRY00 Country
+# cm: BDC08E00 Ethnicity 
+# parent_cm: BPOFRE00 Any parent reads to child 
+# parent_cm: BPPIAW00 Main / Secondary Career warm relationship with child
+# parent: BDD05S00 NS-SEC for the family
+# parent: BDDNVQ00 Parental Education (NVQ)
+# hhgrid: BHCREL00 Relationship to CM
+
+df_ethnic_group <- cm %>%
+  select(MCSID, BCNUM00, ethnic_group = BDC08E00)
+
+df_country <- family %>%
+  select(MCSID, country = BACTRY00)
+
+df_reads <- parent_cm %>%
+  select(MCSID, BPNUM00, BCNUM00, BPOFRE00) %>%
+  mutate(parent_reads = case_when(between(BPOFRE00, 1, 3) ~ 1,
+                                  between(BPOFRE00, 4, 6) ~ 0)) %>%
+  drop_na() %>%
+  group_by(MCSID, BCNUM00) %>%
+  summarise(parent_reads = max(parent_reads),
+            .groups = "drop")
+
+df_warm <- parent_cm %>%
+  select(MCSID, BCNUM00, BELIG00, BPPIAW00) %>%
+  mutate(variable = ifelse(BELIG00 == 1, "main_warm", "secondary_warm"),
+         value = case_when(BPPIAW00 == 5 ~ 1,
+                           between(BPPIAW00, 1, 6) ~ 0)) %>%
+  select(MCSID, BCNUM00, variable, value) %>%
+  pivot_wider(names_from = variable, values_from = value)
+
+df_nssec <- parent %>%
+  select(MCSID, BPNUM00, parent_nssec = BDD05S00) %>%
+  mutate(parent_nssec = if_else(parent_nssec < 0, NA, parent_nssec)) %>%
+  drop_na() %>%
+  group_by(MCSID) %>%
+  summarise(family_nssec = min(parent_nssec))
+
+hhgrid %>% count(BHCREL00)
+hhgrid %>% count(BHPSEX00)
+
+df_mother <- hhgrid %>%
+  select(MCSID, BPNUM00, BHCREL00, BHPSEX00) %>%
+  filter(between(BPNUM00, 1, 99),
+         BHCREL00 == 7,
+         BHPSEX00 == 2) %>%
+  distinct(MCSID, BPNUM00) %>%
+  add_count(MCSID) %>%
+  filter(n == 1) %>%
+  select(MCSID, BPNUM00)
+
+df_mother_edu <- parent %>%
+  select(MCSID, BPNUM00, mother_nvq = BDDNVQ00) %>%
+  right_join(df_mother, by = c("MCSID", "BPNUM00")) %>%
+  select(-BPNUM00)
+
+df_ethnic_group
+df_country
+df_reads
+df_warm
+df_nssec
+df_mother_edu
+
+df_ethnic_group %>%
+  left_join(df_country, by = "MCSID") %>%
+  left_join(df_reads, by = c("MCSID", "BCNUM00")) %>%
+  left_join(df_warm, by = c("MCSID", "BCNUM00")) %>%
+  left_join(df_nssec, by = "MCSID") %>%
+  left_join(df_mother_edu, by = "MCSID")