Skip to content

Commit c18e95d

Browse files
authored
Merge pull request #289 from Boston-area-Women-in-Bioinformatics/add-biobank-blog-04
add blog post 4 in the biobank tutorial series
2 parents f6c4aa5 + 215899e commit c18e95d

File tree

1 file changed

+149
-0
lines changed

1 file changed

+149
-0
lines changed
Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
---
2+
publishDate: 2026-03-16T00:00:00-05:00
3+
title: 'Biobank Intro Series: UK Biobank Observational Data (Part II)'
4+
excerpt: 'Loading phenotype data in the UK Biobank RAP (Research Analysis Platform) environment'
5+
slug: blog/biobank-intro-series/03-ukb-observational-data-partII
6+
image: /blog_images/biobank1/ukb_fieldid_scent.png
7+
imageAlt: 'Cat following wafts of fresh kibble onto a table. The scent trails are labeled with UK Biobank Field IDs.'
8+
imageDescription: 'When you finally know which field IDs you need and suddenly the whole dataset smells delicious.'
9+
imagePosition: top
10+
hideHeroImage: false
11+
author: Samantha J. Klasfeld, Ph.D.
12+
authorUrl: 'https://linkedin.com/in/samantha-klasfeld'
13+
category: Tutorial
14+
series: 'Biobank Intro Series'
15+
tags:
16+
- biobank
17+
- sql
18+
- ukb-rap
19+
- ehr-data
20+
- phenotype-data
21+
draft: false
22+
seo:
23+
image:
24+
src: '/blog_images/biobank1/ukb_fieldid_scent.png'
25+
alt: 'Cat following wafts of fresh kibble onto a table. The scent trails are labeled with UK Biobank Field IDs.'
26+
---
27+
28+
When you get approved for a UK Biobank project you are gifted a VIP pass to a secure data wonderland called the UK Biobank Research Analysis Platform (UKB RAP). The UKB RAP is a cloud-based venue (built on DNAnexus infrastructure) where you can spin up coding environments (JupyterLab, RStudio, take your pick) and analyze data without the nightmare of downloading 500,000+ participant records to your poor laptop.
29+
30+
I enjoy working in JupyterLab, but the concepts transfer regardless of your preferred environment. Opening JupyterLab on UKB RAP is straightforward: click "Tools" in the navigation menu and hit the teal "+ New JupyterLab" button. This opens a setup GUI where you can configure your compute specs. For most analyses, the defaults work just fine (see my tips on hardware in a [previous post](../02-hardwareonukbandaou)).
31+
32+
In this post, we'll focus on the phenotype data. Here I define phenotype data as essentially anything that is not genetic data. This includes not just questionnaire responses, physical measurements, and hospital records, but proteomics data as well.
33+
34+
To query the data, you need a fully qualified dataset reference combining your project ID (your workspace on RAP) and your dispensed dataset ID (the UK Biobank data object provisioned to that project). Below I have cheatcodes for finding these values in python and BASH.
35+
36+
python:
37+
38+
```{python}
39+
import dxpy
40+
import glob
41+
42+
# Get your dataset identifier
43+
dispensed_dataset_id = dxpy.find_one_data_object(
44+
typename='Dataset',
45+
name='app*.dataset',
46+
folder='/',
47+
name_mode='glob'
48+
)['id']
49+
50+
project_id = dxpy.find_one_project()["id"]
51+
dataset = f"{project_id}:{dispensed_dataset_id}"
52+
```
53+
54+
bash:
55+
56+
```{bash}
57+
# Get project ID
58+
project_id=$(dx env --json | jq -r '.project')
59+
60+
# Get dispensed dataset ID
61+
dispensed_dataset_id=$(dx find data --name "app*.dataset" --brief)
62+
63+
# Combine them
64+
dataset="${project_id}:${dispensed_dataset_id}"
65+
```
66+
67+
With that in hand, let's get some data.
68+
69+
## Step 1: Find the field names for your data
70+
71+
<figure class="my-8 !max-w-none">
72+
<img src="/blog_images/biobank1/notebook_v_commandline.png" alt="Cat peering over a table in delight at two plates of kibble: one labeled 'Jupyter Notebook' and the other labeled 'Command Line'." />
73+
<figcaption class="text-center text-sm opacity-80 mt-2">
74+
<em>Two bowls, same feast: whether you import via Jupyter or the command line, you’re still getting the data you want.</em>
75+
</figcaption>
76+
</figure>
77+
78+
With a list of field IDs you gathered from the UKB Showcase, your next step is to figure out their exact field names on the RAP (which aren't always identical to what the Showcase shows), and then extract the actual participant data for those fields. There are two different methods for this.
79+
80+
### Quick lookup via terminal
81+
82+
When I'm in "just get it working" mode (which, let's be honest, is most of research), I found that the command-line approach is faster for quick lookups. I simply list all the field names in the terminal and grep for the ones I need.
83+
84+
```bash
85+
dx extract_dataset ${dataset} --entities participant --list-fields | grep "22420"
86+
```
87+
88+
### Dictionary approach
89+
90+
The UKB RAP documentation will steer you toward extracting dictionary files (`*.dataset.data_dictionary.csv `), which map field IDs to field names, describe data coding schemes, and generally serve as the Rosetta Stone for the dataset. This approach is considered more "proper", and for good reason: it's reproducible, documentable, and plays nicely with notebooks.
91+
92+
The dictionary approach requires more setup: extracting CSVs, loading them into pandas, and writing filter logic. When you're exploring or need a quick answer, I recommend the quick and dirty command-line approach. That said, there are times when the dictionaries are nice to have. They make your code easier to comprehend and also are useful for making sense of these data tables after extraction.
93+
94+
## Step 2: Extract the dataset
95+
96+
To extract the actual dataset values, use DNAnexus' `extract_dataset` command with the `--fields` flag set to the relevant field names:
97+
98+
```bash
99+
dx extract_dataset <project_id>:<dispensed_dataset_id> \
100+
--fields participant.eid,participant.p22420_i2,participant.p22420_i3 \
101+
--delimiter "," \
102+
--output lvef_pheno.csv
103+
```
104+
105+
## Step 3: Translate the dataset values
106+
107+
The data you just extracted is often coded, meaning the raw values or numbers are not immediately interpretable.. Therefore, to make human analysis easier, it is sometimes helpful to translate these code to their definition.
108+
109+
**Example: Filtering cardiomyopathy diagnoses from ICD10 codes**
110+
111+
Working on a project to understand the genetic architecture of cardiomyopathy, I needed to identify participants with cardiomyopathy diagnoses. The Showcase told me International Classification of Disease version 10 (ICD-10) diagnosis codes were Field ID 41270.
112+
113+
I extracted the field names easily enough with `dx` commands. However, the raw data returned value codes like "I42" or "I420", meaningless without context.
114+
115+
Fortunately, the UKB Showcase maintains various data coding tables for each of their coded data fields. Specifically ICD-10 diagnosis codes are given coding 19 in UKB. Similarly ICD9 codes are found in coding 87.
116+
117+
You could also extract all the coding dictionaries once and have them ready as searchable dataframes. For example,
118+
119+
```python
120+
import subprocess
121+
import pandas as pd
122+
123+
# Extract dictionaries once with -ddd flag
124+
cmd = ["dx", "extract_dataset", dataset, "-ddd", "--delimiter", ","]
125+
subprocess.check_call(cmd)
126+
127+
# Load the codings dictionary
128+
codings_df = pd.read_csv(glob.glob("*.codings.csv")[0])
129+
130+
# Find which ICD10 codes mean "cardiomyopathy"
131+
icd10_coding = codings_df[codings_df['coding_name'] == "data_coding_19"]
132+
cardiomyopathy_codes = icd10_coding[
133+
icd10_coding['meaning'].str.contains('cardiomyopathy', case=False)
134+
][["meaning", "code"]]
135+
```
136+
137+
Now you can filter your extracted data to only participants with those specific codes, without tab-switching back to the Showcase every time you need to decode something.
138+
139+
## The Gotchas Nobody Tells You
140+
141+
**Authentication weirdness:** Sometimes `dx` commands work fine in terminal but throw mysterious errors when called through `subprocess` in Jupyter. I've never pinned down exactly why. It possible that the Jupyter notebook does not inherit the same environment variables that your interactive terminal session has. Either way, when you hit this, just run the command in terminal instead.
142+
143+
**Spark is not optional for big extractions:** If you're pulling more than ~30 fields, you'll need a Spark cluster. The [UKB RAP documentation](https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/accessing-data/accessing-phenotypic-data) covers this, but fair warning: Spark uses lazy evaluation, which means errors can show up way downstream from where they actually originated. Fun times.
144+
145+
## Wrapping Up
146+
147+
That's the full pipeline: find your field names, extract the data, and decode any coded values. Once you've done it a couple of times, the whole process takes minutes.
148+
149+
In the next post, we'll look at how to do the same thing on the All of Us Researcher Workbench.

0 commit comments

Comments
 (0)