This page describes the steps we applied in going from the raw data provided by UK-biobank, to that used to run GWAS.
- Create a new VM on google cloud, install R and git.
- Clone the PHESANT repository.
- Run
reengineering_phenofile_pharma_exomes.r- Remove cancer variables - these have a up to 31 visits and should be considered separately.
- Restrict to the first visit (we made the assumption that the first visit contained the least missing data).
- Prepend each column name with
xand replace all instances of-and.with_, for compatibility with PHESANT.
- Restrict to the subset of samples that are output by the genetic data QC pipeline.
- The resultant phenotype file, restricted to the individuals for whom we have clean genetic data is then passed to PHESANT. A summary of how PHESANT parses the raw phenotype data is shown here.
- PHESANT inputs
variablelistfileoutcome_info_final_round2.tsv.- As described in the README for PHESANT, see the Variable information file heading.
- The most important column is
EXCLUDED. If there is any text in a cell in this column it is excluded. CAT_MULT_INDICATOR_FIELDSis also important, it describes who to include as controls for this phenotype according to the rules explained in the README for PHESANT.
- The most important column is
- As described in the README for PHESANT, see the Variable information file heading.
datacodingfiledata-coding-ordinal-info.txt. See the README for PHESANT for details of its format.- Our default settings for filters.
--catmultcutoff50. The cutoff for exclusion when creating dichotomous variables for CAT-MULTIPLE.--catordnacutoff500. The cutoff for exclusion for number of non-NAs in ordered categorical variables.--catunordnacutoff5000. The cutoff for exclusion for number of non-NAs in unordered categorical variables.--contnacutoff5000. The cutoff for exclusion for number of non-NAs in continuous variables.--binnacutoff5000. The cutoff for exclusion for number of non-NAs in binary-variables.--bintruecutoff100. The cutoff for exclusion for numbers of members of a category in binary-variables.--mincategorysize10. The minimum number of samples in a category for categorical single, integer, and continous variables.--maxunorderedcategories1000. The maximum number of categories in an unordered categorical variable.--propforcontinuous0.2. The cutoff for proportion of samples with the same value for the variable to not be considered continuous.
- We run PHESANT in chunks on the cloud.
- We then run the
summarise_phenotypes.rfunctions to obtain a summary file of the phenotypes that made it into the final file.