Skip to content

Excel hell

Steve Harris edited this page Mar 18, 2016 · 21 revisions

Start here

Computers are stupid (and colour blind)

  • do not store information by highlighting the cells, rows, or columns in colours
  • ditto for comments
  • keep a 'readme' sheet for notes to yourself
    • you might want to list each variable here with an explanatory note
  • row 1 is for column names (do not add extra rows to explain your column headings or categorise them)
  • every other row is for a unique entity (a patient, a day, an observation, a ward, a hospital, a bed)
  • each column is for a characteristics of that entity e.g.
    • a patient will have age, a gender, a systolic blood pressure etc
    • a hospital will have a name, a postcode, a number of beds etc
  • column names will be used in your code so they need to follow certain rules
    • white space is not allowed (maxHeight is OK, max Height is not) because when you write code the computer will assume new words are new names
    • use names that will make sense to you when you read it in your work (x is quick to write but could mean anything, maxHeight is pretty obvious)
    • concatenating words can make them impossible to read
      • lowestheartrateweek1 is an 'eyefull';
      • try lowest_heart_rate_week1 (uses an underscore, aka snake case)
      • or lowestHeartRateWeek1 (capitalise the first letter of words, aka camel case)

Others

  • one table per validation
  • data validation in excel
  • especially when creating factors
  • case sensitivity
  • don't merge cells
  • don't mix letters and numbers (beware adding units to numbers)
  • be clear about missing values and zeroes
    • blank is missing
  • multiple entries in one cell
  • dates and times in excel and conversion
    • recommendation for separate columns for each component
    • need part of the data tidying lecture to include handling dates and times following this recommendatino

Basic principles

  • define your unit of analysis (the row)
  • define characteristics of that unit (the column)
  • if something seems repetitive then consider a separate table (concept of relational data and keys); but for simple projects try hard to keep everything in a simple, square table
  • simple formatting principles
    • computers are colour blind (highlighting cells/rows is only good for you)
    • variable names
      • camel case vs snake case
      • avoid spaces/hyphens/funny chars
      • unique names
      • a naming system (i.e. bps, bpd, bpm might be easier to search/organise than sysbp, diastolicbp, meanbp) esp if bp1, bp2, bp3 ... which is amenable to tools for automatic translation from wide to long
    • row identifiers
    • standardise where possible (not y, YES, yes, true, 1, positive: they are all the same thing, choose any and stick with it)

Need some examples of badly done data with specific lessons

  • ragged wide data not long: before/after BP in a row is OK, but day 1-3 BP for one patient, and day 1-7 BP for the next isn't

Simple test

-can you save it to github as a CSV file and still make sense?

Clone this wiki locally