-
Notifications
You must be signed in to change notification settings - Fork 6
Excel hell
Steve Harris edited this page Mar 18, 2016
·
21 revisions
- do not store information by highlighting the cells, rows, or columns in colours
- ditto for comments
- keep a 'readme' sheet for notes to yourself
- you might want to list each variable here with an explanatory note
- row 1 is for column names (do not add extra rows to explain your column headings or categorise them)
- every other row is for a unique entity (a patient, a day, an observation, a ward, a hospital, a bed)
- each column is for a characteristics of that entity e.g.
- a patient will have age, a gender, a systolic blood pressure etc
- a hospital will have a name, a postcode, a number of beds etc
- column names will be used in your code so they need to follow certain rules
- white space is not allowed (
maxHeightis OK,max Heightis not) because when you write code the computer will assume new words are new names - use names that will make sense to you when you read it in your work (
xis quick to write but could mean anything,maxHeightis pretty obvious) - concatenating words can make them impossible to read
-
lowestheartrateweek1is an 'eyefull'; - try
lowest_heart_rate_week1(uses an underscore, aka snake case) - or
lowestHeartRateWeek1(capitalise the first letter of words, aka camel case)
-
- white space is not allowed (
- one table per validation
- data validation in excel
- especially when creating factors
- case sensitivity
- don't merge cells
- don't mix letters and numbers (beware adding units to numbers)
- be clear about missing values and zeroes
- blank is missing
- multiple entries in one cell
- dates and times in excel and conversion
- recommendation for separate columns for each component
- need part of the data tidying lecture to include handling dates and times following this recommendatino
- define your unit of analysis (the row)
- define characteristics of that unit (the column)
- if something seems repetitive then consider a separate table (concept of relational data and keys); but for simple projects try hard to keep everything in a simple, square table
- simple formatting principles
- computers are colour blind (highlighting cells/rows is only good for you)
- variable names
- camel case vs snake case
- avoid spaces/hyphens/funny chars
- unique names
- a naming system (i.e. bps, bpd, bpm might be easier to search/organise than sysbp, diastolicbp, meanbp) esp if bp1, bp2, bp3 ... which is amenable to tools for automatic translation from wide to long
- row identifiers
- standardise where possible (not y, YES, yes, true, 1, positive: they are all the same thing, choose any and stick with it)
Need some examples of badly done data with specific lessons
- ragged wide data not long: before/after BP in a row is OK, but day 1-3 BP for one patient, and day 1-7 BP for the next isn't
Simple test
-can you save it to github as a CSV file and still make sense?
Please contact Steve Harris if you have any questions.