-
Notifications
You must be signed in to change notification settings - Fork 6
Excel hell
#“Friends don’t let friends enter bad data”
##Introduction Spread sheets offer a simple and effective method to collect data. However, their ease of use often leads to a sloppy approach to data entry. Good data discipline is an essential foundation to data analysis and a small investment at this stage can prevent you running into trouble when trying to analyse and publish your study. This tutorial will take you through the essential dos and don’ts of managing your data in a spreadsheet.
As a word of caution, don’t forget, that any data collection should be inline with the Data Protection Act. Please read up here and ensure you manage your data appropriately.
##Which spreadsheet to use? There are many different spreadsheets to use. Almost everyone will be familiar with excel, although it’s worth considering if this is actually the right choice for you. If you are planning a major collaboration, where many people need to enter data simultaneously, then an excel spreadsheet is probably not for you. Have a look at google sheets as an alternative. As your requirements grow, you may want to explore a relational database. But for now, we’ll assume you are using excel or google sheets.
##Data types In computing terms, not all data is equal. Whilst excel will make attempts to guess what kind of data you are entering, it isn’t particularly good at it. The following are some of the main data types, and why they are important.
###Integers & Decimals Integers are any whole number. Decimals include any number with a decimal point. Computers apply maths differently to these two numerical data types, so its best practice to be consistent.
###Strings A string is any sequence of characters. Literally anything you can type can be represented as a string. This is the default data type in excel. This becomes really important when analysing your data. Take the example of logging the dose of analgesia given. If we write “180 mg”, excel will store this value as a string. It has no choice, since the “mg” adds data that cannot possibly be interpreted numerically. When dealing with data that is numerical in nature, it’s best practice to keep all units elsewhere and keep the data as an integer or decimal.
###Date/Time objects This is essentially a special case of the decimal (sometimes referred to as a double). Most dates are simply numbers counted up from a point in time (usually Jan 1st 1970). Time is given as a division of a day. When storing date and time information, it is often easier to separate out the two. I.e store date in a single column and time in another.
###Booleans These are True or False statements. Usually also described as 1 or 0. False is ALWAYS 0 or NULL (which means no data is present). Any non 0 value, could be interpreted as True in this context, but the convention is to use 1.
##Designing your Spreadsheet Spend some time to develop your spreadsheet before you start collecting data. Here are some simple rules to follow to help get it right:
- Define your unit of analysis (the row). What is the most basic event that we are reviewing? A common mistake is use each row as a specific patient, and keep extending into more columns are you acquire more data. Each row should represent an “episode” in your data, this is not necessarily a patient. -- Define characteristics of that unit (the column) -- Naming columns. When you import data into R, it’s useful to have your naming conventions already sorted. “Blood Pressure” may seem intuitive and easy to read, however it can come unstuck later in the analysis as computers don’t tend to like interpreting spaces. The accepted standards are: —- Camel case: BloodPressure, PatientNumber, HospitalNumber —- Snake case: blood_pressure, patient_number, hospital_number If you can avoid spaces/hyphens/funny characters, you will be better served later.
- Make sure all your column names are completely unique. If you start recording more of the same metric (i.e. blood_pressure_1, blood_pressure_2 etc.) have you considered whether or not this should be an entirely new row?
- Don’t use formatting to store data. Computers are colour blind, so any formatting you apply serves no function other than to highlight things to you. This is fine if it helps you interpret your data, but if you are storing information by turning certain cells red, you need to rethink your structure.
a naming system (i.e. bps, bpd, bam might be easier to search/organise than sysop, diastolicbp, meanbp) esp if bp1, bp2, bp3 … which is amenable to tools for automatic translation from wide to long
row identifiers
In many audits, data tends to be binary. Something is true or not true, a standard it met or not. Try to apply this consistently. It doesn’t necessarily matter what you choose, but stick with it all the way through: — y / n — Y / N — YES / NO — yes / no — Yes / No — TRUE / FALSE — 1 / 0 These are all the same thing. Our advice would be to use 1 and 0, or TRUE and FALSE.
Create data validation.
Keep units in your column titles.
Need some examples of badly done data with specific lessons
ragged wide data not long: before/after BP in a row is OK, but day 1-3 BP for one patient, and day 1-7 BP for the next isn’t Simple test
if something seems repetitive then consider a separate table (concept of relational data and keys)
-can you save it to github as a CSV file and still make sense?
Things not to do
one table per validation data validation in excel especially when creating factors case sensitivity don’t merge cells don’t mix letters and numbers (beware adding units to numbers) be clear about missing values and zeroes blank is missing multiple entries in one cell dates and times in excel and conversion recommendation for separate columns for each component need part of the data tidying lecture to include handling dates and times following this recommendation Data out from excel
Option 1. google sheet 2. text file (css or tab): highlight issues with commas
Computers are stupid (and colour blind)
do not store information by highlighting the cells, rows, or columns in colours ditto for comments keep a ‘ready’ sheet for notes to yourself you might want to list each variable here with an explanatory note row 1 is for column names (do not add extra rows to explain your column headings or categorise them) every other row is for a unique entity (a patient, a day, an observation, a ward, a hospital, a bed) each column is for a characteristics of that entity e.g. a patient will have age, a gender, a systolic blood pressure etc a hospital will have a name, a postcode, a number of beds etc column names will be used in your code so they need to follow certain rules white space is not allowed (maxHeight is OK, max Height is not) because when you write code the computer will assume new words are new names use names that will make sense to you when you read it in your work (x is quick to write but could mean anything, maxHeight is pretty obvious) concatenating words can make them impossible to read lowestheartrateweek1 is an ‘eyeful’; try lowest_heart_rate_week1 (uses an underscore, aka snake case) or lowestHeartRateWeek1 (capitalise the first letter of words, aka camel case)
Be more prescriptive in our use of 1 and 0 as bools Summary characteristics Take care of writing formulas when exporting to CSV Don't start your data analysis early - i.e. no totals etc. Construct course feedback excel sheet - handover to Ahmed Index / Key Many examples of "rows" as entities. Relational data ?TidyR
Please contact Steve Harris if you have any questions.