Skip to content

Excel hell

DocEd edited this page Apr 24, 2016 · 21 revisions

#“Friends don’t let friends enter bad data”

##Introduction Spread sheets offer a simple and effective method to collect data. However, their ease of use often leads to a sloppy approach to data entry. Good data discipline is an essential foundation to data analysis and a small investment at this stage can prevent you running into trouble when trying to analyse and publish your study. This tutorial will take you through the essential dos and don’ts of managing your data in a spreadsheet.

As a word of caution, don’t forget, that any data collection should be inline with the Data Protection Act. Please ensure you manage your data appropriately.

##Which application to use? There are many different spreadsheets to use. Almost everyone will be familiar with excel, although it’s worth considering if this is actually the right choice for you. If you are planning a major collaboration, where many people need to enter data simultaneously, then an excel spreadsheet is probably not for you. Have a look at google sheets as an alternative. As your requirements grow, you may want to explore a relational database. But for now, we’ll assume you are using excel.

##Data types In computing terms, not all data is equal. It is important to tell excel what kind of data you are storing in each cell. Whilst excel will make attempts to guess what kind of data you are entering, it isn’t particularly good at it. The following are some of the main data types, and why they are important.

###Integers & Decimals Integers are any whole number. Decimals include any number with a decimal point. Computers apply maths differently to these two numerical data types, so its best practice to be consistent.

###Strings A string is any sequence of characters. Literally anything you can type can be represented as a string. This is the default data type in excel. This becomes really important when analysing your data. Take the example of recording the dose of analgesia given for postoperative pain relief. If we write “180 mg”, excel will store this value as a string. It has no choice, since the “mg” adds data that cannot possibly be interpreted numerically. When dealing with data that is numerical in nature, it’s best practice to keep all units elsewhere and keep the data as an integer or decimal.

###Date/Time objects This is essentially a special case of the decimal (sometimes also referred to as a double). Most dates are simply numbers counted up from a point in time (usually Jan 1st 1970). Time is given as a decimal of a single day. When storing date and time information, it is often easier to separate out the two. I.e store date in a single column and time in another.

###Booleans These are True or False statements. Usually also described as 1 or 0. False is ALWAYS 0 or NULL (which means no data is present). Any non 0 value, could be interpreted as True in this context, but the convention is to use 1. When storing any true or false data, we recommend using 1 and 0 as a convention.

##Designing your Spreadsheet Spend some time to develop your spreadsheet before you start collecting data. Here are some simple rules to follow to help get it right:

  • Define your unit of analysis (the row). What is the most basic event that we are reviewing? A common mistake is use each row as a specific patient, and keep extending into more columns are you acquire more data. Each row should represent an “episode” in your data, this is not necessarily a patient.
  • Define characteristics of that unit (the column)
  • Naming columns. When you import data into R, it’s useful to have your naming conventions already sorted. “Blood Pressure” may seem intuitive and easy to read, however it can come unstuck later in the analysis as computers don’t tend to like interpreting spaces. The accepted standards are:
  • Camel case: BloodPressure, PatientNumber, HospitalNumber — Snake case: blood_pressure, patient_number, hospital_number If you can avoid spaces/hyphens/funny characters, you will be better served later.
  • Make sure all your column names are completely unique. If you start recording more of the same metric (i.e. blood_pressure_1, blood_pressure_2 etc.) have you considered whether or not this should be an entirely new row?
  • Don’t use formatting to store data. Computers are colour blind, so any formatting you apply serves no function other than to highlight things to you. This is fine if it helps you interpret your data, but if you are storing information by turning certain cells red, you need to rethink your structure.
  • Develop a naming system (i.e. bps, bpd, bam might be easier to search/organise than sysop, diastolicbp, meanbp)
  • In many audits, data tends to be binary. Something is true or not true, a standard it met or not. We would advocate using 1 and 0 to represent True and False respectively.
  • Use data validation. Data validation restricts the options you have when entering data into a cell. Typing "C-Section" and "Caesarian Section" and "c-section" are all considered different values, when we want them to be interpreted as the same event. Data validation keeps data entry consistent and avoids this issue.
  • Keep units in your column titles. Any units used in your cell will automatically convert a numerical value into a string. It also creates a lot of duplication in typing which is unnecessary.
  • Don't start using excel for your data analysis. Keep it purely for gathering your raw data. If you start adding rows for summary counts, averages or basic statistics, it will make it harder to unpick raw data from statistic when importing to R.
  • Can you save it to github as a CSV file and still make sense?
  • Watch your case. Using data validation will help keep everything consistent
  • Don’t merge cells under any circumstances.
  • Be clear about missing values and zeroes
  • Do not enter multiple entries in one cell

##Exporting to CSV Comma separated values, or .CSV files are a very basic type of file that stores information in a universally accessible way. .xls files are proprietary to microsoft and can only be opened correctly by a particular version of excel. Being able to convert this file into a .CSV file is a vital part of working with your data in a different application.

row 1 is for column names (do not add extra rows to explain your column headings or categorise them) every other row is for a unique entity (a patient, a day, an observation, a ward, a hospital, a bed) each column is for a characteristics of that entity e.g. a patient will have age, a gender, a systolic blood pressure etc a hospital will have a name, a postcode, a number of beds etc column names will be used in your code so they need to follow certain rules white space is not allowed (maxHeight is OK, max Height is not) because when you write code the computer will assume new words are new names use names that will make sense to you when you read it in your work (x is quick to write but could mean anything, maxHeight is pretty obvious) concatenating words can make them impossible to read lowestheartrateweek1 is an ‘eyeful’; try lowest_heart_rate_week1 (uses an underscore, aka snake case) or lowestHeartRateWeek1 (capitalise the first letter of words, aka camel case)

Issues

Be more prescriptive in our use of 1 and 0 as bools Summary characteristics Take care of writing formulas when exporting to CSV Don't start your data analysis early - i.e. no totals etc. Construct course feedback excel sheet - handover to Ahmed Index / Key Many examples of "rows" as entities. Relational data ?TidyR

Clone this wiki locally