Skip to content

Commit ed2a30c

Browse files
committed
source commit: 479096c
0 parents  commit ed2a30c

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

57 files changed

+1785
-0
lines changed

00-intro.md

Lines changed: 215 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,215 @@
1+
---
2+
title: Using spreadsheet programs for data organisation
3+
teaching: 10
4+
exercises: 5
5+
authors:
6+
- Jez Cope
7+
- Christie Bahlai
8+
- Aleksandra Pawlik
9+
contributors:
10+
- Jennifer Bryan
11+
- Alexander Duryee
12+
- Jeffrey Hollister
13+
- Daisie Huang
14+
- Owen Jones
15+
- Clare Sloggett
16+
- Harriet Dashnow
17+
- Ben Marwick
18+
- Sherry Lake
19+
---
20+
21+
::::::::::::::::::::::::::::::::::::::: objectives
22+
23+
- Understanding some drawbacks and advantages of using spreadsheet programs
24+
- Distinguish machine readable tidy data from data that is easy to read for humans
25+
26+
::::::::::::::::::::::::::::::::::::::::::::::::::
27+
28+
:::::::::::::::::::::::::::::::::::::::: questions
29+
30+
- What are good data practices for using spreadsheets for organizing data?
31+
32+
::::::::::::::::::::::::::::::::::::::::::::::::::
33+
34+
:::::::::::::::::::::::::::::::::::::::: instructor
35+
36+
### Narrative Guidance
37+
38+
- Introduce that we're teaching data organisation, and that we're using
39+
spreadsheets, because most people do data entry in spreadsheets or
40+
have data in spreadsheets.
41+
- Emphasize that we are teaching good practice in data organisation and that
42+
this is the foundation of their research practice. Without organised and clean
43+
data, it will be difficult for them to apply the things we're teaching in the
44+
rest of the workshop to their data.
45+
- Much of their lives as a researcher will be spent on this 'data wrangling' stage, but
46+
some of it can be prevented with good strategies for data collection up front.
47+
- Tell that we're not teaching data analysis or plotting in spreadsheets, because it's
48+
very manual and also not reproducible. That's why we're teaching SQL, R, Python!
49+
- Now let's talk about spreadsheets, and when we say spreadsheets, we mean any program that
50+
does spreadsheets like Excel, LibreOffice, OpenOffice. Most learners are probably using Excel.
51+
- Ask the audience any things they've accidentally done in spreadsheets. Talk about an example of your own, like that you accidentally sorted only a single column and not the rest
52+
of the data in the spreadsheet. What are the pain points!?
53+
- As people answer highlight some of these issues with spreadsheets
54+
55+
56+
:::::::::::::::::::::::::::::::::::::::::::::::::::
57+
58+
Good **data organisation** is the foundation of much of our day-to-day
59+
work in libraries. Most **librarians** have data or do data entry in
60+
spreadsheets. Spreadsheet programs are very **useful graphical
61+
interfaces** for designing data tables and handling very basic data
62+
quality control functions.
63+
64+
Spreadsheets encompass a lot of the things we need
65+
to be able to do as librarians. We can use them for:
66+
67+
- Data entry
68+
- Organizing data
69+
- Subsetting and sorting data
70+
- Statistics
71+
- Plotting
72+
73+
::::::::::::::::::::::::::::::::::::::::: callout
74+
75+
## Jargon busting (Optional, not included in timing)
76+
The [Jargon Busting exercise](jargon_busting.md) is a helpful way to begin to explore terms, phrases, and ideas related to code and software development.
77+
78+
:::::::::::::::::::::::::::::::::::::::: instructor
79+
This exercise can be useful when you teach Tidy Data as the introduction to a full LC workshop, especially if you want learners to have an opportunity to meet each other and interact. It can take anywhere from 10 to 45 minutes, depending on your approach.
80+
::::::::::::::::::::::::::::::::::::::::::::::::::
81+
82+
::::::::::::::::::::::::::::::::::::::::::::::::::
83+
84+
### Spreadsheet outline
85+
86+
In this lesson, we will look at:
87+
88+
- Good data entry practices - formatting data tables in spreadsheets
89+
- How to avoid common formatting mistakes
90+
- Dates as data - beware!
91+
- Basic quality control and data manipulation in spreadsheets
92+
- Exporting data from spreadsheets
93+
94+
**Much of your time when you're producing a report will be spent in
95+
this 'data wrangling' stage.** It's not the most fun, but it's
96+
necessary. We'll teach you how to think about data organisation and
97+
some practices for more effective data wrangling.
98+
99+
***
100+
101+
### What this lesson will not teach you
102+
103+
- How to do *statistics* in a spreadsheet
104+
- How to do *plotting* in a spreadsheet
105+
- How to *write code* in spreadsheet programs
106+
107+
If you're looking to do this, a good reference is
108+
[Microsoft Excel 365 Bible](https://search.worldcat.org/en/title/1263023438).
109+
110+
***
111+
112+
### Why aren't we teaching data analysis in spreadsheets
113+
114+
- Data analysis in spreadsheets usually requires **a lot of manual
115+
work**. If you want to change a parameter or run an analysis with a
116+
new dataset, you usually have to redo everything by hand. (We do
117+
know that you can create macros, but see the next point.)
118+
119+
- It is also difficult to **track or reproduce statistical or plotting
120+
analyses** done in spreadsheet programs when you want to go back to
121+
your work or someone asks for details of your analysis.
122+
123+
### Spreadsheet programs
124+
125+
There are a number of spreadsheet programs available for use on a desktop or web browser:
126+
127+
- LibreOffice Calc
128+
- Microsoft Excel
129+
- Apple Numbers
130+
- Google Sheets
131+
- Gnumeric
132+
- Apache OpenOffice Calc
133+
134+
Commands may differ a bit between programs, but the general idea
135+
is the same. In this lesson, we will assume that you are most likely using Excel as
136+
your primary spreadsheet program. There are others with similar functionality, including Gnumeric, OpenOffice Calc, and Google Sheets, but Excel is the package you're most likely to have available on your work computer.
137+
138+
***
139+
140+
::::::::::::::::::::::::::::::::::::::: challenge
141+
142+
## Questions:
143+
144+
- How many people have used spreadsheets in their work?
145+
- What kind of operations do you do in spreadsheets?
146+
- Which ones do you think spreadsheets are good for?
147+
148+
149+
::::::::::::::::::::::::::::::::::::::::::::::::::
150+
151+
***
152+
153+
::::::::::::::::::::::::::::::::::::::: challenge
154+
155+
## Question
156+
157+
- Spreadsheets can be very useful, but they can also be frustrating and even sometimes give us incorrect results. What are some things that you've accidentally done in a spreadsheet, or have been frustrated that you can't do easily?
158+
159+
160+
::::::::::::::::::::::::::::::::::::::::::::::::::
161+
162+
***
163+
164+
## Problems with Spreadsheets
165+
166+
Spreadsheets are **good for data entry**, but in reality we **tend to
167+
use spreadsheet programs for much more** than data entry. We use them
168+
to create data tables for publications, to generate summary
169+
statistics, and make figures.
170+
171+
Generating **tables for reports** in a spreadsheet is not optimal -
172+
often, when formatting a data table for publication, we're reporting
173+
key summary statistics in a way that is **not really meant to be read
174+
as data**, and often involves **special formatting** (merging cells,
175+
creating borders, making it pretty). We advise you to do this sort of
176+
operation within your document editing software.
177+
178+
The latter two applications, **generating statistics and figures**, should
179+
be used with caution: because of the graphical, drag and drop nature of
180+
spreadsheet programs, it can be very difficult, if not impossible, to
181+
replicate your steps (much less retrace anyone else's), particularly if your
182+
stats or figures require you to do more complex calculations. Furthermore,
183+
in doing calculations in a spreadsheet, it's easy to accidentally apply a
184+
slightly different formula to multiple adjacent cells. When using a
185+
command-line based statistics program like R or SAS, it's practically
186+
impossible to accidentally apply a calculation to one observation in your
187+
dataset but not another unless you're doing it on purpose.
188+
189+
### Using Spreadsheets for Data Entry and Cleaning
190+
191+
**HOWEVER**, there are circumstances where you might want to use a
192+
spreadsheet program to produce "quick and dirty" calculations or
193+
figures, and some of these features can be used in **data cleaning**,
194+
prior to importation into a statistical analysis program. We will show
195+
you how to use some features of spreadsheet programs to check your
196+
data quality along the way and produce preliminary summary statistics.
197+
198+
In this lesson, we're going to talk about:
199+
200+
1. [Formatting data tables in spreadsheets](01-format-data.md)
201+
2. [Formatting problems](02-common-mistakes.md)
202+
3. [Dates as data](03-dates-as-data.md)
203+
4. [Basic quality control and data manipulation in spreadsheets](04-quality-control.md)
204+
5. [Exporting data from spreadsheets](05-exporting-data.md)
205+
6. [Data export formats caveats](06-data-formats-caveats.md)
206+
207+
208+
:::::::::::::::::::::::::::::::::::::::: keypoints
209+
210+
- We will discuss good practices for data entry and formatting
211+
- We will not discuss analysis or visualisation
212+
213+
::::::::::::::::::::::::::::::::::::::::::::::::::
214+
215+

0 commit comments

Comments
 (0)