Skip to content

Commit 06640a5

Browse files
chennesyjt14den
authored andcommitted
add tidy exercises
1 parent d65809f commit 06640a5

File tree

1 file changed

+93
-0
lines changed

1 file changed

+93
-0
lines changed

episodes/tidy.md

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ To address this we can reshape our data in a long format. This is sometimes call
5656
## Tidy Data
5757

5858
Tidy data is a standard way of organizing data values within a dataset, making it easier to work with. Here are the key principles of tidy data:
59+
5960
1. Every column holds a single variable, like "month" or "temperature."
6061
2. Every row represents a single observation, like circulation counts by branch and month.
6162
3. Every cell contains a single value.
@@ -69,6 +70,7 @@ R for Data Science [12.1](https://r4ds.had.co.nz/tidy-data.html#fig:tidy-structu
6970
### Benefits of Tidy Data
7071

7172
Transforming our data into a tidy data format provides several advantages:
73+
7274
- Python operations, such as visualization, filtering, and statistical analysis libraries, work better with data in a tidy format.
7375
- Tidy data makes transforming, summarizing, and visualizing information easier. For instance, comparing monthly trends or calculating annual averages becomes more straightforward.
7476
- As datasets grow, tidy data ensures that they remain manageable and analyses remain accurate.
@@ -337,6 +339,97 @@ Let's save `df_long` to use in the next episode.
337339
df.to_pickle('data/df_long.pkl')
338340
```
339341

342+
::::::::::::::::::::::::::::::::::::::: challenge
343+
344+
## Tidy Data Principles
345+
346+
How would you reorganize the following table about research data workshops to follow the three tidy data principles?
347+
348+
1. Every column holds a single variable.
349+
2. Every row represents a single observation.
350+
3. Every cell contains a single value.
351+
352+
| Date | Length | Content | Instructor |
353+
|------------|---------|-------------|------------|
354+
| 2023-01-15 | 30 min | RDM, DMP | CH |
355+
| 2023-02-02 | 2 hours | Python, RDM | CH, TD |
356+
| 2023-02-03 | 90 min | Python | SP |
357+
358+
You can use each content unit (e.g., RDM, DMP, Python) as an observation, and breakdown the length of time or instructor initials to match the content unit however you like.
359+
360+
361+
::::::::::::::: solution
362+
363+
## Solution
364+
365+
| Year | Month | Day | Length (min) | Content | Instructor |
366+
|------|-------|-----|--------------|---------|------------|
367+
| 2023 | 01 | 15 | 20 | RDM | CH |
368+
| 2023 | 01 | 15 | 10 | DMP | CH |
369+
| 2023 | 02 | 02 | 100 | Python | TD |
370+
| 2023 | 02 | 02 | 20 | RDM | CH |
371+
| 2023 | 02 | 03 | 100 | Python | SP |
372+
373+
:::::::::::::::::::::::::
374+
375+
::::::::::::::::::::::::::::::::::::::::::::::::::
376+
377+
::::::::::::::::::::::::::::::::::::::: challenge
378+
379+
## Subsetting df_long
380+
381+
Using df_long, create a new DataFrame, `low_circ', that only includes branches with circulation numbers lower than 500 per month. When you create a subset DataFrame, show the following columns: branch, circulation, month, and year. Next, eliminate the rows when the circulation is equal to 0.
382+
383+
```python
384+
low_circ = df_long[_________[_________] __ 500]
385+
low_circ = _________[_________[_________] != __]
386+
low_circ.sort_values(by='circulation', ascending=False)
387+
```
388+
389+
::::::::::::::: solution
390+
391+
## Solution
392+
393+
```python
394+
low_circ = df_long[df_long['circulation'] < 500]
395+
low_circ = low_circ[low_circ['circulation'] != 0]
396+
low_circ.sort_values(by='circulation', ascending=False)
397+
```
398+
399+
:::::::::::::::::::::::::
400+
401+
::::::::::::::::::::::::::::::::::::::::::::::::::
402+
::::::::::::::::::::::::::::::::::::::: challenge
403+
404+
## Group and aggregate for circulation by year
405+
How would you create a DataFrame that sums up the circulation by year across all branches? In other words you want a DataFrame that includes one row for each year, and columns for 'year' and 'sum', the latter of which is the sum of all circulation figures for the entire year.
406+
407+
408+
::::::::::::::: solution
409+
410+
## Solution
411+
412+
```python
413+
df_long.groupby(['year'])['circulation'].agg(['sum'])
414+
```
415+
416+
| year | sum |
417+
|------|---------|
418+
| 2011 | 7774198 |
419+
| 2012 | 7598080 |
420+
| 2013 | 6894958 |
421+
| 2014 | 6406512 |
422+
| 2015 | 5953920 |
423+
| 2016 | 5696456 |
424+
| 2017 | 5305624 |
425+
| 2018 | 4989239 |
426+
| 2019 | 4785108 |
427+
| 2020 | 2726156 |
428+
| 2021 | 3184327 |
429+
| 2022 | 3342472 |
430+
:::::::::::::::::::::::::
431+
432+
::::::::::::::::::::::::::::::::::::::::::::::::::
340433
:::::::::::::::::::::::::::::::::::::::: keypoints
341434

342435
- In tidy data each variable forms a column, each observation forms a row, and each type of observational unit forms a table.

0 commit comments

Comments
 (0)