You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: episodes/tidy.md
+93Lines changed: 93 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -56,6 +56,7 @@ To address this we can reshape our data in a long format. This is sometimes call
56
56
## Tidy Data
57
57
58
58
Tidy data is a standard way of organizing data values within a dataset, making it easier to work with. Here are the key principles of tidy data:
59
+
59
60
1. Every column holds a single variable, like "month" or "temperature."
60
61
2. Every row represents a single observation, like circulation counts by branch and month.
61
62
3. Every cell contains a single value.
@@ -69,6 +70,7 @@ R for Data Science [12.1](https://r4ds.had.co.nz/tidy-data.html#fig:tidy-structu
69
70
### Benefits of Tidy Data
70
71
71
72
Transforming our data into a tidy data format provides several advantages:
73
+
72
74
- Python operations, such as visualization, filtering, and statistical analysis libraries, work better with data in a tidy format.
73
75
- Tidy data makes transforming, summarizing, and visualizing information easier. For instance, comparing monthly trends or calculating annual averages becomes more straightforward.
74
76
- As datasets grow, tidy data ensures that they remain manageable and analyses remain accurate.
@@ -337,6 +339,97 @@ Let's save `df_long` to use in the next episode.
337
339
df.to_pickle('data/df_long.pkl')
338
340
```
339
341
342
+
::::::::::::::::::::::::::::::::::::::: challenge
343
+
344
+
## Tidy Data Principles
345
+
346
+
How would you reorganize the following table about research data workshops to follow the three tidy data principles?
You can use each content unit (e.g., RDM, DMP, Python) as an observation, and breakdown the length of time or instructor initials to match the content unit however you like.
359
+
360
+
361
+
::::::::::::::: solution
362
+
363
+
## Solution
364
+
365
+
| Year | Month | Day | Length (min) | Content | Instructor |
Using df_long, create a new DataFrame, `low_circ', that only includes branches with circulation numbers lower than 500 per month. When you create a subset DataFrame, show the following columns: branch, circulation, month, and year. Next, eliminate the rows when the circulation is equal to 0.
How would you create a DataFrame that sums up the circulation by year across all branches? In other words you want a DataFrame that includes one row for each year, and columns for 'year' and 'sum', the latter of which is the sum of all circulation figures for the entire year.
0 commit comments