You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: topics/digital-humanities/tutorials/open-refine-tutorial/tutorial.md
+15-10Lines changed: 15 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -75,7 +75,7 @@ In practice, you can iterate on a workflow in a familiar Graphic-User-Interface
75
75
76
76
# Hands on: Get the data
77
77
78
-
We will work with a slightly adapted dataset from the **[Powerhouse Museum](https://powerhouse.com.au/)** (Australia’s largest museum group) containing a metadata collection. The museum shared the dataset online before giving API access to its collection. We slightly adapted the dataset and uploaded it to Zenodo for long-term reuse. The tabular file (**36.4 MB**) includes **14 columns** for **75,811** objects, released under a **[Creative Commons Attribution Share Alike (CCASA) license](http://creativecommons.org/licenses/by-nc/2.5/au/)**. We will answer two questions: *From what year does the museum have the most objects?* And *what objects does the museum have from that year?*
78
+
We will work with a slightly adapted dataset from the **[Powerhouse Museum](https://powerhouse.com.au/)** (Australia’s largest museum group) containing a metadata collection. The museum shared the dataset online before giving API access to its collection. We slightly adapted the dataset and uploaded it to Zenodo for long-term reuse. The tabular file (**36.4 MB**) includes **14 columns** for **75,811** objects, released under a **[Creative Commons Attribution Share Alike (CCASA) license](http://creativecommons.org/licenses/by-nc/2.5/au/)**. We will answer three questions: From **which category** does the museum have the most objects? From **what year** does the museum have the most objects? And **what objects does the museum have from that year?**
79
79
80
80
**Why this dataset?** It is credible, openly published, and realistically messy—ideal for practising problems scholars encounter at scale. Records include a **Categories** field populated from the **Powerhouse Museum Object Names Thesaurus (PONT)**, a controlled vocabulary reflecting Australian usage. The tutorial deliberately surfaces common quality issues—blank values that are actually stray whitespace, duplicate rows, and multi-valued cells separated by the pipe character `|` (including edge cases where **double pipes**`||` inflate row counts)—so we can practice systematic inspection before any analysis. During cleaning, you will compute sanity checks (after de-duplication, the dataset drops to **XXXX** unique records; a facet reveals **XXXX** distinct categories and **XXXX** items with no category). Without careful atomization and clustering, these irregularities would bias statistics, visualizations, and downstream reconciliation.
81
81
@@ -224,7 +224,7 @@ Take a look at the `Categories` column of your dataset. Most objects were attrib
224
224
> Many different categories describe the object. You may notice duplicates categorising the same object twice.
225
225
> We also want to remove those to ensure we only have unique categories that describe a single object.
226
226
>
227
-
> 6. Click on the triangle on the left of `Categories`, hover over `edit cells`, and click on `Transform...`.
227
+
> 6. Click on the triangle on the left of `Categories`, hover over `Edit cells`, and click on `Transform...`.
@@ -254,14 +254,12 @@ Each entry can be assigned to more than one category. To leverage those keywords
254
254
255
255
> <hands-on-title>Atomization</hands-on-title>
256
256
>
257
-
> 1. Click on the triangle on the left of `Categories`, hover over `edit cells`, and click on `Split multi-valued cells...`.
257
+
> 1. Click on the triangle on the left of `Categories`, hover over `Edit cells`, and click on `Split multi-valued cells...`.
258
258
>
259
259
> 
260
260
>
261
261
> 2. Define the `Separator` as `\|` (pipe). Click on `OK`.
262
262
>
263
-
> 
264
-
>
265
263
{: .hands_on}
266
264
267
265
Are you ready for a little challenge? Let's investigate the categories column of the museum items.
@@ -275,17 +273,19 @@ Are you ready for a little challenge? Let's investigate the categories column of
275
273
> >
276
274
> > 1. 168,476
277
275
> > 2. Click on the triangle on the left of `Categories` and hover over `facet` and move your mouse over `Customized facets`, and click on `Facet by blank (null or empty string)`. The `true` value for blank entries is 447.
276
+
> >
277
+
> > 
278
278
> >
279
279
> {: .solution}
280
280
{: .question}
281
281
282
-
Now, let's use faceting based on text.
283
-
284
282
## Faceting
285
283
286
-
> <hands-on-title>Atomization</hands-on-title>
284
+
Now that the `Categories` field is cleaned, we can check the occurrence of categories with various facets.
285
+
286
+
> <hands-on-title>Faceting</hands-on-title>
287
287
>
288
-
> 1. Click on the triangle on the left of `Categories`, hover over `facet`, and click on`Text facet`.
288
+
> 1. Click on the triangle on the left of `Categories`, hover over `Facet`, and click on`Text facet`.
289
289
> 2. On the left panel, it mentions the total number of choices. The default value of `count limit` is low for this dataset, and we should increase it. Click on `Set choice count limit`.
290
290
>
291
291
> 
@@ -300,13 +300,18 @@ Now, let's use faceting based on text.
300
300
>
301
301
{: .hands_on}
302
302
303
+
You can now see, from which category the museum has the most objects, one of our initial questions about the dataset.
304
+
303
305
> <question-title></question-title>
304
306
>
305
307
> 1. What are the top 3 categories? How many items are associated with each of them?
306
308
>
307
309
> > <solution-title></solution-title>
308
310
> >
309
-
> > 1. Numismatics (8011), Ceramics (7389), and Clothing and Dress (7279)
311
+
> > 1. Numismatics (8011), Ceramics (7389), and Clothing and Dress (7279).
312
+
> > Congratulations, you have just answered our first question: from which category does the museum have the most objects?
313
+
> > It is numismatic objects, meaning coins. This makes a lot of sense; coins have a long history and convey a lot of information. They are therefore very interesting for researchers.
314
+
> > Moreover, they are robust and compact, making them durable and relatively easy for museums to store.
0 commit comments