Skip to content

Commit 3eb321d

Browse files
authored
Update tutorial.md
1 parent 69ee03c commit 3eb321d

File tree

1 file changed

+15
-10
lines changed
  • topics/digital-humanities/tutorials/open-refine-tutorial

1 file changed

+15
-10
lines changed

topics/digital-humanities/tutorials/open-refine-tutorial/tutorial.md

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ In practice, you can iterate on a workflow in a familiar Graphic-User-Interface
7575

7676
# Hands on: Get the data
7777

78-
We will work with a slightly adapted dataset from the **[Powerhouse Museum](https://powerhouse.com.au/)** (Australia’s largest museum group) containing a metadata collection. The museum shared the dataset online before giving API access to its collection. We slightly adapted the dataset and uploaded it to Zenodo for long-term reuse. The tabular file (**36.4 MB**) includes **14 columns** for **75,811** objects, released under a **[Creative Commons Attribution Share Alike (CCASA) license](http://creativecommons.org/licenses/by-nc/2.5/au/)**. We will answer two questions: *From what year does the museum have the most objects?* And *what objects does the museum have from that year?*
78+
We will work with a slightly adapted dataset from the **[Powerhouse Museum](https://powerhouse.com.au/)** (Australia’s largest museum group) containing a metadata collection. The museum shared the dataset online before giving API access to its collection. We slightly adapted the dataset and uploaded it to Zenodo for long-term reuse. The tabular file (**36.4 MB**) includes **14 columns** for **75,811** objects, released under a **[Creative Commons Attribution Share Alike (CCASA) license](http://creativecommons.org/licenses/by-nc/2.5/au/)**. We will answer three questions: From **which category** does the museum have the most objects? From **what year** does the museum have the most objects? And **what objects does the museum have from that year?**
7979

8080
**Why this dataset?** It is credible, openly published, and realistically messy—ideal for practising problems scholars encounter at scale. Records include a **Categories** field populated from the **Powerhouse Museum Object Names Thesaurus (PONT)**, a controlled vocabulary reflecting Australian usage. The tutorial deliberately surfaces common quality issues—blank values that are actually stray whitespace, duplicate rows, and multi-valued cells separated by the pipe character `|` (including edge cases where **double pipes** `||` inflate row counts)—so we can practice systematic inspection before any analysis. During cleaning, you will compute sanity checks (after de-duplication, the dataset drops to **XXXX** unique records; a facet reveals **XXXX** distinct categories and **XXXX** items with no category). Without careful atomization and clustering, these irregularities would bias statistics, visualizations, and downstream reconciliation.
8181

@@ -224,7 +224,7 @@ Take a look at the `Categories` column of your dataset. Most objects were attrib
224224
> Many different categories describe the object. You may notice duplicates categorising the same object twice.
225225
> We also want to remove those to ensure we only have unique categories that describe a single object.
226226
>
227-
> 6. Click on the triangle on the left of `Categories`, hover over `edit cells`, and click on `Transform...`.
227+
> 6. Click on the triangle on the left of `Categories`, hover over `Edit cells`, and click on `Transform...`.
228228
>
229229
> ![Edit cells Categories](images/filter_grel.png)
230230
>
@@ -254,14 +254,12 @@ Each entry can be assigned to more than one category. To leverage those keywords
254254
255255
> <hands-on-title>Atomization</hands-on-title>
256256
>
257-
> 1. Click on the triangle on the left of `Categories`, hover over `edit cells`, and click on `Split multi-valued cells...`.
257+
> 1. Click on the triangle on the left of `Categories`, hover over `Edit cells`, and click on `Split multi-valued cells...`.
258258
>
259259
> ![Atomization of Categories](images/split_multi_valued_cells.png)
260260
>
261261
> 2. Define the `Separator` as `\|` (pipe). Click on `OK`.
262262
>
263-
> ![Facet Blank of atomized Categories](images/facet_categories_blank.png)
264-
>
265263
{: .hands_on}
266264
267265
Are you ready for a little challenge? Let's investigate the categories column of the museum items.
@@ -275,17 +273,19 @@ Are you ready for a little challenge? Let's investigate the categories column of
275273
> >
276274
> > 1. 168,476
277275
> > 2. Click on the triangle on the left of `Categories` and hover over `facet` and move your mouse over `Customized facets`, and click on `Facet by blank (null or empty string)`. The `true` value for blank entries is 447.
276+
> >
277+
> > ![Facet Blank of atomized Categories](images/facet_categories_blank.png)
278278
> >
279279
> {: .solution}
280280
{: .question}
281281
282-
Now, let's use faceting based on text.
283-
284282
## Faceting
285283
286-
> <hands-on-title>Atomization</hands-on-title>
284+
Now that the `Categories` field is cleaned, we can check the occurrence of categories with various facets.
285+
286+
> <hands-on-title>Faceting</hands-on-title>
287287
>
288-
> 1. Click on the triangle on the left of `Categories`, hover over `facet`, and click on`Text facet`.
288+
> 1. Click on the triangle on the left of `Categories`, hover over `Facet`, and click on `Text facet`.
289289
> 2. On the left panel, it mentions the total number of choices. The default value of `count limit` is low for this dataset, and we should increase it. Click on `Set choice count limit`.
290290
>
291291
> ![Text faceting of atomized Categories](images/text_facet.png)
@@ -300,13 +300,18 @@ Now, let's use faceting based on text.
300300
>
301301
{: .hands_on}
302302
303+
You can now see, from which category the museum has the most objects, one of our initial questions about the dataset.
304+
303305
> <question-title></question-title>
304306
>
305307
> 1. What are the top 3 categories? How many items are associated with each of them?
306308
>
307309
> > <solution-title></solution-title>
308310
> >
309-
> > 1. Numismatics (8011), Ceramics (7389), and Clothing and Dress (7279)
311+
> > 1. Numismatics (8011), Ceramics (7389), and Clothing and Dress (7279).
312+
> > Congratulations, you have just answered our first question: from which category does the museum have the most objects?
313+
> > It is numismatic objects, meaning coins. This makes a lot of sense; coins have a long history and convey a lot of information. They are therefore very interesting for researchers.
314+
> > Moreover, they are robust and compact, making them durable and relatively easy for museums to store.
310315
> >
311316
> {: .solution}
312317
{: .question}

0 commit comments

Comments
 (0)