You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- How to use OpenRefine in Galaxy to clean your data?
7
+
- How to use OpenRefine in Galaxy to clean your data?
8
8
- How to use a workflow in Galaxy to extract and visualise information from your data?
9
9
objectives:
10
10
- Start OpenRefine as an Interactive Tool in Galaxy
@@ -35,31 +35,31 @@ answer_histories:
35
35
date: 2025-09-19
36
36
---
37
37
This tutorial shows how to use **OpenRefine** in Galaxy to clean and visualize data from the **humanities and social sciences**. It has two parts:
38
-
-**Introduction to OpenRefine**, based on {% cite Hooland_2013 %} and adapted for Galaxy.
39
-
-**Introduction to running Galaxy workflows** to visualize cleaned data and extract specific information.
38
+
-**Introduction to OpenRefine**, based on {% cite Hooland_2013 %} and adapted for Galaxy.
39
+
-**Introduction to running Galaxy workflows** to visualize cleaned data and extract specific information.
40
40
41
41
42
-
## What is OpenRefine?
42
+
## What is OpenRefine?
43
43
44
-
**OpenRefine** is a free, open-source “data wrangler” built for messy, heterogeneous, evolving datasets. It imports common formats (CSV/TSV, Excel, JSON, XML) and domain-specific ones used across GLAM (Galleries, Libraries, Archives and Museums) and official statistics (MARC, RDF serializations, PC-Axis).
44
+
**OpenRefine** is a free, open-source “data wrangler” built for messy, heterogeneous, evolving datasets. It imports common formats (CSV/TSV, Excel, JSON, XML) and domain-specific ones used across GLAM (Galleries, Libraries, Archives and Museums) and official statistics (MARC, RDF serializations, PC-Axis).
45
45
46
-
It is **non-destructive** — OpenRefine does not alter your source files, but works on copies and saves projects locally. Facets and filters let you audit categories, surface outliers, and triage inconsistencies without code. Its **clustering** tools consolidate near-duplicates using both key-collision methods (fingerprint, n-gram, phonetic) and edit-distance/nearest-neighbour methods (Levenshtein, PPM) so you can standardize names and places at scale while keeping human oversight.
46
+
It is **non-destructive** — OpenRefine does not alter your source files, but works on copies and saves projects locally. Facets and filters let you audit categories, surface outliers, and triage inconsistencies without code. Its **clustering** tools consolidate near-duplicates using both key-collision methods (fingerprint, n-gram, phonetic) and edit-distance/nearest-neighbour methods (Levenshtein, PPM) so you can standardize names and places at scale while keeping human oversight.
47
47
48
-
For enrichment, OpenRefine speaks the **Reconciliation API** to match local values to external authorities (e.g. **Wikidata**, **ROR**) and optionally pull back richer metadata. Transformations—both point-and-click and **GREL** formulas—are recorded as a stepwise, undoable history that you can export as JSON and re-apply to other datasets, enabling reproducible cleaning and easy peer review. Finished tables export cleanly to **CSV/TSV**, ODS/XLS(X), SQL statements, templated JSON, Google Sheets, or can be exported back to Galaxy.
48
+
For enrichment, OpenRefine speaks the **Reconciliation API** to match local values to external authorities (e.g. **Wikidata**, **ROR**) and optionally pull back richer metadata. Transformations—both point-and-click and **GREL** formulas—are recorded as a stepwise, undoable history that you can export as JSON and re-apply to other datasets, enabling reproducible cleaning and easy peer review. Finished tables export cleanly to **CSV/TSV**, ODS/XLS(X), SQL statements, templated JSON, Google Sheets, or can be exported back to Galaxy.
49
49
50
-
## From Cleaning to Analysis in Galaxy
50
+
## From Cleaning to Analysis in Galaxy
51
51
52
-
Once your dataset has been cleaned with OpenRefine, you often want to analyze it further or visualize specific aspects. This is where **Galaxy Workflows** become essential: they let you build reproducible pipelines that operate on your curated data, moving from one-off cleaning to structured analysis.
52
+
Once your dataset has been cleaned with OpenRefine, you often want to analyze it further or visualize specific aspects. This is where **Galaxy Workflows** become essential: they let you build reproducible pipelines that operate on your curated data, moving from one-off cleaning to structured analysis.
53
53
54
-
## What are Galaxy Workflows?
54
+
## What are Galaxy Workflows?
55
55
56
-
**Galaxy Workflows** are structured, stepwise pipelines you build and run entirely in the browser—either extracted from a recorded analysis *history* or assembled in the visual editor. They can be annotated, shared, published, imported, and rerun, making them ideal for teaching, collaboration, and reproducible research.
56
+
**Galaxy Workflows** are structured, stepwise pipelines you build and run entirely in the browser—either extracted from a recorded analysis *history* or assembled in the visual editor. They can be annotated, shared, published, imported, and rerun, making them ideal for teaching, collaboration, and reproducible research.
57
57
58
-
A captured analysis is easy to share: export the workflow as JSON (**`.ga`**: tools, parameters, and Input/Output) or export a provenance-rich run as a **[Workflow Run RO-Crate](https://www.researchobject.org/workflow-run-crate/)** bundling the definition with inputs, outputs, and invocation metadata. This lowers the barrier to entry (no local installs; web UI with pre-installed tools and substantial compute) while preserving best practices (histories track tool versions and parameters; workflows are easily re-applied to new data).
58
+
A captured analysis is easy to share: export the workflow as JSON (**`.ga`**: tools, parameters, and Input/Output) or export a provenance-rich run as a **[Workflow Run RO-Crate](https://www.researchobject.org/workflow-run-crate/)** bundling the definition with inputs, outputs, and invocation metadata. This lowers the barrier to entry (no local installs; web UI with pre-installed tools and substantial compute) while preserving best practices (histories track tool versions and parameters; workflows are easily re-applied to new data).
59
59
60
-
For findability and credit, the community uses **[WorkflowHub](https://workflowhub.eu/)**—a curated registry that supports multiple workflow technologies (including Galaxy) and promotes **FAIR** principles; it offers Spaces/Teams, permissions, versioning, and **DOIs via DataCite**, with metadata linking to identifiers like **[ORCID](https://orcid.org/)** so contributions enter scholarly knowledge graphs and are properly acknowledged.
60
+
For findability and credit, the community uses **[WorkflowHub](https://workflowhub.eu/)**—a curated registry that supports multiple workflow technologies (including Galaxy) and promotes **FAIR** principles; it offers Spaces/Teams, permissions, versioning, and **DOIs via DataCite**, with metadata linking to identifiers like **[ORCID](https://orcid.org/)** so contributions enter scholarly knowledge graphs and are properly acknowledged.
61
61
62
-
In practice, you can iterate on a workflow in a familiar GUI, export the exact definition or a run package, and deposit it where peers can discover, reuse, review, and cite it—closing the loop between simple authoring and robust scholarly dissemination.
62
+
In practice, you can iterate on a workflow in a familiar GUI, export the exact definition or a run package, and deposit it where peers can discover, reuse, review, and cite it—closing the loop between simple authoring and robust scholarly dissemination.
63
63
64
64
65
65
> <agenda-title></agenda-title>
@@ -107,28 +107,28 @@ The users will familiarize themselves with the museum's metadata. In the next st
107
107
108
108
> <hands-on-title>Opening the dataset with OpenRefine</hands-on-title>
109
109
>
110
-
> 1. Open the {% tool [OpenRefine](interactive_tool_openrefine) %}: Working with messy data
111
-
> - *"Input file in tabular format"*: `openrefine-phm-collection.tsv`
110
+
> 1. Open the {% tool [OpenRefine](interactive_tool_openrefine) %}: Working with messy data
111
+
> - *"Input file in tabular format"*: `openrefine-phm-collection.tsv`
112
112
>
113
-
> 2. Click on "Run Tool".
113
+
> 2. Click on "Run Tool".
114
114
>
115
-
> 
115
+
> 
116
116
>
117
-
> 3. After around 30 seconds, using the interactive tools section on the left panel, you can open OpenRefine by clicking on its name. Make sure to wait until you see the symbol with an arrow > pointing outside the box that allows you to start OpenRefine in a new tab.
117
+
> 3. After around 30 seconds, using the interactive tools section on the left panel, you can open OpenRefine by clicking on its name. Make sure to wait until you see the symbol with an arrow > pointing outside the box that allows you to start OpenRefine in a new tab.
118
118
>
119
-
> 
119
+
> 
120
120
>
121
-
> 4. Here, you can see the OpenRefine GUI. Click on `Open Project`.
121
+
> 4. Here, you can see the OpenRefine GUI. Click on `Open Project`.
> 5. Click on `Galaxy file`. If the file does not appear, you may have started OpenRefine before it was fully loaded. Retry steps 3 and 4, and the file should be visible.
126
126
>
127
-
> 
127
+
> 
@@ -149,37 +149,37 @@ Great, now that the dataset is in OpenRefine, we can start cleaning it.
149
149
150
150
> <hands-on-title>Removing the blank rows</hands-on-title>
151
151
>
152
-
> 1. Click on the triangle on the left of `Record ID`.
152
+
> 1. Click on the triangle on the left of `Record ID`.
153
153
>
154
154
> 
155
155
>
156
-
> 2. Click on `Sort...`.
156
+
> 2. Click on `Sort...`.
157
157
>
158
-
> 3. Select `numbers` and click on `OK`.
158
+
> 3. Select `numbers` and click on `OK`.
159
159
>
160
160
> 
161
161
>
162
-
> 4. Above the table, click on `Sort` and select `Reorder rows permanently`.
162
+
> 4. Above the table, click on `Sort` and select `Reorder rows permanently`.
163
163
>
164
164
> 
165
165
>
166
-
> 5. Click on the triangle left of the `Record ID` column. Hover over `Edit cells` and select `Blank down`.
166
+
> 5. Click on the triangle left of the `Record ID` column. Hover over `Edit cells` and select `Blank down`.
167
167
>
168
168
> 
169
169
>
170
-
> 6. Click on the triangle left of the `Record ID` column. Hover over `Facet`, then move your mouse to `Customized facets` and select `Facet by blank (null or empty string)`.
170
+
> 6. Click on the triangle left of the `Record ID` column. Hover over `Facet`, then move your mouse to `Customized facets` and select `Facet by blank (null or empty string)`.
171
171
>
172
172
> 
173
173
>
174
-
> 7. On the left, a new option appears under `Facet/Filter` with the title `Record ID`. Click on `true`.
174
+
> 7. On the left, a new option appears under `Facet/Filter` with the title `Record ID`. Click on `true`.
175
175
>
176
176
> 
177
177
>
178
-
> 8. Click on the triangle to the left of the column called `All`. Hover over `Edit rows`, and select `remove matching rows`.
178
+
> 8. Click on the triangle to the left of the column called `All`. Hover over `Edit rows`, and select `remove matching rows`.
179
179
>
180
180
> 
181
181
>
182
-
> 9. Close the `Facet` by clicking on the cross (x) to see all rows.
182
+
> 9. Close the `Facet` by clicking on the cross (x) to see all rows.
183
183
>
184
184
{: .hands_on}
185
185
@@ -252,7 +252,7 @@ Are you ready for a little challenge? Let's investigate the categories column of
252
252
>
253
253
> 1. How many rows do you have after atomizing the categories column?
254
254
> 2. How many entries do not have any category?
255
-
>
255
+
>
256
256
> > <solution-title></solution-title>
257
257
> >
258
258
> > 1. 168,476
@@ -299,7 +299,7 @@ Now, let's use faceting based on text.
299
299
The clustering allows you to solve issues regarding case inconsistencies, incoherent use of either the singular or plural form, and simple spelling mistakes.
300
300
301
301
> <hands-on-title>Clustering of similar categories</hands-on-title>
302
-
>
302
+
>
303
303
> 1. Click on the `Cluster` button on the left in the `Facet/Filter` tab.
304
304
> 2. Use `Key collision` as clustering method. Change the Keying function to `n-Gram fingerprint` and change the n-Gram size to `3`.
305
305
>
@@ -314,15 +314,15 @@ The clustering allows you to solve issues regarding case inconsistencies, incohe
314
314
> 
315
315
>
316
316
> 5. Now, you can close the clustering window by clicking on `close`.
317
-
>
317
+
>
318
318
> Be careful! Some methods are too aggressive, so you might end up clustering values that do not belong together. Now that the values have been clustered individually, we can put them back together in a single cell.
319
319
> 1. Click the Categories triangle and hover over the `Edit cells` and click on `Join multi-valued cells`.
320
320
> 2. Choose the pipe character (`\|`) as a separator and click on `OK`.
321
321
> The rows now look like before, with a multi-valued Categories field.
322
322
>
323
323
{: .hands_on}
324
324
325
-
When you’re happy with your analysis results, choose whether to export the dataset into your Galaxy history or download it directly onto your computer.
325
+
When you’re happy with your analysis results, choose whether to export the dataset into your Galaxy history or download it directly onto your computer.
326
326
327
327
## Exporting your data back to Galaxy
328
328
@@ -356,12 +356,12 @@ When you’re happy with your analysis results, choose whether to export the dat
356
356
357
357
# Run a Galaxy Workflow on your cleaned data
358
358
359
-
Congratulations, you have successfully cleaned your data and improved its quality!
359
+
Congratulations, you have successfully cleaned your data and improved its quality!
360
360
But what can you do with it now?
361
-
This depends on your aims as a researcher. For us, it is interesting to extract further information from the data.
362
-
To make it easy for you, we created a so-called workflow, which links all the tools needed to do this analysis.
363
-
We wanted to know, from what year the museum had the most objects and what they were.
364
-
You can follow along and answer those questions with us, or explore the Galaxy tools on your own, to adapt the analysis to your needs.
361
+
This depends on your aims as a researcher. For us, it is interesting to extract further information from the data.
362
+
To make it easy for you, we created a so-called workflow, which links all the tools needed to do this analysis.
363
+
We wanted to know, from what year the museum had the most objects and what they were.
364
+
You can follow along and answer those questions with us, or explore the Galaxy tools on your own, to adapt the analysis to your needs.
365
365
In this case, be sure to check out our other tutorials, particularly the introductory ones.
366
366
367
367
## How to find and run existing workflows
@@ -399,7 +399,7 @@ In this case, be sure to check out our other tutorials, particularly the introdu
399
399
{: .hands_on}
400
400
401
401
What can you see here? To follow along, we made all substeps of the task available as outputs. To answer our question of what year most elements in the museum derive from, we first cut the column of production time from the table and filter only dates from the table that derive from specific years, not year ranges. Regular expressions help clean remaining inconsistencies in the dataset. Sorting the production date in descending order reveals that one faulty dataset that is supposed to be created in 2041 is part of the table. We remove it. Datamash allows for summing up how many elements came to the museum in what year. The ascending order, we visualise in a bar chart. To find out from what year most objects derive, we use another sorting order. We parse the input as a conditional statement to search for object descriptions from the objects of that year. In our case, this is 1969. From all object descriptions from 1969, we create a word cloud using the offered stop word list.
402
-
As a result, we get that most objects from the museum are negatives from Davis Mist, which he created in that year and gave to the museum.
402
+
As a result, we get that most objects from the museum are negatives from Davis Mist, which he created in that year and gave to the museum.
403
403
404
404

0 commit comments