You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: topics/digital-humanities/tutorials/open-refine-tutorial/tutorial.md
+31-30Lines changed: 31 additions & 30 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -151,33 +151,33 @@ Great, now that the dataset is in OpenRefine, we can start cleaning it.
151
151
>
152
152
> 1. Click on the triangle on the left of `Record ID`.
153
153
>
154
-
> 
154
+
> 
155
155
>
156
156
> 2. Click on `Sort...`.
157
157
>
158
158
> 3. Select `numbers` and click on `OK`.
159
159
>
160
-
> 
160
+
> 
161
161
>
162
162
> 4. Above the table, click on `Sort` and select `Reorder rows permanently`.
163
163
>
164
-
> 
164
+
> 
165
165
>
166
166
> 5. Click on the triangle left of the `Record ID` column. Hover over `Edit cells` and select `Blank down`.
167
167
>
168
-
> 
168
+
> 
169
169
>
170
170
> 6. Click on the triangle left of the `Record ID` column. Hover over `Facet`, then move your mouse to `Customized facets` and select `Facet by blank (null or empty string)`.
171
171
>
172
-
> 
172
+
> 
173
173
>
174
174
> 7. On the left, a new option appears under `Facet/Filter` with the title `Record ID`. Click on `true`.
175
175
>
176
-
> 
176
+
> 
177
177
>
178
178
> 8. Click on the triangle to the left of the column called `All`. Hover over `Edit rows`, and select `remove matching rows`.
179
179
>
180
-
> 
180
+
> 
181
181
>
182
182
> 9. Close the `Facet` by clicking on the cross (x) to see all rows.
183
183
>
@@ -206,16 +206,17 @@ The dataset does not contain any more blank rows now. But we need to do more cle
206
206
> 3. Click on the triangle on the left of `Categories`, hover over `edit cells`, and click on `Transform...`.
207
207
> 4. In the new window, use the following text `value.replace('||', '|')` as "Expression" and click on `OK`.
208
208
>
209
-
> 
209
+
> 
210
+
>
211
+
> We can also remove the double occurrence of the same for different entries as follows:
210
212
>
211
-
> We can also remove the double occurrence of the same for different entries as follows:
212
213
> 5. Click on the triangle on the left of `Categories`, hover over `edit cells`, and click on `Transform...`.
> 2. In the new window, use the following text `split('|').uniques().join('|')` as "Expression" and click on `OK`.value.
219
+
> 6. In the new window, use the following text `split('|').uniques().join('|')` as "Expression" and click on `OK`.value.
219
220
>
220
221
{: .hands_on}
221
222
@@ -238,11 +239,11 @@ The dataset does not contain any more blank rows now. But we need to do more cle
238
239
> than one category. In order to analyze in detail the use of the keywords, the values of the Categories column need to be split up into individual cells on the basis of the pipe character.
239
240
> 1. Click on the triangle on the left of `Categories`, hover over `edit cells`, and click on `Split multi-valued cells...`.
240
241
>
241
-
> 
242
+
> 
242
243
>
243
244
> 2. Define the `Separator` as `\|` (pipe). Click on `OK`.
244
245
>
245
-
> 
246
+
> 
246
247
>
247
248
{: .hands_on}
248
249
@@ -270,15 +271,15 @@ Now, let's use faceting based on text.
270
271
> 1. Click on the triangle on the left of `Categories`, hover over `facet`, and click on`Text facet`.
271
272
> 2. On the left panel, it mentions the total number of choices. The default value of `count limit` is low for this dataset, and we should increase it. Click on `Set choice count limit`.
272
273
>
273
-
> 
274
+
> 
274
275
>
275
276
> 3. Enter `5000` as the new limit and click on `Ok`.
276
277
>
277
-
> 
278
+
> 
278
279
>
279
280
> 4. Now, you see all categories. Click on `count` to see the categories sorted in descending order.
280
281
>
281
-
> 
282
+
> 
282
283
>
283
284
{: .hands_on}
284
285
@@ -303,21 +304,21 @@ The clustering allows you to solve issues regarding case inconsistencies, incohe
303
304
> 1. Click on the `Cluster` button on the left in the `Facet/Filter` tab.
304
305
> 2. Use `Key collision` as clustering method. Change the Keying function to `n-Gram fingerprint` and change the n-Gram size to `3`.
305
306
>
306
-
> 
307
+
> 
307
308
>
308
309
> 3. Click on the `cluster` button in the middle window.
309
310
>
310
-
> 
311
+
> 
311
312
>
312
313
> 4. Here, you can see different suggestions from OpenRefine to cluster different categories and merge them into one. In our tutorial, we merge all of the suggestions by clicking on `select > all` and then clicking on `Merge selected and re-cluster`.
313
314
>
314
-
> 
315
+
> 
315
316
>
316
317
> 5. Now, you can close the clustering window by clicking on `close`.
317
318
>
318
-
> Be careful! Some methods are too aggressive, so you might end up clustering values that do not belong together. Now that the values have been clustered individually, we can put them back together in a single cell.
319
-
> 1. Click the Categories triangle and hover over the `Edit cells` and click on `Join multi-valued cells`.
320
-
> 2. Choose the pipe character (`\|`) as a separator and click on `OK`.
319
+
> Be careful! Some methods are too aggressive, so you might end up clustering values that do not belong together. Now that the values have been clustered individually, we can put them back together in a single cell.
320
+
> 6. Click the Categories triangle and hover over the `Edit cells` and click on `Join multi-valued cells`.
321
+
> 7. Choose the pipe character (`\|`) as a separator and click on `OK`.
321
322
> The rows now look like before, with a multi-valued Categories field.
322
323
>
323
324
{: .hands_on}
@@ -331,12 +332,12 @@ When you’re happy with your analysis results, choose whether to export the dat
331
332
> 1. Click on `Export` at the top of the table.
332
333
> 2. Select `Galaxy exporter`. Wait a few seconds. In a new page, you will see a text as follows: "Dataset has been exported to Galaxy, please close this tab". When you see this, you can close that tab. Alternatively, you can download your cleaned dataset in various formats such as CSV, TSV, and Excel. You can also close the extra tab that contains OpenRefine and click on the orange item `OpenRefine on data [and a number]`. You do not need it for your next steps
333
334
>
334
-
> 
335
+
> 
335
336
>
336
337
> 3. You can find a new dataset in your Galaxy History (with a green background) that contains your cleaned dataset for further analysis.
337
338
> 4. You can click on the eye icon ({% icon galaxy-eye %}) and investigate the table.
338
339
>
339
-
> 
340
+
> 
340
341
>
341
342
{: .hands_on}
342
343
@@ -345,7 +346,7 @@ When you’re happy with your analysis results, choose whether to export the dat
345
346
> 1. Click on `Undo/Redo` on the left panel.
346
347
> 2. Click on `Extract...`.
347
348
>
348
-
> 
349
+
> 
349
350
>
350
351
> 3. Click on the steps that you want to extract. Here, we selected everything.
351
352
> 4. Click on `Export`. Give your file a name to save it on your computer.
@@ -379,22 +380,22 @@ In this case, be sure to check out our other tutorials, particularly the introdu
379
380
> Let's assume that you have imported a workflow to your Galaxy account.
380
381
> 1. You can find all workflows available to you by clicking on the Workflows Icon ({% icon galaxy-workflows-activity %}) on the left panel.
381
382
>
382
-
> 
383
+
> 
383
384
>
384
385
> 2. Then, you can select and run different workflows (if you have any workflows in your account). Here, let's click on the Run button ({% icon workflow-run %}) of the workflow we provided to you in this tutorial.
385
386
>
386
-
> 
387
+
> 
387
388
>
388
389
> 3. Determine the inputs as follows:
389
390
> Input: `openrefine-Galaxy file.tsv`
390
391
> stop_words_english: `stop_words_english.txt`, which is the file we provided to you in this tutorial.
391
392
>
392
-
> 
393
+
> 
393
394
>
394
395
> 5. Click on the `Run Workflow` button at the top.
395
396
> 6. You can follow the stages of different jobs (computational tasks). They will be created, scheduled, executed, and completed. When everything is green, your workflow has run fully and the results are ready.
396
397
>
397
-
> 
398
+
> 
0 commit comments