You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: topics/digital-humanities/tutorials/open-refine-tutorial/tutorial.md
+12-6Lines changed: 12 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -207,30 +207,36 @@ The dataset no longer contains duplicates based on the Record ID. However, we ne
207
207
208
208
There are many ways to manipulate your dataset in OpenRefine. One of them is the Google Refine Expression Language (GREL). With the help of GREL, you can, for example, create custom facets or add columns by fetching URLs. We will use it to find and replace errors. For more information, refer to the [GREL documentation](https://openrefine.org/docs/manual/expressions).
209
209
210
-
Take a look at the `Categories` column of your dataset. Most objects were attributed to various categories, separated by "\|". However, several fields contain "\|\|" instead of "\|". We want to unify those.
210
+
Take a look at the `Categories` column of your dataset. Most objects were attributed to various categories, separated by "\|". However, several fields contain "\|\|" instead of "\|" as a separator. We want to unify those.
211
211
212
212
> <hands-on-title>Find and replace typos using GREL</hands-on-title>
213
213
>
214
214
> To remove the occurance of double pipe "\|\|" from the file we can do the following:
215
215
> 1. Click on the triangle on the left of `Categories` and select `Text filter`.
216
-
> 2. On the left, using the `Facet/Filter` section, search for the occurrence of "\|" and "\|\|". There are 71061 rows with "\|" and 9 rows with "\|\|". We want to remove these nine lines as they were added by mistake.
217
-
> 3. Click on the triangle on the left of `Categories`, hover over `edit cells`, and click on `Transform...`.
216
+
> 2. On the left, using the `Facet/Filter` section, search for the occurrence of \| and \|\|. There are 71061 rows with "\|" and 9 rows with "\|\|". We would like to remove these nine lines, as they were added by mistake.
217
+
> 3. Click on the triangle on the left of `Categories`, hover over `Edit cells`, and click on `Transform...`.
218
218
> 4. In the new window, use the following text `value.replace('||', '|')` as "Expression" and click on `OK`.
219
219
>
220
220
> 
221
221
>
222
-
> We can also remove the double occurrence of the same for different entries as follows:
222
+
> The expression replaces \|\| with \|. If you search for the occurrence of \|\| again, you will no longer get any results.
223
223
>
224
-
> 5. Click on the triangle on the left of `Categories`, hover over `edit cells`, and click on `Transform...`.
224
+
> There are currently many different categories within one cell, which is not so easy to work with.
225
+
> We, therefore, split the values of the `Categories` column up into individual cells. This is possible by using the pipe character.
226
+
> That way, we can also remove double occurrences of the same categories for one object.
227
+
>
228
+
> 6. Click on the triangle on the left of `Categories`, hover over `edit cells`, and click on `Transform...`.
> 6. In the new window, use the following text `split('|').uniques().join('|')` as "Expression" and click on `OK`.value.
234
+
> 7. In the new window, use the following text `value.split('|').uniques().join('|')` as "Expression" and click on `OK`.
231
235
>
232
236
{: .hands_on}
233
237
238
+
These expressions split categories at the pipe separator and join the unique ones within this column. As a result, duplicate categories for one object are deleted.
0 commit comments