You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: topics/digital-humanities/tutorials/open-refine-tutorial/tutorial.md
+14-13Lines changed: 14 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -445,39 +445,40 @@ What can you see here? To follow along, we made all substeps of the task availab
445
445
To answer our question of which year most elements in the museum derive from, we first cut the column of production time from the table.
446
446
You can see this in the file `Cut on Data (Number)`.
447
447
From this, we filter only the dates that derive from specific years, not year ranges. (See `Filter Tabular on Data (Number)`.)
448
-
You can click on the arrow in a cirlce button of this dataset (`Run job again`) to see what exact input was used to exclude year ranges.
448
+
You can click on the arrow in the circle button of this dataset (`Run job again`) to see what exact input was used to exclude year ranges.
449
449
Regular expressions help clean remaining inconsistencies in the dataset. (Dataset: `Column Regex Find And Replace on Data (Number)`)
450
-
Sorting the production date in descending order, as done in dataset `Sort on data (lowest Number)`, reveals that one faulty dataset, which is supposed to have been created in 2041, is part of the table.
450
+
Sorting the production date in descending order, as done in the dataset `Sort on data (lowest Number)`, reveals that one faulty dataset, which is supposed to have been created in 2041, is part of the table.
451
451
We remove it in the next step with the tool `Remove beginning`.
452
452
453
453
The tool **Datamash** allows for summarising how many elements arrived at the museum in each year. (Dataset: `Datamash on data (Number)`.)
454
454
After we apply this tool, the dataset is no longer 7738 lines long, but only 259.
455
-
This is because the amount of times, each year appeared in the table was summed up in a second column.
456
-
Sorting in ascending order (`Sort on data (Number)`) shows a chronological dataset with the earliest enties in the beginning and the most recent entries at the end of the table.
457
-
This, we can easily visualise in a (particularly crowded) bar chart directly within Galaxy. (`Bar chart on data (Number)`)
458
-
But this is not the most optimal view to show us, what year most objects derive from.
455
+
This is because the number of times each year appeared in the table was summed up in a second column.
456
+
Sorting in ascending order (`Sort on data (Number)`) shows a chronological dataset with the earliest entries in the beginning and the most recent entries at the end of the table.
457
+
We can easily visualise this in a (particularly crowded) bar chart directly within Galaxy. (`Bar chart on data (Number)`)
458
+
But this is not the most optimal view to show us which year most objects derive from.
459
459
460
460
To determine from which year most objects originate, we use another sorting order (`Sort on Data (highest number)`).
461
461
462
462
> <question-title></question-title>
463
463
>
464
-
> 1. From what year does the museum have most objects?
464
+
> 1. From what year does the museum have the most objects?
465
465
>
466
466
> > <solution-title></solution-title>
467
467
> >
468
-
> > 1. The dataset `Sort on data (highest number)` shows the amount of objexts by year, sorted from most to least. 288 items are noted for the year 1969 in the first row. This is the year from which the museum has most (clearly datable) objects.
468
+
> > 1. The dataset `Sort on data (highest number)` shows the number of objects by year, sorted from most to least. 288 items are noted for the year 1969 in the first row. This is the year from which the museum has the most (clearly datable) objects.
469
469
> >
470
470
> {: .solution}
471
471
{: .question}
472
472
473
473
The next four steps parse this year as a conditional statement step by step. (`Select first on data (Number)`, `Cut on data (Number)`, `Parse parameter value on data (Number)` and `Compose text parameter value`.)
474
474
This means, even if you upload another dataset, the highest number is always selected and taken as an input for the next steps.
475
475
476
-
Based on this input, which is determined by the year with the highest input, ` Search in textfiles on data (Number)` now searches for object descriptions from the 288 objects of the most prominent year.
477
-
478
-
From all object descriptions from that year, we create a word cloud of the object descriptions by using the offered stop word list.
479
-
This helps us quickly determine, what kinds of objects the museum has from this popular year.
480
-
The dataset `Word cloud image` shows that most objects from the museum are negatives from Davis Mist, a famous Australian photographer, which he created that year and donated to the museum.
476
+
Based on this input, which is determined by the year with the highest input, `Search in textfiles on data (Number)` searches for object descriptions from the 288 objects of the most prominent year.
477
+
The table is very rich in information, but not that easy to digest.
478
+
To make the table more accessible, we create a word cloud of the object descriptions with the offered stop word list.
479
+
If you click on the stop word list we provided, you see what "fill words" are excluded from the word cloud. In essence, only words conveying meaning remain.
480
+
This helps us quickly determine what kinds of objects the museum has from this popular year.
481
+
The dataset `Word cloud image` shows that most objects from the museum are negatives from Davis Mist, a famous Australian photographer, who created them that year and donated them to the museum.
481
482
482
483

0 commit comments