- Create Project > Web Addresses (URLs) > https://raw.githubusercontent.com/colinrobinsonuib/openrefine-workshop/refs/heads/main/data/bicycle-collection.tsv
- Columns are seperated by tabs (TSV)
- Uncheck - "Quotation marks are used to enclose" (bottom)
- Project Name to "bicycle categories"
- Notice there are 72 rows
- Notice there are two views: rows and records
- The "records" view is for advanced manipulation
Description> View > CollapseMarks> View > Collapse- You can
- "Show" 5-50 rows at a time
- Navigate screens "< Previous" & "next >"
Categories > Facet > Text facet
- Notice mashup of "pipe delimited" text facets
- Switch between the "name" and "count" sorting options.
Categories > edit cells > split multi-valued cells > separator = |
- Now how many cateogry choices exist?
- Now how many rows exist?
- Click records to switch to the records view
- Click count What is the most popular Category term?
- Click name Limit to the "Juvenilia" facet; How many matching records?
new_link> Edit column > Add column by fetching URLs- New column name = Web Data
- Notice Throttle delay
- 2000 milliseconds is good. Less than 2 seconds and you MIGHT get booted, blocked, cast out!
- Click OK and wait
- When done, you’ve got all the source data for each page associated with the link in the "Persistent Link" column
Typically, after fetching data, you’ll need to parse it.
For example, you can transform the data in the retrieved cells with a regular expression to gather only the title of each object.
Web Data> Edit column > Add column based on this column- New Column Name = Web Page Title
- Expression =
value.parseHtml().select("h2")[0].htmlText()
- remove faceting: click Remove All in the facet sidebar
Production Date> Edit column > Split into several columns…- Seperator = - (i.e. "space dash space")
- Height > Edit column > Add column based on this column
- New column name: Dimension
- GREL Expression =
value + " x " + cells["Width"].value
- Dimension > Edit cells > Transform
- GREL Expression = value.replace("mm","")
- Dimension > Facet > Customized facet > facet by blank
- Clear all facets
- Object Title > Text filter > bmx
- Production date 1
- Facet > custom text facet
value[0,2]orvalue.match(/(\d\d).*/)[0]
- Load university-data.csv
- Facet by country
- Cluster by fingerprint
- Cluster by Daitch-Mokotoff
- Manually rename
- Get countries down to 46
We just cleaned all the country values for 5500 rows of data!