-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
searching data dumps with grep and jq
A great way to examine large amounts of Open Library data at once is to use the monthly data dumps. These can be unfamiliar if you have not used them before or are not familiar with standard command line tools for processing text files and JSON. This guide provides a brief introduction with examples. These tools are widely used, and many resources are available online to learn more.
For those familiar with command line tools, the general approach is to use zgrep (or similar) to select only relevant lines in the compressed data without decompressing it. Pipe the output to cut to extract the JSON data from the end of the line, then use jq to query the data in a structured way. This guide explains the process in more detail and provides example commands.
Start by downloading the latest authors dump, which is the smallest file available from https://openlibrary.org/developers/dumps. The same principles apply to other dumps, but this guide uses the authors dump as an example.
The author dump is approximately 0.4 GB compressed and would expand to 3 GB if decompressed. There is no need to decompress it; keep it as a .txt.gz file. The file is typically named ol_dump_authors_latest.txt.gz, but for the following examples it is renamed to a.txt.gz for brevity.
You do not need to decompress the file because each entry is on its own line, and you can extract lines of interest without decompressing first. The standard tool for this is zgrep, but these examples use ripgrep (usually invoked as rg). This approach is simpler and faster, especially for Windows and macOS users. You will need to install ripgrep and jq; Windows users may need to install additional dependencies.
The basic search command is rg -z 'Maurice Sendak' a.txt.gz, which tells rg to search the compressed file for the quoted phrase. Every matching line in a.txt.gz is output to the terminal.
The file currently contains approximately 8,487,789 lines and continues to grow. A search for a relatively uncommon name (such as Kardashian, Cumberbatch, or Humperdinck) will return many matches. Consider using full names and the -m5 option to limit results to the first 5 matches.
If you search for items that appear on every line (such as key, type, author, or revision), the entire 3 GB file will be output. Use -c to get a count of matching lines instead.
rg -z -c type a.txt.gzThis takes about 5 seconds to complete. Most useful queries return overwhelming amounts of data, so you will want to limit the results further. If you ran the Maurice Sendak search above, you may have noticed it returned every line containing that name—including his Open Library author ID (OL366346A), Wikipedia page data, bios of people influenced by him, and authors who collaborated with him. The first search narrows results to a reasonable size, then you can refine the shortlist of entries.
rg -z 'Maurice Sendak' a.txt.gz | cut -f5 | jq .nameThis extracts the 5th item on each line (the data in JSON format), then jq prints only the name from each entry.
These are the basics. There are many additional capabilities—for example, jq can convert JSON to CSV if needed for your workflows. For a detailed introduction, see this Programming Historian guide on JSON and jq.
The jqplay tool mentioned in the guide is particularly useful for learning how jq works. Pasting a few lines of Open Library data into jqplay is faster than experimenting on the command line with a large file, and you can share snippets:
https://jqplay.org/s/NIq_Aku18p
This page can be expanded with worked examples for tasks useful to the Open Library community. Future topics may include:
https://github.com/BurntSushi/ripgrep/blob/master/GUIDE.md#common-options
case insensitive searches:
rg -z -i 'irvine welsh' a.txt.gzUseful regular expressions for matching things that can vary slightly in spelling, and using -o to only output the bits that match
only outputting the lines that don't match -v
Using -w to only match "word boundaries"
Searching for weird typographical characters using -F
https://github.com/BurntSushi/ripgrep/blob/master/GUIDE.md#common-options
Sorting, counting and doing set operations on the output (e.g. how many people have wikidata IDs but no viaf id)
Advanced use of jq query language.