A common issue with matching works from different sources / publishers is the integrity of the data. An author name from one source may come in as 'Toni Morrison', while another has 'Morrison, Toni', or 'Morrison, Toni, 1931-2019', or a myriad of other inputs. The goal of this project is to provide a simple proof of concept service to make resolving such discrepancies easier. The final product of this is a webservice that lets you query by an author name that you have and returns a set of potential matching 'established' names. The 'established' part of this is based on the Library of Congress Name Authority.
The general idea is as follows:
- Download the aforementioned named authority file
- Ingest each record from the file into a PostgreSQL DB
- Tokenize each name into a set of tokens representing (theoretically) each semantic part of the name
- Provide a webservice that uses these tokens as a quick filter before using fuzzy matching to provide a set of potentially matching names
Note, this is very much a proof-of-concept, and not a production-ready app. There are a number of obvious bugs here, and this app misses out on some very basic things like connection pooling. There is also some further learning to do in terms of processing these authority files from the Library of Congress, and this is simply a first stab at that.
This whole thing is a set of Python scripts and a very simple local webserver all running Python and managed by poetry.
- Install poetry: https://python-poetry.org/docs/#installation
- Run
poetry installto install dependencies. - Run
docker compose upto spin up the local database - Run
poetry run python loc_authority_tools/seed_db.pyto create the DB tables - Run
poetry run python loc_authority_tools/full_download.pyto download and unzip the LOC authority file - Run
poetry run python loc_authority_tools/ingest_loc_names.pyto ingest the file into a local DB - Run
poetry run python loc_authority_tools/tokenize_names.pyto ingest all author names and tokenize them for author matching- Note, this will take a long time (around 10 hours). A TODO is to add a
limitparam to the script OR to provide a dump of the data in S3
- Note, this will take a long time (around 10 hours). A TODO is to add a
- Run
poetry run flask --app loc_authority_tools.app:app runto start the webserver
Returns a list of LOC author names potentially matching the search term, with confidence numbers
Example: GET http://localhost:5000/author-match?author=morrison,toni
{
"data": {
"match": [
{
"confidence": 99,
"label": "Morrison, Toni",
"loc_url": "http://id.loc.gov/authorities/names/n80131379",
"uuid": "0537a214-4050-4185-a206-12bb7edac0fb"
},
{
"confidence": 92,
"label": "Morrison, Voni",
"loc_url": "http://id.loc.gov/authorities/names/no2018084424",
"uuid": "361b2258-012d-4ead-b434-f98576ff7700"
},
{
"confidence": 92,
"label": "Morrison, Tony",
"loc_url": "http://id.loc.gov/authorities/names/n79055204",
"uuid": "8ea7df62-4ee9-4a9a-93b7-03139b8b15d6"
},
{
"confidence": 88,
"label": "Morrison, Tom",
"loc_url": "http://id.loc.gov/authorities/names/n90716160",
"uuid": "69a3d747-ef47-445e-b66f-120cb1cfe739"
},
{
"confidence": 88,
"label": "Morrison, Tim",
"loc_url": "http://id.loc.gov/authorities/names/nb2010019150",
"uuid": "b31d2dd5-7fbd-450d-8516-37d96b7bf33f"
},
{
"confidence": 87,
"label": "Morrison, T.",
"loc_url": "http://id.loc.gov/authorities/names/n79025611",
"uuid": "e5b2253f-6b7e-4cbf-a9f3-071dc836905f"
}
]
}
}Note, there are duplicates here omitted for brevity, but that is a TODO to fix. There is also definitely some tuning required for the fuzzy matching.
Returns the corresponding author
{
"data": {
"author": {
"authoritative_label": "Morrison, Toni",
"loc_url": "http://id.loc.gov/authorities/names/n80131379",
"record_source": "84912d62-b264-4bf8-aaa9-7239c3b3be7b",
"uuid": "0537a214-4050-4185-a206-12bb7edac0fb"
}
}
}