LOC Authority Tools

A common issue with matching works from different sources / publishers is the integrity of the data. An author name from one source may come in as 'Toni Morrison', while another has 'Morrison, Toni', or 'Morrison, Toni, 1931-2019', or a myriad of other inputs. The goal of this project is to provide a simple proof of concept service to make resolving such discrepancies easier. The final product of this is a webservice that lets you query by an author name that you have and returns a set of potential matching 'established' names. The 'established' part of this is based on the Library of Congress Name Authority.

The general idea is as follows:

Download the aforementioned named authority file
Ingest each record from the file into a PostgreSQL DB
Tokenize each name into a set of tokens representing (theoretically) each semantic part of the name
Provide a webservice that uses these tokens as a quick filter before using fuzzy matching to provide a set of potentially matching names

Note, this is very much a proof-of-concept, and not a production-ready app. There are a number of obvious bugs here, and this app misses out on some very basic things like connection pooling. There is also some further learning to do in terms of processing these authority files from the Library of Congress, and this is simply a first stab at that.

Getting Started

This whole thing is a set of Python scripts and a very simple local webserver all running Python and managed by poetry.

Setup / Installation

Install poetry: https://python-poetry.org/docs/#installation
Run poetry install to install dependencies.
Run docker compose up to spin up the local database
Run poetry run python loc_authority_tools/seed_db.py to create the DB tables
Run poetry run python loc_authority_tools/full_download.py to download and unzip the LOC authority file
Run poetry run python loc_authority_tools/ingest_loc_names.py to ingest the file into a local DB
Run poetry run python loc_authority_tools/tokenize_names.py to ingest all author names and tokenize them for author matching
- Note, this will take a long time (around 10 hours). A TODO is to add a limit param to the script OR to provide a dump of the data in S3
Run poetry run flask --app loc_authority_tools.app:app run to start the webserver

Endpoints

GET '/authors-match?author={query}'

Returns a list of LOC author names potentially matching the search term, with confidence numbers

Example: GET http://localhost:5000/author-match?author=morrison,toni

{
  "data": {
    "match": [
      {
        "confidence": 99,
        "label": "Morrison, Toni",
        "loc_url": "http://id.loc.gov/authorities/names/n80131379",
        "uuid": "0537a214-4050-4185-a206-12bb7edac0fb"
      },
      {
        "confidence": 92,
        "label": "Morrison, Voni",
        "loc_url": "http://id.loc.gov/authorities/names/no2018084424",
        "uuid": "361b2258-012d-4ead-b434-f98576ff7700"
      },
      {
        "confidence": 92,
        "label": "Morrison, Tony",
        "loc_url": "http://id.loc.gov/authorities/names/n79055204",
        "uuid": "8ea7df62-4ee9-4a9a-93b7-03139b8b15d6"
      },
      {
        "confidence": 88,
        "label": "Morrison, Tom",
        "loc_url": "http://id.loc.gov/authorities/names/n90716160",
        "uuid": "69a3d747-ef47-445e-b66f-120cb1cfe739"
      },
      {
        "confidence": 88,
        "label": "Morrison, Tim",
        "loc_url": "http://id.loc.gov/authorities/names/nb2010019150",
        "uuid": "b31d2dd5-7fbd-450d-8516-37d96b7bf33f"
      },
      {
        "confidence": 87,
        "label": "Morrison, T.",
        "loc_url": "http://id.loc.gov/authorities/names/n79025611",
        "uuid": "e5b2253f-6b7e-4cbf-a9f3-071dc836905f"
      }
    ]
  }
}

Note, there are duplicates here omitted for brevity, but that is a TODO to fix. There is also definitely some tuning required for the fuzzy matching.

GET '/authors/{uuid}'

Returns the corresponding author

{
  "data": {
    "author": {
      "authoritative_label": "Morrison, Toni",
      "loc_url": "http://id.loc.gov/authorities/names/n80131379",
      "record_source": "84912d62-b264-4bf8-aaa9-7239c3b3be7b",
      "uuid": "0537a214-4050-4185-a206-12bb7edac0fb"
    }
  }
}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
loc_authority_tools		loc_authority_tools
tests		tests
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LOC Authority Tools

Getting Started

Setup / Installation

Endpoints

GET '/authors-match?author={query}'

GET '/authors/{uuid}'

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LOC Authority Tools

Getting Started

Setup / Installation

Endpoints

GET '/authors-match?author={query}'

GET '/authors/{uuid}'

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages