Skip to content

Add dataset: chronicling_america Β #85

@davanstrien

Description

@davanstrien

A URL for this dataset

https://chroniclingamerica.loc.gov/about/api/#bulk-data

Dataset description

Chronicling America is a Library of Congress project to digitise historic newspapers. The collection contains mostly English but also contains other languages. Breakdown by language: https://public.tableau.com/app/profile/chronicling.america#!/vizhome/ChroniclingAmericaLanguageCoverageBubble/All_Lang

Various ways of accessing this data include bulk downloads and an API. The API may be the most helpful way of accessing this dataset (via dataset loading script) because this dataset is not static (more titles are digitised and added on a rolling basis).

The 'newspapers' API (https://chroniclingamerica.loc.gov/newspapers.json) is probably the best starting point. This starts instead from a list of Newspaper titles for which digital content is held. A title, i.e. https://chroniclingamerica.loc.gov/lccn/sn86072192.json, contains a bunch of metadata.

Screenshot 2022-09-27 at 16 32 26.

This API also contains all the issues for that title. For each issue, you get a set of pages. Each page contains the plain text generated from the OCR for that page, e.g. https://chroniclingamerica.loc.gov/lccn/sn82014726/1888-04-07/ed-1/seq-1/ocr.txt and a link to the image of that page, e.g. https://chroniclingamerica.loc.gov/lccn/sn82014726/1888-04-07/ed-1/seq-1.jp2.

My suggested approach to loading this dataset would be to call https://chroniclingamerica.loc.gov/newspapers.json at the start of the script and, depending on some filters defined in the loading script, i.e. start/end date of interest, build up a list of relevant URLs for the text/images for each page.

If you want to work on this dataset, please cc @davanstrien and @albertvillanova!

Dataset modality

Mixed

Dataset licence

Other license

Other licence

https://chroniclingamerica.loc.gov/about/#rights

How can you access this data

Via an open API

size of dataset

10GB

Confirm the dataset has an open licence

  • To the best of my knowledge, this dataset is accessible via an open licence

Contact details for data custodian

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    datasetDataset to be added

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions