Skip to content

Date handling for archival object records #7

@DaltonAlves

Description

@DaltonAlves

Archival object records that describe content in web archives format should probably allow for at least two types of date sub records:

  • Date of Creation
    This is the date of creation of the content. We can infer dates of web content from the content itself. Ex. a webpage with information that states "Last updated", from file names (meetingminutes_2024.pdf), or from the visual/textual content itself.
  • Date of Crawl/Capture
    This is the date or date range that web content was crawled. This does not reflect the actual creation date of the content, but rather the date that the content was collected in web archives format.

This integration should allow an archivist to populate date sub records with a date of creation. Hypothetically, we could try to infer dates of creation computationally by inspecting HTML elements, modification dates, ect, but this is out-of-scope for this integration. This type of date information will require human inference.

This integration should continue to pull crawl/capture dates from the CDX/C index.

Presently, the integration searches for the first date subrecord of an archival object to check the date/date range against the CDX/C. The update_dates function will have to be updated to better handle multiple date subrecords.

In addition, the date information pulled from the CDX/C should probably be in a date subrecord with a label like "crawled" or "captured" instead of "creation."

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions