-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Archival object records that describe content in web archives format should probably allow for at least two types of date sub records:
- Date of Creation
This is the date of creation of the content. We can infer dates of web content from the content itself. Ex. a webpage with information that states "Last updated", from file names (meetingminutes_2024.pdf), or from the visual/textual content itself. - Date of Crawl/Capture
This is the date or date range that web content was crawled. This does not reflect the actual creation date of the content, but rather the date that the content was collected in web archives format.
This integration should allow an archivist to populate date sub records with a date of creation. Hypothetically, we could try to infer dates of creation computationally by inspecting HTML elements, modification dates, ect, but this is out-of-scope for this integration. This type of date information will require human inference.
This integration should continue to pull crawl/capture dates from the CDX/C index.
Presently, the integration searches for the first date subrecord of an archival object to check the date/date range against the CDX/C. The update_dates function will have to be updated to better handle multiple date subrecords.
In addition, the date information pulled from the CDX/C should probably be in a date subrecord with a label like "crawled" or "captured" instead of "creation."