Skip to content

New WARC header to provide hint where to find original records for revisit records. #111

@ikreymer

Description

@ikreymer

The revisit record generally includes WARC-Refers-To-Target-URI and WARC-Refers-To-Target-Date to indicate the URL / timestamp
of the original record that the revisit refers to.

It would also be useful to provide an additional hint as to where this data may be found, through some sort of user-defined identifier,
which, if provided, could be used to help find the original records. The header can be a URI.

Some name ideas for this header:

  • WARC-Revisit-Original-Source
  • WARC-Revisit-Original-Location
  • WARC-Revisit-Original-Identifier

For our use case, we would this to store the name of a WACZ file that contains the original, but not necessarily the location, as that may change, eg:

WARC-Revisit-Original-Source: file://somefile.wacz
WARC-Revisit-Original-Source: urn:wacz:somefile.wacz

Other identifiers:

WARC-Revisit-Original-Source: https://example.com/path/to/known.warc.gz
WARC-Revisit-Original-Source: urn:sha-256:<...>
WARC-Revisit-Original-Source: urn:customurn:...

Idea it make this generic so that it could be used for variety of identifiers, though could also go with something more specific (like WARC-Revisit-Original-WACZ-Filename instead).

Would anyone else use such a header if it existed? Thoughts on the name, eg. source vs location?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions