Skip to content

WarcRecordWriter to write and index WAT/WET files #9

@sebastian-nagel

Description

@sebastian-nagel

Currently, the fetcher writes only WARC and CDX files while WAT and WET files are generated from the WARC files using the (CC's fork of the webarchive-commons library). In-lining the WAT/WET generation would allow to add the WAT/WET record offsets to the CDX index. This is a frequent wish from Common Crawl users (eg. 1, 2, 3).

Running the WAT/WET generation in the fetcher requires that it runs sufficiently fast and absolutely robust, otherwise crawled data is lost.

  • profile Fetcher reducer and WARC writer, improve performance, see WarcRecordWriter performance improvements #8
  • profile WAT/WET extractor and improve performance, see /WAT/WET generator performance improvements ia-web-commons#15. Note: if ready data structures are used instead of re-reading WARC records (see next point) the WAT/WET extraction should be faster without any changes.
  • make WAT/WET extraction (WEATGenerator, ResourceFactory implementations, see mapper) callable without the need to pass the WARC record as argument:
    • avoid decompressing and parsing of the WARC record
    • use ready objects instead: payload byte[], HTTP headers
    • detect charset once, use it for language detection and WAT/WET extraction
    • make use of objects not present in WARC response records (eg. store the detected language in WET files)
    • (in the long term) add non-HTML documents (PDF, office) to WET (WAT?)
  • push improvements upstream from ia-web-commons to webarchive-commons
  • add WAT/WET record offsets and lengths to CDX
    • WAT files contain also records for WARC request and metadata records - skip these?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions