WarcRecordWriter to write and index WAT/WET files

Currently, the fetcher writes only WARC and CDX files while WAT and WET files are generated from the WARC files using the ([CC's fork](/commoncrawl/ia-web-commons) of the [webarchive-commons](/iipc/webarchive-commons) library). In-lining the WAT/WET generation would allow to add the WAT/WET record offsets to the CDX index. This is a frequent wish from Common Crawl users (eg. [1](https://groups.google.com/d/topic/common-crawl/TOQDqmzfKM8/discussion), [2](https://groups.google.com/d/topic/common-crawl/xTER7EJ5kuc/discussion), [3](https://groups.google.com/d/topic/common-crawl/Fk8ISx013xs/discussion)).

Running the WAT/WET generation in the fetcher requires that it runs sufficiently fast and absolutely robust, otherwise crawled data is lost.

- [ ] profile Fetcher reducer and WARC writer, improve performance, see #8 
- [ ] profile WAT/WET extractor and improve performance, see /commoncrawl/ia-web-commons#15. Note: if ready data structures are used instead of re-reading WARC records (see next point) the WAT/WET extraction should be faster without any changes.
- [ ] make WAT/WET extraction ([WEATGenerator](/commoncrawl/ia-hadoop-tools/blob/master/src/main/java/org/archive/hadoop/jobs/WEATGenerator.java), ResourceFactory implementations, see [mapper](/commoncrawl/ia-web-commons/blob/master/src/main/java/org/archive/extract/ExtractingResourceFactoryMapper.java)) callable without the need to pass the WARC record as argument:
  - avoid decompressing and parsing of the WARC record
  - use ready objects instead: payload `byte[]`, HTTP headers
  - detect charset once, use it for language detection **and** WAT/WET extraction
  - make use of objects not present in WARC response records (eg. store the detected language in WET files)
  - (in the long term) add non-HTML documents (PDF, office) to WET (WAT?)
- [ ] push improvements upstream from [ia-web-commons](/commoncrawl/ia-web-commons) to [webarchive-commons](/iipc/webarchive-commons)
- [ ] add WAT/WET record offsets and lengths to CDX
  - WAT files contain also records for WARC request and metadata records - skip these?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WarcRecordWriter to write and index WAT/WET files #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

WarcRecordWriter to write and index WAT/WET files #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions