forked from Aloisius/nutch
-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Labels
Description
Currently, the fetcher writes only WARC and CDX files while WAT and WET files are generated from the WARC files using the (CC's fork of the webarchive-commons library). In-lining the WAT/WET generation would allow to add the WAT/WET record offsets to the CDX index. This is a frequent wish from Common Crawl users (eg. 1, 2, 3).
Running the WAT/WET generation in the fetcher requires that it runs sufficiently fast and absolutely robust, otherwise crawled data is lost.
- profile Fetcher reducer and WARC writer, improve performance, see WarcRecordWriter performance improvements #8
- profile WAT/WET extractor and improve performance, see /WAT/WET generator performance improvements ia-web-commons#15. Note: if ready data structures are used instead of re-reading WARC records (see next point) the WAT/WET extraction should be faster without any changes.
- make WAT/WET extraction (WEATGenerator, ResourceFactory implementations, see mapper) callable without the need to pass the WARC record as argument:
- avoid decompressing and parsing of the WARC record
- use ready objects instead: payload
byte[], HTTP headers - detect charset once, use it for language detection and WAT/WET extraction
- make use of objects not present in WARC response records (eg. store the detected language in WET files)
- (in the long term) add non-HTML documents (PDF, office) to WET (WAT?)
- push improvements upstream from ia-web-commons to webarchive-commons
- add WAT/WET record offsets and lengths to CDX
- WAT files contain also records for WARC request and metadata records - skip these?
Reactions are currently unavailable