Skip to content

Commit eb87c7e

Browse files
tooryxcopybara-github
authored andcommitted
Add a type field to the crawl results.
Currently, Tsunami stores the HTTP response directly in the crawl results. This new field provides supports several storage strategy. More specifically, in Goonami we only store a hash of the page for efficiency. PiperOrigin-RevId: 869109670 Change-Id: I1cc03dc60006090c5c60319e505e941aba4da697
1 parent 6a9b6f4 commit eb87c7e

File tree

1 file changed

+12
-0
lines changed

1 file changed

+12
-0
lines changed

proto/web_crawl.proto

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,13 @@ message HttpHeader {
8282
string value = 2;
8383
}
8484

85+
// The type of content stored in the CrawlResult.
86+
enum CrawlContentType {
87+
CONTENT_TYPE_UNSPECIFIED = 0;
88+
CONTENT_TYPE_RAW = 1;
89+
CONTENT_TYPE_HASH = 2;
90+
}
91+
8592
message CrawlResult {
8693
// The target visited by the crawler.
8794
CrawlTarget crawl_target = 1;
@@ -100,4 +107,9 @@ message CrawlResult {
100107

101108
// Http headers of the response
102109
repeated HttpHeader response_headers = 6;
110+
111+
// The type of content stored in the crawl_results. By default, the whole
112+
// response body is stored (RAW). But some configuration can request storing
113+
// only a hash of the response body (HASH).
114+
CrawlContentType crawl_content_type = 7;
103115
}

0 commit comments

Comments
 (0)