Skip to content

Commit d55fb33

Browse files
committed
http collector: Add Chunking parameters
To handle big files in the queue, file splitting is necessary chunking was only available for the file and mail url collector, this adds it to the http collector
1 parent a18d184 commit d55fb33

File tree

3 files changed

+22
-4
lines changed

3 files changed

+22
-4
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ Please refer to the [NEWS](NEWS.md) for a list of changes which have an affect o
2323

2424
### Bots
2525
#### Collectors
26+
- `intelmq.bots.collectors.http.collector_http`: Add Chunking parameters to handle big files (PR#2684 by Sebastian Wagner).
2627

2728
#### Parsers
2829

docs/user/bots.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -321,6 +321,14 @@ This requires the [python-gnupg](https://pypi.org/project/python-gnupg/) library
321321
(optional, string) If specified, the string represents path to keyring file. Otherwise the PGP keyring file of the
322322
current `intelmq` user is used.
323323

324+
**Chunking**
325+
326+
For line-based inputs the bot can split up large reports into smaller chunks. This is particularly important for setups
327+
that use Redis as a message queue which has a per-message size limitation of 512 MB. To configure chunking,
328+
set `chunk_size` to a value in bytes. `chunk_replicate_header` determines whether the header line should be repeated for
329+
each chunk that is passed on to a parser bot. Specifically, to configure a large file input to work around Redis size
330+
limitation set `chunk_size` to something like 384000000 (~384 MB).
331+
324332
---
325333

326334
### Generic URL Stream Fetcher <div id="intelmq.bots.collectors.http.collector_http_stream" />

intelmq/bots/collectors/http/collector_http.py

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -27,11 +27,14 @@
2727
gpg_keyring: none (defaults to user's GPG keyring) or string (path to keyring file)
2828
"""
2929
from datetime import datetime, timedelta
30+
from typing import Optional
31+
from io import BytesIO
3032

3133
from intelmq.lib.bot import CollectorBot
3234
from intelmq.lib.mixins import HttpMixin
3335
from intelmq.lib.utils import unzip
3436
from intelmq.lib.exceptions import MissingDependencyError
37+
from intelmq.lib.splitreports import generate_reports
3538

3639
try:
3740
import gnupg
@@ -64,6 +67,9 @@ class HTTPCollectorBot(CollectorBot, HttpMixin):
6467
signature_url_formatting: bool = False
6568
ssl_client_certificate: str = None # TODO: pathlib.Path
6669
verify_pgp_signatures: bool = False
70+
# splitreports
71+
chunk_replicate_header: bool = True
72+
chunk_size: Optional[int] = None
6773

6874
def init(self):
6975
self.use_gpg = self.verify_pgp_signatures
@@ -130,12 +136,15 @@ def process(self):
130136
return_names=True, logger=self.logger)
131137

132138
for file_name, raw_report in raw_reports:
133-
report = self.new_report()
134-
report.add("raw", raw_report)
135-
report.add("feed.url", http_url)
139+
template = self.new_report()
140+
template.add("raw", raw_report)
141+
template.add("feed.url", http_url)
136142
if file_name:
137143
report.add("extra.file_name", file_name)
138-
self.send_message(report)
144+
for report in generate_reports(template, BytesIO(resp.content),
145+
self.chunk_size,
146+
self.chunk_replicate_header):
147+
self.send_message(report)
139148

140149
def format_url(self, url: str, formatting) -> str:
141150
try:

0 commit comments

Comments
 (0)