-
Notifications
You must be signed in to change notification settings - Fork 556
Description
Description
The Shadowserver connector accumulates all STIX objects for all reports of a single day in memory before yielding a bundle. For customers subscribed to many Shadowserver report types, a single day can produce millions of STIX objects, causing the connector to OOM before completing even one day of processing.
The root cause is in connector.py _collect_intelligence(): the per-day processing loop collects all report data via ThreadPoolExecutor, extends a single stix_objects list with every report's transformed output, then calls remove_duplicates() (which creates a second copy of the full list), and only then yields to the bundle sender.
Environment
- On-prem 7.260317.0
Reproducible Steps
Steps to create the smallest reproducible scenario:
Deploy the Shadowserver connector with a valid API key and secret
Subscribe to a large number of Shadowserver report types (or leave SHADOWSERVER_REPORT_TYPES empty to receive all available reports)
Set SHADOWSERVER_INITIAL_LOOKBACK to 1 (single day)
Start the connector
Observe memory consumption during the first collection cycle
Expected Output
The connector should process and yield STIX bundles incrementally (per-report or in configurable batch sizes) so that memory usage remains bounded regardless of the number of reports or rows per day.
Actual Output
The connector accumulates all STIX objects for all reports of a single day in a single in-memory list before yielding. The processing flow is:
For each day, ThreadPoolExecutor(max_workers=8) downloads all reports in parallel
Each report's CSV is parsed row-by-row; ShadowserverStixTransformation generates multiple STIX objects per row (Identity, Artifact with base64-encoded CSV, ObservedData, Notes, IPs, ASNs, hostnames, etc.)
All results are .extend()-ed into a single stix_objects list
remove_duplicates() creates a second copy of the full list, doubling peak memory
Only then does the method yield the bundle to the sender
Additional information
For high-volume subscriptions, this results in OOM before a single day completes processing. The connector restarts and retries the same day, creating an infinite OOM loop.
Observed in a Kubernetes deployment with 16 GiB memory limit on the connector pod:
Memory pattern: Sawtooth, linear climb from ~2 GiB to ~14 GiB over ~80 minutes, then OOM kill and restart
Pod restarts: Three OOM kills visible in a ~90-minute window
CPU: Pegged at ~1 core (Python GIL bound), 4 cores allocated, CPU is not the bottleneck
Outcome: Connector never completes processing a single day's reports
The ThreadPoolExecutor improves download speed but does not reduce peak memory, since all downloaded and transformed data is accumulated before yielding.
Suggested Fix
Option A (minimal change): Yield per-report instead of per-day. After each report is downloaded and transformed, yield its STIX objects as a bundle immediately rather than accumulating into the day-level list. This bounds memory to the size of a single report's output.
Option B (more robust): Implement chunked bundle sending with a configurable batch size (e.g., SHADOWSERVER_BATCH_SIZE). Accumulate STIX objects and yield a bundle every N objects, regardless of report boundaries. This provides predictable memory usage.
Both options should also address the remove_duplicates() copy: consider in-place deduplication or deduplication per-chunk rather than on the full day's output.
Screenshots (optional)
