Skip to content

fix: Avoid mutating Scrapy request userData during conversion#978

Draft
vdusek wants to merge 2 commits into
masterfrom
fix/scrapy-request-user-data-mutation
Draft

fix: Avoid mutating Scrapy request userData during conversion#978
vdusek wants to merge 2 commits into
masterfrom
fix/scrapy-request-user-data-mutation

Conversation

@vdusek

@vdusek vdusek commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

to_apify_request serialized the Scrapy request after Request.from_url() had mutated the shared meta['userData'] dict (same bug class as #832).

Request.from_url() injects a live CrawleeRequestData object under __crawlee into the very user_data dict it receives — which was the spider's own meta['userData']. Because scrapy_request.to_dict() ran afterward, two things went wrong:

  • the spider's own request meta['userData'] was mutated in place, and
  • the serialized scrapy_request blob stored on the platform embedded redundant Crawlee internals in every request.

Fix: capture scrapy_request.to_dict() before calling from_url(), and pass from_url() a copy of user_data (dict(user_data) for the plain-dict branch; model_dump() already returns a fresh dict). The spider's request stays untouched and the stored blob is free of injected Crawlee data.

Added two regression tests covering both the no-mutation guarantee and the clean serialized blob.

`Request.from_url()` injects a live `CrawleeRequestData` into the `user_data`
dict it receives, which was the spider's own `meta['userData']`. Serialize the
Scrapy request before `from_url()` and pass it a copy so the spider's request
stays untouched and the stored blob is free of redundant Crawlee internals.
@vdusek vdusek added adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. labels Jun 12, 2026
@vdusek vdusek self-assigned this Jun 12, 2026
@github-actions github-actions Bot added this to the 142nd sprint - Tooling team milestone Jun 12, 2026
@github-actions github-actions Bot added the tested Temporary label used only programatically for some analytics. label Jun 12, 2026
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.91%. Comparing base (0daca28) to head (1188131).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #978   +/-   ##
=======================================
  Coverage   89.90%   89.91%           
=======================================
  Files          49       49           
  Lines        3091     3093    +2     
=======================================
+ Hits         2779     2781    +2     
  Misses        312      312           
Flag Coverage Δ
e2e 35.88% <0.00%> (-0.03%) ⬇️
integration 56.83% <0.00%> (-0.04%) ⬇️
unit 78.75% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants