-
Notifications
You must be signed in to change notification settings - Fork 535
Description
GitHub Issue for Crawlee Team
Title
Dataset get_data() returns fewer items than successfully pushed with context.push_data() in PlaywrightCrawler
Labels
t-toolingbug
Description
Problem
When using context.push_data() in a PlaywrightCrawler's default request handler, all push_data() calls complete successfully (no errors thrown), but subsequent calls to crawler.get_data() or Dataset.get_data() return only a subset of the pushed items.
This appears to be related to the storage buffering mechanism mentioned in issue #1532, but affects success handlers rather than error handlers.
Expected Behavior
- Push 4 items with
context.push_data()→ All 4 succeed - Query with
crawler.get_data()→ Should return all 4 items
Actual Behavior
- Push 4 items with
context.push_data()→ All 4 succeed - Query with
crawler.get_data()→ Returns only 2 items (non-deterministic which ones) - Crawler statistics show
requests_finished: 4 - No errors or warnings logged
Impact
- Integration tests become flaky and unreliable
- Cannot verify all crawled data was persisted
- Affects production data quality assurance
- Makes it impossible to know if data loss occurred
Steps to Reproduce
Minimal Reproduction
import asyncio
from pathlib import Path
import tempfile
from crawlee.crawlers import PlaywrightCrawler
from crawlee.configuration import Configuration
async def main():
# Create unique storage directory
with tempfile.TemporaryDirectory(prefix="crawlee_test_") as tmpdir:
storage_path = Path(tmpdir)
# Configure crawler with isolated storage
config = Configuration(storage_dir=str(storage_path))
crawler = PlaywrightCrawler(configuration=config)
# Simple handler that pushes data
@crawler.router.default_handler
async def handler(context):
await context.push_data({
"url": context.request.url,
"title": "Test",
"timestamp": str(context.request.loaded_url),
})
# Crawl 4 URLs (same page with different query params to avoid deduplication)
test_url = "https://crawlee.dev"
await crawler.run([
f"{test_url}?test=1",
f"{test_url}?test=2",
f"{test_url}?test=3",
f"{test_url}?test=4",
])
# Poll for data (allow time for persistence)
await asyncio.sleep(2.0)
# Query dataset
dataset_items = await crawler.get_data()
print(f"Pushed: 4 items")
print(f"Retrieved: {len(dataset_items.items)} items")
print(f"URLs in dataset: {[item.get('url') for item in dataset_items.items]}")
# Check statistics
stats = await crawler.get_statistics()
print(f"Requests finished: {stats['requests_finished']}")
if __name__ == "__main__":
asyncio.run(main())Expected Output
Pushed: 4 items
Retrieved: 4 items
URLs in dataset: [all 4 URLs]
Requests finished: 4
Actual Output
Pushed: 4 items
Retrieved: 1-2 items (varies)
URLs in dataset: [only 1-2 URLs, non-deterministic which ones]
Requests finished: 4
Evidence from Diagnostic Testing
We ran extensive diagnostic tests to isolate the issue:
Test Setup
- Crawled 4 identical pages (same content, different query params)
- Used unique query params to bypass URL deduplication
- All pages produce identical content fingerprints
- Isolated storage directory per test
Results
Crawlee Request Statistics:
┌─────────────────────┬─────┐
│ requests_finished │ 4 │
│ requests_failed │ 0 │
│ requests_total │ 4 │
└─────────────────────┴─────┘
Dataset Query Results:
- Items returned: 2
- URLs present: #2 and #4 (non-sequential)
- Missing URLs: #1 and #3
Key Findings
- All
push_data()calls succeed - No exceptions thrown - All requests complete successfully - Statistics confirm 4/4 finished
- Only ~50% appear in dataset -
get_data()returns 2/4 items - Non-deterministic selection - Which items appear varies between runs
- Not content-related - Same page/content still exhibits bug
- Polling doesn't help - Waiting 10+ seconds shows no improvement
Attempted Workarounds
1. Using Dataset.open() directly
from crawlee.storages import Dataset
dataset = await Dataset.open()
items = await dataset.get_data()Result: Same behavior
2. Using crawler.get_data()
items = await crawler.get_data()Result: Same behavior (both methods call Dataset.open() internally)
3. Specifying dataset_id explicitly
items = await crawler.get_data(dataset_id="default")Result: Same behavior
4. Extended polling with delays
for _ in range(50):
await asyncio.sleep(0.2)
items = await dataset.get_data()
if len(items.items) >= 4:
breakResult: No improvement even after 10 seconds
5. Processing URLs individually vs batch
# Test 1: Process single URL - WORKS
await crawler.run([url1])
data = await crawler.get_data()
# Result: ✅ 100% of items returned
# Test 2: Process multiple URLs - FAILS
await crawler.run([url1, url2, url3])
data = await crawler.get_data()
# Result: ❌ Only ~66% of items returnedResult: Bug ONLY manifests during batch/concurrent processing.
Key Finding: Single URL processing works perfectly (100% success rate).
Related Issues
Issue #1532: "failed_request_handler runs and logs but context.push_data(...) does not write to dataset"
- Status: Closed (fixed in PR fix: Make context helpers work in
FailedRequestHandlerandErrorHandler#1570) - Scope: Fixed buffering for
failed_request_handleranderror_handler - Relevance: Confirmed Crawlee buffers storage operations
Our issue differs:
- failed_request_handler runs and logs but context.push_data(...) does not write to dataset (PlaywrightCrawler) #1532: Error handlers (now fixed)
- This issue: Success handlers (still broken)
- Same root cause: Crawlee's storage buffering mechanism
The maintainer's comment in #1532 mentioned:
"we buffer storage-related calls to prevent partial results from failing requests from being stored"
It appears this buffering mechanism has a bug that affects all handlers, not just error handlers.
Environment
- crawlee: 1.0.3
- Python: 3.12.11
- playwright: 1.57.0
- OS: macOS 14.x / Linux
Impact Assessment
Severity: High
- Data Loss: Cannot verify all crawled data is persisted
- Test Reliability: Integration tests become flaky
- Production Risk: Silent data loss in production crawls
- Debugging Difficulty: No errors or warnings logged
Workaround
Currently using conditional validations in tests:
# Validate whatever items are available
for item in dataset_items.items:
# Validate each available item
assert validate_item(item)This is not acceptable for production as it doesn't guarantee all data is persisted.
Proposed Fix
- Ensure
context.push_data()buffers are flushed beforecrawler.run()completes - Add explicit
flush()orcommit()method to Dataset/Context - Make buffering configurable (or disable for success handlers)
- Add warning logs when buffered data is dropped
- Document buffering behavior in API docs
Request
Could you please:
- Confirm this is a known issue or investigate
- Provide workaround until fixed
- Add this to the roadmap for fixing
- Consider adding integration tests that verify all pushed items are queryable
Additional Context
Happy to provide additional information or help test fixes!
Related: #1532
Component: Dataset, Storage Client, Context Pipeline
Crawlee Version: 1.0.3