Skip to content

Dataset get_data() returns fewer items than successfully pushed with context.push_data() in PlaywrightCrawler #1621

@mabreuortega

Description

@mabreuortega

GitHub Issue for Crawlee Team

Title

Dataset get_data() returns fewer items than successfully pushed with context.push_data() in PlaywrightCrawler

Labels

  • t-tooling
  • bug

Description

Problem

When using context.push_data() in a PlaywrightCrawler's default request handler, all push_data() calls complete successfully (no errors thrown), but subsequent calls to crawler.get_data() or Dataset.get_data() return only a subset of the pushed items.

This appears to be related to the storage buffering mechanism mentioned in issue #1532, but affects success handlers rather than error handlers.

Expected Behavior

  • Push 4 items with context.push_data() → All 4 succeed
  • Query with crawler.get_data() → Should return all 4 items

Actual Behavior

  • Push 4 items with context.push_data() → All 4 succeed
  • Query with crawler.get_data() → Returns only 2 items (non-deterministic which ones)
  • Crawler statistics show requests_finished: 4
  • No errors or warnings logged

Impact

  • Integration tests become flaky and unreliable
  • Cannot verify all crawled data was persisted
  • Affects production data quality assurance
  • Makes it impossible to know if data loss occurred

Steps to Reproduce

Minimal Reproduction

import asyncio
from pathlib import Path
import tempfile
from crawlee.crawlers import PlaywrightCrawler
from crawlee.configuration import Configuration

async def main():
    # Create unique storage directory
    with tempfile.TemporaryDirectory(prefix="crawlee_test_") as tmpdir:
        storage_path = Path(tmpdir)

        # Configure crawler with isolated storage
        config = Configuration(storage_dir=str(storage_path))
        crawler = PlaywrightCrawler(configuration=config)

        # Simple handler that pushes data
        @crawler.router.default_handler
        async def handler(context):
            await context.push_data({
                "url": context.request.url,
                "title": "Test",
                "timestamp": str(context.request.loaded_url),
            })

        # Crawl 4 URLs (same page with different query params to avoid deduplication)
        test_url = "https://crawlee.dev"
        await crawler.run([
            f"{test_url}?test=1",
            f"{test_url}?test=2",
            f"{test_url}?test=3",
            f"{test_url}?test=4",
        ])

        # Poll for data (allow time for persistence)
        await asyncio.sleep(2.0)

        # Query dataset
        dataset_items = await crawler.get_data()

        print(f"Pushed: 4 items")
        print(f"Retrieved: {len(dataset_items.items)} items")
        print(f"URLs in dataset: {[item.get('url') for item in dataset_items.items]}")

        # Check statistics
        stats = await crawler.get_statistics()
        print(f"Requests finished: {stats['requests_finished']}")

if __name__ == "__main__":
    asyncio.run(main())

Expected Output

Pushed: 4 items
Retrieved: 4 items
URLs in dataset: [all 4 URLs]
Requests finished: 4

Actual Output

Pushed: 4 items
Retrieved: 1-2 items (varies)
URLs in dataset: [only 1-2 URLs, non-deterministic which ones]
Requests finished: 4

Evidence from Diagnostic Testing

We ran extensive diagnostic tests to isolate the issue:

Test Setup

  • Crawled 4 identical pages (same content, different query params)
  • Used unique query params to bypass URL deduplication
  • All pages produce identical content fingerprints
  • Isolated storage directory per test

Results

Crawlee Request Statistics:
┌─────────────────────┬─────┐
│ requests_finished   │ 4   │
│ requests_failed     │ 0   │
│ requests_total      │ 4   │
└─────────────────────┴─────┘

Dataset Query Results:
- Items returned: 2
- URLs present: #2 and #4 (non-sequential)
- Missing URLs: #1 and #3

Key Findings

  1. All push_data() calls succeed - No exceptions thrown
  2. All requests complete successfully - Statistics confirm 4/4 finished
  3. Only ~50% appear in dataset - get_data() returns 2/4 items
  4. Non-deterministic selection - Which items appear varies between runs
  5. Not content-related - Same page/content still exhibits bug
  6. Polling doesn't help - Waiting 10+ seconds shows no improvement

Attempted Workarounds

1. Using Dataset.open() directly

from crawlee.storages import Dataset
dataset = await Dataset.open()
items = await dataset.get_data()

Result: Same behavior

2. Using crawler.get_data()

items = await crawler.get_data()

Result: Same behavior (both methods call Dataset.open() internally)

3. Specifying dataset_id explicitly

items = await crawler.get_data(dataset_id="default")

Result: Same behavior

4. Extended polling with delays

for _ in range(50):
    await asyncio.sleep(0.2)
    items = await dataset.get_data()
    if len(items.items) >= 4:
        break

Result: No improvement even after 10 seconds

5. Processing URLs individually vs batch

# Test 1: Process single URL - WORKS
await crawler.run([url1])
data = await crawler.get_data()
# Result: ✅ 100% of items returned

# Test 2: Process multiple URLs - FAILS
await crawler.run([url1, url2, url3])
data = await crawler.get_data()
# Result: ❌ Only ~66% of items returned

Result: Bug ONLY manifests during batch/concurrent processing.
Key Finding: Single URL processing works perfectly (100% success rate).

Related Issues

Issue #1532: "failed_request_handler runs and logs but context.push_data(...) does not write to dataset"

Our issue differs:

The maintainer's comment in #1532 mentioned:

"we buffer storage-related calls to prevent partial results from failing requests from being stored"

It appears this buffering mechanism has a bug that affects all handlers, not just error handlers.

Environment

- crawlee: 1.0.3
- Python: 3.12.11
- playwright: 1.57.0
- OS: macOS 14.x / Linux

Impact Assessment

Severity: High

  • Data Loss: Cannot verify all crawled data is persisted
  • Test Reliability: Integration tests become flaky
  • Production Risk: Silent data loss in production crawls
  • Debugging Difficulty: No errors or warnings logged

Workaround

Currently using conditional validations in tests:

# Validate whatever items are available
for item in dataset_items.items:
    # Validate each available item
    assert validate_item(item)

This is not acceptable for production as it doesn't guarantee all data is persisted.

Proposed Fix

  1. Ensure context.push_data() buffers are flushed before crawler.run() completes
  2. Add explicit flush() or commit() method to Dataset/Context
  3. Make buffering configurable (or disable for success handlers)
  4. Add warning logs when buffered data is dropped
  5. Document buffering behavior in API docs

Request

Could you please:

  1. Confirm this is a known issue or investigate
  2. Provide workaround until fixed
  3. Add this to the roadmap for fixing
  4. Consider adding integration tests that verify all pushed items are queryable

Additional Context

Happy to provide additional information or help test fixes!


Related: #1532
Component: Dataset, Storage Client, Context Pipeline
Crawlee Version: 1.0.3

Metadata

Metadata

Assignees

Labels

bugSomething isn't working.t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions