Dataset get_data() returns fewer items than successfully pushed with context.push_data() in PlaywrightCrawler

# GitHub Issue for Crawlee Team

## Title
Dataset `get_data()` returns fewer items than successfully pushed with `context.push_data()` in PlaywrightCrawler

## Labels
- `t-tooling`
- `bug`

## Description

### Problem
When using `context.push_data()` in a PlaywrightCrawler's default request handler, all `push_data()` calls complete successfully (no errors thrown), but subsequent calls to `crawler.get_data()` or `Dataset.get_data()` return only a subset of the pushed items.

This appears to be related to the storage buffering mechanism mentioned in issue #1532, but affects **success handlers** rather than error handlers.

### Expected Behavior
- Push 4 items with `context.push_data()` → All 4 succeed
- Query with `crawler.get_data()` → Should return all 4 items

### Actual Behavior
- Push 4 items with `context.push_data()` → All 4 succeed
- Query with `crawler.get_data()` → Returns only 2 items (non-deterministic which ones)
- Crawler statistics show `requests_finished: 4`
- No errors or warnings logged

### Impact
- Integration tests become flaky and unreliable
- Cannot verify all crawled data was persisted
- Affects production data quality assurance
- Makes it impossible to know if data loss occurred

## Steps to Reproduce

### Minimal Reproduction

```python
import asyncio
from pathlib import Path
import tempfile
from crawlee.crawlers import PlaywrightCrawler
from crawlee.configuration import Configuration

async def main():
    # Create unique storage directory
    with tempfile.TemporaryDirectory(prefix="crawlee_test_") as tmpdir:
        storage_path = Path(tmpdir)

        # Configure crawler with isolated storage
        config = Configuration(storage_dir=str(storage_path))
        crawler = PlaywrightCrawler(configuration=config)

        # Simple handler that pushes data
        @crawler.router.default_handler
        async def handler(context):
            await context.push_data({
                "url": context.request.url,
                "title": "Test",
                "timestamp": str(context.request.loaded_url),
            })

        # Crawl 4 URLs (same page with different query params to avoid deduplication)
        test_url = "https://crawlee.dev"
        await crawler.run([
            f"{test_url}?test=1",
            f"{test_url}?test=2",
            f"{test_url}?test=3",
            f"{test_url}?test=4",
        ])

        # Poll for data (allow time for persistence)
        await asyncio.sleep(2.0)

        # Query dataset
        dataset_items = await crawler.get_data()

        print(f"Pushed: 4 items")
        print(f"Retrieved: {len(dataset_items.items)} items")
        print(f"URLs in dataset: {[item.get('url') for item in dataset_items.items]}")

        # Check statistics
        stats = await crawler.get_statistics()
        print(f"Requests finished: {stats['requests_finished']}")

if __name__ == "__main__":
    asyncio.run(main())
```

### Expected Output
```
Pushed: 4 items
Retrieved: 4 items
URLs in dataset: [all 4 URLs]
Requests finished: 4
```

### Actual Output
```
Pushed: 4 items
Retrieved: 1-2 items (varies)
URLs in dataset: [only 1-2 URLs, non-deterministic which ones]
Requests finished: 4
```

## Evidence from Diagnostic Testing

We ran extensive diagnostic tests to isolate the issue:

### Test Setup
- Crawled 4 identical pages (same content, different query params)
- Used unique query params to bypass URL deduplication
- All pages produce identical content fingerprints
- Isolated storage directory per test

### Results
```
Crawlee Request Statistics:
┌─────────────────────┬─────┐
│ requests_finished   │ 4   │
│ requests_failed     │ 0   │
│ requests_total      │ 4   │
└─────────────────────┴─────┘

Dataset Query Results:
- Items returned: 2
- URLs present: #2 and #4 (non-sequential)
- Missing URLs: #1 and #3
```

### Key Findings
1. **All `push_data()` calls succeed** - No exceptions thrown
2. **All requests complete successfully** - Statistics confirm 4/4 finished
3. **Only ~50% appear in dataset** - `get_data()` returns 2/4 items
4. **Non-deterministic selection** - Which items appear varies between runs
5. **Not content-related** - Same page/content still exhibits bug
6. **Polling doesn't help** - Waiting 10+ seconds shows no improvement

### Attempted Workarounds

#### 1. Using `Dataset.open()` directly
```python
from crawlee.storages import Dataset
dataset = await Dataset.open()
items = await dataset.get_data()
```
**Result**: Same behavior

#### 2. Using `crawler.get_data()`
```python
items = await crawler.get_data()
```
**Result**: Same behavior (both methods call `Dataset.open()` internally)

#### 3. Specifying dataset_id explicitly
```python
items = await crawler.get_data(dataset_id="default")
```
**Result**: Same behavior

#### 4. Extended polling with delays
```python
for _ in range(50):
    await asyncio.sleep(0.2)
    items = await dataset.get_data()
    if len(items.items) >= 4:
        break
```
**Result**: No improvement even after 10 seconds

#### 5. Processing URLs individually vs batch
```python
# Test 1: Process single URL - WORKS
await crawler.run([url1])
data = await crawler.get_data()
# Result: ✅ 100% of items returned

# Test 2: Process multiple URLs - FAILS
await crawler.run([url1, url2, url3])
data = await crawler.get_data()
# Result: ❌ Only ~66% of items returned
```
**Result**: Bug ONLY manifests during batch/concurrent processing.
**Key Finding**: Single URL processing works perfectly (100% success rate).

## Related Issues

**Issue #1532**: "failed_request_handler runs and logs but context.push_data(...) does not write to dataset"
- **Status**: Closed (fixed in PR #1570)
- **Scope**: Fixed buffering for `failed_request_handler` and `error_handler`
- **Relevance**: Confirmed Crawlee buffers storage operations

**Our issue differs**:
- #1532: Error handlers (now fixed)
- **This issue**: Success handlers (still broken)
- **Same root cause**: Crawlee's storage buffering mechanism

The maintainer's comment in #1532 mentioned:
> "we buffer storage-related calls to prevent partial results from failing requests from being stored"

It appears this buffering mechanism has a bug that affects all handlers, not just error handlers.

## Environment

```
- crawlee: 1.0.3
- Python: 3.12.11
- playwright: 1.57.0
- OS: macOS 14.x / Linux
```

## Impact Assessment

### Severity: High
- **Data Loss**: Cannot verify all crawled data is persisted
- **Test Reliability**: Integration tests become flaky
- **Production Risk**: Silent data loss in production crawls
- **Debugging Difficulty**: No errors or warnings logged

### Workaround
Currently using conditional validations in tests:
```python
# Validate whatever items are available
for item in dataset_items.items:
    # Validate each available item
    assert validate_item(item)
```

This is **not acceptable for production** as it doesn't guarantee all data is persisted.

## Proposed Fix

1. Ensure `context.push_data()` buffers are flushed before `crawler.run()` completes
2. Add explicit `flush()` or `commit()` method to Dataset/Context
3. Make buffering configurable (or disable for success handlers)
4. Add warning logs when buffered data is dropped
5. Document buffering behavior in API docs

## Request

Could you please:
1. Confirm this is a known issue or investigate
2. Provide workaround until fixed
3. Add this to the roadmap for fixing
4. Consider adding integration tests that verify all pushed items are queryable

## Additional Context

Happy to provide additional information or help test fixes!

---

**Related**: #1532
**Component**: Dataset, Storage Client, Context Pipeline
**Crawlee Version**: 1.0.3


Dataset get_data() returns fewer items than successfully pushed with context.push_data() in PlaywrightCrawler #1621

Description

GitHub Issue for Crawlee Team

Title

Labels

Description

Problem

Expected Behavior

Actual Behavior

Impact

Steps to Reproduce

Minimal Reproduction

Expected Output

Actual Output

Evidence from Diagnostic Testing

Test Setup

Results

Key Findings

Attempted Workarounds

1. Using Dataset.open() directly

2. Using crawler.get_data()

3. Specifying dataset_id explicitly

4. Extended polling with delays

5. Processing URLs individually vs batch

Related Issues

Environment

Impact Assessment

Severity: High

Workaround

Proposed Fix

Request

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Using `Dataset.open()` directly

2. Using `crawler.get_data()`