Skip to content

feat: add streaming mode for large files#58

Merged
alvinreal merged 1 commit intomainfrom
fix/issue-29
Feb 22, 2026
Merged

feat: add streaming mode for large files#58
alvinreal merged 1 commit intomainfrom
fix/issue-29

Conversation

@alvinreal
Copy link
Owner

Adds a --stream CLI flag for element-by-element processing of large files, enabling morph to handle datasets that would otherwise require loading everything into memory.

New CLI Option

Flag Description
--stream Enable streaming mode (process elements one at a time)

Supported Streaming Pipelines

Input Output Streaming Type
JSONL JSONL/JSON/CSV True line-by-line (constant memory)
CSV JSONL/JSON/CSV True row-by-row (constant memory)
JSON array JSONL/JSON/CSV Parse + stream output

For unsupported combinations (e.g., YAML→JSON with --stream), the flag is silently ignored and the normal pipeline runs.

Implementation

  • New src/streaming.rs module with:
    • StreamWriter — incremental output writer for JSONL, JSON arrays, and CSV
    • stream_jsonl() — line-by-line JSONL processing
    • stream_csv() — row-by-row CSV processing
    • stream_json_array() — JSON array streaming output
    • run_streaming() — full pipeline orchestrator
  • Mapping support: per-element transforms work with -e/-m flags
  • Format-specific options respected (e.g., --csv-delimiter in streaming mode)
  • Output flushed via BufWriter for performance

Tests

  • 24 unit tests in src/streaming.rs: JSONL/CSV/JSON streaming, mappings, edge cases, tab delimiters
  • 7 integration tests in tests/cross_format.rs: CLI --stream flag with various format combinations

Fixes #29

Adds --stream flag for element-by-element processing of JSON arrays,
JSONL, and CSV input. In streaming mode, each element is parsed,
optionally transformed via mappings, and written to the output format
immediately rather than buffering the entire dataset in memory.

Supported streaming pipelines:
- JSONL → JSONL/JSON/CSV (true line-by-line streaming)
- CSV → JSONL/JSON/CSV (true row-by-row streaming)
- JSON array → JSONL/JSON/CSV (parse + stream output)

Features:
- StreamWriter abstraction for incremental output
- Mapping support (per-element transforms via -e/-m)
- CSV delimiter support in streaming mode
- Graceful fallback: --stream is silently ignored for unsupported
  format combinations (e.g. YAML→JSON)
- Periodic flushing via BufWriter

Includes 24 unit tests and 7 integration tests covering all streaming
combinations, edge cases, and mapping integration.

Fixes #29
@alvinreal alvinreal merged commit 51411af into main Feb 22, 2026
6 checks passed
@alvinreal alvinreal deleted the fix/issue-29 branch February 22, 2026 12:39
@github-actions github-actions bot mentioned this pull request Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Streaming mode for large files

1 participant