-
Notifications
You must be signed in to change notification settings - Fork 357
Description
Summary
Following the successful implementation of streaming support for JSON generation/dumping (#524, #686), I propose adding similar streaming capabilities for JSON parsing/loading operations.
Background
The JSON library recently added support for streaming JSON generation, allowing direct writing to IO objects without intermediate buffering (#686). This significantly improved memory efficiency for large JSON dumps by eliminating the need to buffer the entire JSON structure in memory.
However, the parsing side still requires loading the entire JSON content into memory before parsing, which can be problematic for:
- Large JSON files that exceed available memory
- Stream processing scenarios where data arrives incrementally
- API responses that could be processed as they arrive
- Log file processing where JSON entries could be parsed one at a time
Proposed Enhancement
Add streaming support for JSON parsing through one or more of these approaches:
Option 1: Enhance existing JSON.parse/JSON.load
Allow these methods to accept IO objects and parse incrementally:
File.open('large.json', 'r') do |io|
JSON.parse(io, stream: true) do |object|
# Process each top-level object as it's parsed
process_object(object)
end
endOption 2: New streaming-specific methods
Introduce dedicated methods like JSON.parse_stream or JSON.each:
JSON.parse_stream(io) do |object|
# Process each object
end
# Or for line-delimited JSON (JSONL/NDJSON)
JSON.each_line(io) do |object|
# Process each line as a separate JSON object
endOption 3: SAX-style parsing API
Provide an event-driven parsing interface for maximum flexibility:
parser = JSON::StreamParser.new
parser.on_object do |obj|
# Handle complete objects
end
parser.on_array do |arr|
# Handle arrays
end
parser.parse(io)Use Cases
- Large file processing: Parse multi-gigabyte JSON files without loading them entirely into memory
- Real-time data streams: Process JSON data from network streams as it arrives
- Line-delimited JSON (JSONL/NDJSON): Efficiently process log files and data exports that use newline-delimited JSON format
- Memory-constrained environments: Enable JSON processing in environments with limited memory
- Progressive web APIs: Parse streaming JSON responses from APIs that support chunked transfer encoding
Benefits
- Memory efficiency: Constant memory usage regardless of input size
- Improved performance: Start processing data before the entire document is received
- Better scalability: Handle arbitrarily large JSON documents
- Consistency: Mirrors the streaming generation capability already implemented
Technical Considerations
- Would need implementation across all three backends (pure Ruby, C extension, Java)
- Should handle both single large JSON objects and streams of multiple JSON objects
- Error handling for partial/malformed JSON in streams
- Backward compatibility with existing parsing APIs
- Support for common streaming JSON formats (JSONL, JSON Text Sequences RFC 7464)
Related Work
- Streaming enhancements for dumping #524: Original issue for streaming JSON generation
- JSON.dump: write directly into the provided IO #686: Implementation of streaming JSON generation
- Similar functionality exists in other JSON libraries:
- yajl-ruby: Streaming JSON parsing
- oj: ScHandler for SAX-style parsing
- Node.js stream-json
- Python ijson
Questions for Discussion
- Which API approach would be most consistent with Ruby conventions and the existing JSON library?
- Should we support both SAX-style and object-stream patterns?
- How should we handle different streaming JSON formats (single large object vs. line-delimited)?
- What would be the performance targets compared to the current implementation?
Would love to hear thoughts from maintainers and the community on this enhancement!