io: introduce URI-based IO layer with optional s3 backend #4040
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR Description
Feature: URI-based I/O helpers with optional S3 backend
This PR introduces a URI-based I/O utility layer to support local and S3 storage in a reusable way, and makes the S3 backend an optional, lazily imported dependency. This is a pure infrastructure enhancement and does not change any FastAPI endpoints.
What Changed
New Files
mineru/data/utils/uri_io.py – Core URI-based helpers:
read_bytes_from_uri (local path + s3:// / s3a://)
prepare_output_dir
upload_parse_dir_to_s3
cleanup_temp_dir
tests/test_uri_io.py – Unit tests for local/S3 URI handling and output directory selection
Modified Files
pyproject.toml
Moved boto3 from core dependencies to an optional extra: mineru[s3]
Ensured mineru[core] pulls in mineru[s3] so the “full install” still has S3 support
mineru/data/io/s3.py
Wrapped boto3 import in a guarded lazy import with a clear error message:
If S3 is used without installing mineru[s3], users get a precise ImportError telling them how to enable it
URI I/O Logic
How It Works
URI-based reading (read_bytes_from_uri)
Local paths (no ://): use existing read_fn(Path(...)) to handle PDF/images as before
S3 URIs (s3:// / s3a://):
Validate S3 backend availability (_require_s3_backend)
Read bytes via S3Reader using env-based config (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, S3_ENDPOINT_URL, optional S3_ADDRESSING_STYLE)
Other schemes (http://, https://, file://, etc.):
Explicitly rejected with a ValueError("Unsupported URI scheme ... Only local paths and s3:// are supported.")
Output directory selection (prepare_output_dir)
S3 output (output_uri starts with s3:// / s3a://):
Allocate a temporary local directory with tempfile.mkdtemp
Return (temp_dir, is_s3_output=True, normalized_output_uri=output_uri)
Local output / no URI:
Ensure fallback_local_dir exists
Return (fallback_local_dir, is_s3_output=False, normalized_output_uri)
Uploading parse results to S3 (upload_parse_dir_to_s3)
Requires S3 backend (mineru[s3] installed)
Uses S3DataWriter to mirror the full subtree under local_parse_dir into the target S3 prefix
Returns the final S3 “parse_dir” URI (s3://bucket/prefix/) for use in API responses
Temp directory cleanup (cleanup_temp_dir)
Best-effort shutil.rmtree with warning logs on failure
Designed to be safe for use in FastAPI BackgroundTask
Key Components
mineru.data.utils.uri_io.read_bytes_from_uri:
Normalizes input handling for local paths and S3 URIs
Provides clear error messages for unsupported schemes and missing S3 backend
mineru.data.utils.uri_io.prepare_output_dir:
Centralizes decision “write to real local directory vs temp dir for S3 upload”
mineru.data.utils.uri_io.upload_parse_dir_to_s3:
Generic “upload a whole parse directory to S3” helper, independent of API layer
Optional S3 backend (mineru[s3]):
pyproject.toml defines s3 extra with boto3
mineru/data/io/s3.py guards imports and gives a direct hint: pip install "mineru[s3]"
Checklist
Code Quality
[x] Code follows existing project style and layout conventions
[x] Self-review performed for uri_io.py, s3.py, and pyproject.toml changes
[x] New helpers and edge cases are documented in code comments
[x] Changes introduce no new linter errors in touched files
Testing
[x] pytest tests/test_uri_io.py passes locally
[x] Verified local-path reading works against tests/unittest/pdfs/test.pdf
[x] Verified unsupported schemes (http://...) raise clear ValueError
[x] Verified S3 access without backend raises clear ImportError pointing to mineru[s3]
[x] Verified prepare_output_dir behavior for both local and S3 output modes
Documentation
[ ] (To be done in follow-up PR) Public docs/update for new URI-based behavior when integrating into /file_parse
[x] Internal behavior and expectations are documented in uri_io.py docstrings and tests
🧪 Testing Guide
Basic Tests:
Ensure mineru is installed in editable mode with test extras:
pip install -e ".[test]"
Run the new tests:
pytest tests/test_uri_io.py
Scenarios Covered:
Local read:
read_bytes_from_uri("tests/unittest/pdfs/test.pdf") returns non-empty bytes
Unsupported scheme:
read_bytes_from_uri("http://example.com/foo.pdf") raises ValueError with “Unsupported URI scheme” and “Only local paths and s3://”
Missing S3 backend:
With S3 backend disabled/missing, calling read_bytes_from_uri("s3://bucket/key") raises ImportError mentioning pip install "mineru[s3]".
Output dir selection:
prepare_output_dir(None, "./output") returns a real local directory
prepare_output_dir("s3://my-bucket/prefix", "./output") returns a temp dir with is_s3_output=True and normalized_output_uri unchanged