Skip to content

Conversation

@eric-ozim
Copy link

PR Description
Feature: URI-based I/O helpers with optional S3 backend
This PR introduces a URI-based I/O utility layer to support local and S3 storage in a reusable way, and makes the S3 backend an optional, lazily imported dependency. This is a pure infrastructure enhancement and does not change any FastAPI endpoints.
What Changed
New Files
mineru/data/utils/uri_io.py – Core URI-based helpers:
read_bytes_from_uri (local path + s3:// / s3a://)
prepare_output_dir
upload_parse_dir_to_s3
cleanup_temp_dir
tests/test_uri_io.py – Unit tests for local/S3 URI handling and output directory selection
Modified Files
pyproject.toml
Moved boto3 from core dependencies to an optional extra: mineru[s3]
Ensured mineru[core] pulls in mineru[s3] so the “full install” still has S3 support
mineru/data/io/s3.py
Wrapped boto3 import in a guarded lazy import with a clear error message:
If S3 is used without installing mineru[s3], users get a precise ImportError telling them how to enable it
URI I/O Logic
How It Works
URI-based reading (read_bytes_from_uri)
Local paths (no ://): use existing read_fn(Path(...)) to handle PDF/images as before
S3 URIs (s3:// / s3a://):
Validate S3 backend availability (_require_s3_backend)
Read bytes via S3Reader using env-based config (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, S3_ENDPOINT_URL, optional S3_ADDRESSING_STYLE)
Other schemes (http://, https://, file://, etc.):
Explicitly rejected with a ValueError("Unsupported URI scheme ... Only local paths and s3:// are supported.")
Output directory selection (prepare_output_dir)
S3 output (output_uri starts with s3:// / s3a://):
Allocate a temporary local directory with tempfile.mkdtemp
Return (temp_dir, is_s3_output=True, normalized_output_uri=output_uri)
Local output / no URI:
Ensure fallback_local_dir exists
Return (fallback_local_dir, is_s3_output=False, normalized_output_uri)
Uploading parse results to S3 (upload_parse_dir_to_s3)
Requires S3 backend (mineru[s3] installed)
Uses S3DataWriter to mirror the full subtree under local_parse_dir into the target S3 prefix
Returns the final S3 “parse_dir” URI (s3://bucket/prefix/) for use in API responses
Temp directory cleanup (cleanup_temp_dir)
Best-effort shutil.rmtree with warning logs on failure
Designed to be safe for use in FastAPI BackgroundTask
Key Components
mineru.data.utils.uri_io.read_bytes_from_uri:
Normalizes input handling for local paths and S3 URIs
Provides clear error messages for unsupported schemes and missing S3 backend
mineru.data.utils.uri_io.prepare_output_dir:
Centralizes decision “write to real local directory vs temp dir for S3 upload”
mineru.data.utils.uri_io.upload_parse_dir_to_s3:
Generic “upload a whole parse directory to S3” helper, independent of API layer
Optional S3 backend (mineru[s3]):
pyproject.toml defines s3 extra with boto3
mineru/data/io/s3.py guards imports and gives a direct hint: pip install "mineru[s3]"

Checklist
Code Quality
[x] Code follows existing project style and layout conventions
[x] Self-review performed for uri_io.py, s3.py, and pyproject.toml changes
[x] New helpers and edge cases are documented in code comments
[x] Changes introduce no new linter errors in touched files
Testing
[x] pytest tests/test_uri_io.py passes locally
[x] Verified local-path reading works against tests/unittest/pdfs/test.pdf
[x] Verified unsupported schemes (http://...) raise clear ValueError
[x] Verified S3 access without backend raises clear ImportError pointing to mineru[s3]
[x] Verified prepare_output_dir behavior for both local and S3 output modes
Documentation
[ ] (To be done in follow-up PR) Public docs/update for new URI-based behavior when integrating into /file_parse
[x] Internal behavior and expectations are documented in uri_io.py docstrings and tests
🧪 Testing Guide
Basic Tests:
Ensure mineru is installed in editable mode with test extras:
pip install -e ".[test]"
Run the new tests:
pytest tests/test_uri_io.py
Scenarios Covered:
Local read:
read_bytes_from_uri("tests/unittest/pdfs/test.pdf") returns non-empty bytes
Unsupported scheme:
read_bytes_from_uri("http://example.com/foo.pdf") raises ValueError with “Unsupported URI scheme” and “Only local paths and s3://”
Missing S3 backend:
With S3 backend disabled/missing, calling read_bytes_from_uri("s3://bucket/key") raises ImportError mentioning pip install "mineru[s3]".
Output dir selection:
prepare_output_dir(None, "./output") returns a real local directory
prepare_output_dir("s3://my-bucket/prefix", "./output") returns a temp dir with is_s3_output=True and normalized_output_uri unchanged

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Nov 21, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Nov 21, 2025

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@dosubot dosubot bot added the enhancement New feature or request label Nov 21, 2025
@eric-ozim
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

@eric-ozim eric-ozim force-pushed the feature/parse_file_enhancement branch from a3020a0 to 67d2dcb Compare November 21, 2025 11:59
github-actions bot added a commit that referenced this pull request Nov 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant