feat(parser): add remote URL support for DoclingParser#195
feat(parser): add remote URL support for DoclingParser#195bueno12223 wants to merge 2 commits intoHKUDS:mainfrom
Conversation
- Implemented URL detection and secure downloading in Parser base class. - Added temporary file handling with automatic cleanup in DoclingParser. - Added user-agent headers to prevent 403 errors during document retrieval. - Included verification script for automated URL parsing tests. - Closes HKUDS#183 - add docling v2.72.0 to requirements
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 836b5e452c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| except Exception as e: | ||
| self.logger.error(f"Failed to download file from {url}: {e}") | ||
| raise RuntimeError(f"Failed to download file from {url}: {e}") |
There was a problem hiding this comment.
Remove temp file when URL download fails
The new URL path creates a temp file before the network read, but on urlopen/copy errors this except path re-raises without deleting that file. Because parse_document only tracks downloaded_temp_file after _download_file returns, failed downloads leak files in /tmp (reproducible with an unreachable URL), which can accumulate and eventually impact long-running ingestion workers.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
The _download_file method now properly cleans up the temporary file in the exception handler:
- Initialize
tmp_path = Nonebefore the try block - In the except block, check if
tmp_pathexists and delete it before re-raising - Added a
finallyblock to close the HTTP response
This ensures that even ifurlopenorcopyfileobjfail, the temporary file is removed and won't leak in/tmp.
| # Parse URL to get path and extension | ||
| parsed_url = urllib.parse.urlparse(url) | ||
| path = Path(parsed_url.path) | ||
| suffix = path.suffix if path.suffix else "" |
There was a problem hiding this comment.
Support extensionless URLs in format detection
The downloaded temp file suffix is derived only from Path(parsed_url.path).suffix, so URLs like /download?id=123 that return a valid PDF get saved without an extension. parse_document then dispatches by file_path.suffix and raises Unsupported file format, so the new remote-URL feature fails for many common signed/download endpoints even though the payload is parseable.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
The method now supports extensionless URLs by inspecting the Content-Type header:
- Open the HTTP connection first to access response headers
- If no extension is found in the URL path, check the
Content-Typeheader - Use
mimetypes.guess_extension()to infer the correct extension (e.g.,application/pdf→.pdf) - Create the temporary file with the inferred extension
This allows URLs like/download?id=123to work correctly as long as they return a validContent-Typeheader.
- Fixed temp file leak by adding cleanup in exception handler when download fails. - Added Content-Type header inspection to support extensionless URLs. - Used mimetypes.guess_extension() to infer file extensions from MIME types. - Applied linting and formatting fixes with ruff and pre-commit. - Ensures temp files are properly deleted even on urlopen/copyfileobj errors.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 758ea1131a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ) | ||
|
|
||
| # Open connection to get headers | ||
| response = urllib.request.urlopen(req) |
There was a problem hiding this comment.
Set an explicit timeout for URL downloads
_download_file calls urllib.request.urlopen(req) without a timeout, which uses Python’s default socket timeout (None) and can block forever when an endpoint stalls or a connection hangs mid-transfer; in URL-based parsing this can wedge worker processes and halt ingestion pipelines until manually restarted. Please pass a bounded timeout (ideally configurable) so remote fetch failures terminate predictably.
Useful? React with 👍 / 👎.
|
There are some lint errors that need to be fixed by running:
|
Description
This PR implements the ability to parse documents directly from a URL using the
DoclingParser. It allows the RAG pipeline to ingest remote resources seamlessly by handling the download and cleanup process automatically.Related Issues
Closes #183
Changes Made
Parserbase class: Added_is_url()for detection and_download_file()to handle retrieval with custom User-Agent headers.DoclingParserclass: Integrated the URL workflow into theparse_documentmethod, usingtry...finallyto ensure disk cleanup of temporary files.Checklist
Additional Notes
The implementation mimics a browser User-Agent to avoid
403 Forbiddenerrors from common document hosts (like S3 or GitHub).