Skip to content

Comments

feat(parser): add remote URL support for DoclingParser#195

Open
bueno12223 wants to merge 2 commits intoHKUDS:mainfrom
bueno12223:feat/parser-url-support
Open

feat(parser): add remote URL support for DoclingParser#195
bueno12223 wants to merge 2 commits intoHKUDS:mainfrom
bueno12223:feat/parser-url-support

Conversation

@bueno12223
Copy link

  • Implemented URL detection and secure downloading in Parser base class.
  • Added temporary file handling with automatic cleanup in DoclingParser.
  • Added user-agent headers to prevent 403 errors during document retrieval.
  • Included verification script for automated URL parsing tests.
  • add docling v2.72.0 to requirements
  • Closes [Question]: Does Parsers (Docling) support processing files from URL #183

Description

This PR implements the ability to parse documents directly from a URL using the DoclingParser. It allows the RAG pipeline to ingest remote resources seamlessly by handling the download and cleanup process automatically.

Related Issues

Closes #183

Changes Made

  • Parser base class: Added _is_url() for detection and _download_file() to handle retrieval with custom User-Agent headers.
  • DoclingParser class: Integrated the URL workflow into the parse_document method, using try...finally to ensure disk cleanup of temporary files.
  • Verification script: A verification script is included at scripts/test_url_parsing.py. I'm happy to remove it if you prefer to keep the scripts folder strictly for core utilities.

Checklist

  • Changes tested locally
  • Code reviewed
  • Documentation updated (if necessary)
  • Unit tests added (if applicable)

Additional Notes

The implementation mimics a browser User-Agent to avoid 403 Forbidden errors from common document hosts (like S3 or GitHub).

- Implemented URL detection and secure downloading in Parser base class.
- Added temporary file handling with automatic cleanup in DoclingParser.
- Added user-agent headers to prevent 403 errors during document retrieval.
- Included verification script for automated URL parsing tests.
- Closes HKUDS#183
- add docling v2.72.0 to requirements
@LarFii
Copy link
Collaborator

LarFii commented Feb 10, 2026

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 836b5e452c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 110 to 112
except Exception as e:
self.logger.error(f"Failed to download file from {url}: {e}")
raise RuntimeError(f"Failed to download file from {url}: {e}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove temp file when URL download fails

The new URL path creates a temp file before the network read, but on urlopen/copy errors this except path re-raises without deleting that file. Because parse_document only tracks downloaded_temp_file after _download_file returns, failed downloads leak files in /tmp (reproducible with an unreachable URL), which can accumulate and eventually impact long-running ingestion workers.

Useful? React with 👍 / 👎.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _download_file method now properly cleans up the temporary file in the exception handler:

  • Initialize tmp_path = None before the try block
  • In the except block, check if tmp_path exists and delete it before re-raising
  • Added a finally block to close the HTTP response
    This ensures that even if urlopen or copyfileobj fail, the temporary file is removed and won't leak in /tmp.

# Parse URL to get path and extension
parsed_url = urllib.parse.urlparse(url)
path = Path(parsed_url.path)
suffix = path.suffix if path.suffix else ""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Support extensionless URLs in format detection

The downloaded temp file suffix is derived only from Path(parsed_url.path).suffix, so URLs like /download?id=123 that return a valid PDF get saved without an extension. parse_document then dispatches by file_path.suffix and raises Unsupported file format, so the new remote-URL feature fails for many common signed/download endpoints even though the payload is parseable.

Useful? React with 👍 / 👎.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method now supports extensionless URLs by inspecting the Content-Type header:

  1. Open the HTTP connection first to access response headers
  2. If no extension is found in the URL path, check the Content-Type header
  3. Use mimetypes.guess_extension() to infer the correct extension (e.g., application/pdf.pdf)
  4. Create the temporary file with the inferred extension
    This allows URLs like /download?id=123 to work correctly as long as they return a valid Content-Type header.

- Fixed temp file leak by adding cleanup in exception handler when download fails.
- Added Content-Type header inspection to support extensionless URLs.
- Used mimetypes.guess_extension() to infer file extensions from MIME types.
- Applied linting and formatting fixes with ruff and pre-commit.
- Ensures temp files are properly deleted even on urlopen/copyfileobj errors.
@LarFii
Copy link
Collaborator

LarFii commented Feb 17, 2026

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 758ea1131a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

)

# Open connection to get headers
response = urllib.request.urlopen(req)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Set an explicit timeout for URL downloads

_download_file calls urllib.request.urlopen(req) without a timeout, which uses Python’s default socket timeout (None) and can block forever when an endpoint stalls or a connection hangs mid-transfer; in URL-based parsing this can wedge worker processes and halt ingestion pipelines until manually restarted. Please pass a bounded timeout (ideally configurable) so remote fetch failures terminate predictably.

Useful? React with 👍 / 👎.

@LarFii
Copy link
Collaborator

LarFii commented Feb 20, 2026

There are some lint errors that need to be fixed by running:

pre-commit run --all-files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Question]: Does Parsers (Docling) support processing files from URL

2 participants