feat(parser): add remote URL support for DoclingParser by bueno12223 · Pull Request #195 · HKUDS/RAG-Anything

bueno12223 · 2026-02-10T00:33:50Z

Implemented URL detection and secure downloading in Parser base class.
Added temporary file handling with automatic cleanup in DoclingParser.
Added user-agent headers to prevent 403 errors during document retrieval.
Included verification script for automated URL parsing tests.
add docling v2.72.0 to requirements
Closes [Question]: Does Parsers (Docling) support processing files from URL #183

Description

This PR implements the ability to parse documents directly from a URL using the DoclingParser. It allows the RAG pipeline to ingest remote resources seamlessly by handling the download and cleanup process automatically.

Related Issues

Closes #183

Changes Made

Parser base class: Added _is_url() for detection and _download_file() to handle retrieval with custom User-Agent headers.
DoclingParser class: Integrated the URL workflow into the parse_document method, using try...finally to ensure disk cleanup of temporary files.
Verification script: A verification script is included at scripts/test_url_parsing.py. I'm happy to remove it if you prefer to keep the scripts folder strictly for core utilities.

Checklist

Changes tested locally
Code reviewed
Documentation updated (if necessary)
Unit tests added (if applicable)

Additional Notes

The implementation mimics a browser User-Agent to avoid 403 Forbidden errors from common document hosts (like S3 or GitHub).

- Implemented URL detection and secure downloading in Parser base class. - Added temporary file handling with automatic cleanup in DoclingParser. - Added user-agent headers to prevent 403 errors during document retrieval. - Included verification script for automated URL parsing tests. - Closes HKUDS#183 - add docling v2.72.0 to requirements

LarFii · 2026-02-10T08:16:46Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 836b5e452c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-10T08:20:17Z

raganything/parser.py

+        except Exception as e:
+            self.logger.error(f"Failed to download file from {url}: {e}")
+            raise RuntimeError(f"Failed to download file from {url}: {e}")


Remove temp file when URL download fails

The new URL path creates a temp file before the network read, but on urlopen/copy errors this except path re-raises without deleting that file. Because parse_document only tracks downloaded_temp_file after _download_file returns, failed downloads leak files in /tmp (reproducible with an unreachable URL), which can accumulate and eventually impact long-running ingestion workers.

Useful? React with 👍 / 👎.

The _download_file method now properly cleans up the temporary file in the exception handler:

Initialize tmp_path = None before the try block

In the except block, check if tmp_path exists and delete it before re-raising

Added a finally block to close the HTTP response
This ensures that even if urlopen or copyfileobj fail, the temporary file is removed and won't leak in /tmp.

chatgpt-codex-connector · 2026-02-10T08:20:17Z

raganything/parser.py

+            # Parse URL to get path and extension
+            parsed_url = urllib.parse.urlparse(url)
+            path = Path(parsed_url.path)
+            suffix = path.suffix if path.suffix else ""


Support extensionless URLs in format detection

The downloaded temp file suffix is derived only from Path(parsed_url.path).suffix, so URLs like /download?id=123 that return a valid PDF get saved without an extension. parse_document then dispatches by file_path.suffix and raises Unsupported file format, so the new remote-URL feature fails for many common signed/download endpoints even though the payload is parseable.

Useful? React with 👍 / 👎.

The method now supports extensionless URLs by inspecting the Content-Type header:

Open the HTTP connection first to access response headers

If no extension is found in the URL path, check the Content-Type header

Use mimetypes.guess_extension() to infer the correct extension (e.g., application/pdf → .pdf)

Create the temporary file with the inferred extension
This allows URLs like /download?id=123 to work correctly as long as they return a valid Content-Type header.

- Fixed temp file leak by adding cleanup in exception handler when download fails. - Added Content-Type header inspection to support extensionless URLs. - Used mimetypes.guess_extension() to infer file extensions from MIME types. - Applied linting and formatting fixes with ruff and pre-commit. - Ensures temp files are properly deleted even on urlopen/copyfileobj errors.

LarFii · 2026-02-17T14:55:58Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 758ea1131a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-17T15:00:05Z

raganything/parser.py

+            )
+
+            # Open connection to get headers
+            response = urllib.request.urlopen(req)


Set an explicit timeout for URL downloads

_download_file calls urllib.request.urlopen(req) without a timeout, which uses Python’s default socket timeout (None) and can block forever when an endpoint stalls or a connection hangs mid-transfer; in URL-based parsing this can wedge worker processes and halt ingestion pipelines until manually restarted. Please pass a bounded timeout (ideally configurable) so remote fetch failures terminate predictably.

Useful? React with 👍 / 👎.

LarFii · 2026-02-20T16:45:32Z

There are some lint errors that need to be fixed by running:

pre-commit run --all-files

bueno12223 mentioned this pull request Feb 10, 2026

[Question]: Does Parsers (Docling) support processing files from URL #183

Open

2 tasks

chatgpt-codex-connector bot reviewed Feb 10, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Feb 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat(parser): add remote URL support for DoclingParser#195

feat(parser): add remote URL support for DoclingParser#195
bueno12223 wants to merge 2 commits intoHKUDS:mainfrom
bueno12223:feat/parser-url-support

bueno12223 commented Feb 10, 2026

Uh oh!

LarFii commented Feb 10, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 10, 2026

Uh oh!

bueno12223 Feb 11, 2026

Uh oh!

chatgpt-codex-connector bot Feb 10, 2026

Uh oh!

bueno12223 Feb 11, 2026

Uh oh!

LarFii commented Feb 17, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 17, 2026

Uh oh!

LarFii commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

bueno12223 commented Feb 10, 2026

Description

Related Issues

Changes Made

Checklist

Additional Notes

Uh oh!

LarFii commented Feb 10, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

bueno12223 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

bueno12223 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

LarFii commented Feb 17, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

LarFii commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants