feat(rag): enhance url loader with Jina Reader for better HTML parsing#225
Merged
moonpyt merged 3 commits intoXSpoonAi:inntegration/ragfrom Dec 24, 2025
Merged
Conversation
|
The account who enabled Codex for this repo no longer has access to Codex. Please contact the admins of this repo to enable Codex again. |
Contributor
Author
|
I pushed a small follow-up commit to refine |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR significantly improves the capabilities of the RAG system's URL loader by integrating Jina Reader. It shifts from a simple regex-based HTML stripper to a smart extraction engine that converts web pages into structured Markdown, filtering out noise like sidebars and ads.
Key Changes
1. Hybrid Loading Strategy
The loader now intelligently distinguishes between target types:
Direct Path
raw.githubusercontent.com) and pure text files (.py,.md,.json, etc.) are still downloaded directly to ensure code accuracy and zero latency.Smart Path
https://r.jina.ai/) to extract high-quality, LLM-friendly content.2. Fallback Mechanism
If the Jina Reader service times out or fails, the system automatically falls back to the original method (direct download + regex stripping) to ensure no disruption in service.
3. Configuration
Added support for an optional
JINA_API_KEYenvironment variable, though it works anonymously by default.