Skip to content

feat(rag): enhance url loader with Jina Reader for better HTML parsing#225

Merged
moonpyt merged 3 commits intoXSpoonAi:inntegration/ragfrom
yueliao11:feat/rag-framework
Dec 24, 2025
Merged

feat(rag): enhance url loader with Jina Reader for better HTML parsing#225
moonpyt merged 3 commits intoXSpoonAi:inntegration/ragfrom
yueliao11:feat/rag-framework

Conversation

@yueliao11
Copy link
Copy Markdown
Contributor

Summary

This PR significantly improves the capabilities of the RAG system's URL loader by integrating Jina Reader. It shifts from a simple regex-based HTML stripper to a smart extraction engine that converts web pages into structured Markdown, filtering out noise like sidebars and ads.

Key Changes

1. Hybrid Loading Strategy

The loader now intelligently distinguishes between target types:

  • Direct Path

    • GitHub Raw URLs (raw.githubusercontent.com) and pure text files (.py, .md, .json, etc.) are still downloaded directly to ensure code accuracy and zero latency.
  • Smart Path

    • General web pages are routed through Jina Reader (https://r.jina.ai/) to extract high-quality, LLM-friendly content.

2. Fallback Mechanism

If the Jina Reader service times out or fails, the system automatically falls back to the original method (direct download + regex stripping) to ensure no disruption in service.

3. Configuration

Added support for an optional JINA_API_KEY environment variable, though it works anonymously by default.

@chatgpt-codex-connector
Copy link
Copy Markdown

The account who enabled Codex for this repo no longer has access to Codex. Please contact the admins of this repo to enable Codex again.

@yueliao11
Copy link
Copy Markdown
Contributor Author

I pushed a small follow-up commit to refine spoon_ai/rag/qa.py after local testing. No other files are included.

@moonpyt moonpyt changed the base branch from main to inntegration/rag December 24, 2025 03:44
@moonpyt moonpyt merged commit ae5892b into XSpoonAi:inntegration/rag Dec 24, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants