feat(rag): enhance url loader with Jina Reader for better HTML parsing by yueliao11 · Pull Request #225 · XSpoonAi/spoon-core

yueliao11 · 2025-12-19T12:51:53Z

Summary

This PR significantly improves the capabilities of the RAG system's URL loader by integrating Jina Reader. It shifts from a simple regex-based HTML stripper to a smart extraction engine that converts web pages into structured Markdown, filtering out noise like sidebars and ads.

Key Changes

1. Hybrid Loading Strategy

The loader now intelligently distinguishes between target types:

Direct Path
- GitHub Raw URLs (raw.githubusercontent.com) and pure text files (.py, .md, .json, etc.) are still downloaded directly to ensure code accuracy and zero latency.
Smart Path
- General web pages are routed through Jina Reader (https://r.jina.ai/) to extract high-quality, LLM-friendly content.

2. Fallback Mechanism

If the Jina Reader service times out or fails, the system automatically falls back to the original method (direct download + regex stripping) to ensure no disruption in service.

3. Configuration

Added support for an optional JINA_API_KEY environment variable, though it works anonymously by default.

chatgpt-codex-connector · 2025-12-19T12:52:03Z

The account who enabled Codex for this repo no longer has access to Codex. Please contact the admins of this repo to enable Codex again.

yueliao11 · 2025-12-23T15:05:41Z

I pushed a small follow-up commit to refine spoon_ai/rag/qa.py after local testing. No other files are included.

feat(rag): enhance url loader with Jina Reader for better HTML parsing

d61cfd9

yueliao11 added 2 commits December 19, 2025 21:00

Translate RAG Loader Comments to English

d49f7db

fix(rag): improve QA behavior and robustness

9d11c7a

moonpyt changed the base branch from main to inntegration/rag December 24, 2025 03:44

moonpyt merged commit ae5892b into XSpoonAi:inntegration/rag Dec 24, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rag): enhance url loader with Jina Reader for better HTML parsing#225

feat(rag): enhance url loader with Jina Reader for better HTML parsing#225
moonpyt merged 3 commits intoXSpoonAi:inntegration/ragfrom
yueliao11:feat/rag-framework

yueliao11 commented Dec 19, 2025

Uh oh!

chatgpt-codex-connector bot commented Dec 19, 2025

Uh oh!

yueliao11 commented Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yueliao11 commented Dec 19, 2025

Summary

Key Changes

1. Hybrid Loading Strategy

2. Fallback Mechanism

3. Configuration

Uh oh!

chatgpt-codex-connector bot commented Dec 19, 2025

Uh oh!

yueliao11 commented Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants