Skip to content

feat(fetch-url): 支持可配置 reader 回退与 Jina 配额提升#31

Merged
DCjanus merged 6 commits intomasterfrom
feat/fetch-url-reader-fallback
Mar 9, 2026
Merged

feat(fetch-url): 支持可配置 reader 回退与 Jina 配额提升#31
DCjanus merged 6 commits intomasterfrom
feat/fetch-url-reader-fallback

Conversation

@DCjanus
Copy link
Copy Markdown
Owner

@DCjanus DCjanus commented Mar 9, 2026

Why

  • 当前 fetch-url 的 Markdown 获取在服务端不支持时会较早落到本地 Playwright,容易触发站点限流。
  • 某些 reader 即使成功返回内容,也可能实际拿到的是限流或验证码提示页,需要允许 agent 明确切换到更兜底的抓取方式。

What

  • fetch_url.py 增加 Jina Reader fallback,并支持通过环境变量 JINA_API_KEY 传入 API Token 提升 Reader 配额。
  • 新增 --fetch-strategy 参数,允许在 autoagentjinabrowser 之间显式选择 Markdown 抓取路径。
  • 增加对常见限流、验证码和拦截提示内容的启发式检测,在 auto 模式下继续回退到更兜底的方式。
  • 更新 fetch-url skill 文档,补充 Jina Reader 配置、限流说明和手工切换抓取策略的用法。

Testing

  • 运行 ./scripts/fetch_url.py --help,确认新参数已暴露且 CLI 可正常启动。
  • 通过 rg 核对代码与文档中的 JINA_API_KEY--fetch-strategy 和 fallback 说明保持一致。

Co-authored-by: OpenAI Codex <codex@openai.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 9, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Added a Markdown-only --fetch-strategy option (auto, agent, jina, browser); default auto uses sensible fallbacks.
    • Integrated Jina Reader support with optional JINA_API_KEY to improve Markdown fetching and handle rate-limit responses.
  • Chores

    • Removed the --disable-twitter-api flag; use --fetch-strategy to control fetch behavior and fallbacks.
  • Documentation

    • Updated docs and examples to show fetch-strategy usage, JINA_API_KEY examples, and browser mode.

Walkthrough

Adds a Markdown-only CLI option --fetch-strategy (auto|agent|jina|browser), removes --disable-twitter-api, implements Jina Reader fetching with optional JINA_API_KEY, and updates fetch-url logic to route and fallback between FxTwitter, Jina Reader, agent, and browser strategies.

Changes

Cohort / File(s) Summary
Documentation
skills/fetch-url/SKILL.md
Replace --disable-twitter-api with --fetch-strategy docs (values: auto/agent/jina/browser); add JINA_API_KEY examples; update markdown fetch order and guidance.
Core implementation
skills/fetch-url/scripts/fetch_url.py
Add FetchStrategy type and fetch_strategy CLI param; remove disable_twitter_api; add Jina-related constants and fetch_jina_reader_markdown(); implement strategy-based routing and sequential fallbacks among FxTwitter, agent, Jina, and browser; update messages and error guidance.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant FetchCmd as Fetch Command
    participant Site as Original Site
    participant FxTwitter as FxTwitter API
    participant Jina as Jina Reader API
    participant Browser as Playwright Browser

    User->>FetchCmd: Request markdown (URL, --fetch-strategy)
    alt URL is Twitter/X and strategy == auto
        FetchCmd->>FxTwitter: Fetch thread via FxTwitter API
        FxTwitter-->>FetchCmd: Thread Markdown / Error
        alt FxTwitter success
            FetchCmd-->>User: Return Markdown (FxTwitter)
        else FxTwitter error
            FetchCmd->>Jina: (fallback) Request Markdown
            Jina-->>FetchCmd: Markdown / rate-limit
            alt Jina success
                FetchCmd-->>User: Return Markdown (Jina)
            else Jina blocked
                FetchCmd->>Browser: Render page
                Browser-->>FetchCmd: Rendered Markdown
                FetchCmd-->>User: Return Markdown (Browser)
            end
        end
    else strategy == agent
        FetchCmd->>Site: Request Accept: text/markdown
        Site-->>FetchCmd: Markdown / unusable
        alt usable
            FetchCmd-->>User: Return Markdown (agent)
        else unusable
            FetchCmd-->>User: Error (agent-only)
        end
    else strategy == jina
        FetchCmd->>Jina: Request Markdown (include JINA_API_KEY if set)
        Jina-->>FetchCmd: Markdown / unusable / rate-limit
        alt usable
            FetchCmd-->>User: Return Markdown (Jina)
        else unusable
            FetchCmd-->>User: Error (jina-only)
        end
    else strategy == browser
        FetchCmd->>Browser: Render page via Playwright
        Browser-->>FetchCmd: Rendered Markdown
        FetchCmd-->>User: Return Markdown (Browser)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title in Chinese describes the main feature: configurable reader fallback and Jina quota enhancement, which directly aligns with the PR's core changes of adding --fetch-strategy parameter and Jina Reader support.
Description check ✅ Passed The description clearly explains the motivation (early fallback to Playwright causing rate limiting), the implementation (Jina Reader with API key support and --fetch-strategy parameter), and heuristic detection of rate-limit pages, all matching the actual code changes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/fetch-url-reader-fallback

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
skills/fetch-url/scripts/fetch_url.py (1)

543-559: ⚠️ Potential issue | 🟠 Major

Respect explicit --fetch-strategy values for Twitter/X URLs.

Lines 543-559 run the FxTwitter path before any strategy handling, so --fetch-strategy browser, agent, or jina is ignored for tweet URLs unless the caller also remembers --disable-twitter-api. That makes the new strategy flag non-authoritative on one of the main markdown flows.

♻️ One straightforward way to let explicit strategies win
-        if output_format == "markdown" and not disable_twitter_api:
+        if output_format == "markdown" and not disable_twitter_api and fetch_strategy == "auto":

Also applies to: 596-597

skills/fetch-url/SKILL.md (1)

51-55: ⚠️ Potential issue | 🟡 Minor

Describe the current FxTwitter markdown header correctly.

Line 54 says the first line is a comment marking FxTwitter provenance, but the renderer currently starts with a heading and emits the source note later in the metadata list. Please update the doc or the renderer so downstream parsers are not told to expect a marker that never appears.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ba3754c6-626c-4430-bbbd-ac18147f4bb7

📥 Commits

Reviewing files that changed from the base of the PR and between 54e3470 and 84c4d8a.

📒 Files selected for processing (2)
  • skills/fetch-url/SKILL.md
  • skills/fetch-url/scripts/fetch_url.py

Co-authored-by: OpenAI Codex <codex@openai.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
skills/fetch-url/scripts/fetch_url.py (2)

240-250: ⚠️ Potential issue | 🟠 Major

Don't treat any non-empty Jina body as a successful Markdown fetch.

A 200 interstitial or rate-limit page also satisfies markdown.strip(), so auto will stop here and never reach the browser fallback. Reject unexpected content types and obvious challenge/limit pages before returning success.

Run this read-only check to confirm the helper currently returns on "non-empty body" alone:

#!/bin/bash
set -euo pipefail
sed -n '216,257p' skills/fetch-url/scripts/fetch_url.py

Expected result: the function decodes the response and returns it after markdown.strip() without a content-type or interstitial check.


517-524: ⚠️ Potential issue | 🟠 Major

Keep the Jina fallback opt-in for non-public targets.

When agent negotiation misses, auto still forwards the original URL to the Jina reader service. That can leak localhost/intranet hosts, embedded credentials, or signed query strings to a third party without an explicit opt-in. Gate this fallback to clearly public URLs, otherwise require --fetch-strategy jina.

Run this read-only check to confirm the auto path calls the Jina helper directly and that the helper builds the upstream URL from the raw target URL:

#!/bin/bash
set -euo pipefail
sed -n '216,238p' skills/fetch-url/scripts/fetch_url.py
printf '\n---\n'
sed -n '499,524p' skills/fetch-url/scripts/fetch_url.py

Expected result: there is no public/private target guard between the auto branch and the Jina call.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e2d0aa2a-54e0-4a42-abf2-0662b02f9768

📥 Commits

Reviewing files that changed from the base of the PR and between 84c4d8a and 8a0f98f.

📒 Files selected for processing (2)
  • skills/fetch-url/SKILL.md
  • skills/fetch-url/scripts/fetch_url.py

@DCjanus
Copy link
Copy Markdown
Owner Author

DCjanus commented Mar 9, 2026

补充说明一下这次对 CodeRabbit 意见的取舍:

  1. 关于 Jina 返回内容直接视为成功的问题,这次已经做了一个更保守的修正:只在命中少数高置信度的明显限流/挑战页特征时,才把 Jina 结果视为不可用并继续 fallback。这里刻意避免做宽泛关键词匹配,允许部分限流页没有被识别出来而直接返回,以减少误伤正常正文。

  2. 关于 auto 模式会把目标 URL 发给 Jina Reader 的问题,这里我们选择接受这个行为,不做额外限制。这个仓库里的典型使用场景以公开网页为主,我们一般不会拿敏感 URL 走这条链路,因此当前更重视 auto 模式下的抓取成功率和无浏览器 fallback 能力。

@coderabbitai

Co-authored-by: OpenAI Codex <codex@openai.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
skills/fetch-url/scripts/fetch_url.py (1)

539-546: ⚠️ Potential issue | 🟠 Major

auto still forwards arbitrary target URLs to Jina by default.

After agent negotiation misses, this path sends the original URL verbatim to Jina Reader. That includes localhost/private hosts, embedded credentials, signed query strings, and even #fragment data that would not be sent in a normal page request. Please gate this fallback to clearly public targets or require explicit --fetch-strategy jina opt-in for non-public URLs.

🧹 Nitpick comments (1)
skills/fetch-url/SKILL.md (1)

52-53: Link to Jina Reader's official rate-limit documentation instead of hard-coding specific RPM quotas.

The numbers in lines 52–53 (20 RPM, 500 RPM, 5000 RPM) are currently accurate, but vendor quotas are subject to change. Reference the official Jina Reader API documentation for rate limits and keep only the qualitative guidance (unauthenticated, standard API key, premium) in this troubleshooting section.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bdb192b5-fe2d-4cf3-afb1-250320284629

📥 Commits

Reviewing files that changed from the base of the PR and between 8a0f98f and 5572ea4.

📒 Files selected for processing (2)
  • skills/fetch-url/SKILL.md
  • skills/fetch-url/scripts/fetch_url.py

Co-authored-by: OpenAI Codex <codex@openai.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
skills/fetch-url/scripts/fetch_url.py (1)

253-259: ⚠️ Potential issue | 🟡 Minor

Keep this Jina log message strategy-neutral.

fetch_jina_reader_markdown() now serves both auto and explicit jina mode, so “continue fallback” is only true for one caller. In --fetch-strategy jina, this path turns into a hard error, which makes verbose diagnostics misleading again.

🪵 Proposed log tweak
-                        "[yellow]Jina Reader returned a probable rate-limit page, continue fallback[/yellow]",
+                        "[yellow]Jina Reader returned a probable rate-limit page[/yellow]",

Also applies to: 554-561

🧹 Nitpick comments (1)
skills/fetch-url/scripts/fetch_url.py (1)

49-56: Add a few mocked tests around the new strategy matrix.

This path now decides whether auto keeps going, whether --fetch-strategy jina exits, and when browser fallback is used, but the reported checks only cover --help and doc/code string matching. A small set of mocked cases around is_obvious_jina_block_page(), agent miss → Jina hit, and Jina miss → browser fallback would make future changes here much safer.

Also applies to: 275-279, 521-563


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c96055cf-43ac-480a-8b69-b21f80a95ee5

📥 Commits

Reviewing files that changed from the base of the PR and between 5572ea4 and 357f775.

📒 Files selected for processing (2)
  • skills/fetch-url/SKILL.md
  • skills/fetch-url/scripts/fetch_url.py

Co-authored-by: OpenAI Codex <codex@openai.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 48059c19-3017-4f9b-b924-8d47b0b4e62d

📥 Commits

Reviewing files that changed from the base of the PR and between 357f775 and 4de94dc.

📒 Files selected for processing (1)
  • skills/fetch-url/SKILL.md

Comment on lines +32 to 40
`--fetch-strategy` 常用值:
- `auto`:默认选择。
- `agent`:优先用原站 Markdown 协商。
- `jina`:优先用 Jina Reader。
- `browser`:直接用本地 Playwright。

环境变量:
- 可设置 `JINA_API_KEY` 提升 Jina Reader 限流:`JINA_API_KEY=your-token ./scripts/fetch_url.py ...`

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Document that auto may send the URL to Jina Reader before local fallback.

The default strategy now changes the network/privacy boundary, but the doc only says auto is the default. Users should be told that auto may forward the target URL to Jina Reader and only falls back when the response looks like a block/challenge page; otherwise the default behavior is easy to misread as purely local fetching.

✍️ Suggested wording
 `--fetch-strategy` 常用值:
-- `auto`:默认选择。
+- `auto`:默认选择。会先尝试无需本地浏览器的抓取路径;其中可能会把目标 URL 发送给 Jina Reader。若返回内容命中明显限流 / 验证码特征,再继续 fallback 到更兜底的方式。
 - `agent`:优先用原站 Markdown 协商。
 - `jina`:优先用 Jina Reader。
 - `browser`:直接用本地 Playwright。
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
`--fetch-strategy` 常用值:
- `auto`:默认选择。
- `agent`:优先用原站 Markdown 协商。
- `jina`:优先用 Jina Reader。
- `browser`:直接用本地 Playwright。
环境变量:
- 可设置 `JINA_API_KEY` 提升 Jina Reader 限流:`JINA_API_KEY=your-token ./scripts/fetch_url.py ...`
`--fetch-strategy` 常用值:
- `auto`:默认选择。会先尝试无需本地浏览器的抓取路径;其中可能会把目标 URL 发送给 Jina Reader。若返回内容命中明显限流 / 验证码特征,再继续 fallback 到更兜底的方式。
- `agent`:优先用原站 Markdown 协商。
- `jina`:优先用 Jina Reader。
- `browser`:直接用本地 Playwright。
环境变量:
- 可设置 `JINA_API_KEY` 提升 Jina Reader 限流:`JINA_API_KEY=your-token ./scripts/fetch_url.py ...`

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 9, 2026

Tip

For best results, initiate chat on the files or code changes.

@DCjanus 感谢详细说明,两点取舍都很清晰,我已记录在长期知识库中,后续审查中不会再就这两点重复提出意见。

  1. Jina 限流检测策略:保守优先、宁漏勿误的取向合理,可以在积累到更多限流页面样本后再逐步补充高置信度特征。
  2. auto 模式 URL 转发 Jina:公开网页场景下提升抓取成功率和无浏览器回退能力是合理权衡,已知悉。

(づ。◕‿‿◕。)づ


✏️ Learnings added
Learnt from: DCjanus
URL: https://github.com/DCjanus/prompts/pull/31

Timestamp: 2026-03-09T10:25:10.822Z
Learning: In DCjanus/prompts (skills/fetch-url/scripts/fetch_url.py), the Jina block-page detection (`is_obvious_jina_block_page`) intentionally uses only a small set of high-confidence rate-limit/challenge signals. It deliberately avoids broad keyword matching to minimise false positives on normal page content — some undetected rate-limit pages are acceptable as a trade-off.

Learnt from: DCjanus
URL: https://github.com/DCjanus/prompts/pull/31

Timestamp: 2026-03-09T10:25:10.822Z
Learning: In DCjanus/prompts (skills/fetch-url/scripts/fetch_url.py), the `auto` fetch strategy forwards the target URL to Jina Reader without additional restrictions. This is an accepted design decision: the repository's typical use case is public web pages, so improving `auto`-mode fetch success rate and browserless fallback capability takes priority over URL-privacy concerns.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: DCjanus
Repo: DCjanus/prompts PR: 0
File: :0-0
Timestamp: 2026-02-27T11:54:38.945Z
Learning: In the DCjanus/prompts repository, scripts under skills/fetch-url target newer Python runtime environments and use requires-python >=3.14 as a deliberate policy to maintain a modern baseline and avoid compatibility complexity.

Co-authored-by: OpenAI Codex <codex@openai.com>
@DCjanus DCjanus merged commit 3489bfb into master Mar 9, 2026
1 of 2 checks passed
@DCjanus DCjanus deleted the feat/fetch-url-reader-fallback branch March 9, 2026 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant