feat(fetch-url): 支持可配置 reader 回退与 Jina 配额提升#31
Conversation
Co-authored-by: OpenAI Codex <codex@openai.com>
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds a Markdown-only CLI option Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant FetchCmd as Fetch Command
participant Site as Original Site
participant FxTwitter as FxTwitter API
participant Jina as Jina Reader API
participant Browser as Playwright Browser
User->>FetchCmd: Request markdown (URL, --fetch-strategy)
alt URL is Twitter/X and strategy == auto
FetchCmd->>FxTwitter: Fetch thread via FxTwitter API
FxTwitter-->>FetchCmd: Thread Markdown / Error
alt FxTwitter success
FetchCmd-->>User: Return Markdown (FxTwitter)
else FxTwitter error
FetchCmd->>Jina: (fallback) Request Markdown
Jina-->>FetchCmd: Markdown / rate-limit
alt Jina success
FetchCmd-->>User: Return Markdown (Jina)
else Jina blocked
FetchCmd->>Browser: Render page
Browser-->>FetchCmd: Rendered Markdown
FetchCmd-->>User: Return Markdown (Browser)
end
end
else strategy == agent
FetchCmd->>Site: Request Accept: text/markdown
Site-->>FetchCmd: Markdown / unusable
alt usable
FetchCmd-->>User: Return Markdown (agent)
else unusable
FetchCmd-->>User: Error (agent-only)
end
else strategy == jina
FetchCmd->>Jina: Request Markdown (include JINA_API_KEY if set)
Jina-->>FetchCmd: Markdown / unusable / rate-limit
alt usable
FetchCmd-->>User: Return Markdown (Jina)
else unusable
FetchCmd-->>User: Error (jina-only)
end
else strategy == browser
FetchCmd->>Browser: Render page via Playwright
Browser-->>FetchCmd: Rendered Markdown
FetchCmd-->>User: Return Markdown (Browser)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
skills/fetch-url/scripts/fetch_url.py (1)
543-559:⚠️ Potential issue | 🟠 MajorRespect explicit
--fetch-strategyvalues for Twitter/X URLs.Lines 543-559 run the FxTwitter path before any strategy handling, so
--fetch-strategy browser,agent, orjinais ignored for tweet URLs unless the caller also remembers--disable-twitter-api. That makes the new strategy flag non-authoritative on one of the main markdown flows.♻️ One straightforward way to let explicit strategies win
- if output_format == "markdown" and not disable_twitter_api: + if output_format == "markdown" and not disable_twitter_api and fetch_strategy == "auto":Also applies to: 596-597
skills/fetch-url/SKILL.md (1)
51-55:⚠️ Potential issue | 🟡 MinorDescribe the current FxTwitter markdown header correctly.
Line 54 says the first line is a comment marking FxTwitter provenance, but the renderer currently starts with a heading and emits the source note later in the metadata list. Please update the doc or the renderer so downstream parsers are not told to expect a marker that never appears.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ba3754c6-626c-4430-bbbd-ac18147f4bb7
📒 Files selected for processing (2)
skills/fetch-url/SKILL.mdskills/fetch-url/scripts/fetch_url.py
Co-authored-by: OpenAI Codex <codex@openai.com>
There was a problem hiding this comment.
♻️ Duplicate comments (2)
skills/fetch-url/scripts/fetch_url.py (2)
240-250:⚠️ Potential issue | 🟠 MajorDon't treat any non-empty Jina body as a successful Markdown fetch.
A
200interstitial or rate-limit page also satisfiesmarkdown.strip(), soautowill stop here and never reach the browser fallback. Reject unexpected content types and obvious challenge/limit pages before returning success.Run this read-only check to confirm the helper currently returns on "non-empty body" alone:
#!/bin/bash set -euo pipefail sed -n '216,257p' skills/fetch-url/scripts/fetch_url.pyExpected result: the function decodes the response and returns it after
markdown.strip()without a content-type or interstitial check.
517-524:⚠️ Potential issue | 🟠 MajorKeep the Jina fallback opt-in for non-public targets.
When agent negotiation misses,
autostill forwards the original URL to the Jina reader service. That can leak localhost/intranet hosts, embedded credentials, or signed query strings to a third party without an explicit opt-in. Gate this fallback to clearly public URLs, otherwise require--fetch-strategy jina.Run this read-only check to confirm the
autopath calls the Jina helper directly and that the helper builds the upstream URL from the raw target URL:#!/bin/bash set -euo pipefail sed -n '216,238p' skills/fetch-url/scripts/fetch_url.py printf '\n---\n' sed -n '499,524p' skills/fetch-url/scripts/fetch_url.pyExpected result: there is no public/private target guard between the
autobranch and the Jina call.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e2d0aa2a-54e0-4a42-abf2-0662b02f9768
📒 Files selected for processing (2)
skills/fetch-url/SKILL.mdskills/fetch-url/scripts/fetch_url.py
|
补充说明一下这次对 CodeRabbit 意见的取舍:
|
Co-authored-by: OpenAI Codex <codex@openai.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
skills/fetch-url/scripts/fetch_url.py (1)
539-546:⚠️ Potential issue | 🟠 Major
autostill forwards arbitrary target URLs to Jina by default.After agent negotiation misses, this path sends the original URL verbatim to Jina Reader. That includes localhost/private hosts, embedded credentials, signed query strings, and even
#fragmentdata that would not be sent in a normal page request. Please gate this fallback to clearly public targets or require explicit--fetch-strategy jinaopt-in for non-public URLs.
🧹 Nitpick comments (1)
skills/fetch-url/SKILL.md (1)
52-53: Link to Jina Reader's official rate-limit documentation instead of hard-coding specific RPM quotas.The numbers in lines 52–53 (
20 RPM,500 RPM,5000 RPM) are currently accurate, but vendor quotas are subject to change. Reference the official Jina Reader API documentation for rate limits and keep only the qualitative guidance (unauthenticated, standard API key, premium) in this troubleshooting section.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: bdb192b5-fe2d-4cf3-afb1-250320284629
📒 Files selected for processing (2)
skills/fetch-url/SKILL.mdskills/fetch-url/scripts/fetch_url.py
Co-authored-by: OpenAI Codex <codex@openai.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
skills/fetch-url/scripts/fetch_url.py (1)
253-259:⚠️ Potential issue | 🟡 MinorKeep this Jina log message strategy-neutral.
fetch_jina_reader_markdown()now serves bothautoand explicitjinamode, so “continue fallback” is only true for one caller. In--fetch-strategy jina, this path turns into a hard error, which makes verbose diagnostics misleading again.🪵 Proposed log tweak
- "[yellow]Jina Reader returned a probable rate-limit page, continue fallback[/yellow]", + "[yellow]Jina Reader returned a probable rate-limit page[/yellow]",Also applies to: 554-561
🧹 Nitpick comments (1)
skills/fetch-url/scripts/fetch_url.py (1)
49-56: Add a few mocked tests around the new strategy matrix.This path now decides whether
autokeeps going, whether--fetch-strategy jinaexits, and when browser fallback is used, but the reported checks only cover--helpand doc/code string matching. A small set of mocked cases aroundis_obvious_jina_block_page(), agent miss → Jina hit, and Jina miss → browser fallback would make future changes here much safer.Also applies to: 275-279, 521-563
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c96055cf-43ac-480a-8b69-b21f80a95ee5
📒 Files selected for processing (2)
skills/fetch-url/SKILL.mdskills/fetch-url/scripts/fetch_url.py
Co-authored-by: OpenAI Codex <codex@openai.com>
| `--fetch-strategy` 常用值: | ||
| - `auto`:默认选择。 | ||
| - `agent`:优先用原站 Markdown 协商。 | ||
| - `jina`:优先用 Jina Reader。 | ||
| - `browser`:直接用本地 Playwright。 | ||
|
|
||
| 环境变量: | ||
| - 可设置 `JINA_API_KEY` 提升 Jina Reader 限流:`JINA_API_KEY=your-token ./scripts/fetch_url.py ...` | ||
|
|
There was a problem hiding this comment.
Document that auto may send the URL to Jina Reader before local fallback.
The default strategy now changes the network/privacy boundary, but the doc only says auto is the default. Users should be told that auto may forward the target URL to Jina Reader and only falls back when the response looks like a block/challenge page; otherwise the default behavior is easy to misread as purely local fetching.
✍️ Suggested wording
`--fetch-strategy` 常用值:
-- `auto`:默认选择。
+- `auto`:默认选择。会先尝试无需本地浏览器的抓取路径;其中可能会把目标 URL 发送给 Jina Reader。若返回内容命中明显限流 / 验证码特征,再继续 fallback 到更兜底的方式。
- `agent`:优先用原站 Markdown 协商。
- `jina`:优先用 Jina Reader。
- `browser`:直接用本地 Playwright。📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| `--fetch-strategy` 常用值: | |
| - `auto`:默认选择。 | |
| - `agent`:优先用原站 Markdown 协商。 | |
| - `jina`:优先用 Jina Reader。 | |
| - `browser`:直接用本地 Playwright。 | |
| 环境变量: | |
| - 可设置 `JINA_API_KEY` 提升 Jina Reader 限流:`JINA_API_KEY=your-token ./scripts/fetch_url.py ...` | |
| `--fetch-strategy` 常用值: | |
| - `auto`:默认选择。会先尝试无需本地浏览器的抓取路径;其中可能会把目标 URL 发送给 Jina Reader。若返回内容命中明显限流 / 验证码特征,再继续 fallback 到更兜底的方式。 | |
| - `agent`:优先用原站 Markdown 协商。 | |
| - `jina`:优先用 Jina Reader。 | |
| - `browser`:直接用本地 Playwright。 | |
| 环境变量: | |
| - 可设置 `JINA_API_KEY` 提升 Jina Reader 限流:`JINA_API_KEY=your-token ./scripts/fetch_url.py ...` |
|
Tip For best results, initiate chat on the files or code changes.
(づ。◕‿‿◕。)づ ✏️ Learnings added
🧠 Learnings used |
Co-authored-by: OpenAI Codex <codex@openai.com>
Why
fetch-url的 Markdown 获取在服务端不支持时会较早落到本地 Playwright,容易触发站点限流。What
fetch_url.py增加 Jina Reader fallback,并支持通过环境变量JINA_API_KEY传入 API Token 提升 Reader 配额。--fetch-strategy参数,允许在auto、agent、jina、browser之间显式选择 Markdown 抓取路径。auto模式下继续回退到更兜底的方式。fetch-urlskill 文档,补充 Jina Reader 配置、限流说明和手工切换抓取策略的用法。Testing
./scripts/fetch_url.py --help,确认新参数已暴露且 CLI 可正常启动。rg核对代码与文档中的JINA_API_KEY、--fetch-strategy和 fallback 说明保持一致。