Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 13 additions & 7 deletions skills/fetch-url/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,20 +27,26 @@ uv run playwright install chromium
- `--timeout-ms`:Playwright 导航超时(毫秒,默认 60000)。
- `--browser-path`:指定本地 Chromium 系浏览器路径(默认自动探测)。
- `--output-format`:输出格式(默认 `markdown`),支持 `csv`、`html`、`json`、`markdown`、`raw-html`、`txt`、`xml`、`xmltei`;`raw-html` 直接输出渲染后的 HTML(不经 trafilatura)。
- `--disable-twitter-api`:关闭 Twitter/X 的 FxTwitter API 优化路径
- `--fetch-strategy`:仅 `markdown` 可用,支持 `auto`、`agent`、`jina`、`browser`。默认 `auto`

Twitter/X 特化(仅 `markdown`):
- 当 URL 命中 `x.com`/`twitter.com` 推文链接且未设置 `--disable-twitter-api`,脚本会优先调用 `https://api.fxtwitter.com/2/status/{id}`。
- 当 FxTwitter 返回 `thread` 数据时,Markdown 会附加 `## Thread` 小节,按顺序列出 thread 内其它推文(自动去重主推文)。
- 输出的 Markdown 首行会包含注释,明确标记内容来自 FxTwitter API,而非直接访问页面。
- 若 FxTwitter API 请求失败,命令会直接报错(不降级到网页抓取);如需跳过该逻辑,请显式传入 `--disable-twitter-api`。
`--fetch-strategy` 常用值:
- `auto`:默认选择。
- `agent`:优先用原站 Markdown 协商。
- `jina`:优先用 Jina Reader。
- `browser`:直接用本地 Playwright。

环境变量:
- 可设置 `JINA_API_KEY` 提升 Jina Reader 限流:`JINA_API_KEY=your-token ./scripts/fetch_url.py ...`

Comment on lines +32 to 40
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Document that auto may send the URL to Jina Reader before local fallback.

The default strategy now changes the network/privacy boundary, but the doc only says auto is the default. Users should be told that auto may forward the target URL to Jina Reader and only falls back when the response looks like a block/challenge page; otherwise the default behavior is easy to misread as purely local fetching.

✍️ Suggested wording
 `--fetch-strategy` 常用值:
-- `auto`:默认选择。
+- `auto`:默认选择。会先尝试无需本地浏览器的抓取路径;其中可能会把目标 URL 发送给 Jina Reader。若返回内容命中明显限流 / 验证码特征,再继续 fallback 到更兜底的方式。
 - `agent`:优先用原站 Markdown 协商。
 - `jina`:优先用 Jina Reader。
 - `browser`:直接用本地 Playwright。
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
`--fetch-strategy` 常用值:
- `auto`:默认选择。
- `agent`:优先用原站 Markdown 协商。
- `jina`:优先用 Jina Reader。
- `browser`:直接用本地 Playwright。
环境变量:
- 可设置 `JINA_API_KEY` 提升 Jina Reader 限流:`JINA_API_KEY=your-token ./scripts/fetch_url.py ...`
`--fetch-strategy` 常用值:
- `auto`:默认选择。会先尝试无需本地浏览器的抓取路径;其中可能会把目标 URL 发送给 Jina Reader。若返回内容命中明显限流 / 验证码特征,再继续 fallback 到更兜底的方式。
- `agent`:优先用原站 Markdown 协商。
- `jina`:优先用 Jina Reader。
- `browser`:直接用本地 Playwright。
环境变量:
- 可设置 `JINA_API_KEY` 提升 Jina Reader 限流:`JINA_API_KEY=your-token ./scripts/fetch_url.py ...`

示例:

```bash
./scripts/fetch_url.py https://example.com --output ./page.md --timeout-ms 60000
./scripts/fetch_url.py https://example.com --fetch-strategy jina
JINA_API_KEY=your-token ./scripts/fetch_url.py https://example.com --fetch-strategy jina
./scripts/fetch_url.py https://example.com --fetch-strategy browser
./scripts/fetch_url.py https://x.com/jack/status/20 --output-format markdown
./scripts/fetch_url.py https://x.com/jack/status/20 --output-format markdown --disable-twitter-api
./scripts/fetch_url.py https://x.com/jack/status/20 --output-format markdown --fetch-strategy browser
```

Reference:[`scripts/fetch_url.py`](scripts/fetch_url.py)
109 changes: 102 additions & 7 deletions skills/fetch-url/scripts/fetch_url.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from __future__ import annotations

import json
import os
from pathlib import Path
import re
from typing import Any, Literal
Expand All @@ -33,6 +34,7 @@
CONSOLE = Console()

OutputFormat = Literal["csv", "html", "json", "markdown", "raw-html", "txt", "xml", "xmltei"]
FetchStrategy = Literal["auto", "agent", "jina", "browser"]
TWITTER_HOSTS = {
"x.com",
"www.x.com",
Expand All @@ -42,6 +44,16 @@
"mobile.twitter.com",
}
FXTWITTER_API_ROOT = "https://api.fxtwitter.com/2/status"
JINA_READER_API_ROOT = "https://r.jina.ai/"
JINA_API_KEY_ENV = "JINA_API_KEY"
JINA_BLOCK_PAGE_SIGNALS = (
("rate limit", "jina"),
("too many requests", "jina"),
("rate limit exceeded", "jina"),
("retry after", "jina"),
("request limit reached", "jina"),
("security verification", "jina"),
)
# FxTwitter source repository: https://github.com/allnodes/FxTwitter


Expand Down Expand Up @@ -209,6 +221,64 @@ def fetch_agent_markdown(url: str, timeout_ms: int, verbose: bool) -> str | None
return None


def fetch_jina_reader_markdown(url: str, timeout_ms: int, verbose: bool) -> str | None:
"""通过 Jina Reader 获取 Markdown, 命中则直接返回。"""

reader_url = f"{JINA_READER_API_ROOT}{url}"
api_key = os.getenv(JINA_API_KEY_ENV, "").strip()
headers = {
"Accept": "text/markdown, text/plain;q=0.9, */*;q=0.1",
"User-Agent": "fetch-url/1.0 (+https://github.com/DCjanus/prompts/tree/master/skills/fetch-url)",
}
if api_key:
headers["Authorization"] = f"Bearer {api_key}"

if verbose:
auth_mode = f"with {JINA_API_KEY_ENV}" if api_key else "without API key"
CONSOLE.print(
f"[cyan]Trying Jina Reader[/cyan] {auth_mode}",
highlight=False,
)

request = Request( # noqa: S310 - 本地 CLI 可信输入, URL 由用户主动提供
reader_url,
headers=headers,
)
try:
with urlopen(request, timeout=max(timeout_ms / 1000.0, 1.0)) as response: # noqa: S310
charset = response.headers.get_content_charset() or "utf-8"
markdown = response.read().decode(charset, errors="replace")
if not markdown.strip():
return None
if is_obvious_jina_block_page(markdown):
if verbose:
CONSOLE.print(
"[yellow]Jina Reader returned a probable rate-limit page[/yellow]",
highlight=False,
)
return None
if verbose:
CONSOLE.print(
f"[green]Jina Reader hit[/green] {len(markdown)} chars",
highlight=False,
)
return markdown
except (URLError, OSError) as exc:
if verbose:
CONSOLE.print(
f"[yellow]Jina Reader failed[/yellow] ({exc})",
highlight=False,
)
return None


def is_obvious_jina_block_page(content: str) -> bool:
"""识别少数非常明显的 Jina 限流或挑战页。"""

snippet = content[:4000].lower()
return any(all(part in snippet for part in signal) for signal in JINA_BLOCK_PAGE_SIGNALS)


def extract_twitter_status_id(url: str) -> str | None:
"""从 x.com/twitter.com 推文链接提取 status id。"""

Expand Down Expand Up @@ -432,22 +502,23 @@ def fetch(
"markdown",
help="Output format: csv, html, json, markdown, raw-html, txt, xml, xmltei.",
),
disable_twitter_api: bool = typer.Option(
False,
"--disable-twitter-api",
help="Disable FxTwitter API optimization for x.com/twitter.com links in markdown mode.",
fetch_strategy: FetchStrategy = typer.Option(
"auto",
help="Fetch strategy for markdown: auto, agent, jina, browser.",
),
verbose: bool = typer.Option(False, "--verbose", help="Print progress and diagnostic logs."),
) -> None:
"""通过 Playwright 渲染并用 trafilatura 提取内容。"""
parsed = urlparse(url)
if parsed.scheme not in {"http", "https"}:
raise typer.BadParameter("Only http or https URLs are supported.")
if output_format != "markdown" and fetch_strategy != "auto":
raise typer.BadParameter("Custom fetch strategy is only supported with markdown output.")

resolved_browser_path = str(browser_path) if browser_path else detect_browser_path()
try:
content: str | None = None
if output_format == "markdown" and not disable_twitter_api:
if output_format == "markdown" and fetch_strategy == "auto":
twitter_status_id = extract_twitter_status_id(url)
if twitter_status_id:
payload = fetch_fxtwitter_status(
Expand All @@ -458,14 +529,38 @@ def fetch(
if payload is None:
raise ValueError(
"FxTwitter API request failed for this Twitter/X URL. "
"Use --disable-twitter-api to skip this path."
"Use --fetch-strategy agent, jina, or browser to skip this path."
)
content = render_fxtwitter_markdown(payload, source_url=url)
if verbose:
CONSOLE.print("[green]Using FxTwitter API markdown path[/green]", highlight=False)
if output_format == "markdown":
if content is None:
content = fetch_agent_markdown(url, timeout_ms=timeout_ms, verbose=verbose)
if fetch_strategy == "auto":
content = fetch_agent_markdown(url, timeout_ms=timeout_ms, verbose=verbose)
if content is None:
content = fetch_jina_reader_markdown(
url,
timeout_ms=timeout_ms,
verbose=verbose,
)
elif fetch_strategy == "agent":
content = fetch_agent_markdown(url, timeout_ms=timeout_ms, verbose=verbose)
if content is None:
raise ValueError(
"Markdown negotiation did not return usable content. "
"Try --fetch-strategy jina or --fetch-strategy browser."
)
elif fetch_strategy == "jina":
content = fetch_jina_reader_markdown(url, timeout_ms=timeout_ms, verbose=verbose)
if content is None:
raise ValueError(
"Jina Reader did not return usable content. "
f"If this is rate limiting, configure {JINA_API_KEY_ENV} or try "
"--fetch-strategy browser."
)
elif fetch_strategy == "browser" and verbose:
CONSOLE.print("[cyan]Skipping non-browser markdown readers[/cyan]", highlight=False)
if content is None:
html = render_html(
url,
Expand Down