Skip to content

Commit aca4dbb

Browse files
committed
Add browser-gated download mode
1 parent 379b8f0 commit aca4dbb

File tree

13 files changed

+1618
-220
lines changed

13 files changed

+1618
-220
lines changed

README.md

Lines changed: 60 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,65 @@ Use `dependency_setup/setup_glossapi.sh` for the Docling environment, or `depend
5656

5757
`setup_glossapi.sh --mode deepseek` now delegates to the same uv-based installer. `setup_deepseek_uv.sh` uses `uv venv` + `uv sync`, installs the Rust extensions in editable mode, and can download `deepseek-ai/DeepSeek-OCR-2` with `huggingface_hub`.
5858

59+
If you want a guided install that asks which phases you plan to use, run:
60+
61+
```bash
62+
python install_glossapi.py
63+
```
64+
65+
That wizard keeps browser-gated download support (`playwright`) and the dedicated DeepSeek OCR runtime out of the main environment unless you explicitly select them.
66+
67+
## Browser-Gated Download Mode
68+
69+
`Corpus.download(...)` now supports three high-level routes for file acquisition:
70+
71+
- `download_mode="standard"`: direct HTTP downloader only
72+
- `download_mode="auto"`: direct HTTP first, then browser-assisted recovery when the response is a recoverable browser-gated interstitial
73+
- `download_mode="browser"`: go straight to browser-assisted acquisition for known browser-gated file endpoints
74+
75+
Use `browser_mode=True` as a legacy alias for `download_mode="browser"`.
76+
77+
### Policy-driven routing
78+
79+
If you know which domains require browser bootstrap, route them with a policy file instead of probing every URL:
80+
81+
```yaml
82+
default:
83+
downloader: standard
84+
85+
rules:
86+
- match:
87+
domains: [eur-lex.europa.eu]
88+
downloader: browser
89+
90+
- match:
91+
url_regex: "https://example.org/protected/.*"
92+
downloader: auto
93+
```
94+
95+
```python
96+
from glossapi import Corpus
97+
98+
corpus = Corpus(input_dir="out", output_dir="out")
99+
corpus.download(
100+
input_parquet="input_urls.parquet",
101+
download_policy_file="download_policy.yml",
102+
)
103+
```
104+
105+
### Operational notes
106+
107+
- Browser mode is for browser-gated file endpoints, not viewer-only sources.
108+
- Browser sessions are cached per domain so a successful bootstrap can be reused across multiple files.
109+
- Successful downloads still land in `downloads/`; extraction continues to consume only real files from that directory.
110+
- Viewer-style sources still fail cleanly in `download_results/*.parquet` and do not create fake files.
111+
112+
### Regression strategy
113+
114+
The checked-in browser download tests use mocked browser/session flows and fake PDF bytes rather than hard-coded live URLs.
115+
116+
For manual smoke checks against live browser-gated sources, build an ad hoc parquet locally and run it outside the committed test suite.
117+
59118
**DeepSeek runtime checklist**
60119
- Run `python -m glossapi.ocr.deepseek.preflight` from the DeepSeek venv to fail fast before OCR.
61120
- Export these to force the real runtime and avoid silent stub output:
@@ -93,7 +152,7 @@ Use this as the shortest path from a documentation concept to the public call th
93152

94153
| Stage | Main call | Important parameters | Writes |
95154
| --- | --- | --- | --- |
96-
| Download | `Corpus.download(...)` | `input_parquet`, `links_column`, `parallelize_by`, downloader kwargs | `downloads/`, `download_results/*.parquet` |
155+
| Download | `Corpus.download(...)` | `input_parquet`, `links_column`, `parallelize_by`, `download_mode="standard"|"auto"|"browser"`, `download_policy_file`, downloader kwargs | `downloads/`, `download_results/*.parquet` |
97156
| Extract (Phase-1) | `Corpus.extract(...)` | `input_format`, `phase1_backend`, `force_ocr`, `use_gpus`, `export_doc_json`, `emit_formula_index` | `markdown/<stem>.md`, `json/<stem>.docling.json(.zst)`, `json/metrics/*.json` |
98157
| Clean | `Corpus.clean(...)` | `threshold`, `drop_bad`, `empty_char_threshold`, `empty_min_pages` | `clean_markdown/<stem>.md`, updated parquet metrics/flags |
99158
| OCR / math follow-up | `Corpus.ocr(...)` | `mode`, `fix_bad`, `math_enhance`, `use_gpus`, `devices` | refreshed `markdown/<stem>.md`, optional `json/<stem>.latex_map.jsonl` |

docs/api/corpus.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -187,12 +187,35 @@ download(
187187
- Important parameters:
188188
- `links_column`: override URL column name
189189
- `parallelize_by`: choose grouping for the scheduler
190+
- `download_mode`: one of `standard`, `auto`, or `browser`
191+
- `browser_mode=True`: alias for `download_mode="browser"`
192+
- `download_policy_file`: route specific domains/URL patterns to `standard`, `auto`, or `browser`
190193
- downloader kwargs via `**kwargs` for concurrency, SSL, cookies, retries, checkpoints, etc.
191194
- Main outputs:
192195
- downloaded files in `downloads/`
193196
- partial/final results in `download_results/`
194197
- returned `pd.DataFrame` with download status and metadata
195198

199+
Browser-capable download mode is intended for browser-gated file endpoints where a real file still exists behind session/bootstrap checks. It is not a general viewer extractor. Viewer-only sources should still fail cleanly with a recorded error and no local file artifact.
200+
201+
Example:
202+
203+
```python
204+
corpus.download(
205+
input_parquet="input_urls.parquet",
206+
download_mode="browser",
207+
)
208+
```
209+
210+
Policy-routed example:
211+
212+
```python
213+
corpus.download(
214+
input_parquet="input_urls.parquet",
215+
download_policy_file="download_policy.yml",
216+
)
217+
```
218+
196219
## triage_math()
197220

198221
- Purpose: summarize per-page metrics and recommend Phase‑2 for math-dense docs.

docs/stages/download.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ The download stage acquires source documents from parquet-based URL metadata and
88

99
- read URL-bearing parquet input
1010
- download files concurrently
11+
- route known browser-gated sources through browser-assisted acquisition when configured
1112
- retain source metadata context
1213
- avoid refetching previously successful downloads
1314
- assign stable-enough local filenames for downstream processing
@@ -42,10 +43,34 @@ Typical issues include:
4243

4344
- transient network failures
4445
- rate limiting
46+
- browser-gated file endpoints that return HTML challenge/interstitial pages
47+
- viewer-only sources that should fail cleanly instead of being recorded as successful downloads
4548
- duplicate URLs
4649
- filename collisions
4750
- partially completed corpus fetches
4851

52+
## Browser-gated sources
53+
54+
The downloader now distinguishes between:
55+
56+
- direct file endpoints
57+
- browser-gated file endpoints
58+
- viewer-only/document-reader sources
59+
60+
For browser-gated file endpoints:
61+
62+
- `download_mode="auto"` probes with direct HTTP and escalates to a browser session when it detects a recoverable interstitial
63+
- `download_mode="browser"` goes directly to the browser-assisted path
64+
- `download_policy_file=...` can route known domains or URL patterns to the correct path without probing every file
65+
66+
Browser-assisted mode is designed for retrievable file endpoints, not for sources that only expose page images, tiles, HTML/SVG re-rendering, or DRM-wrapped readers.
67+
68+
## Session reuse
69+
70+
Browser-assisted mode reuses cached browser session state per domain so multiple files from the same protected source do not need a fresh browser bootstrap every time.
71+
72+
This keeps the browser as a session-bootstrap resource rather than the main downloader.
73+
4974
## Contributor note
5075

5176
Any change to filename assignment or result parquet structure can have downstream impact on:

install_glossapi.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
from __future__ import annotations
2+
3+
import sys
4+
from pathlib import Path
5+
6+
7+
def _bootstrap_repo_src() -> None:
8+
repo_root = Path(__file__).resolve().parent
9+
src_dir = repo_root / "src"
10+
src_str = str(src_dir)
11+
if src_str not in sys.path:
12+
sys.path.insert(0, src_str)
13+
14+
15+
def main() -> int:
16+
_bootstrap_repo_src()
17+
from glossapi.scripts.install_glossapi import main as _main
18+
19+
return int(_main())
20+
21+
22+
if __name__ == "__main__":
23+
raise SystemExit(main())

pyproject.toml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,10 @@ classifiers = [
3737
]
3838

3939
[project.optional-dependencies]
40+
# Browser automation fallback for browser-gated file endpoints
41+
browser = [
42+
"playwright>=1.52,<2",
43+
]
4044
# Docling extraction/layout stack
4145
docling = [
4246
"docling==2.48.0",

src/glossapi/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
'Sampler',
1010
'Section',
1111
'GlossDownloader',
12+
'BrowserGlossDownloader',
1213
]
1314

1415
def __getattr__(name: str):
@@ -31,6 +32,9 @@ def __getattr__(name: str):
3132
if name == 'GlossDownloader':
3233
from .gloss_downloader import GlossDownloader # type: ignore
3334
return GlossDownloader
35+
if name == 'BrowserGlossDownloader':
36+
from .gloss_browser_downloader import BrowserGlossDownloader # type: ignore
37+
return BrowserGlossDownloader
3438
raise AttributeError(name)
3539

3640
try:

src/glossapi/corpus/phase_download.py

Lines changed: 28 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
import pandas as pd
2020

2121
from .._naming import canonical_stem
22+
from ..gloss_browser_downloader import BrowserGlossDownloader
2223
from ..gloss_downloader import GlossDownloader
2324
# Avoid importing section/classifier here; download phase does not use them.
2425
from .corpus_skiplist import _SkiplistManager, _resolve_skiplist_path
@@ -212,6 +213,22 @@ def _looks_like_list(s: str) -> bool:
212213
# Initialize downloader configuration (kwargs take precedence)
213214
dl_cfg = dict(self.downloader_config)
214215
dl_cfg.update(kwargs)
216+
browser_mode = dl_cfg.pop('browser_mode', None)
217+
if browser_mode is not None and 'download_mode' not in dl_cfg:
218+
dl_cfg['download_mode'] = 'browser' if browser_mode else 'standard'
219+
download_mode = str(dl_cfg.pop('download_mode', 'standard')).strip().lower()
220+
policy_requested = bool(dl_cfg.get('download_policy_file') or dl_cfg.get('download_policy'))
221+
if download_mode in {'standard', 'default', 'http'} and not policy_requested:
222+
downloader_cls = GlossDownloader
223+
default_download_route = 'standard'
224+
elif download_mode in {'browser', 'browser_protected'} or policy_requested:
225+
downloader_cls = BrowserGlossDownloader
226+
default_download_route = 'browser' if download_mode in {'browser', 'browser_protected'} else 'standard'
227+
elif download_mode in {'auto', 'browser_fallback'}:
228+
downloader_cls = BrowserGlossDownloader
229+
default_download_route = 'auto'
230+
else:
231+
raise ValueError(f"Unsupported download_mode: {download_mode}")
215232
# Allow caller to override which column holds links
216233
if links_column:
217234
url_column = links_column
@@ -232,14 +249,18 @@ def _looks_like_list(s: str) -> bool:
232249
except Exception:
233250
pass
234251

235-
downloader = GlossDownloader(
236-
url_column=url_column,
237-
output_dir=str(self.output_dir),
238-
log_level=self.logger.level,
239-
verbose=verbose if verbose is not None else self.verbose,
252+
downloader_kwargs = {
253+
"url_column": url_column,
254+
"output_dir": str(self.output_dir),
255+
"log_level": self.logger.level,
256+
"verbose": verbose if verbose is not None else self.verbose,
240257
**{k: v for k, v in dl_cfg.items() if k not in {'input_parquet'}},
241-
_used_filename_bases=used_bases
242-
)
258+
"_used_filename_bases": used_bases,
259+
}
260+
if downloader_cls is BrowserGlossDownloader:
261+
downloader_kwargs["default_download_route"] = default_download_route
262+
263+
downloader = downloader_cls(**downloader_kwargs)
243264

244265
# Download files
245266
self.logger.info(f"Downloading files from URLs in {input_parquet}...")

src/glossapi/download_policy.py

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
"""Policy routing for downloader selection."""
2+
3+
from __future__ import annotations
4+
5+
import re
6+
from dataclasses import dataclass
7+
from pathlib import Path
8+
from typing import Any, Dict, Iterable, Optional
9+
from urllib.parse import urlparse
10+
11+
import yaml
12+
13+
VALID_DOWNLOADERS = {"standard", "browser", "auto"}
14+
ROUTE_OPTION_KEYS = {
15+
"browser_timeout_ms",
16+
"browser_post_load_wait_ms",
17+
"browser_engine",
18+
"browser_headless",
19+
"browser_session_ttl_seconds",
20+
}
21+
22+
23+
def _normalize_downloader(value: Any, default: str = "standard") -> str:
24+
normalized = str(value or default).strip().lower()
25+
if normalized in {"default", "http"}:
26+
normalized = "standard"
27+
if normalized in {"browser_fallback"}:
28+
normalized = "auto"
29+
if normalized in {"browser_protected"}:
30+
normalized = "browser"
31+
if normalized not in VALID_DOWNLOADERS:
32+
raise ValueError(f"Unsupported downloader route: {value}")
33+
return normalized
34+
35+
36+
@dataclass(frozen=True)
37+
class DownloadPolicyMatch:
38+
domains: tuple[str, ...] = ()
39+
url_regex: Optional[re.Pattern[str]] = None
40+
41+
def matches(self, url: str) -> bool:
42+
parsed = urlparse(url)
43+
hostname = (parsed.hostname or "").lower()
44+
if self.domains:
45+
matched_domain = any(
46+
hostname == domain or hostname.endswith(f".{domain}")
47+
for domain in self.domains
48+
)
49+
if not matched_domain:
50+
return False
51+
if self.url_regex and not self.url_regex.search(url):
52+
return False
53+
return True
54+
55+
56+
@dataclass(frozen=True)
57+
class DownloadPolicyRule:
58+
matcher: DownloadPolicyMatch
59+
downloader: str
60+
options: Dict[str, Any]
61+
62+
def matches(self, url: str) -> bool:
63+
return self.matcher.matches(url)
64+
65+
66+
@dataclass(frozen=True)
67+
class DownloadPolicy:
68+
default_downloader: str = "standard"
69+
default_options: Dict[str, Any] | None = None
70+
rules: tuple[DownloadPolicyRule, ...] = ()
71+
72+
def resolve(self, url: str) -> tuple[str, Dict[str, Any]]:
73+
for rule in self.rules:
74+
if rule.matches(url):
75+
return rule.downloader, dict(rule.options)
76+
return self.default_downloader, dict(self.default_options or {})
77+
78+
79+
def _extract_route_options(data: Dict[str, Any]) -> Dict[str, Any]:
80+
return {key: value for key, value in data.items() if key in ROUTE_OPTION_KEYS}
81+
82+
83+
def _build_matcher(raw: Dict[str, Any]) -> DownloadPolicyMatch:
84+
domains = tuple(str(item).strip().lower() for item in (raw.get("domains") or []) if str(item).strip())
85+
url_regex = raw.get("url_regex")
86+
compiled = re.compile(str(url_regex)) if url_regex else None
87+
return DownloadPolicyMatch(domains=domains, url_regex=compiled)
88+
89+
90+
def build_download_policy(data: Dict[str, Any]) -> DownloadPolicy:
91+
default_block = dict(data.get("default") or {})
92+
default_downloader = _normalize_downloader(default_block.get("downloader"), default="standard")
93+
default_options = _extract_route_options(default_block)
94+
95+
rules = []
96+
for raw_rule in data.get("rules") or []:
97+
raw_rule = dict(raw_rule or {})
98+
matcher = _build_matcher(dict(raw_rule.get("match") or {}))
99+
downloader = _normalize_downloader(raw_rule.get("downloader"), default=default_downloader)
100+
options = _extract_route_options(raw_rule)
101+
rules.append(DownloadPolicyRule(matcher=matcher, downloader=downloader, options=options))
102+
103+
return DownloadPolicy(
104+
default_downloader=default_downloader,
105+
default_options=default_options,
106+
rules=tuple(rules),
107+
)
108+
109+
110+
def load_download_policy(path: str | Path) -> DownloadPolicy:
111+
policy_path = Path(path).expanduser().resolve()
112+
payload = yaml.safe_load(policy_path.read_text(encoding="utf-8")) or {}
113+
if not isinstance(payload, dict):
114+
raise ValueError("Download policy file must define a mapping at the top level")
115+
return build_download_policy(payload)
116+
117+
118+
__all__ = [
119+
"DownloadPolicy",
120+
"DownloadPolicyMatch",
121+
"DownloadPolicyRule",
122+
"VALID_DOWNLOADERS",
123+
"build_download_policy",
124+
"load_download_policy",
125+
]

0 commit comments

Comments
 (0)