Skip to content

Commit 525d04a

Browse files
committed
update docs
1 parent a93ffe5 commit 525d04a

File tree

2 files changed

+62
-6
lines changed

2 files changed

+62
-6
lines changed

docs/worai/commands/graph.md

Lines changed: 35 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,20 @@ title: graph
77
Run graph-specific workflows.
88

99
## Usage
10-
- `worai graph sync --profile <name> [--debug]`
11-
- `worai --config <path> graph sync --profile <name> [--debug]`
10+
- `worai graph sync run --profile <name> [--debug]`
11+
- `worai --config <path> graph sync run --profile <name> [--debug]`
12+
- `worai graph sync create <destination> [--template <src>] [--defaults] [--data-file <path>] [--vcs-ref <ref>] [--non-interactive] [--force]`
13+
- `worai graph property delete <predicate> [--dry-run] [--yes] [--workers <n>] [--retries <n>] [--rate-delay <s>] [--limit <n>]`
1214

1315
## Notes
16+
- `graph sync run` executes the graph sync workflow for a profile in `worai.toml`.
17+
- `graph sync create` bootstraps a new graph sync project from a Copier template.
18+
- `graph sync create` enables Copier trusted mode by default, so template `_tasks` run automatically.
19+
- `graph property delete` removes one predicate from all matching entities.
20+
- accepts full IRI (`https://w3id.org/seovoc/html`) or CURIE (`seovoc:html`).
21+
- includes private fields by default (`X-include-Private: true`) for both matching-entity discovery and deletion PATCH requests.
22+
- `--dry-run` reports matching entities without patching.
23+
- without `--yes`, the command asks for confirmation before patching.
1424
- Supported input sources:
1525
- `urls` (explicit URL list)
1626
- `sitemap_url` (+ optional `sitemap_url_pattern`)
@@ -24,6 +34,23 @@ Run graph-specific workflows.
2434
- profile value overrides global value.
2535
- default is `false` when unset.
2636
- mapped to SDK setting `GOOGLE_SEARCH_CONSOLE`.
37+
- SDK 5 ingestion settings are forwarded when configured:
38+
- `ingest.source`: `auto|urls|sitemap|sheets|local`
39+
- `ingest.loader`: `auto|simple|proxy|playwright|premium_scraper|web_scrape_api|passthrough`
40+
- `ingest.passthrough_when_html`: default `true`
41+
- Loader default behavior is `web_scrape_api`.
42+
- legacy `web_page_import_mode=default` maps to `web_scrape_api`
43+
- legacy `proxy` and `premium_scraper` keep the same value
44+
Example ingestion config:
45+
```toml
46+
[profile.acme]
47+
api_key = "wl_..."
48+
sitemap_url = "https://example.com/sitemap.xml"
49+
ingest.source = "sitemap"
50+
ingest.loader = "web_scrape_api"
51+
ingest.passthrough_when_html = true
52+
web_page_import_timeout = "60s"
53+
```
2754
- `sheets_service_account` is required only when using Google Sheets source (`sheets_url` + `sheets_name`).
2855
- Failure cases for `sheets_service_account` (Sheets source only):
2956
- value is missing or empty
@@ -41,6 +68,9 @@ sheets_service_account = "./service-account.json"
4168
```
4269

4370
## Examples
44-
- `worai graph sync --profile acme`
45-
- `worai --config ./worai.toml graph sync --profile acme`
46-
- `worai graph sync --profile acme --debug`
71+
- `worai graph sync run --profile acme`
72+
- `worai --config ./worai.toml graph sync run --profile acme`
73+
- `worai graph sync run --profile acme --debug`
74+
- `worai graph sync create ./acme-graph`
75+
- `worai graph property delete seovoc:html --dry-run`
76+
- `worai graph property delete https://w3id.org/seovoc/html --yes --workers 4`

docs/worai/commands/structured-data.md

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,9 @@ Parse all URLs from a sitemap, extract JSON-LD from each page, and export a stru
113113
| `--timeout` | float | `30.0` | HTTP timeout in seconds for sitemap and page fetches. |
114114
| `--concurrency` | string | `auto` | Worker count or `auto` to adapt to fetch/parse responses. |
115115
| `--source-type` | string | none | Optional source parser override (e.g., `debug-cloud`). |
116+
| `--ingest-source` | string | none | SDK 5 source axis: `auto`, `urls`, `sitemap`, `sheets`, `local`. |
117+
| `--ingest-loader` | string | none | SDK 5 loader axis: `auto`, `simple`, `proxy`, `playwright`, `premium_scraper`, `web_scrape_api`, `passthrough`. |
118+
| `--ingest-passthrough-when-html / --no-ingest-passthrough-when-html` | bool | config/default | Prefer passthrough when source records include embedded HTML. |
116119

117120
### Output columns
118121
- `url`
@@ -128,11 +131,32 @@ Parse all URLs from a sitemap, extract JSON-LD from each page, and export a stru
128131
- Fetches page content with Playwright using the shared worai default User-Agent.
129132
- Shows a progress bar while processing source URLs.
130133
- Supports adaptive concurrency via `--concurrency auto`.
134+
- Ingestion precedence:
135+
- new ingest settings (`--ingest-*` or `ingest.*` config) win over legacy when both are set
136+
- legacy remains supported when new is unset
137+
- disagreements emit a structured warning event
138+
- Loader defaults:
139+
- default and `auto` loader resolve to `web_scrape_api`
140+
- passthrough takes precedence when embedded HTML exists and passthrough-when-html is enabled
141+
- `worai.toml` examples:
142+
```toml
143+
[ingest]
144+
source = "auto"
145+
loader = "web_scrape_api"
146+
passthrough_when_html = true
147+
```
148+
149+
```toml
150+
[profile.inventory_local]
151+
ingest.source = "local"
152+
ingest.loader = "passthrough"
153+
ingest.passthrough_when_html = true
154+
```
131155
- Local URL list file support:
132156
- `.txt`: one URL per line
133157
- `.csv`: requires `url` column
134158
- When using a Google Spreadsheet as source, `--sheet-name` is required.
135-
- `--source-type debug-cloud` supports `.ttl` debug artifacts by reading:
159+
- `--source-type debug-cloud` (legacy alias of `--ingest-source local`) supports `.ttl` debug artifacts by reading:
136160
- URL from `http://schema.org/url`
137161
- HTML from `https://w3id.org/seovoc/html`
138162

@@ -143,3 +167,5 @@ Parse all URLs from a sitemap, extract JSON-LD from each page, and export a stru
143167
- `worai structured-data inventory https://example.com/sitemap.xml --destination-sheet-id 1AbCdEfGhIjKlMnOp --destination-sheet-name Inventory`
144168
- `worai structured-data inventory https://example.com/sitemap.xml --output ./structured-data-inventory.csv --concurrency auto`
145169
- `worai structured-data inventory /path/to/debug_cloud/us --source-type debug-cloud --output ./structured-data-inventory.csv`
170+
- `worai structured-data inventory /path/to/debug_cloud/us --ingest-source local --ingest-loader passthrough --output ./structured-data-inventory.csv`
171+
- `worai structured-data inventory https://example.com/sitemap.xml --ingest-loader web_scrape_api --output ./structured-data-inventory.csv`

0 commit comments

Comments
 (0)