Skip to content

Commit dad04d0

Browse files
authored
feat: add Web Page Import developer guide (#26)
* feat: add Web Page Import developer guide * docs: refine Web Page Import guide to focus on goal rather than implementation
1 parent f38da5a commit dad04d0

File tree

3 files changed

+101
-8
lines changed

3 files changed

+101
-8
lines changed
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
---
2+
sidebar_position: 25
3+
toc_min_heading_level: 2
4+
toc_max_heading_level: 5
5+
---
6+
7+
# Web Page Import
8+
9+
The Web Page Import API allows you to transform unstructured web content into structured data within your Knowledge Graph. By providing a URL, the service automatically retrieves the page content, extracts the core information, and represents it as a Knowledge Graph entity.
10+
11+
## Overview
12+
13+
When you import a web page, WordLift performs several key actions:
14+
1. **Retrieval**: Downloads the HTML content of the page using a smart fetching system.
15+
2. **Extraction**: Identifies the main text, headline, and metadata while ignoring noise like sidebars or ads.
16+
3. **Structuring**: Converts the extracted data into RDF, typically following the [Schema.org WebPage](https://schema.org/WebPage) vocabulary.
17+
4. **Enrichment**: Optionally generates vector embeddings for semantic search and AI applications.
18+
19+
## Basic Import
20+
21+
To import a web page with default settings, send a `POST` request to the `/web-page-imports` endpoint.
22+
23+
```sh
24+
curl -X "POST" "https://api.wordlift.io/web-page-imports"
25+
-H 'Authorization: Key <YOUR_API_KEY>'
26+
-H 'Content-Type: application/json'
27+
-d '{
28+
"url": "https://example.com/some-page"
29+
}'
30+
```
31+
32+
## Advanced Fetching Options
33+
34+
Modern websites often use dynamic rendering or strict bot protections. You can tune how WordLift retrieves the page using the `fetch_options` object.
35+
36+
### Fetching Modes
37+
38+
The `mode` parameter determines the strategy used to download the page:
39+
40+
| Mode | Description |
41+
| :--- | :--- |
42+
| `default` | **Recommended.** Uses a smart fallback system. It attempts a standard fetch and automatically upgrades to an advanced rendering fetch if the target site blocks the initial request. |
43+
| `proxy` | Uses the standard fetching service. |
44+
| `scrapingbee` | Forces the use of an advanced rendering engine capable of handling JavaScript and premium residential networks. |
45+
46+
### Configuration Parameters
47+
48+
When using the advanced rendering engine (either via `default` fallback or forced `scrapingbee` mode), the following options are available:
49+
50+
| Option | Type | Description |
51+
| :--- | :--- | :--- |
52+
| `render_js` | Boolean | Executes JavaScript on the page. Required for Single Page Applications (SPAs) or content hidden behind interactive elements like accordions. |
53+
| `premium_proxy` | Boolean | Uses high-quality residential networks to bypass advanced security shields on enterprise websites. |
54+
| `country_code` | String | Sets the geographical location for the request (e.g., `us`, `de`, `ch`), allowing you to import region-specific content. |
55+
| `wait_for` | String | Instructs the fetcher to wait for a specific CSS selector to appear before capturing the content. |
56+
| `block_ads` | Boolean | Prevents ads from loading, reducing bandwidth and ensuring a cleaner extraction. |
57+
58+
### Example: Importing a Dynamic Swiss Website
59+
60+
```json
61+
{
62+
"url": "https://www.zurich.ch/de",
63+
"fetch_options": {
64+
"mode": "scrapingbee",
65+
"render_js": true,
66+
"country_code": "ch",
67+
"wait_for": ".main-content"
68+
}
69+
}
70+
```
71+
72+
## Extracted Data
73+
74+
The resulting entity is stored in the Knowledge Graph and includes the following structured properties:
75+
- **Headline**: The primary title of the content.
76+
- **Text**: The full body text extracted from the page.
77+
- **Abstract**: A concise summary of the page content.
78+
- **Url**: The original source URL.
79+
- **Types**: By default, the entity is typed as `http://schema.org/WebPage`.
80+
81+
## Troubleshooting
82+
83+
- **Access Denied (403)**: If the standard fetch is blocked, the system usually retries automatically in `default` mode. If it still fails, try forcing `mode: scrapingbee` with `premium_proxy: true`.
84+
- **Incomplete Content**: If the page relies heavily on client-side rendering, ensure `render_js` is set to `true`.

docs/worai/commands/validate.md

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,19 @@
11
# validate
22

3-
Validate RDF (Turtle or JSON-LD) against SHACL shapes. Required properties surface as errors; recommended properties surface as warnings.
3+
Validate JSON-LD against SHACL shapes. Required properties surface as errors; recommended properties surface as warnings.
4+
5+
`worai validate` without a subcommand is deprecated. Use `worai validate jsonld` instead. For webpage URLs, use `worai structured-data validate page`.
46

57
## Usage
6-
- `worai validate [--list-shapes] <file.ttl|file.jsonld|url> [--shape <shape>] [--report-file <path>] [--format pretty|raw] [--color|--no-color]`
8+
- `worai validate jsonld [--list-shapes] <file.jsonld|url> [--shape <shape>] [--report-file <path>] [--format pretty|raw] [--color|--no-color]`
9+
- `worai structured-data validate page [--list-shapes] <url> [--shape <shape>] [--report-file <path>] [--format pretty|raw] [--color|--no-color]`
710

811
## Examples
9-
- `worai validate ./data.jsonld`
10-
- `worai validate ./data.jsonld --shape review-snippet`
11-
- `worai validate ./data.jsonld --shape ./custom-shape.ttl --report-file ./report.txt`
12-
- `worai validate https://api.wordlift.io/data/example.jsonld --shape review-snippet`
13-
- `worai validate ./data.jsonld --format raw`
14-
- `worai validate ./data.jsonld --format pretty --no-color`
12+
- `worai validate jsonld ./data.jsonld`
13+
- `worai validate jsonld ./data.jsonld --shape review-snippet`
14+
- `worai validate jsonld ./data.jsonld --shape ./custom-shape.ttl --report-file ./report.txt`
15+
- `worai validate jsonld https://api.wordlift.io/data/example.jsonld --shape review-snippet`
16+
- `worai validate jsonld ./data.jsonld --format raw`
17+
- `worai validate jsonld ./data.jsonld --format pretty --no-color`
1518
- `worai validate --list-shapes`
19+
- `worai structured-data validate page https://example.com/article --shape review-snippet`

sidebars.js

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,11 @@ const sidebars = {
3737
id: "knowledge-graph/sitemap-import",
3838
label: "Sitemap Import"
3939
},
40+
{
41+
type: "doc",
42+
id: "knowledge-graph/web-page-import",
43+
label: "Web Page Import"
44+
},
4045
{
4146
type: "doc",
4247
id: "knowledge-graph/analytics-api",

0 commit comments

Comments
 (0)