|
| 1 | +--- |
| 2 | +sidebar_position: 25 |
| 3 | +toc_min_heading_level: 2 |
| 4 | +toc_max_heading_level: 5 |
| 5 | +--- |
| 6 | + |
| 7 | +# Web Page Import |
| 8 | + |
| 9 | +The Web Page Import API allows you to transform unstructured web content into structured data within your Knowledge Graph. By providing a URL, the service automatically retrieves the page content, extracts the core information, and represents it as a Knowledge Graph entity. |
| 10 | + |
| 11 | +## Overview |
| 12 | + |
| 13 | +When you import a web page, WordLift performs several key actions: |
| 14 | +1. **Retrieval**: Downloads the HTML content of the page using a smart fetching system. |
| 15 | +2. **Extraction**: Identifies the main text, headline, and metadata while ignoring noise like sidebars or ads. |
| 16 | +3. **Structuring**: Converts the extracted data into RDF, typically following the [Schema.org WebPage](https://schema.org/WebPage) vocabulary. |
| 17 | +4. **Enrichment**: Optionally generates vector embeddings for semantic search and AI applications. |
| 18 | + |
| 19 | +## Basic Import |
| 20 | + |
| 21 | +To import a web page with default settings, send a `POST` request to the `/web-page-imports` endpoint. |
| 22 | + |
| 23 | +```sh |
| 24 | +curl -X "POST" "https://api.wordlift.io/web-page-imports" |
| 25 | + -H 'Authorization: Key <YOUR_API_KEY>' |
| 26 | + -H 'Content-Type: application/json' |
| 27 | + -d '{ |
| 28 | + "url": "https://example.com/some-page" |
| 29 | +}' |
| 30 | +``` |
| 31 | + |
| 32 | +## Advanced Fetching Options |
| 33 | + |
| 34 | +Modern websites often use dynamic rendering or strict bot protections. You can tune how WordLift retrieves the page using the `fetch_options` object. |
| 35 | + |
| 36 | +### Fetching Modes |
| 37 | + |
| 38 | +The `mode` parameter determines the strategy used to download the page: |
| 39 | + |
| 40 | +| Mode | Description | |
| 41 | +| :--- | :--- | |
| 42 | +| `default` | **Recommended.** Uses a smart fallback system. It attempts a standard fetch and automatically upgrades to an advanced rendering fetch if the target site blocks the initial request. | |
| 43 | +| `proxy` | Uses the standard fetching service. | |
| 44 | +| `scrapingbee` | Forces the use of an advanced rendering engine capable of handling JavaScript and premium residential networks. | |
| 45 | + |
| 46 | +### Configuration Parameters |
| 47 | + |
| 48 | +When using the advanced rendering engine (either via `default` fallback or forced `scrapingbee` mode), the following options are available: |
| 49 | + |
| 50 | +| Option | Type | Description | |
| 51 | +| :--- | :--- | :--- | |
| 52 | +| `render_js` | Boolean | Executes JavaScript on the page. Required for Single Page Applications (SPAs) or content hidden behind interactive elements like accordions. | |
| 53 | +| `premium_proxy` | Boolean | Uses high-quality residential networks to bypass advanced security shields on enterprise websites. | |
| 54 | +| `country_code` | String | Sets the geographical location for the request (e.g., `us`, `de`, `ch`), allowing you to import region-specific content. | |
| 55 | +| `wait_for` | String | Instructs the fetcher to wait for a specific CSS selector to appear before capturing the content. | |
| 56 | +| `block_ads` | Boolean | Prevents ads from loading, reducing bandwidth and ensuring a cleaner extraction. | |
| 57 | + |
| 58 | +### Example: Importing a Dynamic Swiss Website |
| 59 | + |
| 60 | +```json |
| 61 | +{ |
| 62 | + "url": "https://www.zurich.ch/de", |
| 63 | + "fetch_options": { |
| 64 | + "mode": "scrapingbee", |
| 65 | + "render_js": true, |
| 66 | + "country_code": "ch", |
| 67 | + "wait_for": ".main-content" |
| 68 | + } |
| 69 | +} |
| 70 | +``` |
| 71 | + |
| 72 | +## Extracted Data |
| 73 | + |
| 74 | +The resulting entity is stored in the Knowledge Graph and includes the following structured properties: |
| 75 | +- **Headline**: The primary title of the content. |
| 76 | +- **Text**: The full body text extracted from the page. |
| 77 | +- **Abstract**: A concise summary of the page content. |
| 78 | +- **Url**: The original source URL. |
| 79 | +- **Types**: By default, the entity is typed as `http://schema.org/WebPage`. |
| 80 | + |
| 81 | +## Troubleshooting |
| 82 | + |
| 83 | +- **Access Denied (403)**: If the standard fetch is blocked, the system usually retries automatically in `default` mode. If it still fails, try forcing `mode: scrapingbee` with `premium_proxy: true`. |
| 84 | +- **Incomplete Content**: If the page relies heavily on client-side rendering, ensure `render_js` is set to `true`. |
0 commit comments