-Pydoll automates Chromium-based browsers (Chrome, Edge) by connecting directly to the Chrome DevTools Protocol over WebSocket. No WebDriver binary, no `navigator.webdriver` flag, no compatibility issues.
+Pydoll automates Chromium-based browsers (Chrome, Edge) by connecting directly to the Chrome DevTools Protocol over WebSocket. **No WebDriver binary, no `navigator.webdriver` flag, no compatibility issues.**
-It combines a high-level API for common tasks with low-level CDP access for fine-grained control over network, fingerprinting, and browser behavior. The entire codebase is async-native and fully type-checked with mypy.
+It combines a high-level API for stealthy automation with low-level CDP access for fine-grained control over network, fingerprinting, and browser behavior. And with its new **Pydantic-powered extraction engine**, it maps the DOM directly to structured Python objects, delivering an unmatched Developer Experience (DX).
### Top Sponsors
@@ -48,11 +48,11 @@ It combines a high-level API for common tasks with low-level CDP access for fine
### Why Pydoll
-- **Stealth-first**: Human-like mouse movement, realistic typing, and granular [browser preference](https://pydoll.tech/docs/features/configuration/browser-preferences/) control for fingerprint management.
+- **Structured extraction**: Define a [Pydantic](https://docs.pydantic.dev/) model, call `tab.extract()`, get typed and validated data back. No manual element-by-element querying.
- **Async and typed**: Built on `asyncio` from the ground up, 100% type-checked with `mypy`. Full IDE autocompletion and static error checking.
+- **Stealth built in**: Human-like mouse movement, realistic typing, and granular [browser preference](https://pydoll.tech/docs/features/configuration/browser-preferences/) control for fingerprint management.
- **Network control**: [Intercept](https://pydoll.tech/docs/features/network/interception/) requests to block ads/trackers, [monitor](https://pydoll.tech/docs/features/network/monitoring/) traffic for API discovery, and make [authenticated HTTP requests](https://pydoll.tech/docs/features/network/http-requests/) that inherit the browser session.
- **Shadow DOM and iframes**: Full support for [shadow roots](https://pydoll.tech/docs/deep-dive/architecture/shadow-dom/) (including closed) and cross-origin iframes. Discover, query, and interact with elements inside them using the same API.
-- **Ergonomic API**: `tab.find()` for most cases, `tab.query()` for complex [CSS/XPath selectors](https://pydoll.tech/docs/deep-dive/guides/selectors-guide/).
## Installation
@@ -62,55 +62,124 @@ pip install pydoll-python
No WebDriver binaries or external dependencies required.
-## What's New
+## Getting Started
-
-HAR Network Recording
-
+### 1. Stateful Automation & Evasion
-Record network activity during a browser session and export as HAR 1.2. Replay recorded requests to reproduce exact API sequences.
+When you need to navigate, bypass challenges, or interact with dynamic UI, Pydoll's imperative API handles it with humanized timing by default.
```python
-from pydoll.browser.chromium import Chrome
+import asyncio
+from pydoll.browser import Chrome
+from pydoll.constants import Key
-async with Chrome() as browser:
- tab = await browser.start()
+async def google_search(query: str):
+ async with Chrome() as browser:
+ tab = await browser.start()
+ await tab.go_to('https://www.google.com')
- async with tab.request.record() as capture:
- await tab.go_to('https://example.com')
+ # Find elements and interact with human-like timing
+ search_box = await tab.find(tag_name='textarea', name='q')
+ await search_box.insert_text(query)
+ await tab.keyboard.press(Key.ENTER)
- capture.save('flow.har')
- print(f'Captured {len(capture.entries)} requests')
+ first_result = await tab.find(
+ tag_name='h3',
+ text='autoscrape-labs/pydoll',
+ timeout=10,
+ )
+ await first_result.click()
+ print(f"Page loaded: {await tab.title}")
- responses = await tab.request.replay('flow.har')
+asyncio.run(google_search('pydoll site:github.com'))
```
-Filter by resource type:
+### 2. Structured Data Extraction
+
+Once you reach the target page, switch to the declarative engine. Define what you want with a model, and Pydoll extracts it — typed, validated, and ready to use.
```python
-from pydoll.protocol.network.types import ResourceType
+from pydoll.browser.chromium import Chrome
+from pydoll.extractor import ExtractionModel, Field
+
+class Quote(ExtractionModel):
+ text: str = Field(selector='.text', description='The quote text')
+ author: str = Field(selector='.author', description='Who said it')
+ tags: list[str] = Field(selector='.tag', description='Tags')
+ year: int | None = Field(selector='.year', description='Year', default=None)
-async with tab.request.record(
- resource_types=[ResourceType.FETCH, ResourceType.XHR]
-) as capture:
- await tab.go_to('https://example.com')
+async def extract_quotes():
+ async with Chrome() as browser:
+ tab = await browser.start()
+ await tab.go_to('https://quotes.toscrape.com')
+
+ quotes = await tab.extract_all(Quote, scope='.quote', timeout=5)
+
+ for q in quotes:
+ print(f'{q.author}: {q.text}') # fully typed, IDE autocomplete works
+ print(q.tags) # list[str], not a raw element
+ print(q.model_dump_json()) # pydantic serialization built-in
+
+asyncio.run(extract_quotes())
```
-[HAR Recording Docs](https://pydoll.tech/docs/features/network/network-recording/)
+Models support CSS/XPath auto-detection, HTML attribute targeting, custom transforms, and nested models.
+
+
+Nested models, transforms, and attribute extraction
+
+
+```python
+from datetime import datetime
+from pydoll.extractor import ExtractionModel, Field
+
+def parse_date(raw: str) -> datetime:
+ return datetime.strptime(raw.strip(), '%B %d, %Y')
+
+class Author(ExtractionModel):
+ name: str = Field(selector='.author-title')
+ born: datetime = Field(
+ selector='.author-born-date',
+ transform=parse_date,
+ )
+
+class Article(ExtractionModel):
+ title: str = Field(selector='h1')
+ url: str = Field(selector='.source-link', attribute='href')
+ author: Author = Field(selector='.author-card', description='Nested model')
+
+article = await tab.extract(Article, timeout=5)
+article.author.born.year # int — types are preserved all the way down
+```
+## Features
+
-Page Bundles
+Humanized Mouse Movement
-Save the current page and all its assets (CSS, JS, images, fonts) as a `.zip` bundle for offline viewing. Optionally inline everything into a single HTML file.
+Mouse operations produce human-like cursor movement by default:
+
+- **Bezier curve paths** with asymmetric control points
+- **Fitts's Law timing**: duration scales with distance
+- **Minimum-jerk velocity**: bell-shaped speed profile
+- **Physiological tremor**: Gaussian noise scaled with velocity
+- **Overshoot correction**: ~70% chance on fast movements, then corrects back
```python
-await tab.save_bundle('page.zip')
-await tab.save_bundle('page-inline.zip', inline_assets=True)
+await tab.mouse.move(500, 300)
+await tab.mouse.click(500, 300)
+await tab.mouse.drag(100, 200, 500, 400)
+
+button = await tab.find(id='submit')
+await button.click()
+
+# Opt out when speed matters
+await tab.mouse.click(500, 300, humanize=False)
```
-[Screenshots, PDFs & Bundles Docs](https://pydoll.tech/docs/features/automation/screenshots-and-pdfs/)
+[Mouse Control Docs](https://pydoll.tech/docs/features/automation/mouse-control/)
@@ -139,75 +208,46 @@ Highlights:
- `deep=True` traverses cross-origin iframes (OOPIFs)
- Standard `find()`, `query()`, `click()` API inside shadow roots
-```python
-# Cloudflare Turnstile inside a cross-origin iframe
-shadow_roots = await tab.find_shadow_roots(deep=True, timeout=10)
-for sr in shadow_roots:
- checkbox = await sr.query('input[type="checkbox"]', raise_exc=False)
- if checkbox:
- await checkbox.click()
-```
-
[Shadow DOM Docs](https://pydoll.tech/docs/deep-dive/architecture/shadow-dom/)
-Humanized Mouse Movement
+HAR Network Recording
-Mouse operations produce human-like cursor movement by default:
-
-- **Bezier curve paths** with asymmetric control points
-- **Fitts's Law timing**: duration scales with distance
-- **Minimum-jerk velocity**: bell-shaped speed profile
-- **Physiological tremor**: Gaussian noise scaled with velocity
-- **Overshoot correction**: ~70% chance on fast movements, then corrects back
+Record network activity during a browser session and export as HAR 1.2. Replay recorded requests to reproduce exact API sequences.
```python
-await tab.mouse.move(500, 300)
-await tab.mouse.click(500, 300)
-await tab.mouse.drag(100, 200, 500, 400)
-
-button = await tab.find(id='submit')
-await button.click()
-
-# Opt out when speed matters
-await tab.mouse.click(500, 300, humanize=False)
-```
+from pydoll.browser.chromium import Chrome
-[Mouse Control Docs](https://pydoll.tech/docs/features/automation/mouse-control/)
-
+async with Chrome() as browser:
+ tab = await browser.start()
-## Getting Started
+ async with tab.request.record() as capture:
+ await tab.go_to('https://example.com')
-```python
-import asyncio
-from pydoll.browser import Chrome
-from pydoll.constants import Key
+ capture.save('flow.har')
+ print(f'Captured {len(capture.entries)} requests')
-async def google_search(query: str):
- async with Chrome() as browser:
- tab = await browser.start()
- await tab.go_to('https://www.google.com')
+ responses = await tab.request.replay('flow.har')
+```
- search_box = await tab.find(tag_name='textarea', name='q')
- await search_box.insert_text(query)
- await tab.keyboard.press(Key.ENTER)
+[HAR Recording Docs](https://pydoll.tech/docs/features/network/network-recording/)
+
- first_result = await tab.find(
- tag_name='h3',
- text='autoscrape-labs/pydoll',
- timeout=10,
- )
- await first_result.click()
+
+Page Bundles
+
- await tab.find(id='repository-container-header', timeout=10)
- print(f"Page loaded: {await tab.title}")
+Save the current page and all its assets (CSS, JS, images, fonts) as a `.zip` bundle for offline viewing. Optionally inline everything into a single HTML file.
-asyncio.run(google_search('pydoll site:github.com'))
+```python
+await tab.save_bundle('page.zip')
+await tab.save_bundle('page-inline.zip', inline_assets=True)
```
-## Features
+[Screenshots, PDFs & Bundles Docs](https://pydoll.tech/docs/features/automation/screenshots-and-pdfs/)
+Hybrid Automation (UI + API)
From af7cc3cf3d80c8bb2eb3852a5bfd9867b174e33b Mon Sep 17 00:00:00 2001
From: Thalison Fernandes
Date: Sun, 22 Mar 2026 17:54:47 -0300
Subject: [PATCH 17/21] docs: update landing pages with extractor examples in
all languages
---
docs/en/index.md | 177 ++++++++++++++++++++++++++--------------------
docs/pt/index.md | 164 ++++++++++++++++++++++++------------------
docs/zh/index.md | 180 +++++++++++++++++++++++++++--------------------
3 files changed, 301 insertions(+), 220 deletions(-)
diff --git a/docs/en/index.md b/docs/en/index.md
index dd8e94f4..18c20454 100644
--- a/docs/en/index.md
+++ b/docs/en/index.md
@@ -50,6 +50,7 @@ $ pip install git+https://github.com/autoscrape-labs/pydoll.git
- **Powerful Network Monitoring**: Intercept, modify, and analyze all network traffic with ease, giving you complete control over requests.
- **Event-Driven Architecture**: React to page events, network requests, and user interactions in real-time.
- **Intuitive Element Finding**: Modern `find()` and `query()` methods that make sense and work as you'd expect.
+- **Structured Extraction**: Define a [Pydantic](https://docs.pydantic.dev/) model, call `tab.extract()`, get typed and validated data back. No manual element-by-element querying.
- **Robust Type Safety**: Comprehensive type system for better IDE support and error prevention.
@@ -57,9 +58,11 @@ Ready to dive in? The following pages will guide you through installation, basic
Let's start automating the web, the right way! 🚀
-## Quick Start Guide: A simple example
+## Quick Start Guide
-Let's start with a practical example. The following script will open the Pydoll GitHub repository and star it:
+### 1. Stateful Automation & Evasion
+
+When you need to navigate, bypass challenges, or interact with dynamic UI, Pydoll's imperative API handles it with humanized timing by default.
```python
import asyncio
@@ -69,7 +72,8 @@ async def main():
async with Chrome() as browser:
tab = await browser.start()
await tab.go_to('https://github.com/autoscrape-labs/pydoll')
-
+
+ # Find elements and interact with human-like timing
star_button = await tab.find(
tag_name='button',
timeout=5,
@@ -85,101 +89,123 @@ async def main():
asyncio.run(main())
```
-This example demonstrates how to navigate to a website, wait for an element to appear, and interact with it. You can adapt this pattern to automate many different web tasks.
+### 2. Structured Data Extraction
-??? note "Or use without context manager..."
- If you prefer not to use the context manager pattern, you can manually manage the browser instance:
-
- ```python
- import asyncio
- from pydoll.browser.chromium import Chrome
-
- async def main():
- browser = Chrome()
+Once you reach the target page, switch to the declarative engine. Define what you want with a model, and Pydoll extracts it — typed, validated, and ready to use.
+
+```python
+import asyncio
+from pydoll.browser.chromium import Chrome
+from pydoll.extractor import ExtractionModel, Field
+
+class Quote(ExtractionModel):
+ text: str = Field(selector='.text', description='The quote text')
+ author: str = Field(selector='.author', description='Who said it')
+ tags: list[str] = Field(selector='.tag', description='Tags')
+ year: int | None = Field(selector='.year', description='Year', default=None)
+
+async def extract_quotes():
+ async with Chrome() as browser:
tab = await browser.start()
- await tab.go_to('https://github.com/autoscrape-labs/pydoll')
-
- star_button = await tab.find(
- tag_name='button',
- timeout=5,
- raise_exc=False
- )
- if not star_button:
- print("Ops! The button was not found.")
- return
+ await tab.go_to('https://quotes.toscrape.com')
- await star_button.click()
- await asyncio.sleep(3)
- await browser.stop()
-
- asyncio.run(main())
- ```
-
- Note that when not using the context manager, you'll need to explicitly call `browser.stop()` to release resources.
+ quotes = await tab.extract_all(Quote, scope='.quote', timeout=5)
+
+ for q in quotes:
+ print(f'{q.author}: {q.text}') # fully typed, IDE autocomplete works
+ print(q.tags) # list[str], not a raw element
+ print(q.model_dump_json()) # pydantic serialization built-in
-## Extended Example: Custom Browser Configuration
+asyncio.run(extract_quotes())
+```
+
+Models support CSS/XPath auto-detection, HTML attribute targeting, custom transforms, and nested models.
+
+??? note "Nested models, transforms, and attribute extraction"
+ ```python
+ from datetime import datetime
+ from pydoll.extractor import ExtractionModel, Field
+
+ def parse_date(raw: str) -> datetime:
+ return datetime.strptime(raw.strip(), '%B %d, %Y')
+
+ class Author(ExtractionModel):
+ name: str = Field(selector='.author-title')
+ born: datetime = Field(
+ selector='.author-born-date',
+ transform=parse_date,
+ )
-For more advanced usage scenarios, Pydoll allows you to customize your browser configuration using the `ChromiumOptions` class. This is useful when you need to:
+ class Article(ExtractionModel):
+ title: str = Field(selector='h1')
+ url: str = Field(selector='.source-link', attribute='href')
+ author: Author = Field(selector='.author-card', description='Nested model')
-- Run in headless mode (no visible browser window)
-- Specify a custom browser executable path
-- Configure proxies, user agents, or other browser settings
-- Set window dimensions or startup arguments
+ article = await tab.extract(Article, timeout=5)
+ article.author.born.year # int — types are preserved all the way down
+ ```
+
+## Extended Example: Combining Both Approaches
-Here's an example showing how to use custom options for Chrome:
+A real-world scraping task typically combines both approaches: imperative automation to navigate and bypass challenges, then declarative extraction to collect structured data.
-```python hl_lines="8-12 30-32 34-38"
+```python
import asyncio
-import os
+from typing import Optional
+
from pydoll.browser.chromium import Chrome
from pydoll.browser.options import ChromiumOptions
+from pydoll.extractor import ExtractionModel, Field
+
+
+class GitHubRepo(ExtractionModel):
+ name: str = Field(
+ selector='[itemprop="name"] a',
+ description='Repository name',
+ )
+ description: Optional[str] = Field(
+ selector='[itemprop="description"]',
+ description='Repository description',
+ default=None,
+ )
+ language: Optional[str] = Field(
+ selector='[itemprop="programmingLanguage"]',
+ description='Primary programming language',
+ default=None,
+ )
+
async def main():
options = ChromiumOptions()
- options.binary_location = '/usr/bin/google-chrome-stable'
options.add_argument('--headless=new')
- options.add_argument('--start-maximized')
- options.add_argument('--disable-notifications')
-
+
async with Chrome(options=options) as browser:
tab = await browser.start()
- await tab.go_to('https://github.com/autoscrape-labs/pydoll')
-
- star_button = await tab.find(
- tag_name='button',
- timeout=5,
- raise_exc=False
- )
- if not star_button:
- print("Ops! The button was not found.")
- return
- await star_button.click()
- await asyncio.sleep(3)
+ # 1. Navigate and interact (imperative)
+ await tab.go_to('https://github.com/autoscrape-labs')
- screenshot_path = os.path.join(os.getcwd(), 'pydoll_repo.png')
- await tab.take_screenshot(path=screenshot_path)
- print(f"Screenshot saved to: {screenshot_path}")
-
- base64_screenshot = await tab.take_screenshot(as_base64=True)
-
- repo_description_element = await tab.find(
- class_name='f4.my-3'
+ # 2. Extract structured data (declarative)
+ repos = await tab.extract_all(
+ GitHubRepo,
+ scope='article.Box-row',
+ timeout=10,
)
- repo_description = await repo_description_element.text
- print(f"Repository description: {repo_description}")
+
+ for repo in repos:
+ print(f'{repo.name} ({repo.language}): {repo.description}')
+ print(repo.model_dump_json())
if __name__ == "__main__":
asyncio.run(main())
```
-This extended example demonstrates:
+This example demonstrates:
-1. Creating and configuring browser options
-2. Setting a custom Chrome binary path
-3. Enabling headless mode for invisible operation
-4. Setting additional browser flags
-5. Taking screenshots (especially useful in headless mode)
+1. Defining a typed model for GitHub repository data
+2. Configuring headless mode for invisible operation
+3. Using `extract_all` to collect multiple repositories at once
+4. Getting fully typed objects with IDE autocomplete and pydantic serialization
??? info "About Chromium Options"
The `options.add_argument()` method allows you to pass any Chromium command-line argument to customize browser behavior. There are hundreds of available options to control everything from networking to rendering behavior.
@@ -232,10 +258,11 @@ Pydoll relies on just a few carefully selected packages:
```
python = "^3.10"
-websockets = "^13.1"
+websockets = "^14"
aiohttp = "^3.9.5"
-aiofiles = "^23.2.1"
-bs4 = "^0.0.2"
+aiofiles = "^25.1.0"
+pydantic = "^2.0"
+typing_extensions = "^4.14.0"
```
That's it! This minimal dependency approach means:
diff --git a/docs/pt/index.md b/docs/pt/index.md
index 048de490..58a8e44e 100644
--- a/docs/pt/index.md
+++ b/docs/pt/index.md
@@ -50,6 +50,7 @@ $ pip install git+https://github.com/autoscrape-labs/pydoll.git
- **Monitoramento de Rede Poderoso**: Intercepte, modifique e analise todo o tráfego de rede com facilidade, dando a você controle total sobre as requisições.
- **Arquitetura Orientada a Eventos**: Reaja a eventos da página, requisições de rede e interações do usuário em tempo real.
- **Localização de Elementos Intuitiva**: Métodos modernos `find()` e `query()` que fazem sentido e funcionam como você esperaria.
+- **Extração Estruturada**: Defina um modelo [Pydantic](https://docs.pydantic.dev/), chame `tab.extract()` e receba dados tipados e validados. Sem consulta manual elemento por elemento.
- **Segurança de Tipos Robusta**: Sistema de tipos abrangente para melhor suporte da IDE e prevenção de erros.
@@ -57,9 +58,11 @@ Pronto para começar? As páginas a seguir guiarão você pela instalação, uso
Vamos começar a automatizar a web, da maneira certa! 🚀
-## Guia de Início Rápido: Um exemplo simples
+## Guia de Início Rápido
-Vamos começar com um exemplo prático. O script a seguir abrirá o repositório Pydoll no GitHub e o marcará como favorito:
+### 1. Automação Stateful e Evasão
+
+Quando você precisa navegar, contornar desafios ou interagir com UIs dinâmicas, a API imperativa do Pydoll cuida de tudo com timing humanizado por padrão.
```python
import asyncio
@@ -70,6 +73,7 @@ async def main():
tab = await browser.start()
await tab.go_to('https://github.com/autoscrape-labs/pydoll')
+ # Encontra elementos e interage com timing humano
star_button = await tab.find(
tag_name='button',
timeout=5,
@@ -85,100 +89,123 @@ async def main():
asyncio.run(main())
```
-Este exemplo demonstra como navegar até um site, esperar que um elemento apareça e interagir com ele. Você pode adaptar esse padrão para automatizar diversas tarefas web.
+### 2. Extração Estruturada de Dados
-??? note "Ou use sem o gerenciador de contexto..."
- Se preferir não usar o padrão de gerenciador de contexto, você pode gerenciar a instância do navegador manualmente:
- ```python
- import asyncio
- from pydoll.browser.chromium import Chrome
+Ao chegar na página alvo, mude para o motor declarativo. Defina o que você quer com um modelo, e o Pydoll extrai — tipado, validado e pronto para uso.
+
+```python
+import asyncio
+from pydoll.browser.chromium import Chrome
+from pydoll.extractor import ExtractionModel, Field
- async def main():
- browser = Chrome()
+class Quote(ExtractionModel):
+ text: str = Field(selector='.text', description='O texto da citação')
+ author: str = Field(selector='.author', description='Quem disse')
+ tags: list[str] = Field(selector='.tag', description='Tags')
+ year: int | None = Field(selector='.year', description='Ano', default=None)
+
+async def extract_quotes():
+ async with Chrome() as browser:
tab = await browser.start()
- await tab.go_to('https://github.com/autoscrape-labs/pydoll')
+ await tab.go_to('https://quotes.toscrape.com')
- star_button = await tab.find(
- tag_name='button',
- timeout=5,
- raise_exc=False
- )
- if not star_button:
- print("Ops! O botão não foi encontrado.")
- return
+ quotes = await tab.extract_all(Quote, scope='.quote', timeout=5)
- await star_button.click()
- await asyncio.sleep(3)
- await browser.stop()
+ for q in quotes:
+ print(f'{q.author}: {q.text}') # totalmente tipado, autocomplete da IDE funciona
+ print(q.tags) # list[str], não um elemento bruto
+ print(q.model_dump_json()) # serialização pydantic embutida
- asyncio.run(main())
- ```
- Observe que, ao não usar o gerenciador de contexto, você precisará chamar explicitamente `browser.stop()` para liberar os recursos.
+asyncio.run(extract_quotes())
+```
+
+Modelos suportam auto-detecção CSS/XPath, extração de atributos HTML, transforms customizados e modelos aninhados.
+??? note "Modelos aninhados, transforms e extração de atributos"
+ ```python
+ from datetime import datetime
+ from pydoll.extractor import ExtractionModel, Field
+
+ def parse_date(raw: str) -> datetime:
+ return datetime.strptime(raw.strip(), '%B %d, %Y')
+
+ class Author(ExtractionModel):
+ name: str = Field(selector='.author-title')
+ born: datetime = Field(
+ selector='.author-born-date',
+ transform=parse_date,
+ )
-## Exemplo Estendido: Configuração personalizada do navegador
+ class Article(ExtractionModel):
+ title: str = Field(selector='h1')
+ url: str = Field(selector='.source-link', attribute='href')
+ author: Author = Field(selector='.author-card', description='Modelo aninhado')
-Para cenários de uso mais avançados, o Pydoll permite personalizar a configuração do seu navegador usando a classe `ChromiumOptions`. Isso é útil quando você precisa:
+ article = await tab.extract(Article, timeout=5)
+ article.author.born.year # int — tipos preservados em toda a cadeia
+ ```
-- Executar em modo headless (sem janela do navegador visível)
-- Especificar um caminho personalizado para o executável do navegador
-- Configurar proxies, user agents ou outras configurações do navegador
-- Definir as dimensões da janela ou argumentos de inicialização
+## Exemplo Estendido: Combinando as Duas Abordagens
-Aqui está um exemplo mostrando como usar opções personalizadas para o Chrome:
+Uma tarefa real de scraping tipicamente combina as duas abordagens: automação imperativa para navegar e contornar desafios, depois extração declarativa para coletar dados estruturados.
-```python hl_lines="8-12 30-32 34-38"
+```python
import asyncio
-import os
+from typing import Optional
+
from pydoll.browser.chromium import Chrome
from pydoll.browser.options import ChromiumOptions
+from pydoll.extractor import ExtractionModel, Field
+
+
+class GitHubRepo(ExtractionModel):
+ name: str = Field(
+ selector='[itemprop="name"] a',
+ description='Nome do repositório',
+ )
+ description: Optional[str] = Field(
+ selector='[itemprop="description"]',
+ description='Descrição do repositório',
+ default=None,
+ )
+ language: Optional[str] = Field(
+ selector='[itemprop="programmingLanguage"]',
+ description='Linguagem de programação principal',
+ default=None,
+ )
+
async def main():
options = ChromiumOptions()
- options.binary_location = '/usr/bin/google-chrome-stable'
options.add_argument('--headless=new')
- options.add_argument('--start-maximized')
- options.add_argument('--disable-notifications')
async with Chrome(options=options) as browser:
tab = await browser.start()
- await tab.go_to('https://github.com/autoscrape-labs/pydoll')
- star_button = await tab.find(
- tag_name='button',
- timeout=5,
- raise_exc=False
- )
- if not star_button:
- print("Ops! O botão não foi encontrado.")
- return
+ # 1. Navegar e interagir (imperativo)
+ await tab.go_to('https://github.com/autoscrape-labs')
- await star_button.click()
- await asyncio.sleep(3)
-
- screenshot_path = os.path.join(os.getcwd(), 'pydoll_repo.png')
- await tab.take_screenshot(path=screenshot_path)
- print(f"Captura de tela salva em: {screenshot_path}")
-
- base64_screenshot = await tab.take_screenshot(as_base64=True)
-
- repo_description_element = await tab.find(
- class_name='f4.my-3'
+ # 2. Extrair dados estruturados (declarativo)
+ repos = await tab.extract_all(
+ GitHubRepo,
+ scope='article.Box-row',
+ timeout=10,
)
- repo_description = await repo_description_element.text
- print(f"Descrição do repositório: {repo_description}")
+
+ for repo in repos:
+ print(f'{repo.name} ({repo.language}): {repo.description}')
+ print(repo.model_dump_json())
if __name__ == "__main__":
asyncio.run(main())
```
-Este exemplo estendido demonstra:
+Este exemplo demonstra:
-1. Criação e configuração de opções do navegador
-2. Definição de um caminho personalizado para o binário do Chrome
-3. Habilitação do modo headless para operação invisível
-4. Definição de sinalizadores adicionais do navegador
-5. Captura de tela (especialmente útil em modo headless) modo)
+1. Definição de um modelo tipado para dados de repositórios GitHub
+2. Configuração do modo headless para operação invisível
+3. Uso de `extract_all` para coletar múltiplos repositórios de uma vez
+4. Objetos totalmente tipados com autocomplete da IDE e serialização pydantic
??? info "Sobre as Opções do Chromium"
O método `options.add_argument()` permite que você passe qualquer argumento de linha de comando do Chromium para personalizar o comportamento do navegador. Existem centenas de opções disponíveis para controlar tudo, desde rede até comportamento de renderização.
@@ -231,10 +258,11 @@ O Pydoll depende de apenas alguns pacotes cuidadosamente selecionados:
```
python = "^3.10"
-websockets = "^13.1"
+websockets = "^14"
aiohttp = "^3.9.5"
-aiofiles = "^23.2.1"
-bs4 = "^0.0.2"
+aiofiles = "^25.1.0"
+pydantic = "^2.0"
+typing_extensions = "^4.14.0"
```
É só isso! Essa dependência mínima do Pydoll significa:
diff --git a/docs/zh/index.md b/docs/zh/index.md
index e1ba73f4..323fcf80 100644
--- a/docs/zh/index.md
+++ b/docs/zh/index.md
@@ -50,6 +50,7 @@ $ pip install git+https://github.com/autoscrape-labs/pydoll.git
- **强大的网络监控**: 轻松实现请求拦截、流量篡改与响应分析,完整掌控网络通信链路,轻松突破层层防护体系。
- **事件驱动架构**: 实时响应页面事件、网络请求与用户交互,构建能动态适应防护系统的智能自动化流。
- **直观的元素定位**: 使用符合人类直觉的定位方法 `find()` 和 `query()` ,面对动态加载的防护内容,定位依然精准。
+- **结构化提取**: 定义 [Pydantic](https://docs.pydantic.dev/) 模型,调用 `tab.extract()`,获取类型化和验证过的数据。无需逐元素手动查询。
- **强类型安全**: 完备的类型系统为复杂自动化场景提供更优IDE支持和更好地预防运行时报错。
@@ -57,9 +58,11 @@ $ pip install git+https://github.com/autoscrape-labs/pydoll.git
让我们以最优雅的方式,开启您的网页自动化之旅!🚀
-## 简单的例子上手
+## 快速入门
-让我们从一个实际案例开始。以下脚本将打开 Pydoll 的 GitHub 仓库并star:
+### 1. 有状态自动化与规避
+
+当您需要导航、绕过挑战或与动态UI交互时,Pydoll的命令式API默认以人性化的时序处理一切。
```python
import asyncio
@@ -69,14 +72,15 @@ async def main():
async with Chrome() as browser:
tab = await browser.start()
await tab.go_to('https://github.com/autoscrape-labs/pydoll')
-
+
+ # 查找元素并以人类般的时序进行交互
star_button = await tab.find(
tag_name='button',
timeout=5,
raise_exc=False
)
if not star_button:
- print("Ops! The button was not found.")
+ print("按钮未找到。")
return
await star_button.click()
@@ -85,102 +89,123 @@ async def main():
asyncio.run(main())
```
-此示例演示了如何导航到网站、等待元素出现并与之交互。您可以使用这样的模式来自动执行许多不同的 Web 任务。
+### 2. 结构化数据提取
-??? note "或者使用不带上下文管理器的..."
- 如果你不想要使用上下文管理器模式,你可以手动管理浏览器实例:
-
- ```python
- import asyncio
- from pydoll.browser.chromium import Chrome
-
- async def main():
- browser = Chrome()
+到达目标页面后,切换到声明式引擎。用模型定义您想要的数据,Pydoll会提取它——类型化、验证过、随时可用。
+
+```python
+import asyncio
+from pydoll.browser.chromium import Chrome
+from pydoll.extractor import ExtractionModel, Field
+
+class Quote(ExtractionModel):
+ text: str = Field(selector='.text', description='引用文本')
+ author: str = Field(selector='.author', description='作者')
+ tags: list[str] = Field(selector='.tag', description='标签')
+ year: int | None = Field(selector='.year', description='年份', default=None)
+
+async def extract_quotes():
+ async with Chrome() as browser:
tab = await browser.start()
- await tab.go_to('https://github.com/autoscrape-labs/pydoll')
-
- star_button = await tab.find(
- tag_name='button',
- timeout=5,
- raise_exc=False
- )
- if not star_button:
- print("Ops! The button was not found.")
- return
+ await tab.go_to('https://quotes.toscrape.com')
- await star_button.click()
- await asyncio.sleep(3)
- await browser.stop()
-
- asyncio.run(main())
- ```
-
- Note that when not using the context manager, you'll need to explicitly call `browser.stop()` to release resources.
+ quotes = await tab.extract_all(Quote, scope='.quote', timeout=5)
-## 补充例子: 自定义浏览器配置
+ for q in quotes:
+ print(f'{q.author}: {q.text}') # 完全类型化,IDE自动补全
+ print(q.tags) # list[str],不是原始元素
+ print(q.model_dump_json()) # 内置pydantic序列化
-对于更高级的使用场景,Pydoll 允许您使用 `ChromiumOptions` 类自定义浏览器配置。此功能在您需要执行以下操作时非常有用:
+asyncio.run(extract_quotes())
+```
-- 在无头模式下运行(无可见浏览器窗口)
-- 指定自定义浏览器可执行文件路径
-- 配置代理、用户代理或其他浏览器设置
-- 设置窗口尺寸或启动参数
+模型支持CSS/XPath自动检测、HTML属性提取、自定义转换函数和嵌套模型。
-以下示例展示了如何使用 Chrome 的自定义选项:
+??? note "嵌套模型、转换函数和属性提取"
+ ```python
+ from datetime import datetime
+ from pydoll.extractor import ExtractionModel, Field
-```python hl_lines="8-12 30-32 34-38"
+ def parse_date(raw: str) -> datetime:
+ return datetime.strptime(raw.strip(), '%B %d, %Y')
+
+ class Author(ExtractionModel):
+ name: str = Field(selector='.author-title')
+ born: datetime = Field(
+ selector='.author-born-date',
+ transform=parse_date,
+ )
+
+ class Article(ExtractionModel):
+ title: str = Field(selector='h1')
+ url: str = Field(selector='.source-link', attribute='href')
+ author: Author = Field(selector='.author-card', description='嵌套模型')
+
+ article = await tab.extract(Article, timeout=5)
+ article.author.born.year # int — 类型在整个链中保持一致
+ ```
+
+## 扩展示例:结合两种方式
+
+实际的抓取任务通常结合两种方式:命令式自动化用于导航和绕过挑战,然后声明式提取用于收集结构化数据。
+
+```python
import asyncio
-import os
+from typing import Optional
+
from pydoll.browser.chromium import Chrome
from pydoll.browser.options import ChromiumOptions
+from pydoll.extractor import ExtractionModel, Field
+
+
+class GitHubRepo(ExtractionModel):
+ name: str = Field(
+ selector='[itemprop="name"] a',
+ description='仓库名称',
+ )
+ description: Optional[str] = Field(
+ selector='[itemprop="description"]',
+ description='仓库描述',
+ default=None,
+ )
+ language: Optional[str] = Field(
+ selector='[itemprop="programmingLanguage"]',
+ description='主要编程语言',
+ default=None,
+ )
+
async def main():
options = ChromiumOptions()
- options.binary_location = '/usr/bin/google-chrome-stable'
options.add_argument('--headless=new')
- options.add_argument('--start-maximized')
- options.add_argument('--disable-notifications')
-
+
async with Chrome(options=options) as browser:
tab = await browser.start()
- await tab.go_to('https://github.com/autoscrape-labs/pydoll')
-
- star_button = await tab.find(
- tag_name='button',
- timeout=5,
- raise_exc=False
- )
- if not star_button:
- print("Ops! The button was not found.")
- return
- await star_button.click()
- await asyncio.sleep(3)
-
- screenshot_path = os.path.join(os.getcwd(), 'pydoll_repo.png')
- await tab.take_screenshot(path=screenshot_path)
- print(f"Screenshot saved to: {screenshot_path}")
+ # 1. 导航和交互(命令式)
+ await tab.go_to('https://github.com/autoscrape-labs')
- base64_screenshot = await tab.take_screenshot(as_base64=True)
-
- repo_description_element = await tab.find(
- class_name='f4.my-3'
+ # 2. 提取结构化数据(声明式)
+ repos = await tab.extract_all(
+ GitHubRepo,
+ scope='article.Box-row',
+ timeout=10,
)
- repo_description = await repo_description_element.text
- print(f"Repository description: {repo_description}")
+
+ for repo in repos:
+ print(f'{repo.name} ({repo.language}): {repo.description}')
+ print(repo.model_dump_json())
if __name__ == "__main__":
asyncio.run(main())
```
+此示例演示了:
-此扩展示例演示了:
-
-1. 创建和配置浏览器选项
-2. 设置自定义Chrome可执行程序路径
-3. 启用无头模式以实现无痕操作
-4. 设置其他浏览器命令行flags
-5. 屏幕截图(在无头模式下尤其有用)
+1. 为GitHub仓库数据定义类型化模型
+2. 配置无头模式以实现无痕操作
+3. 使用 `extract_all` 一次性收集多个仓库
+4. 获取完全类型化的对象,支持IDE自动补全和pydantic序列化
??? info "关于Chrome配置选项"
The `options.add_argument()` 方法允许您传递任何 Chromium 命令行参数来自定义浏览器行为。有数百个可用选项可用于控制从网络到渲染行为的所有内容。
@@ -233,10 +258,11 @@ Pydoll仅依赖少量的核心库:
```
python = "^3.10"
-websockets = "^13.1"
+websockets = "^14"
aiohttp = "^3.9.5"
-aiofiles = "^23.2.1"
-bs4 = "^0.0.2"
+aiofiles = "^25.1.0"
+pydantic = "^2.0"
+typing_extensions = "^4.14.0"
```
这种极简依赖策略带来五大核心优势:
From 35f489840faecac84a4f1e9a1dc2ae8a2a6cd322 Mon Sep 17 00:00:00 2001
From: Thalison Fernandes
Date: Sun, 22 Mar 2026 17:54:52 -0300
Subject: [PATCH 18/21] docs(extractor): add structured extraction guide in en,
pt, zh
---
.../extraction/structured-extraction.md | 329 ++++++++++++++++++
docs/en/features/index.md | 6 +
.../extraction/structured-extraction.md | 329 ++++++++++++++++++
docs/pt/features/index.md | 6 +
.../extraction/structured-extraction.md | 329 ++++++++++++++++++
docs/zh/features/index.md | 6 +
mkdocs.yml | 2 +
7 files changed, 1007 insertions(+)
create mode 100644 docs/en/features/extraction/structured-extraction.md
create mode 100644 docs/pt/features/extraction/structured-extraction.md
create mode 100644 docs/zh/features/extraction/structured-extraction.md
diff --git a/docs/en/features/extraction/structured-extraction.md b/docs/en/features/extraction/structured-extraction.md
new file mode 100644
index 00000000..8cb43d50
--- /dev/null
+++ b/docs/en/features/extraction/structured-extraction.md
@@ -0,0 +1,329 @@
+# Structured Data Extraction
+
+Pydoll's extraction engine lets you define **what** you want from a page using typed models, and handles the **how** automatically. Instead of manually querying elements one by one, you declare a model with selectors and call `tab.extract()`. The result is a fully typed, validated Python object powered by [Pydantic](https://docs.pydantic.dev/).
+
+## Why Use Structured Extraction?
+
+Traditional scraping code tends to grow into a tangled mess of `find()` calls, `await element.text`, attribute reads, and manual type conversions scattered across dozens of lines. When the page changes, you hunt through that code to find which selector broke.
+
+With structured extraction, all your selectors live in one place (the model), the types are enforced automatically, and the output is a clean Pydantic object with IDE autocomplete and serialization built in.
+
+## Basic Usage
+
+### Defining a Model
+
+An extraction model is a class that inherits from `ExtractionModel`. Each field uses `Field()` to declare a CSS or XPath selector.
+
+```python
+from pydoll.extractor import ExtractionModel, Field
+
+class Quote(ExtractionModel):
+ text: str = Field(selector='.text', description='The quote text')
+ author: str = Field(selector='.author', description='Who said it')
+ tags: list[str] = Field(selector='.tag', description='Associated tags')
+```
+
+The `selector` parameter accepts both CSS selectors and XPath expressions. Pydoll auto-detects the type, exactly like `tab.query()`.
+
+### Extracting a Single Item
+
+Use `tab.extract()` to populate one model instance from the page:
+
+```python
+from pydoll.browser.chromium import Chrome
+
+async with Chrome() as browser:
+ tab = await browser.start()
+ await tab.go_to('https://example.com/article')
+
+ article = await tab.extract(Article)
+ print(article.title) # str, fully typed
+ print(article.model_dump()) # dict via pydantic
+```
+
+### Extracting Multiple Items
+
+Use `tab.extract_all()` with a `scope` selector that identifies the repeating container. Each match generates one model instance, with fields resolved relative to that container.
+
+```python
+quotes = await tab.extract_all(Quote, scope='.quote')
+
+for q in quotes:
+ print(f'{q.author}: {q.text}')
+ print(q.tags)
+```
+
+You can limit the number of results:
+
+```python
+top_5 = await tab.extract_all(Quote, scope='.quote', limit=5)
+```
+
+## Field Options
+
+The `Field()` function accepts the following parameters:
+
+| Parameter | Type | Description |
+|---------------|-------------------------|--------------------------------------------------------------|
+| `selector` | `str` or `None` | CSS or XPath selector (auto-detected) |
+| `attribute` | `str` or `None` | HTML attribute to read instead of inner text |
+| `description` | `str` or `None` | Semantic description of the field |
+| `default` | any value | Default value when the element is not found |
+| `transform` | callable or `None` | Post-processing function applied to the raw string |
+
+At least one of `selector` or `description` must be provided. Fields with only `description` (no selector) are reserved for future LLM-based extraction and are skipped by the current CSS engine.
+
+## Attribute Extraction
+
+By default, the engine reads the element's visible text (`innerText`). To read an HTML attribute instead, use the `attribute` parameter:
+
+```python
+class Article(ExtractionModel):
+ title: str = Field(selector='h1', description='Title')
+ published: str = Field(
+ selector='time.date',
+ attribute='datetime',
+ description='ISO publication date',
+ )
+ image_url: str = Field(
+ selector='.hero img',
+ attribute='src',
+ description='Hero image URL',
+ )
+ link: str = Field(
+ selector='a.source',
+ attribute='href',
+ description='Source link',
+ )
+ image_id: str = Field(
+ selector='.hero img',
+ attribute='data-id',
+ description='Custom data attribute',
+ )
+```
+
+Any HTML attribute works, including `data-*`, `aria-*`, `href`, `src`, `alt`, and custom attributes.
+
+## Transforms
+
+The `transform` parameter takes a callable that receives the raw string from the DOM and returns the desired type. This is where you convert strings to numbers, parse dates, or clean up formatting.
+
+```python
+from datetime import datetime
+
+def parse_price(raw: str) -> float:
+ return float(raw.replace('R$', '').replace('.', '').replace(',', '.').strip())
+
+def parse_date(raw: str) -> datetime:
+ return datetime.strptime(raw.strip(), '%B %d, %Y')
+
+class Product(ExtractionModel):
+ name: str = Field(selector='.name', description='Product name')
+ price: float = Field(
+ selector='.price',
+ description='Price in BRL',
+ transform=parse_price,
+ )
+ release: datetime = Field(
+ selector='.release-date',
+ description='Release date',
+ transform=parse_date,
+ )
+```
+
+The transform runs **before** Pydantic validation, so the field type should match what the transform returns.
+
+## Nested Models
+
+When a field's type is another `ExtractionModel`, the engine uses the field's selector to find a scope element, then extracts the nested model's fields within that scope.
+
+```python
+class Author(ExtractionModel):
+ name: str = Field(selector='.name', description='Author name')
+ avatar: str = Field(
+ selector='img.avatar',
+ attribute='src',
+ description='Avatar URL',
+ )
+ bio: str = Field(selector='.bio', description='Short bio')
+
+class Article(ExtractionModel):
+ title: str = Field(selector='h1', description='Title')
+ author: Author = Field(
+ selector='.author-card',
+ description='Author information',
+ )
+```
+
+The `.author-card` selector defines the scope. The `Author` fields (`.name`, `img.avatar`, `.bio`) are resolved **inside** that element, not from the full page. This prevents selector collisions when the page has multiple `.name` elements in different sections.
+
+### Lists of Nested Models
+
+You can also extract a list of nested models:
+
+```python
+class Contributor(ExtractionModel):
+ name: str = Field(selector='.name', description='Contributor name')
+ role: str = Field(selector='.role', description='Role')
+
+class Project(ExtractionModel):
+ title: str = Field(selector='h1', description='Project title')
+ contributors: list[Contributor] = Field(
+ selector='.contributor',
+ description='Project contributors',
+ )
+```
+
+Each `.contributor` element becomes the scope for one `Contributor` instance.
+
+## Optional Fields and Defaults
+
+Fields that might not be present on every page should use `Optional` with a `default`:
+
+```python
+from typing import Optional
+
+class Article(ExtractionModel):
+ title: str = Field(selector='h1', description='Title')
+ subtitle: Optional[str] = Field(
+ selector='.subtitle',
+ description='Optional subtitle',
+ default=None,
+ )
+ category: str = Field(
+ selector='.category',
+ description='Category with fallback',
+ default='uncategorized',
+ )
+```
+
+When the element is not found:
+
+- Fields **with** a default silently use that default value.
+- Fields **without** a default (required) raise `FieldExtractionFailed`.
+
+Both `typing.Optional[str]` and the PEP 604 syntax `str | None` are supported.
+
+## Timeout and Waiting
+
+The `timeout` parameter controls how long the engine waits for elements to appear, in seconds. This is propagated to every internal query, including nested models and list fields.
+
+```python
+# Wait up to 10 seconds for elements to appear
+article = await tab.extract(Article, timeout=10)
+
+# No waiting (default), elements must already be in the DOM
+article = await tab.extract(Article)
+
+# Also works with extract_all
+quotes = await tab.extract_all(Quote, scope='.quote', timeout=5)
+```
+
+This uses the same polling mechanism as `tab.query(timeout=...)`, so there is no need for manual `asyncio.sleep()` calls between navigation and extraction.
+
+## Scoped Extraction
+
+The `scope` parameter limits extraction to a specific region of the page:
+
+```python
+# Extract only from the main article, ignoring sidebar/footer
+article = await tab.extract(Article, scope='#main-article')
+
+# extract_all requires scope (it defines the repeating container)
+quotes = await tab.extract_all(Quote, scope='.quote')
+```
+
+## XPath Selectors
+
+XPath expressions are auto-detected (they start with `/` or `./`) and work everywhere CSS selectors work:
+
+```python
+class SearchResult(ExtractionModel):
+ title: str = Field(
+ selector='//h3[@class="title"]',
+ description='Result title via XPath',
+ )
+ url: str = Field(
+ selector='.//a',
+ attribute='href',
+ description='Result URL',
+ )
+```
+
+## Error Handling
+
+The extraction engine raises specific exceptions that you can catch and handle:
+
+```python
+from pydoll.extractor import FieldExtractionFailed, InvalidExtractionModel
+
+# InvalidExtractionModel: raised at model definition time
+# when a Field has neither selector nor description
+try:
+ class BadModel(ExtractionModel):
+ field: str = Field() # no selector, no description
+except InvalidExtractionModel:
+ print('Invalid model definition')
+
+# FieldExtractionFailed: raised at extraction time
+# when a required field's element is not found
+try:
+ result = await tab.extract(MyModel)
+except FieldExtractionFailed as e:
+ print(f'Extraction failed: {e}')
+```
+
+For optional fields, extraction failures are silently handled and the default value is used. Only required fields (those without a `default`) raise exceptions.
+
+## Pydantic Integration
+
+`ExtractionModel` inherits from `pydantic.BaseModel`, so all Pydantic features work out of the box:
+
+```python
+article = await tab.extract(Article)
+
+# Serialization
+article.model_dump() # dict
+article.model_dump_json() # JSON string
+
+# JSON Schema (useful for API docs or LLM prompts)
+Article.model_json_schema()
+
+# Validation happens automatically
+# If a transform returns the wrong type, Pydantic raises ValidationError
+```
+
+You can use any Pydantic feature in your models: validators, field aliases, model configuration, and more. The extraction engine adds the selector/transform layer on top without interfering with Pydantic's behavior.
+
+## Complete Example
+
+Here is a complete, runnable example that extracts quotes from [quotes.toscrape.com](https://quotes.toscrape.com):
+
+```python
+import asyncio
+from pydoll.browser.chromium import Chrome
+from pydoll.extractor import ExtractionModel, Field
+
+class Quote(ExtractionModel):
+ text: str = Field(selector='.text', description='The quote text')
+ author: str = Field(selector='.author', description='Who said the quote')
+ tags: list[str] = Field(selector='.tag', description='Associated tags')
+
+async def main():
+ async with Chrome() as browser:
+ tab = await browser.start()
+ await tab.go_to('https://quotes.toscrape.com')
+
+ quotes = await tab.extract_all(Quote, scope='.quote', timeout=5)
+
+ print(f'Extracted {len(quotes)} quotes\n')
+ for q in quotes:
+ print(f'"{q.text}"')
+ print(f' by {q.author} | tags: {", ".join(q.tags)}\n')
+
+ # Pydantic serialization
+ for q in quotes:
+ print(q.model_dump_json())
+
+asyncio.run(main())
+```
diff --git a/docs/en/features/index.md b/docs/en/features/index.md
index 04101d9b..3f517714 100644
--- a/docs/en/features/index.md
+++ b/docs/en/features/index.md
@@ -18,6 +18,12 @@ Finding and interacting with page elements is the bread and butter of automation
**[Element Finding](element-finding.md)**: Master Pydoll's element location strategies, from the intuitive `find()` method that uses natural HTML attributes, to the powerful `query()` method for CSS selectors and XPath. You'll also learn about DOM traversal helpers that let you navigate the page structure efficiently.
+## Data Extraction
+
+Turn web pages into structured Python objects with typed models, automatic validation, and Pydantic serialization.
+
+**[Structured Extraction](extraction/structured-extraction.md)**: Define a Pydantic model with CSS/XPath selectors, call `tab.extract()`, and get a fully typed object back. Supports nested models, list fields, attribute extraction, custom transforms, optional fields with defaults, and configurable timeouts. No manual element-by-element querying required.
+
## Automation Capabilities
These are the features that bring your automation to life: simulating user interactions, keyboard control, handling file operations, working with iframes, and capturing visual content.
diff --git a/docs/pt/features/extraction/structured-extraction.md b/docs/pt/features/extraction/structured-extraction.md
new file mode 100644
index 00000000..ba51cc84
--- /dev/null
+++ b/docs/pt/features/extraction/structured-extraction.md
@@ -0,0 +1,329 @@
+# Extração Estruturada de Dados
+
+O motor de extração do Pydoll permite que você defina **o que** deseja de uma página usando modelos tipados, e cuida do **como** automaticamente. Em vez de consultar elementos manualmente um a um, você declara um modelo com seletores e chama `tab.extract()`. O resultado é um objeto Python totalmente tipado e validado, alimentado pelo [Pydantic](https://docs.pydantic.dev/).
+
+## Por Que Usar Extração Estruturada?
+
+Código de scraping tradicional tende a crescer em uma confusão de chamadas `find()`, `await element.text`, leitura de atributos e conversões manuais de tipo espalhadas por dezenas de linhas. Quando a página muda, você precisa caçar no código para encontrar qual seletor quebrou.
+
+Com extração estruturada, todos os seus seletores ficam em um único lugar (o modelo), os tipos são garantidos automaticamente, e a saída é um objeto Pydantic limpo com autocomplete da IDE e serialização embutida.
+
+## Uso Básico
+
+### Definindo um Modelo
+
+Um modelo de extração é uma classe que herda de `ExtractionModel`. Cada campo usa `Field()` para declarar um seletor CSS ou XPath.
+
+```python
+from pydoll.extractor import ExtractionModel, Field
+
+class Quote(ExtractionModel):
+ text: str = Field(selector='.text', description='The quote text')
+ author: str = Field(selector='.author', description='Who said it')
+ tags: list[str] = Field(selector='.tag', description='Associated tags')
+```
+
+O parâmetro `selector` aceita tanto seletores CSS quanto expressões XPath. O Pydoll auto-detecta o tipo, exatamente como o `tab.query()`.
+
+### Extraindo um Único Item
+
+Use `tab.extract()` para preencher uma instância do modelo a partir da página:
+
+```python
+from pydoll.browser.chromium import Chrome
+
+async with Chrome() as browser:
+ tab = await browser.start()
+ await tab.go_to('https://example.com/article')
+
+ article = await tab.extract(Article)
+ print(article.title) # str, fully typed
+ print(article.model_dump()) # dict via pydantic
+```
+
+### Extraindo Múltiplos Itens
+
+Use `tab.extract_all()` com um seletor `scope` que identifica o container repetido. Cada match gera uma instância do modelo, com os campos resolvidos relativamente àquele container.
+
+```python
+quotes = await tab.extract_all(Quote, scope='.quote')
+
+for q in quotes:
+ print(f'{q.author}: {q.text}')
+ print(q.tags)
+```
+
+Você pode limitar o número de resultados:
+
+```python
+top_5 = await tab.extract_all(Quote, scope='.quote', limit=5)
+```
+
+## Opções do Field
+
+A função `Field()` aceita os seguintes parâmetros:
+
+| Parâmetro | Tipo | Descrição |
+|---------------|-------------------------|--------------------------------------------------------------|
+| `selector` | `str` ou `None` | Seletor CSS ou XPath (auto-detectado) |
+| `attribute` | `str` ou `None` | Atributo HTML a ler em vez do texto interno |
+| `description` | `str` ou `None` | Descrição semântica do campo |
+| `default` | qualquer valor | Valor padrão quando o elemento não é encontrado |
+| `transform` | callable ou `None` | Função de pós-processamento aplicada à string bruta |
+
+Pelo menos um entre `selector` ou `description` deve ser fornecido. Campos com apenas `description` (sem selector) são reservados para futura extração baseada em LLM e são ignorados pelo motor CSS atual.
+
+## Extração de Atributos
+
+Por padrão, o motor lê o texto visível do elemento (`innerText`). Para ler um atributo HTML em vez disso, use o parâmetro `attribute`:
+
+```python
+class Article(ExtractionModel):
+ title: str = Field(selector='h1', description='Title')
+ published: str = Field(
+ selector='time.date',
+ attribute='datetime',
+ description='ISO publication date',
+ )
+ image_url: str = Field(
+ selector='.hero img',
+ attribute='src',
+ description='Hero image URL',
+ )
+ link: str = Field(
+ selector='a.source',
+ attribute='href',
+ description='Source link',
+ )
+ image_id: str = Field(
+ selector='.hero img',
+ attribute='data-id',
+ description='Custom data attribute',
+ )
+```
+
+Qualquer atributo HTML funciona, incluindo `data-*`, `aria-*`, `href`, `src`, `alt` e atributos customizados.
+
+## Transforms
+
+O parâmetro `transform` recebe um callable que recebe a string bruta do DOM e retorna o tipo desejado. É aqui que você converte strings para números, parseia datas ou limpa formatação.
+
+```python
+from datetime import datetime
+
+def parse_price(raw: str) -> float:
+ return float(raw.replace('R$', '').replace('.', '').replace(',', '.').strip())
+
+def parse_date(raw: str) -> datetime:
+ return datetime.strptime(raw.strip(), '%B %d, %Y')
+
+class Product(ExtractionModel):
+ name: str = Field(selector='.name', description='Product name')
+ price: float = Field(
+ selector='.price',
+ description='Price in BRL',
+ transform=parse_price,
+ )
+ release: datetime = Field(
+ selector='.release-date',
+ description='Release date',
+ transform=parse_date,
+ )
+```
+
+O transform executa **antes** da validação do Pydantic, então o tipo do campo deve corresponder ao que o transform retorna.
+
+## Modelos Aninhados
+
+Quando o tipo de um campo é outro `ExtractionModel`, o motor usa o seletor do campo para encontrar um elemento de escopo, e então extrai os campos do modelo aninhado dentro daquele escopo.
+
+```python
+class Author(ExtractionModel):
+ name: str = Field(selector='.name', description='Author name')
+ avatar: str = Field(
+ selector='img.avatar',
+ attribute='src',
+ description='Avatar URL',
+ )
+ bio: str = Field(selector='.bio', description='Short bio')
+
+class Article(ExtractionModel):
+ title: str = Field(selector='h1', description='Title')
+ author: Author = Field(
+ selector='.author-card',
+ description='Author information',
+ )
+```
+
+O seletor `.author-card` define o escopo. Os campos do `Author` (`.name`, `img.avatar`, `.bio`) são resolvidos **dentro** daquele elemento, não da página inteira. Isso previne colisões de seletores quando a página tem múltiplos elementos `.name` em seções diferentes.
+
+### Listas de Modelos Aninhados
+
+Você também pode extrair uma lista de modelos aninhados:
+
+```python
+class Contributor(ExtractionModel):
+ name: str = Field(selector='.name', description='Contributor name')
+ role: str = Field(selector='.role', description='Role')
+
+class Project(ExtractionModel):
+ title: str = Field(selector='h1', description='Project title')
+ contributors: list[Contributor] = Field(
+ selector='.contributor',
+ description='Project contributors',
+ )
+```
+
+Cada elemento `.contributor` se torna o escopo para uma instância de `Contributor`.
+
+## Campos Opcionais e Defaults
+
+Campos que podem não estar presentes em toda página devem usar `Optional` com um `default`:
+
+```python
+from typing import Optional
+
+class Article(ExtractionModel):
+ title: str = Field(selector='h1', description='Title')
+ subtitle: Optional[str] = Field(
+ selector='.subtitle',
+ description='Optional subtitle',
+ default=None,
+ )
+ category: str = Field(
+ selector='.category',
+ description='Category with fallback',
+ default='uncategorized',
+ )
+```
+
+Quando o elemento não é encontrado:
+
+- Campos **com** default usam silenciosamente aquele valor padrão.
+- Campos **sem** default (obrigatórios) levantam `FieldExtractionFailed`.
+
+Tanto `typing.Optional[str]` quanto a sintaxe PEP 604 `str | None` são suportados.
+
+## Timeout e Espera
+
+O parâmetro `timeout` controla quanto tempo o motor espera pelos elementos aparecerem, em segundos. Ele é propagado para toda query interna, incluindo modelos aninhados e campos lista.
+
+```python
+# Wait up to 10 seconds for elements to appear
+article = await tab.extract(Article, timeout=10)
+
+# No waiting (default), elements must already be in the DOM
+article = await tab.extract(Article)
+
+# Also works with extract_all
+quotes = await tab.extract_all(Quote, scope='.quote', timeout=5)
+```
+
+Isso usa o mesmo mecanismo de polling que `tab.query(timeout=...)`, então não há necessidade de chamadas manuais `asyncio.sleep()` entre navegação e extração.
+
+## Extração com Escopo
+
+O parâmetro `scope` limita a extração a uma região específica da página:
+
+```python
+# Extract only from the main article, ignoring sidebar/footer
+article = await tab.extract(Article, scope='#main-article')
+
+# extract_all requires scope (it defines the repeating container)
+quotes = await tab.extract_all(Quote, scope='.quote')
+```
+
+## Seletores XPath
+
+Expressões XPath são auto-detectadas (começam com `/` ou `./`) e funcionam em todo lugar que seletores CSS funcionam:
+
+```python
+class SearchResult(ExtractionModel):
+ title: str = Field(
+ selector='//h3[@class="title"]',
+ description='Result title via XPath',
+ )
+ url: str = Field(
+ selector='.//a',
+ attribute='href',
+ description='Result URL',
+ )
+```
+
+## Tratamento de Erros
+
+O motor de extração levanta exceções específicas que você pode capturar e tratar:
+
+```python
+from pydoll.extractor import FieldExtractionFailed, InvalidExtractionModel
+
+# InvalidExtractionModel: raised at model definition time
+# when a Field has neither selector nor description
+try:
+ class BadModel(ExtractionModel):
+ field: str = Field() # no selector, no description
+except InvalidExtractionModel:
+ print('Invalid model definition')
+
+# FieldExtractionFailed: raised at extraction time
+# when a required field's element is not found
+try:
+ result = await tab.extract(MyModel)
+except FieldExtractionFailed as e:
+ print(f'Extraction failed: {e}')
+```
+
+Para campos opcionais, falhas de extração são tratadas silenciosamente e o valor default é utilizado. Apenas campos obrigatórios (aqueles sem `default`) levantam exceções.
+
+## Integração com Pydantic
+
+`ExtractionModel` herda de `pydantic.BaseModel`, então todas as funcionalidades do Pydantic funcionam imediatamente:
+
+```python
+article = await tab.extract(Article)
+
+# Serialization
+article.model_dump() # dict
+article.model_dump_json() # JSON string
+
+# JSON Schema (useful for API docs or LLM prompts)
+Article.model_json_schema()
+
+# Validation happens automatically
+# If a transform returns the wrong type, Pydantic raises ValidationError
+```
+
+Você pode usar qualquer funcionalidade do Pydantic nos seus modelos: validadores, aliases de campos, configuração de modelo e mais. O motor de extração adiciona a camada de seletor/transform por cima sem interferir no comportamento do Pydantic.
+
+## Exemplo Completo
+
+Aqui está um exemplo completo e executável que extrai citações do [quotes.toscrape.com](https://quotes.toscrape.com):
+
+```python
+import asyncio
+from pydoll.browser.chromium import Chrome
+from pydoll.extractor import ExtractionModel, Field
+
+class Quote(ExtractionModel):
+ text: str = Field(selector='.text', description='The quote text')
+ author: str = Field(selector='.author', description='Who said the quote')
+ tags: list[str] = Field(selector='.tag', description='Associated tags')
+
+async def main():
+ async with Chrome() as browser:
+ tab = await browser.start()
+ await tab.go_to('https://quotes.toscrape.com')
+
+ quotes = await tab.extract_all(Quote, scope='.quote', timeout=5)
+
+ print(f'Extracted {len(quotes)} quotes\n')
+ for q in quotes:
+ print(f'"{q.text}"')
+ print(f' by {q.author} | tags: {", ".join(q.tags)}\n')
+
+ # Pydantic serialization
+ for q in quotes:
+ print(q.model_dump_json())
+
+asyncio.run(main())
+```
diff --git a/docs/pt/features/index.md b/docs/pt/features/index.md
index 9f0a488d..8c9c6b79 100644
--- a/docs/pt/features/index.md
+++ b/docs/pt/features/index.md
@@ -18,6 +18,12 @@ Encontrar e interagir com elementos da página é o pão com manteiga da automa
**[Localização de Elementos](element-finding.md)**: Domine as estratégias de localização de elementos do Pydoll, desde o intuitivo método `find()` que usa atributos HTML naturais, até o poderoso método `query()` para seletores CSS e XPath. Você também aprenderá sobre auxiliares de travessia do DOM que permitem navegar pela estrutura da página eficientemente.
+## Extração de Dados
+
+Transforme páginas web em objetos Python estruturados com modelos tipados, validação automática e serialização Pydantic.
+
+**[Extração Estruturada](extraction/structured-extraction.md)**: Defina um modelo Pydantic com seletores CSS/XPath, chame `tab.extract()` e receba um objeto totalmente tipado. Suporta modelos aninhados, campos lista, extração de atributos, transforms customizados, campos opcionais com defaults e timeouts configuráveis. Sem necessidade de consulta manual elemento por elemento.
+
## Capacidades de Automação
Estas são as funcionalidades que dão vida à sua automação: simular interações do usuário, controle de teclado, lidar com operações de arquivo, trabalhar com iframes e capturar conteúdo visual.
diff --git a/docs/zh/features/extraction/structured-extraction.md b/docs/zh/features/extraction/structured-extraction.md
new file mode 100644
index 00000000..17442348
--- /dev/null
+++ b/docs/zh/features/extraction/structured-extraction.md
@@ -0,0 +1,329 @@
+# 结构化数据提取
+
+Pydoll 的提取引擎让您使用类型化模型定义想要从页面获取**什么**数据,并自动处理**如何**获取。无需逐个手动查询元素,您只需声明一个带有选择器的模型并调用 `tab.extract()`。结果是一个由 [Pydantic](https://docs.pydantic.dev/) 驱动的、完全类型化和验证过的 Python 对象。
+
+## 为什么使用结构化提取?
+
+传统的抓取代码往往会变成一堆散落在数十行中的 `find()` 调用、`await element.text`、属性读取和手动类型转换。当页面发生变化时,您需要在代码中逐行排查哪个选择器出了问题。
+
+使用结构化提取后,所有选择器都集中在一个地方(模型),类型会自动强制执行,输出是一个干净的 Pydantic 对象,内置 IDE 自动补全和序列化功能。
+
+## 基本用法
+
+### 定义模型
+
+提取模型是一个继承自 `ExtractionModel` 的类。每个字段使用 `Field()` 来声明 CSS 或 XPath 选择器。
+
+```python
+from pydoll.extractor import ExtractionModel, Field
+
+class Quote(ExtractionModel):
+ text: str = Field(selector='.text', description='The quote text')
+ author: str = Field(selector='.author', description='Who said it')
+ tags: list[str] = Field(selector='.tag', description='Associated tags')
+```
+
+`selector` 参数同时接受 CSS 选择器和 XPath 表达式。Pydoll 会自动检测类型,与 `tab.query()` 的行为完全一致。
+
+### 提取单个项目
+
+使用 `tab.extract()` 从页面填充一个模型实例:
+
+```python
+from pydoll.browser.chromium import Chrome
+
+async with Chrome() as browser:
+ tab = await browser.start()
+ await tab.go_to('https://example.com/article')
+
+ article = await tab.extract(Article)
+ print(article.title) # str, fully typed
+ print(article.model_dump()) # dict via pydantic
+```
+
+### 提取多个项目
+
+使用 `tab.extract_all()` 并配合 `scope` 选择器来标识重复的容器。每个匹配项生成一个模型实例,字段相对于该容器解析。
+
+```python
+quotes = await tab.extract_all(Quote, scope='.quote')
+
+for q in quotes:
+ print(f'{q.author}: {q.text}')
+ print(q.tags)
+```
+
+您可以限制结果数量:
+
+```python
+top_5 = await tab.extract_all(Quote, scope='.quote', limit=5)
+```
+
+## Field 选项
+
+`Field()` 函数接受以下参数:
+
+| 参数 | 类型 | 描述 |
+|---------------|-------------------------|--------------------------------------------------------------|
+| `selector` | `str` 或 `None` | CSS 或 XPath 选择器(自动检测) |
+| `attribute` | `str` 或 `None` | 要读取的 HTML 属性,而非内部文本 |
+| `description` | `str` 或 `None` | 字段的语义描述 |
+| `default` | 任意值 | 未找到元素时的默认值 |
+| `transform` | callable 或 `None` | 应用于原始字符串的后处理函数 |
+
+必须提供 `selector` 或 `description` 中的至少一个。仅有 `description`(无 selector)的字段保留用于未来基于 LLM 的提取,当前 CSS 引擎会跳过这些字段。
+
+## 属性提取
+
+默认情况下,引擎读取元素的可见文本(`innerText`)。要读取 HTML 属性,请使用 `attribute` 参数:
+
+```python
+class Article(ExtractionModel):
+ title: str = Field(selector='h1', description='Title')
+ published: str = Field(
+ selector='time.date',
+ attribute='datetime',
+ description='ISO publication date',
+ )
+ image_url: str = Field(
+ selector='.hero img',
+ attribute='src',
+ description='Hero image URL',
+ )
+ link: str = Field(
+ selector='a.source',
+ attribute='href',
+ description='Source link',
+ )
+ image_id: str = Field(
+ selector='.hero img',
+ attribute='data-id',
+ description='Custom data attribute',
+ )
+```
+
+任何 HTML 属性都可以使用,包括 `data-*`、`aria-*`、`href`、`src`、`alt` 和自定义属性。
+
+## 转换函数
+
+`transform` 参数接受一个 callable,它接收来自 DOM 的原始字符串并返回所需类型。这是您将字符串转换为数字、解析日期或清理格式的地方。
+
+```python
+from datetime import datetime
+
+def parse_price(raw: str) -> float:
+ return float(raw.replace('R$', '').replace('.', '').replace(',', '.').strip())
+
+def parse_date(raw: str) -> datetime:
+ return datetime.strptime(raw.strip(), '%B %d, %Y')
+
+class Product(ExtractionModel):
+ name: str = Field(selector='.name', description='Product name')
+ price: float = Field(
+ selector='.price',
+ description='Price in BRL',
+ transform=parse_price,
+ )
+ release: datetime = Field(
+ selector='.release-date',
+ description='Release date',
+ transform=parse_date,
+ )
+```
+
+转换函数在 Pydantic 验证**之前**运行,因此字段类型应与转换函数的返回值匹配。
+
+## 嵌套模型
+
+当字段的类型是另一个 `ExtractionModel` 时,引擎使用该字段的选择器找到作用域元素,然后在该作用域内提取嵌套模型的字段。
+
+```python
+class Author(ExtractionModel):
+ name: str = Field(selector='.name', description='Author name')
+ avatar: str = Field(
+ selector='img.avatar',
+ attribute='src',
+ description='Avatar URL',
+ )
+ bio: str = Field(selector='.bio', description='Short bio')
+
+class Article(ExtractionModel):
+ title: str = Field(selector='h1', description='Title')
+ author: Author = Field(
+ selector='.author-card',
+ description='Author information',
+ )
+```
+
+`.author-card` 选择器定义了作用域。`Author` 的字段(`.name`、`img.avatar`、`.bio`)在该元素**内部**解析,而非从整个页面解析。这可以防止当页面在不同区域有多个 `.name` 元素时发生选择器冲突。
+
+### 嵌套模型列表
+
+您还可以提取嵌套模型的列表:
+
+```python
+class Contributor(ExtractionModel):
+ name: str = Field(selector='.name', description='Contributor name')
+ role: str = Field(selector='.role', description='Role')
+
+class Project(ExtractionModel):
+ title: str = Field(selector='h1', description='Project title')
+ contributors: list[Contributor] = Field(
+ selector='.contributor',
+ description='Project contributors',
+ )
+```
+
+每个 `.contributor` 元素成为一个 `Contributor` 实例的作用域。
+
+## 可选字段和默认值
+
+可能不会出现在每个页面上的字段应使用 `Optional` 和 `default`:
+
+```python
+from typing import Optional
+
+class Article(ExtractionModel):
+ title: str = Field(selector='h1', description='Title')
+ subtitle: Optional[str] = Field(
+ selector='.subtitle',
+ description='Optional subtitle',
+ default=None,
+ )
+ category: str = Field(
+ selector='.category',
+ description='Category with fallback',
+ default='uncategorized',
+ )
+```
+
+当未找到元素时:
+
+- **有**默认值的字段会静默使用该默认值。
+- **没有**默认值的字段(必填)会抛出 `FieldExtractionFailed`。
+
+`typing.Optional[str]` 和 PEP 604 语法 `str | None` 都受支持。
+
+## 超时和等待
+
+`timeout` 参数控制引擎等待元素出现的时间,单位为秒。它会传播到每个内部查询,包括嵌套模型和列表字段。
+
+```python
+# Wait up to 10 seconds for elements to appear
+article = await tab.extract(Article, timeout=10)
+
+# No waiting (default), elements must already be in the DOM
+article = await tab.extract(Article)
+
+# Also works with extract_all
+quotes = await tab.extract_all(Quote, scope='.quote', timeout=5)
+```
+
+这使用与 `tab.query(timeout=...)` 相同的轮询机制,因此在导航和提取之间不需要手动调用 `asyncio.sleep()`。
+
+## 限定范围提取
+
+`scope` 参数将提取限制在页面的特定区域:
+
+```python
+# Extract only from the main article, ignoring sidebar/footer
+article = await tab.extract(Article, scope='#main-article')
+
+# extract_all requires scope (it defines the repeating container)
+quotes = await tab.extract_all(Quote, scope='.quote')
+```
+
+## XPath 选择器
+
+XPath 表达式会自动检测(以 `/` 或 `./` 开头),并且在 CSS 选择器适用的所有地方都可以使用:
+
+```python
+class SearchResult(ExtractionModel):
+ title: str = Field(
+ selector='//h3[@class="title"]',
+ description='Result title via XPath',
+ )
+ url: str = Field(
+ selector='.//a',
+ attribute='href',
+ description='Result URL',
+ )
+```
+
+## 错误处理
+
+提取引擎会抛出特定的异常,您可以捕获和处理:
+
+```python
+from pydoll.extractor import FieldExtractionFailed, InvalidExtractionModel
+
+# InvalidExtractionModel: raised at model definition time
+# when a Field has neither selector nor description
+try:
+ class BadModel(ExtractionModel):
+ field: str = Field() # no selector, no description
+except InvalidExtractionModel:
+ print('Invalid model definition')
+
+# FieldExtractionFailed: raised at extraction time
+# when a required field's element is not found
+try:
+ result = await tab.extract(MyModel)
+except FieldExtractionFailed as e:
+ print(f'Extraction failed: {e}')
+```
+
+对于可选字段,提取失败会被静默处理并使用默认值。只有必填字段(没有 `default` 的字段)会抛出异常。
+
+## Pydantic 集成
+
+`ExtractionModel` 继承自 `pydantic.BaseModel`,因此所有 Pydantic 功能都可以直接使用:
+
+```python
+article = await tab.extract(Article)
+
+# Serialization
+article.model_dump() # dict
+article.model_dump_json() # JSON string
+
+# JSON Schema (useful for API docs or LLM prompts)
+Article.model_json_schema()
+
+# Validation happens automatically
+# If a transform returns the wrong type, Pydantic raises ValidationError
+```
+
+您可以在模型中使用任何 Pydantic 功能:验证器、字段别名、模型配置等。提取引擎在 Pydantic 行为之上添加了选择器/转换层,不会干扰 Pydantic 的行为。
+
+## 完整示例
+
+以下是一个完整的、可运行的示例,从 [quotes.toscrape.com](https://quotes.toscrape.com) 提取名言:
+
+```python
+import asyncio
+from pydoll.browser.chromium import Chrome
+from pydoll.extractor import ExtractionModel, Field
+
+class Quote(ExtractionModel):
+ text: str = Field(selector='.text', description='The quote text')
+ author: str = Field(selector='.author', description='Who said the quote')
+ tags: list[str] = Field(selector='.tag', description='Associated tags')
+
+async def main():
+ async with Chrome() as browser:
+ tab = await browser.start()
+ await tab.go_to('https://quotes.toscrape.com')
+
+ quotes = await tab.extract_all(Quote, scope='.quote', timeout=5)
+
+ print(f'Extracted {len(quotes)} quotes\n')
+ for q in quotes:
+ print(f'"{q.text}"')
+ print(f' by {q.author} | tags: {", ".join(q.tags)}\n')
+
+ # Pydantic serialization
+ for q in quotes:
+ print(q.model_dump_json())
+
+asyncio.run(main())
+```
diff --git a/docs/zh/features/index.md b/docs/zh/features/index.md
index 1320bc3d..23ebac32 100644
--- a/docs/zh/features/index.md
+++ b/docs/zh/features/index.md
@@ -18,6 +18,12 @@
**[元素查找](element-finding.md)**:掌握 Pydoll 的元素定位策略,从使用自然 HTML 属性的直观 `find()` 方法,到用于 CSS 选择器和 XPath 的强大 `query()` 方法。您还将学习 DOM 遍历辅助工具,让您高效地导航页面结构。
+## 数据提取
+
+将网页转化为结构化 Python 对象,具备类型化模型、自动验证和 Pydantic 序列化。
+
+**[结构化提取](extraction/structured-extraction.md)**:定义带有 CSS/XPath 选择器的 Pydantic 模型,调用 `tab.extract()`,获取完全类型化的对象。支持嵌套模型、列表字段、属性提取、自定义转换函数、带默认值的可选字段和可配置超时。无需逐元素手动查询。
+
## 自动化能力
这些功能使您的自动化栩栩如生:模拟用户交互、键盘控制、处理文件操作、使用 iframe 以及捕获视觉内容。
diff --git a/mkdocs.yml b/mkdocs.yml
index c2bef711..04e72e48 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -12,6 +12,8 @@ nav:
- Overview: features/index.md
- Core Concepts: features/core-concepts.md
- Element Finding: features/element-finding.md
+ - Data Extraction:
+ - Structured Extraction: features/extraction/structured-extraction.md
- Automation:
- Human-Like Interactions: features/automation/human-interactions.md
- Keyboard Control: features/automation/keyboard-control.md
From 16eb0efb37b681fce45ede3b9428290e545acce1 Mon Sep 17 00:00:00 2001
From: Thalison Fernandes
Date: Sun, 22 Mar 2026 17:57:53 -0300
Subject: [PATCH 19/21] fix(extractor): correct coroutine type annotation for
mypy
---
pydoll/extractor/engine.py | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/pydoll/extractor/engine.py b/pydoll/extractor/engine.py
index cd71394a..9b957675 100644
--- a/pydoll/extractor/engine.py
+++ b/pydoll/extractor/engine.py
@@ -5,6 +5,7 @@
import asyncio
import logging
import types
+from collections.abc import Coroutine
from typing import TYPE_CHECKING, Optional, TypeVar, Union, get_args, get_origin
from pydoll.elements.mixins.find_elements_mixin import FindElementsMixin
@@ -119,7 +120,9 @@ async def _extract_fields(
Dictionary of field name -> extracted value.
"""
field_names: list[str] = []
- tasks: list[asyncio.Task[Union[str, int, float, bool, list[str], object]]] = []
+ coroutines: list[
+ Coroutine[None, None, Union[str, int, float, bool, list[str], object]]
+ ] = []
for name, metadata in model.get_extraction_fields().items():
if not metadata.has_selector:
@@ -132,9 +135,9 @@ async def _extract_fields(
continue
field_names.append(name)
- tasks.append(self._extract_field(metadata, annotation, context, timeout))
+ coroutines.append(self._extract_field(metadata, annotation, context, timeout))
- results = await asyncio.gather(*tasks, return_exceptions=True)
+ results = await asyncio.gather(*coroutines, return_exceptions=True)
values: dict[str, Union[str, int, float, bool, list[str], object]] = {}
for name, result in zip(field_names, results):
From b4408b4403c31d0d5967971fcb5ada8a80e51b55 Mon Sep 17 00:00:00 2001
From: Thalison Fernandes
Date: Sun, 22 Mar 2026 18:28:08 -0300
Subject: [PATCH 20/21] fix(test): filter only DeprecationWarning in interval
deprecated test
---
tests/test_interactions/test_keyboard.py | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/tests/test_interactions/test_keyboard.py b/tests/test_interactions/test_keyboard.py
index 821b8561..d36e96e0 100644
--- a/tests/test_interactions/test_keyboard.py
+++ b/tests/test_interactions/test_keyboard.py
@@ -519,12 +519,14 @@ async def test_type_text_interval_deprecated_warning(self, keyboard_api):
import warnings
with warnings.catch_warnings(record=True) as w:
- warnings.simplefilter("always")
+ warnings.simplefilter("always", DeprecationWarning)
await keyboard_api.type_text("a", interval=0.1)
- assert len(w) == 1
- assert issubclass(w[0].category, DeprecationWarning)
- assert "interval" in str(w[0].message)
+ deprecation_warnings = [
+ x for x in w if issubclass(x.category, DeprecationWarning)
+ ]
+ assert len(deprecation_warnings) == 1
+ assert "interval" in str(deprecation_warnings[0].message)
@pytest.mark.asyncio
async def test_type_char_calls_focus(self, keyboard_api, mock_tab):
From 597a914622f0f569f898ce8abd09812f10cc83a1 Mon Sep 17 00:00:00 2001
From: Thalison Fernandes
Date: Sun, 22 Mar 2026 18:32:49 -0300
Subject: [PATCH 21/21] refactor(extractor): parallelize list field extraction
with asyncio.gather
---
pydoll/extractor/engine.py | 18 +++++++-----------
1 file changed, 7 insertions(+), 11 deletions(-)
diff --git a/pydoll/extractor/engine.py b/pydoll/extractor/engine.py
index 9b957675..0d33d86a 100644
--- a/pydoll/extractor/engine.py
+++ b/pydoll/extractor/engine.py
@@ -204,17 +204,13 @@ async def _extract_list_field(
inner_type = _get_inner_type(annotation)
if _is_extraction_model(inner_type):
- results: list[Union[str, int, float, bool, object]] = []
- for element in elements:
- field_values = await self._extract_fields(inner_type, element, timeout)
- results.append(_build_instance(inner_type, field_values))
- return results
-
- scalar_values: list[Union[str, int, float, bool, object]] = []
- for element in elements:
- raw = await _extract_value(element, metadata)
- scalar_values.append(_apply_transform(raw, metadata))
- return scalar_values
+ all_field_values = await asyncio.gather(
+ *(self._extract_fields(inner_type, el, timeout) for el in elements)
+ )
+ return [_build_instance(inner_type, fv) for fv in all_field_values]
+
+ all_raw = await asyncio.gather(*(_extract_value(el, metadata) for el in elements))
+ return [_apply_transform(raw, metadata) for raw in all_raw]
async def _extract_nested_model(
self,