Skip to content
Open
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
0b8fca9
feat: add pydantic as required dependency
thalissonvs Mar 22, 2026
a11380c
feat(extractor): add extraction exception hierarchy
thalissonvs Mar 22, 2026
e84cc1a
feat(extractor): add Field descriptor and ExtractionMetadata
thalissonvs Mar 22, 2026
0bb1bd3
feat(extractor): add ExtractionModel base class
thalissonvs Mar 22, 2026
c852fe4
feat(extractor): add extraction engine with CSS/XPath support
thalissonvs Mar 22, 2026
9ac2221
feat(extractor): add module public API exports
thalissonvs Mar 22, 2026
5f36762
feat(extractor): integrate extract and extract_all into Tab
thalissonvs Mar 22, 2026
e21b1a3
test(extractor): add integration tests with real browser
thalissonvs Mar 22, 2026
219267e
docs(extractor): add usage example with quotes.toscrape.com
thalissonvs Mar 22, 2026
a20858b
Revert "docs(extractor): add usage example with quotes.toscrape.com"
thalissonvs Mar 22, 2026
48e26ce
chore(deps): update dependencies and add new packages to poetry.lock
thalissonvs Mar 22, 2026
4c9ed4d
fix(extractor): resolve mypy type errors
thalissonvs Mar 22, 2026
510b461
style: apply ruff formatting
thalissonvs Mar 22, 2026
533603c
refactor(extractor): use asyncio.gather for concurrent field extraction
thalissonvs Mar 22, 2026
dffc2ac
test(extractor): add concurrent extraction tests
thalissonvs Mar 22, 2026
67a7421
docs: restructure README with extraction showcase and updated positio…
thalissonvs Mar 22, 2026
af7cc3c
docs: update landing pages with extractor examples in all languages
thalissonvs Mar 22, 2026
35f4898
docs(extractor): add structured extraction guide in en, pt, zh
thalissonvs Mar 22, 2026
16eb0ef
fix(extractor): correct coroutine type annotation for mypy
thalissonvs Mar 22, 2026
b4408b4
fix(test): filter only DeprecationWarning in interval deprecated test
thalissonvs Mar 22, 2026
597a914
refactor(extractor): parallelize list field extraction with asyncio.g…
thalissonvs Mar 22, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
200 changes: 120 additions & 80 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,9 @@
<a href="#support">Support</a>
</p>

Pydoll automates Chromium-based browsers (Chrome, Edge) by connecting directly to the Chrome DevTools Protocol over WebSocket. No WebDriver binary, no `navigator.webdriver` flag, no compatibility issues.
Pydoll automates Chromium-based browsers (Chrome, Edge) by connecting directly to the Chrome DevTools Protocol over WebSocket. **No WebDriver binary, no `navigator.webdriver` flag, no compatibility issues.**

It combines a high-level API for common tasks with low-level CDP access for fine-grained control over network, fingerprinting, and browser behavior. The entire codebase is async-native and fully type-checked with mypy.
It combines a high-level API for stealthy automation with low-level CDP access for fine-grained control over network, fingerprinting, and browser behavior. And with its new **Pydantic-powered extraction engine**, it maps the DOM directly to structured Python objects, delivering an unmatched Developer Experience (DX).

### Top Sponsors

Expand All @@ -48,11 +48,11 @@ It combines a high-level API for common tasks with low-level CDP access for fine

### Why Pydoll

- **Stealth-first**: Human-like mouse movement, realistic typing, and granular [browser preference](https://pydoll.tech/docs/features/configuration/browser-preferences/) control for fingerprint management.
- **Structured extraction**: Define a [Pydantic](https://docs.pydantic.dev/) model, call `tab.extract()`, get typed and validated data back. No manual element-by-element querying.
- **Async and typed**: Built on `asyncio` from the ground up, 100% type-checked with `mypy`. Full IDE autocompletion and static error checking.
- **Stealth built in**: Human-like mouse movement, realistic typing, and granular [browser preference](https://pydoll.tech/docs/features/configuration/browser-preferences/) control for fingerprint management.
- **Network control**: [Intercept](https://pydoll.tech/docs/features/network/interception/) requests to block ads/trackers, [monitor](https://pydoll.tech/docs/features/network/monitoring/) traffic for API discovery, and make [authenticated HTTP requests](https://pydoll.tech/docs/features/network/http-requests/) that inherit the browser session.
- **Shadow DOM and iframes**: Full support for [shadow roots](https://pydoll.tech/docs/deep-dive/architecture/shadow-dom/) (including closed) and cross-origin iframes. Discover, query, and interact with elements inside them using the same API.
- **Ergonomic API**: `tab.find()` for most cases, `tab.query()` for complex [CSS/XPath selectors](https://pydoll.tech/docs/deep-dive/guides/selectors-guide/).

## Installation

Expand All @@ -62,55 +62,124 @@ pip install pydoll-python

No WebDriver binaries or external dependencies required.

## What's New
## Getting Started

<details>
<summary><b>HAR Network Recording</b></summary>
<br>
### 1. Stateful Automation & Evasion

Record network activity during a browser session and export as HAR 1.2. Replay recorded requests to reproduce exact API sequences.
When you need to navigate, bypass challenges, or interact with dynamic UI, Pydoll's imperative API handles it with humanized timing by default.

```python
from pydoll.browser.chromium import Chrome
import asyncio
from pydoll.browser import Chrome
from pydoll.constants import Key

async with Chrome() as browser:
tab = await browser.start()
async def google_search(query: str):
async with Chrome() as browser:
tab = await browser.start()
await tab.go_to('https://www.google.com')

async with tab.request.record() as capture:
await tab.go_to('https://example.com')
# Find elements and interact with human-like timing
search_box = await tab.find(tag_name='textarea', name='q')
await search_box.insert_text(query)
await tab.keyboard.press(Key.ENTER)

capture.save('flow.har')
print(f'Captured {len(capture.entries)} requests')
first_result = await tab.find(
tag_name='h3',
text='autoscrape-labs/pydoll',
timeout=10,
)
await first_result.click()
print(f"Page loaded: {await tab.title}")

responses = await tab.request.replay('flow.har')
asyncio.run(google_search('pydoll site:github.com'))
```

Filter by resource type:
### 2. Structured Data Extraction

Once you reach the target page, switch to the declarative engine. Define what you want with a model, and Pydoll extracts it — typed, validated, and ready to use.

```python
from pydoll.protocol.network.types import ResourceType
from pydoll.browser.chromium import Chrome
from pydoll.extractor import ExtractionModel, Field

class Quote(ExtractionModel):
text: str = Field(selector='.text', description='The quote text')
author: str = Field(selector='.author', description='Who said it')
tags: list[str] = Field(selector='.tag', description='Tags')
year: int | None = Field(selector='.year', description='Year', default=None)

async with tab.request.record(
resource_types=[ResourceType.FETCH, ResourceType.XHR]
) as capture:
await tab.go_to('https://example.com')
async def extract_quotes():
async with Chrome() as browser:
tab = await browser.start()
await tab.go_to('https://quotes.toscrape.com')

quotes = await tab.extract_all(Quote, scope='.quote', timeout=5)

for q in quotes:
print(f'{q.author}: {q.text}') # fully typed, IDE autocomplete works
print(q.tags) # list[str], not a raw element
print(q.model_dump_json()) # pydantic serialization built-in

asyncio.run(extract_quotes())
```

[HAR Recording Docs](https://pydoll.tech/docs/features/network/network-recording/)
Models support CSS/XPath auto-detection, HTML attribute targeting, custom transforms, and nested models.

<details>
<summary><b>Nested models, transforms, and attribute extraction</b></summary>
<br>

```python
from datetime import datetime
from pydoll.extractor import ExtractionModel, Field

def parse_date(raw: str) -> datetime:
return datetime.strptime(raw.strip(), '%B %d, %Y')

class Author(ExtractionModel):
name: str = Field(selector='.author-title')
born: datetime = Field(
selector='.author-born-date',
transform=parse_date,
)

class Article(ExtractionModel):
title: str = Field(selector='h1')
url: str = Field(selector='.source-link', attribute='href')
author: Author = Field(selector='.author-card', description='Nested model')

article = await tab.extract(Article, timeout=5)
article.author.born.year # int — types are preserved all the way down
```
</details>

## Features

<details>
<summary><b>Page Bundles</b></summary>
<summary><b>Humanized Mouse Movement</b></summary>
<br>

Save the current page and all its assets (CSS, JS, images, fonts) as a `.zip` bundle for offline viewing. Optionally inline everything into a single HTML file.
Mouse operations produce human-like cursor movement by default:

- **Bezier curve paths** with asymmetric control points
- **Fitts's Law timing**: duration scales with distance
- **Minimum-jerk velocity**: bell-shaped speed profile
- **Physiological tremor**: Gaussian noise scaled with velocity
- **Overshoot correction**: ~70% chance on fast movements, then corrects back

```python
await tab.save_bundle('page.zip')
await tab.save_bundle('page-inline.zip', inline_assets=True)
await tab.mouse.move(500, 300)
await tab.mouse.click(500, 300)
await tab.mouse.drag(100, 200, 500, 400)

button = await tab.find(id='submit')
await button.click()

# Opt out when speed matters
await tab.mouse.click(500, 300, humanize=False)
```

[Screenshots, PDFs & Bundles Docs](https://pydoll.tech/docs/features/automation/screenshots-and-pdfs/)
[Mouse Control Docs](https://pydoll.tech/docs/features/automation/mouse-control/)
</details>

<details>
Expand Down Expand Up @@ -139,75 +208,46 @@ Highlights:
- `deep=True` traverses cross-origin iframes (OOPIFs)
- Standard `find()`, `query()`, `click()` API inside shadow roots

```python
# Cloudflare Turnstile inside a cross-origin iframe
shadow_roots = await tab.find_shadow_roots(deep=True, timeout=10)
for sr in shadow_roots:
checkbox = await sr.query('input[type="checkbox"]', raise_exc=False)
if checkbox:
await checkbox.click()
```

[Shadow DOM Docs](https://pydoll.tech/docs/deep-dive/architecture/shadow-dom/)
</details>

<details>
<summary><b>Humanized Mouse Movement</b></summary>
<summary><b>HAR Network Recording</b></summary>
<br>

Mouse operations produce human-like cursor movement by default:

- **Bezier curve paths** with asymmetric control points
- **Fitts's Law timing**: duration scales with distance
- **Minimum-jerk velocity**: bell-shaped speed profile
- **Physiological tremor**: Gaussian noise scaled with velocity
- **Overshoot correction**: ~70% chance on fast movements, then corrects back
Record network activity during a browser session and export as HAR 1.2. Replay recorded requests to reproduce exact API sequences.

```python
await tab.mouse.move(500, 300)
await tab.mouse.click(500, 300)
await tab.mouse.drag(100, 200, 500, 400)

button = await tab.find(id='submit')
await button.click()

# Opt out when speed matters
await tab.mouse.click(500, 300, humanize=False)
```
from pydoll.browser.chromium import Chrome

[Mouse Control Docs](https://pydoll.tech/docs/features/automation/mouse-control/)
</details>
async with Chrome() as browser:
tab = await browser.start()

## Getting Started
async with tab.request.record() as capture:
await tab.go_to('https://example.com')

```python
import asyncio
from pydoll.browser import Chrome
from pydoll.constants import Key
capture.save('flow.har')
print(f'Captured {len(capture.entries)} requests')

async def google_search(query: str):
async with Chrome() as browser:
tab = await browser.start()
await tab.go_to('https://www.google.com')
responses = await tab.request.replay('flow.har')
```

search_box = await tab.find(tag_name='textarea', name='q')
await search_box.insert_text(query)
await tab.keyboard.press(Key.ENTER)
[HAR Recording Docs](https://pydoll.tech/docs/features/network/network-recording/)
</details>

first_result = await tab.find(
tag_name='h3',
text='autoscrape-labs/pydoll',
timeout=10,
)
await first_result.click()
<details>
<summary><b>Page Bundles</b></summary>
<br>

await tab.find(id='repository-container-header', timeout=10)
print(f"Page loaded: {await tab.title}")
Save the current page and all its assets (CSS, JS, images, fonts) as a `.zip` bundle for offline viewing. Optionally inline everything into a single HTML file.

asyncio.run(google_search('pydoll site:github.com'))
```python
await tab.save_bundle('page.zip')
await tab.save_bundle('page-inline.zip', inline_assets=True)
```

## Features
[Screenshots, PDFs & Bundles Docs](https://pydoll.tech/docs/features/automation/screenshots-and-pdfs/)
</details>

<details>
<summary><b>Hybrid Automation (UI + API)</b></summary>
Expand Down
Loading
Loading