Skip to content

Implement declarative extractors#387

Open
thalissonvs wants to merge 21 commits intomainfrom
feat/extractor
Open

Implement declarative extractors#387
thalissonvs wants to merge 21 commits intomainfrom
feat/extractor

Conversation

@thalissonvs
Copy link
Member

@thalissonvs thalissonvs commented Mar 22, 2026

Summary by CodeRabbit

  • New Features
    • Model-driven structured extraction: typed, validated tab.extract() and tab.extract_all() APIs with Field-declared selectors, nested models, lists, attribute extraction, transforms, scoping, limits, and clear required/optional semantics; new extractor exceptions and public extraction primitives.
  • Documentation
    • New and updated docs and README examples for structured extraction (en/pt/zh) and updated quickstart.
  • Tests
    • New integration tests and HTML fixtures covering extraction scenarios, transforms, XPath, limits, concurrency, and error cases.

@thalissonvs thalissonvs self-assigned this Mar 22, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a structured DOM-to-model extraction subsystem: ExtractionModel and Field metadata, a new ExtractionEngine with async extract/extract_all, extractor exceptions and package exports, Tab-level integration, tests and HTML fixtures, multilingual documentation, and a pydantic runtime dependency.

Changes

Cohort / File(s) Summary
Extraction model & field metadata
pydoll/extractor/model.py, pydoll/extractor/field.py
Adds ExtractionModel (pydantic-based) with per-subclass extraction-field caching and a Field() factory that registers ExtractionMetadata (selector, attribute, transform) into a registry consumed at model creation.
Extraction engine
pydoll/extractor/engine.py
New ExtractionEngine implementing extract() and extract_all(); orchestrates DOM queries, handles scalar fields, lists, nested models, transforms, concurrent field extraction, instance building/validation, and raises FieldExtractionFailed for required failures.
Exceptions & package API
pydoll/extractor/exceptions.py, pydoll/extractor/__init__.py
Adds extractor-specific exceptions (ExtractionException, FieldExtractionFailed, InvalidExtractionModel) and re-exports core types (ExtractionModel, Field, ExtractionMetadata) via package __all__.
Tab integration
pydoll/browser/tab.py
Adds generic T and async methods extract() and extract_all() on Tab, with a lazily-instantiated _extraction_engine delegating extraction calls.
Tests & fixtures
tests/pages/test_extractor.html, tests/test_extractor/test_extractor_integration.py
Adds HTML fixture and comprehensive integration tests covering single/batch extraction, attributes, lists, transforms, nested models, scoping, XPath, limits, optional/required semantics, validation errors, and concurrency.
Docs & site config
docs/**/features/extraction/structured-extraction.md, docs/**/features/index.md, docs/**/index.md, mkdocs.yml
Adds structured-extraction docs (EN/PT/ZH), updates feature index and quickstart examples, and adds navigation entry for Structured Extraction.
Build/Dependency
pyproject.toml
Adds runtime dependency pydantic = "^2.0".
Minor formatting tweaks
pydoll/commands/accessibility_commands.py, pydoll/interactions/iframe.py, pydoll/protocol/accessibility/methods.py
Small non-functional formatting/assignment changes; signatures and behavior unchanged.

Sequence Diagram

sequenceDiagram
    participant Tab
    participant ExtractionEngine
    participant DOM_Page
    participant Pydantic_Model

    rect rgba(100, 150, 200, 0.5)
    Note over Tab,ExtractionEngine: extract(model) — single instance
    Tab->>Tab: access _extractor (lazy init)
    Tab->>ExtractionEngine: extract(model, scope, timeout)
    ExtractionEngine->>ExtractionEngine: get_extraction_fields()
    loop for each field
        ExtractionEngine->>DOM_Page: query selector / attribute
        DOM_Page-->>ExtractionEngine: element(s) / text / attribute
        ExtractionEngine->>ExtractionEngine: apply transform / recurse for nested
    end
    ExtractionEngine->>Pydantic_Model: construct & validate model
    Pydantic_Model-->>ExtractionEngine: validated instance
    ExtractionEngine-->>Tab: model instance
    end

    rect rgba(150, 200, 100, 0.5)
    Note over Tab,ExtractionEngine: extract_all(model) — repeated containers
    Tab->>ExtractionEngine: extract_all(model, scope, limit)
    ExtractionEngine->>DOM_Page: query container selector
    DOM_Page-->>ExtractionEngine: list of container elements
    loop per container (up to limit)
        ExtractionEngine->>DOM_Page: query fields scoped to container
        DOM_Page-->>ExtractionEngine: field values
        ExtractionEngine->>Pydantic_Model: construct & validate instance
    end
    ExtractionEngine-->>Tab: list of model instances
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰
I hop through nodes at break of dawn,
Selectors in paw, transforms on,
I stitch small fields into tidy molds,
Nest and list — carrots turned to golds,
Validated crumbs in my burrow drawn.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request has no description provided by the author, making it impossible to assess whether required template sections (Description, Related Issues, Type of Change, Testing) are present or adequately filled. Add a comprehensive PR description following the repository's template, including objectives, type of change, testing details, and API changes documentation.
Docstring Coverage ⚠️ Warning Docstring coverage is 72.13% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Implement declarative extractors' clearly and concisely describes the main feature addition: a declarative extraction system with ExtractionModel, Field, and tab.extract()/extract_all() APIs.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/extractor

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Nitpick comments (1)
tests/test_extractor/test_extractor_integration.py (1)

219-222: Replace the repeated fixed sleep with a deterministic ready check.

Lines 219-222 add a hard-coded 500 ms delay, and the same setup block repeats across the async tests. If go_to() already waits for navigation, this is just dead time; if it does not, the tests are still timing-sensitive. A shared fixture/helper that waits on a stable sentinel element from tests/pages/test_extractor.html would make this suite faster and less flaky.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_extractor/test_extractor_integration.py` around lines 219 - 222,
Replace the hard-coded asyncio.sleep(0.5) after opening FILE_URL with a
deterministic readiness check: remove the fixed delay in the block using Chrome/
browser.start/ tab.go_to(FILE_URL) and instead call a shared test helper or
fixture that waits for a stable sentinel element in
tests/pages/test_extractor.html (e.g., a unique id or data-test attribute) to
appear/ be visible before proceeding; add this helper/fixture and use it across
the async tests to reduce flakiness and speed up the suite.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pydoll/browser/tab.py`:
- Around line 130-131: Bind the TypeVar T to ExtractionModel to satisfy MyPy:
replace the unbound T defined in pydoll/browser/tab.py with a TypeVar that has
bound 'ExtractionModel', add a conditional import of ExtractionModel inside a
TYPE_CHECKING block to avoid circular imports (refer to the symbol
ExtractionModel), and keep the bound as a string annotation to prevent runtime
import issues; this will make T compatible with ExtractionEngine.extract()
usages in the file.
- Around line 90-91: The top-level import of ExtractionEngine causes
lint/circular-import issues; remove it and either add it under a TYPE_CHECKING
block or perform a runtime import inside the lazy _extractor property. Update
the _extractor property (and use the existing _extraction_engine attribute) to
import ExtractionEngine locally when _extraction_engine is None, instantiate it
with self, assign to _extraction_engine, and return it; alternatively add "from
typing import TYPE_CHECKING" and place "from pydoll.extractor.engine import
ExtractionEngine" under TYPE_CHECKING and keep the runtime import approach in
_extractor to avoid the Ruff I001 and potential circular imports.

In `@pydoll/extractor/engine.py`:
- Around line 118-122: field_info.annotation can be None but _extract_field
requires a type; add a guard before calling _extract_field in the block that
reads model.model_fields[name] and annotation: if annotation is None, set it to
typing.Any (import Any from typing) or another appropriate fallback type, then
call await self._extract_field(metadata, annotation, context, timeout) so the
call always receives a concrete type and MyPy errors are resolved.
- Around line 201-204: The call to context.query(...) returns a broad Union type
but at runtime with raise_exc=True it yields a single WebElement; update the
call to pass find_all=False (context.query(metadata.selector, find_all=False,
timeout=timeout, raise_exc=True)) and then narrow the type of scope_element
(e.g., use typing.cast(WebElement, scope_element) or an assert
isinstance(scope_element, WebElement)) before passing it to _extract_fields so
MyPy understands it's a single WebElement; reference symbols: context.query,
scope_element, _extract_fields, _build_instance.
- Around line 113-116: The code checks metadata.has_selector but MyPy doesn't
narrow metadata.selector from Optional[str] to str, causing type errors when
passing it to query(); after the existing has_selector check in the loop over
model.get_extraction_fields() and inside the helper methods _extract_list_field,
_extract_nested_model, and _extract_scalar_field, add an assertion like assert
metadata.selector is not None to guarantee metadata.selector is a str before
calling query() (or otherwise using it), so MyPy recognizes the narrowed type
and the query(metadata.selector) calls type-check cleanly.
- Around line 239-254: The return type of _apply_transform does not match the
declared return type of ExtractionMetadata.transform; update _apply_transform's
annotation to include object (i.e., Union[str, int, float, bool, object]) so it
matches metadata.transform, or alternatively change the
ExtractionMetadata.transform signature to exclude object and keep
_apply_transform as-is—locate the function _apply_transform and the
ExtractionMetadata.transform type declaration to ensure both signatures align
and MyPy errors are resolved.
- Around line 173-192: The variable elements returned from context.query is
typed as Union[WebElement, list[WebElement], None] but you iterate it as a list;
update the code in the extraction path (around where context.query(...) is
called and before the loops that use elements) to narrow/cast elements to
list[WebElement] (or assert isinstance(elements, list)) so MyPy knows it's a
list when iterating in the branches that call self._extract_fields,
_build_instance, _extract_value and _apply_transform; ensure the cast/assert is
applied before both the model-extraction loop (which calls self._extract_fields
and _build_instance) and the scalar loop (which calls _extract_value and
_apply_transform).
- Around line 207-216: In _extract_scalar_field: guard against metadata.selector
being None and narrow the type returned by context.query before passing it to
_extract_value; specifically, validate metadata.selector (from
ExtractionMetadata) and raise or return a clear error if it's None, call
context.query(metadata.selector, ...) only when non-None, then assert/handle the
optional return from FindElementsMixin.query so the value passed into
_extract_value is a concrete element type (not None) and _apply_transform
receives the correctly typed raw value; update function _extract_scalar_field
accordingly and keep references to _extract_value and _apply_transform for
downstream calls.

In `@pydoll/extractor/field.py`:
- Around line 56-102: The Field function's default parameter is typed as object
which makes MyPy infer the wrong overload for the PydanticField call; change the
signature's default parameter type from object to typing.Any (e.g., default: Any
= PydanticUndefined) and add the corresponding Any import so the default matches
Pydantic's overloaded Field signature and resolves the MyPy error in Field (the
function that constructs ExtractionMetadata and calls PydanticField).

In `@pydoll/extractor/model.py`:
- Around line 49-53: MyPy complains the method may return None because ClassVar
_extraction_fields_cache isn't narrowed; fix by ensuring a statically
non-optional return: after possibly assigning cls._extraction_fields_cache =
_collect_extraction_metadata(cls), return typing.cast(dict[str,
ExtractionMetadata], cls._extraction_fields_cache) (or assign the result to a
local non-optional variable and return that), and add the required import for
typing.cast; reference the symbols _extraction_fields_cache and
_collect_extraction_metadata when making the change.

In `@tests/test_extractor/test_extractor_integration.py`:
- Around line 19-20: Replace the manual file:// string construction for FILE_URL
with Path.as_uri() to produce a correct, platform-independent file URI: locate
the TEST_PAGE and FILE_URL definitions (symbols TEST_PAGE and FILE_URL) and set
FILE_URL using TEST_PAGE.as_uri() instead of f'file://{TEST_PAGE.absolute()}';
this ensures proper escaping and correct separators on Windows and POSIX.

---

Nitpick comments:
In `@tests/test_extractor/test_extractor_integration.py`:
- Around line 219-222: Replace the hard-coded asyncio.sleep(0.5) after opening
FILE_URL with a deterministic readiness check: remove the fixed delay in the
block using Chrome/ browser.start/ tab.go_to(FILE_URL) and instead call a shared
test helper or fixture that waits for a stable sentinel element in
tests/pages/test_extractor.html (e.g., a unique id or data-test attribute) to
appear/ be visible before proceeding; add this helper/fixture and use it across
the async tests to reduce flakiness and speed up the suite.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7fdf2cd9-243b-4302-8416-7bc9dbb82890

📥 Commits

Reviewing files that changed from the base of the PR and between f6c202a and 48e26ce.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (10)
  • pydoll/browser/tab.py
  • pydoll/extractor/__init__.py
  • pydoll/extractor/engine.py
  • pydoll/extractor/exceptions.py
  • pydoll/extractor/field.py
  • pydoll/extractor/model.py
  • pyproject.toml
  • tests/pages/test_extractor.html
  • tests/test_extractor/__init__.py
  • tests/test_extractor/test_extractor_integration.py

Comment on lines +19 to +20
TEST_PAGE = Path(__file__).parent.parent / 'pages' / 'test_extractor.html'
FILE_URL = f'file://{TEST_PAGE.absolute()}'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n tests/test_extractor/test_extractor_integration.py | head -25

Repository: autoscrape-labs/pydoll

Length of output: 893


🏁 Script executed:

# Verify the pathlib.Path.as_uri() method and behavior
python3 << 'EOF'
from pathlib import Path
import tempfile
import os

# Test 1: Check if as_uri() exists and works
test_path = Path("/tmp/test file with spaces.html")
print("Test 1: as_uri() method exists and works")
print(f"  Path: {test_path}")
print(f"  as_uri(): {test_path.as_uri()}")
print()

# Test 2: Compare resolve().as_uri() vs as_uri()
cwd_path = Path(".") / "test.html"
print("Test 2: resolve().as_uri() vs as_uri()")
print(f"  as_uri(): {cwd_path.as_uri()}")
print(f"  resolve().as_uri(): {cwd_path.resolve().as_uri()}")
print()

# Test 3: Demonstrate non-ASCII character handling
non_ascii_path = Path("/tmp/test_café.html")
print("Test 3: Non-ASCII character handling")
print(f"  as_uri(): {non_ascii_path.as_uri()}")
print()

# Test 4: Demonstrate the problematic manual concatenation
print("Test 4: Manual concatenation vs as_uri()")
manual = f'file://{cwd_path.resolve().absolute()}'
correct = cwd_path.resolve().as_uri()
print(f"  Manual concatenation: {manual}")
print(f"  as_uri(): {correct}")
print(f"  Are they the same? {manual == correct}")
EOF

Repository: autoscrape-labs/pydoll

Length of output: 525


🏁 Script executed:

python3 << 'EOF'
from pathlib import Path

# Test the actual code pattern from the file
test_html_path = Path(__file__).parent / 'pages' / 'test_extractor.html'
print("Test 1: Code from test_extractor_integration.py pattern")
print(f"  TEST_PAGE: {test_html_path}")
print(f"  TEST_PAGE.absolute(): {test_html_path.absolute()}")

# Test manual concatenation (current code)
manual_uri = f'file://{test_html_path.absolute()}'
print(f"  Manual URI: {manual_uri}")
print()

# Test with as_uri() - absolute path
absolute_uri = test_html_path.absolute().as_uri()
print("Test 2: Using .absolute().as_uri()")
print(f"  as_uri() result: {absolute_uri}")
print(f"  Match manual? {manual_uri == absolute_uri}")
print()

# Test with resolve().as_uri()
resolved_uri = test_html_path.resolve().as_uri()
print("Test 3: Using .resolve().as_uri()")
print(f"  resolve().as_uri() result: {resolved_uri}")
print(f"  Match manual? {manual_uri == resolved_uri}")
print()

# Test problematic case: path with spaces
problematic_path = Path("/home/user/test dir/file with spaces.html")
manual_problematic = f'file://{problematic_path.absolute()}'
uri_problematic = problematic_path.absolute().as_uri()
print("Test 4: Path with spaces (problematic case)")
print(f"  Manual: {manual_problematic}")
print(f"  as_uri(): {uri_problematic}")
print(f"  Spaces escaped? as_uri has %20: {'%20' in uri_problematic}")
print()

# Test cross-platform: simulate Windows path behavior
print("Test 5: Windows path note")
print(f"  On Windows, absolute() returns: C:\\Users\\...\\ style paths")
print(f"  Manual f'file://{{win_path}}' would produce: file://C:\\Users\\... (WRONG)")
print(f"  as_uri() would properly produce: file:///C:/Users/... (CORRECT)")
EOF

Repository: autoscrape-labs/pydoll

Length of output: 996


🏁 Script executed:

# Also verify if the test actually runs and passes with the current code
cd tests/test_extractor && ls -la | head -20

Repository: autoscrape-labs/pydoll

Length of output: 321


Build the local page URL with Path.as_uri().

Line 20 manually concatenates a file: URI. This breaks on Windows (uses backslashes instead of forward slashes in the URI) and fails to escape special characters like spaces, even on POSIX systems.

Suggested fix
 TEST_PAGE = Path(__file__).parent.parent / 'pages' / 'test_extractor.html'
-FILE_URL = f'file://{TEST_PAGE.absolute()}'
+FILE_URL = TEST_PAGE.absolute().as_uri()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_extractor/test_extractor_integration.py` around lines 19 - 20,
Replace the manual file:// string construction for FILE_URL with Path.as_uri()
to produce a correct, platform-independent file URI: locate the TEST_PAGE and
FILE_URL definitions (symbols TEST_PAGE and FILE_URL) and set FILE_URL using
TEST_PAGE.as_uri() instead of f'file://{TEST_PAGE.absolute()}'; this ensures
proper escaping and correct separators on Windows and POSIX.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pydoll/extractor/engine.py`:
- Around line 53-55: The scoped query result assigned to context (type
FindElementsMixin = self._tab) needs explicit type narrowing because
self._tab.query(...) returns Union[WebElement, list[WebElement], None] and MyPy
can't infer the runtime behavior; after calling self._tab.query(scope,
timeout=timeout) add an assertion that narrows context to a WebElement (e.g.,
assert isinstance(context, WebElement)) before using it as a single element, and
ensure WebElement is available at runtime (move its import out of TYPE_CHECKING
or use a string-based isinstance check) so the type checker accepts the narrowed
type; also handle the None/list cases if they can occur by raising or converting
to the expected single-element form.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 13b5678f-304f-40b2-8e18-df8653277bd8

📥 Commits

Reviewing files that changed from the base of the PR and between 48e26ce and 510b461.

📒 Files selected for processing (7)
  • pydoll/browser/tab.py
  • pydoll/commands/accessibility_commands.py
  • pydoll/extractor/engine.py
  • pydoll/extractor/field.py
  • pydoll/extractor/model.py
  • pydoll/interactions/iframe.py
  • pydoll/protocol/accessibility/methods.py
✅ Files skipped from review due to trivial changes (3)
  • pydoll/commands/accessibility_commands.py
  • pydoll/interactions/iframe.py
  • pydoll/protocol/accessibility/methods.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • pydoll/browser/tab.py
  • pydoll/extractor/model.py
  • pydoll/extractor/field.py

@codecov
Copy link

codecov bot commented Mar 22, 2026

Codecov Report

❌ Patch coverage is 91.92825% with 18 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pydoll/extractor/engine.py 90.22% 13 Missing ⚠️
pydoll/extractor/model.py 87.50% 4 Missing ⚠️
pydoll/browser/tab.py 92.30% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
pydoll/extractor/engine.py (1)

129-132: Consider logging when skipping fields with None annotation.

Silently skipping fields where annotation is None may make debugging difficult. A debug log similar to the one for missing selectors would improve observability.

💡 Suggested improvement
             annotation = field_info.annotation
             if annotation is None:
+                logger.debug(f'Skipping field "{name}" (no type annotation)')
                 continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pydoll/extractor/engine.py` around lines 129 - 132, Add a debug log when a
field has annotation None before the continue so skipping is observable; locate
the loop that reads model.model_fields[name] and checks annotation (the block
with field_info = model.model_fields[name]; annotation = field_info.annotation;
if annotation is None: continue) and emit a debug message (e.g., logger.debug or
process_logger.debug) including the model name/class and the field name variable
`name` to mirror the existing missing-selector debug logging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pydoll/extractor/engine.py`:
- Around line 122-137: The type annotation for tasks is wrong: tasks is
collecting coroutines returned by self._extract_field but is annotated as
list[asyncio.Task[...]], causing a type mismatch when passed to asyncio.gather;
change the annotation to an awaitable/coroutine type (e.g.,
list[collections.abc.Coroutine[...] ] or list[collections.abc.Awaitable[...]])
to reflect that self._extract_field(...) returns a coroutine, and add the
appropriate import for Coroutine/Awaitable at the top of the file so the new
annotation resolves.

In `@README.md`:
- Around line 101-124: The example uses asyncio.run(...) but missing the import;
add "import asyncio" at the top of the snippet so extract_quotes() can be
executed—update the code example that imports Chrome (from
pydoll.browser.chromium import Chrome) and ExtractionModel/Field to also import
asyncio before defining class Quote and the async function extract_quotes().

---

Nitpick comments:
In `@pydoll/extractor/engine.py`:
- Around line 129-132: Add a debug log when a field has annotation None before
the continue so skipping is observable; locate the loop that reads
model.model_fields[name] and checks annotation (the block with field_info =
model.model_fields[name]; annotation = field_info.annotation; if annotation is
None: continue) and emit a debug message (e.g., logger.debug or
process_logger.debug) including the model name/class and the field name variable
`name` to mirror the existing missing-selector debug logging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ca59d12b-bbc6-4f2e-a331-80c2acf37981

📥 Commits

Reviewing files that changed from the base of the PR and between 510b461 and 35f4898.

📒 Files selected for processing (13)
  • README.md
  • docs/en/features/extraction/structured-extraction.md
  • docs/en/features/index.md
  • docs/en/index.md
  • docs/pt/features/extraction/structured-extraction.md
  • docs/pt/features/index.md
  • docs/pt/index.md
  • docs/zh/features/extraction/structured-extraction.md
  • docs/zh/features/index.md
  • docs/zh/index.md
  • mkdocs.yml
  • pydoll/extractor/engine.py
  • tests/test_extractor/test_extractor_integration.py
✅ Files skipped from review due to trivial changes (4)
  • mkdocs.yml
  • docs/en/features/index.md
  • docs/zh/features/index.md
  • docs/en/features/extraction/structured-extraction.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_extractor/test_extractor_integration.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
pydoll/extractor/engine.py (2)

206-217: Consider parallelizing element extraction for better performance.

The sequential loops call async functions (_extract_fields, _extract_value) one element at a time. For large lists, concurrent extraction would improve performance, consistent with the pattern used in extract_all() and _extract_fields().

♻️ Suggested parallel approach for nested models
         if _is_extraction_model(inner_type):
-            results: list[Union[str, int, float, bool, object]] = []
-            for element in elements:
-                field_values = await self._extract_fields(inner_type, element, timeout)
-                results.append(_build_instance(inner_type, field_values))
-            return results
+            extraction_tasks = [
+                self._extract_fields(inner_type, element, timeout)
+                for element in elements
+            ]
+            all_values = await asyncio.gather(*extraction_tasks)
+            return [_build_instance(inner_type, vals) for vals in all_values]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pydoll/extractor/engine.py` around lines 206 - 217, The loops in engine.py
are running asynchronous extraction sequentially; change both branches to run
per-element work concurrently: for extraction models, create tasks calling
self._extract_fields(inner_type, element, timeout) for each element, await
asyncio.gather to get all field_values, then map those results through
_build_instance(inner_type, field_values); for scalar paths, create tasks
calling _extract_value(element, metadata) for each element, await asyncio.gather
to get all raw values, then map those through _apply_transform(raw, metadata).
Use asyncio.create_task or gather directly and preserve ordering of elements;
reference _is_extraction_model, self._extract_fields, _build_instance,
_extract_value, and _apply_transform when locating code to change.

134-135: Consider logging when skipping fields with no type annotation.

Fields without annotations are silently skipped. Adding a debug log here (similar to line 129) would help with troubleshooting when fields unexpectedly produce no value.

♻️ Suggested improvement
             if annotation is None:
+                logger.debug(f'Skipping field "{name}" (no type annotation)')
                 continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pydoll/extractor/engine.py` around lines 134 - 135, When skipping a field
because "annotation is None" add a debug log (like the one at the nearby log on
line 129) before the "continue" so skipped fields are visible during
troubleshooting; use the same logger instance used in that scope and include the
field identifier (e.g., the local "name" or "field" variable) and that the type
annotation is missing to give context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pydoll/extractor/engine.py`:
- Around line 206-217: The loops in engine.py are running asynchronous
extraction sequentially; change both branches to run per-element work
concurrently: for extraction models, create tasks calling
self._extract_fields(inner_type, element, timeout) for each element, await
asyncio.gather to get all field_values, then map those results through
_build_instance(inner_type, field_values); for scalar paths, create tasks
calling _extract_value(element, metadata) for each element, await asyncio.gather
to get all raw values, then map those through _apply_transform(raw, metadata).
Use asyncio.create_task or gather directly and preserve ordering of elements;
reference _is_extraction_model, self._extract_fields, _build_instance,
_extract_value, and _apply_transform when locating code to change.
- Around line 134-135: When skipping a field because "annotation is None" add a
debug log (like the one at the nearby log on line 129) before the "continue" so
skipped fields are visible during troubleshooting; use the same logger instance
used in that scope and include the field identifier (e.g., the local "name" or
"field" variable) and that the type annotation is missing to give context.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: eb0aa412-0b55-4777-8028-24a2c169427c

📥 Commits

Reviewing files that changed from the base of the PR and between 35f4898 and 16eb0ef.

📒 Files selected for processing (1)
  • pydoll/extractor/engine.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant