Conversation
This reverts commit 219267e.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a structured DOM-to-model extraction subsystem: ExtractionModel and Field metadata, a new ExtractionEngine with async extract/extract_all, extractor exceptions and package exports, Tab-level integration, tests and HTML fixtures, multilingual documentation, and a pydantic runtime dependency. Changes
Sequence DiagramsequenceDiagram
participant Tab
participant ExtractionEngine
participant DOM_Page
participant Pydantic_Model
rect rgba(100, 150, 200, 0.5)
Note over Tab,ExtractionEngine: extract(model) — single instance
Tab->>Tab: access _extractor (lazy init)
Tab->>ExtractionEngine: extract(model, scope, timeout)
ExtractionEngine->>ExtractionEngine: get_extraction_fields()
loop for each field
ExtractionEngine->>DOM_Page: query selector / attribute
DOM_Page-->>ExtractionEngine: element(s) / text / attribute
ExtractionEngine->>ExtractionEngine: apply transform / recurse for nested
end
ExtractionEngine->>Pydantic_Model: construct & validate model
Pydantic_Model-->>ExtractionEngine: validated instance
ExtractionEngine-->>Tab: model instance
end
rect rgba(150, 200, 100, 0.5)
Note over Tab,ExtractionEngine: extract_all(model) — repeated containers
Tab->>ExtractionEngine: extract_all(model, scope, limit)
ExtractionEngine->>DOM_Page: query container selector
DOM_Page-->>ExtractionEngine: list of container elements
loop per container (up to limit)
ExtractionEngine->>DOM_Page: query fields scoped to container
DOM_Page-->>ExtractionEngine: field values
ExtractionEngine->>Pydantic_Model: construct & validate instance
end
ExtractionEngine-->>Tab: list of model instances
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 11
🧹 Nitpick comments (1)
tests/test_extractor/test_extractor_integration.py (1)
219-222: Replace the repeated fixed sleep with a deterministic ready check.Lines 219-222 add a hard-coded 500 ms delay, and the same setup block repeats across the async tests. If
go_to()already waits for navigation, this is just dead time; if it does not, the tests are still timing-sensitive. A shared fixture/helper that waits on a stable sentinel element fromtests/pages/test_extractor.htmlwould make this suite faster and less flaky.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_extractor/test_extractor_integration.py` around lines 219 - 222, Replace the hard-coded asyncio.sleep(0.5) after opening FILE_URL with a deterministic readiness check: remove the fixed delay in the block using Chrome/ browser.start/ tab.go_to(FILE_URL) and instead call a shared test helper or fixture that waits for a stable sentinel element in tests/pages/test_extractor.html (e.g., a unique id or data-test attribute) to appear/ be visible before proceeding; add this helper/fixture and use it across the async tests to reduce flakiness and speed up the suite.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pydoll/browser/tab.py`:
- Around line 130-131: Bind the TypeVar T to ExtractionModel to satisfy MyPy:
replace the unbound T defined in pydoll/browser/tab.py with a TypeVar that has
bound 'ExtractionModel', add a conditional import of ExtractionModel inside a
TYPE_CHECKING block to avoid circular imports (refer to the symbol
ExtractionModel), and keep the bound as a string annotation to prevent runtime
import issues; this will make T compatible with ExtractionEngine.extract()
usages in the file.
- Around line 90-91: The top-level import of ExtractionEngine causes
lint/circular-import issues; remove it and either add it under a TYPE_CHECKING
block or perform a runtime import inside the lazy _extractor property. Update
the _extractor property (and use the existing _extraction_engine attribute) to
import ExtractionEngine locally when _extraction_engine is None, instantiate it
with self, assign to _extraction_engine, and return it; alternatively add "from
typing import TYPE_CHECKING" and place "from pydoll.extractor.engine import
ExtractionEngine" under TYPE_CHECKING and keep the runtime import approach in
_extractor to avoid the Ruff I001 and potential circular imports.
In `@pydoll/extractor/engine.py`:
- Around line 118-122: field_info.annotation can be None but _extract_field
requires a type; add a guard before calling _extract_field in the block that
reads model.model_fields[name] and annotation: if annotation is None, set it to
typing.Any (import Any from typing) or another appropriate fallback type, then
call await self._extract_field(metadata, annotation, context, timeout) so the
call always receives a concrete type and MyPy errors are resolved.
- Around line 201-204: The call to context.query(...) returns a broad Union type
but at runtime with raise_exc=True it yields a single WebElement; update the
call to pass find_all=False (context.query(metadata.selector, find_all=False,
timeout=timeout, raise_exc=True)) and then narrow the type of scope_element
(e.g., use typing.cast(WebElement, scope_element) or an assert
isinstance(scope_element, WebElement)) before passing it to _extract_fields so
MyPy understands it's a single WebElement; reference symbols: context.query,
scope_element, _extract_fields, _build_instance.
- Around line 113-116: The code checks metadata.has_selector but MyPy doesn't
narrow metadata.selector from Optional[str] to str, causing type errors when
passing it to query(); after the existing has_selector check in the loop over
model.get_extraction_fields() and inside the helper methods _extract_list_field,
_extract_nested_model, and _extract_scalar_field, add an assertion like assert
metadata.selector is not None to guarantee metadata.selector is a str before
calling query() (or otherwise using it), so MyPy recognizes the narrowed type
and the query(metadata.selector) calls type-check cleanly.
- Around line 239-254: The return type of _apply_transform does not match the
declared return type of ExtractionMetadata.transform; update _apply_transform's
annotation to include object (i.e., Union[str, int, float, bool, object]) so it
matches metadata.transform, or alternatively change the
ExtractionMetadata.transform signature to exclude object and keep
_apply_transform as-is—locate the function _apply_transform and the
ExtractionMetadata.transform type declaration to ensure both signatures align
and MyPy errors are resolved.
- Around line 173-192: The variable elements returned from context.query is
typed as Union[WebElement, list[WebElement], None] but you iterate it as a list;
update the code in the extraction path (around where context.query(...) is
called and before the loops that use elements) to narrow/cast elements to
list[WebElement] (or assert isinstance(elements, list)) so MyPy knows it's a
list when iterating in the branches that call self._extract_fields,
_build_instance, _extract_value and _apply_transform; ensure the cast/assert is
applied before both the model-extraction loop (which calls self._extract_fields
and _build_instance) and the scalar loop (which calls _extract_value and
_apply_transform).
- Around line 207-216: In _extract_scalar_field: guard against metadata.selector
being None and narrow the type returned by context.query before passing it to
_extract_value; specifically, validate metadata.selector (from
ExtractionMetadata) and raise or return a clear error if it's None, call
context.query(metadata.selector, ...) only when non-None, then assert/handle the
optional return from FindElementsMixin.query so the value passed into
_extract_value is a concrete element type (not None) and _apply_transform
receives the correctly typed raw value; update function _extract_scalar_field
accordingly and keep references to _extract_value and _apply_transform for
downstream calls.
In `@pydoll/extractor/field.py`:
- Around line 56-102: The Field function's default parameter is typed as object
which makes MyPy infer the wrong overload for the PydanticField call; change the
signature's default parameter type from object to typing.Any (e.g., default: Any
= PydanticUndefined) and add the corresponding Any import so the default matches
Pydantic's overloaded Field signature and resolves the MyPy error in Field (the
function that constructs ExtractionMetadata and calls PydanticField).
In `@pydoll/extractor/model.py`:
- Around line 49-53: MyPy complains the method may return None because ClassVar
_extraction_fields_cache isn't narrowed; fix by ensuring a statically
non-optional return: after possibly assigning cls._extraction_fields_cache =
_collect_extraction_metadata(cls), return typing.cast(dict[str,
ExtractionMetadata], cls._extraction_fields_cache) (or assign the result to a
local non-optional variable and return that), and add the required import for
typing.cast; reference the symbols _extraction_fields_cache and
_collect_extraction_metadata when making the change.
In `@tests/test_extractor/test_extractor_integration.py`:
- Around line 19-20: Replace the manual file:// string construction for FILE_URL
with Path.as_uri() to produce a correct, platform-independent file URI: locate
the TEST_PAGE and FILE_URL definitions (symbols TEST_PAGE and FILE_URL) and set
FILE_URL using TEST_PAGE.as_uri() instead of f'file://{TEST_PAGE.absolute()}';
this ensures proper escaping and correct separators on Windows and POSIX.
---
Nitpick comments:
In `@tests/test_extractor/test_extractor_integration.py`:
- Around line 219-222: Replace the hard-coded asyncio.sleep(0.5) after opening
FILE_URL with a deterministic readiness check: remove the fixed delay in the
block using Chrome/ browser.start/ tab.go_to(FILE_URL) and instead call a shared
test helper or fixture that waits for a stable sentinel element in
tests/pages/test_extractor.html (e.g., a unique id or data-test attribute) to
appear/ be visible before proceeding; add this helper/fixture and use it across
the async tests to reduce flakiness and speed up the suite.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7fdf2cd9-243b-4302-8416-7bc9dbb82890
⛔ Files ignored due to path filters (1)
poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (10)
pydoll/browser/tab.pypydoll/extractor/__init__.pypydoll/extractor/engine.pypydoll/extractor/exceptions.pypydoll/extractor/field.pypydoll/extractor/model.pypyproject.tomltests/pages/test_extractor.htmltests/test_extractor/__init__.pytests/test_extractor/test_extractor_integration.py
| TEST_PAGE = Path(__file__).parent.parent / 'pages' / 'test_extractor.html' | ||
| FILE_URL = f'file://{TEST_PAGE.absolute()}' |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n tests/test_extractor/test_extractor_integration.py | head -25Repository: autoscrape-labs/pydoll
Length of output: 893
🏁 Script executed:
# Verify the pathlib.Path.as_uri() method and behavior
python3 << 'EOF'
from pathlib import Path
import tempfile
import os
# Test 1: Check if as_uri() exists and works
test_path = Path("/tmp/test file with spaces.html")
print("Test 1: as_uri() method exists and works")
print(f" Path: {test_path}")
print(f" as_uri(): {test_path.as_uri()}")
print()
# Test 2: Compare resolve().as_uri() vs as_uri()
cwd_path = Path(".") / "test.html"
print("Test 2: resolve().as_uri() vs as_uri()")
print(f" as_uri(): {cwd_path.as_uri()}")
print(f" resolve().as_uri(): {cwd_path.resolve().as_uri()}")
print()
# Test 3: Demonstrate non-ASCII character handling
non_ascii_path = Path("/tmp/test_café.html")
print("Test 3: Non-ASCII character handling")
print(f" as_uri(): {non_ascii_path.as_uri()}")
print()
# Test 4: Demonstrate the problematic manual concatenation
print("Test 4: Manual concatenation vs as_uri()")
manual = f'file://{cwd_path.resolve().absolute()}'
correct = cwd_path.resolve().as_uri()
print(f" Manual concatenation: {manual}")
print(f" as_uri(): {correct}")
print(f" Are they the same? {manual == correct}")
EOFRepository: autoscrape-labs/pydoll
Length of output: 525
🏁 Script executed:
python3 << 'EOF'
from pathlib import Path
# Test the actual code pattern from the file
test_html_path = Path(__file__).parent / 'pages' / 'test_extractor.html'
print("Test 1: Code from test_extractor_integration.py pattern")
print(f" TEST_PAGE: {test_html_path}")
print(f" TEST_PAGE.absolute(): {test_html_path.absolute()}")
# Test manual concatenation (current code)
manual_uri = f'file://{test_html_path.absolute()}'
print(f" Manual URI: {manual_uri}")
print()
# Test with as_uri() - absolute path
absolute_uri = test_html_path.absolute().as_uri()
print("Test 2: Using .absolute().as_uri()")
print(f" as_uri() result: {absolute_uri}")
print(f" Match manual? {manual_uri == absolute_uri}")
print()
# Test with resolve().as_uri()
resolved_uri = test_html_path.resolve().as_uri()
print("Test 3: Using .resolve().as_uri()")
print(f" resolve().as_uri() result: {resolved_uri}")
print(f" Match manual? {manual_uri == resolved_uri}")
print()
# Test problematic case: path with spaces
problematic_path = Path("/home/user/test dir/file with spaces.html")
manual_problematic = f'file://{problematic_path.absolute()}'
uri_problematic = problematic_path.absolute().as_uri()
print("Test 4: Path with spaces (problematic case)")
print(f" Manual: {manual_problematic}")
print(f" as_uri(): {uri_problematic}")
print(f" Spaces escaped? as_uri has %20: {'%20' in uri_problematic}")
print()
# Test cross-platform: simulate Windows path behavior
print("Test 5: Windows path note")
print(f" On Windows, absolute() returns: C:\\Users\\...\\ style paths")
print(f" Manual f'file://{{win_path}}' would produce: file://C:\\Users\\... (WRONG)")
print(f" as_uri() would properly produce: file:///C:/Users/... (CORRECT)")
EOFRepository: autoscrape-labs/pydoll
Length of output: 996
🏁 Script executed:
# Also verify if the test actually runs and passes with the current code
cd tests/test_extractor && ls -la | head -20Repository: autoscrape-labs/pydoll
Length of output: 321
Build the local page URL with Path.as_uri().
Line 20 manually concatenates a file: URI. This breaks on Windows (uses backslashes instead of forward slashes in the URI) and fails to escape special characters like spaces, even on POSIX systems.
Suggested fix
TEST_PAGE = Path(__file__).parent.parent / 'pages' / 'test_extractor.html'
-FILE_URL = f'file://{TEST_PAGE.absolute()}'
+FILE_URL = TEST_PAGE.absolute().as_uri()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_extractor/test_extractor_integration.py` around lines 19 - 20,
Replace the manual file:// string construction for FILE_URL with Path.as_uri()
to produce a correct, platform-independent file URI: locate the TEST_PAGE and
FILE_URL definitions (symbols TEST_PAGE and FILE_URL) and set FILE_URL using
TEST_PAGE.as_uri() instead of f'file://{TEST_PAGE.absolute()}'; this ensures
proper escaping and correct separators on Windows and POSIX.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pydoll/extractor/engine.py`:
- Around line 53-55: The scoped query result assigned to context (type
FindElementsMixin = self._tab) needs explicit type narrowing because
self._tab.query(...) returns Union[WebElement, list[WebElement], None] and MyPy
can't infer the runtime behavior; after calling self._tab.query(scope,
timeout=timeout) add an assertion that narrows context to a WebElement (e.g.,
assert isinstance(context, WebElement)) before using it as a single element, and
ensure WebElement is available at runtime (move its import out of TYPE_CHECKING
or use a string-based isinstance check) so the type checker accepts the narrowed
type; also handle the None/list cases if they can occur by raising or converting
to the expected single-element form.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 13b5678f-304f-40b2-8e18-df8653277bd8
📒 Files selected for processing (7)
pydoll/browser/tab.pypydoll/commands/accessibility_commands.pypydoll/extractor/engine.pypydoll/extractor/field.pypydoll/extractor/model.pypydoll/interactions/iframe.pypydoll/protocol/accessibility/methods.py
✅ Files skipped from review due to trivial changes (3)
- pydoll/commands/accessibility_commands.py
- pydoll/interactions/iframe.py
- pydoll/protocol/accessibility/methods.py
🚧 Files skipped from review as they are similar to previous changes (3)
- pydoll/browser/tab.py
- pydoll/extractor/model.py
- pydoll/extractor/field.py
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
pydoll/extractor/engine.py (1)
129-132: Consider logging when skipping fields withNoneannotation.Silently skipping fields where
annotation is Nonemay make debugging difficult. A debug log similar to the one for missing selectors would improve observability.💡 Suggested improvement
annotation = field_info.annotation if annotation is None: + logger.debug(f'Skipping field "{name}" (no type annotation)') continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pydoll/extractor/engine.py` around lines 129 - 132, Add a debug log when a field has annotation None before the continue so skipping is observable; locate the loop that reads model.model_fields[name] and checks annotation (the block with field_info = model.model_fields[name]; annotation = field_info.annotation; if annotation is None: continue) and emit a debug message (e.g., logger.debug or process_logger.debug) including the model name/class and the field name variable `name` to mirror the existing missing-selector debug logging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pydoll/extractor/engine.py`:
- Around line 122-137: The type annotation for tasks is wrong: tasks is
collecting coroutines returned by self._extract_field but is annotated as
list[asyncio.Task[...]], causing a type mismatch when passed to asyncio.gather;
change the annotation to an awaitable/coroutine type (e.g.,
list[collections.abc.Coroutine[...] ] or list[collections.abc.Awaitable[...]])
to reflect that self._extract_field(...) returns a coroutine, and add the
appropriate import for Coroutine/Awaitable at the top of the file so the new
annotation resolves.
In `@README.md`:
- Around line 101-124: The example uses asyncio.run(...) but missing the import;
add "import asyncio" at the top of the snippet so extract_quotes() can be
executed—update the code example that imports Chrome (from
pydoll.browser.chromium import Chrome) and ExtractionModel/Field to also import
asyncio before defining class Quote and the async function extract_quotes().
---
Nitpick comments:
In `@pydoll/extractor/engine.py`:
- Around line 129-132: Add a debug log when a field has annotation None before
the continue so skipping is observable; locate the loop that reads
model.model_fields[name] and checks annotation (the block with field_info =
model.model_fields[name]; annotation = field_info.annotation; if annotation is
None: continue) and emit a debug message (e.g., logger.debug or
process_logger.debug) including the model name/class and the field name variable
`name` to mirror the existing missing-selector debug logging.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ca59d12b-bbc6-4f2e-a331-80c2acf37981
📒 Files selected for processing (13)
README.mddocs/en/features/extraction/structured-extraction.mddocs/en/features/index.mddocs/en/index.mddocs/pt/features/extraction/structured-extraction.mddocs/pt/features/index.mddocs/pt/index.mddocs/zh/features/extraction/structured-extraction.mddocs/zh/features/index.mddocs/zh/index.mdmkdocs.ymlpydoll/extractor/engine.pytests/test_extractor/test_extractor_integration.py
✅ Files skipped from review due to trivial changes (4)
- mkdocs.yml
- docs/en/features/index.md
- docs/zh/features/index.md
- docs/en/features/extraction/structured-extraction.md
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_extractor/test_extractor_integration.py
There was a problem hiding this comment.
🧹 Nitpick comments (2)
pydoll/extractor/engine.py (2)
206-217: Consider parallelizing element extraction for better performance.The sequential loops call async functions (
_extract_fields,_extract_value) one element at a time. For large lists, concurrent extraction would improve performance, consistent with the pattern used inextract_all()and_extract_fields().♻️ Suggested parallel approach for nested models
if _is_extraction_model(inner_type): - results: list[Union[str, int, float, bool, object]] = [] - for element in elements: - field_values = await self._extract_fields(inner_type, element, timeout) - results.append(_build_instance(inner_type, field_values)) - return results + extraction_tasks = [ + self._extract_fields(inner_type, element, timeout) + for element in elements + ] + all_values = await asyncio.gather(*extraction_tasks) + return [_build_instance(inner_type, vals) for vals in all_values]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pydoll/extractor/engine.py` around lines 206 - 217, The loops in engine.py are running asynchronous extraction sequentially; change both branches to run per-element work concurrently: for extraction models, create tasks calling self._extract_fields(inner_type, element, timeout) for each element, await asyncio.gather to get all field_values, then map those results through _build_instance(inner_type, field_values); for scalar paths, create tasks calling _extract_value(element, metadata) for each element, await asyncio.gather to get all raw values, then map those through _apply_transform(raw, metadata). Use asyncio.create_task or gather directly and preserve ordering of elements; reference _is_extraction_model, self._extract_fields, _build_instance, _extract_value, and _apply_transform when locating code to change.
134-135: Consider logging when skipping fields with no type annotation.Fields without annotations are silently skipped. Adding a debug log here (similar to line 129) would help with troubleshooting when fields unexpectedly produce no value.
♻️ Suggested improvement
if annotation is None: + logger.debug(f'Skipping field "{name}" (no type annotation)') continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pydoll/extractor/engine.py` around lines 134 - 135, When skipping a field because "annotation is None" add a debug log (like the one at the nearby log on line 129) before the "continue" so skipped fields are visible during troubleshooting; use the same logger instance used in that scope and include the field identifier (e.g., the local "name" or "field" variable) and that the type annotation is missing to give context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pydoll/extractor/engine.py`:
- Around line 206-217: The loops in engine.py are running asynchronous
extraction sequentially; change both branches to run per-element work
concurrently: for extraction models, create tasks calling
self._extract_fields(inner_type, element, timeout) for each element, await
asyncio.gather to get all field_values, then map those results through
_build_instance(inner_type, field_values); for scalar paths, create tasks
calling _extract_value(element, metadata) for each element, await asyncio.gather
to get all raw values, then map those through _apply_transform(raw, metadata).
Use asyncio.create_task or gather directly and preserve ordering of elements;
reference _is_extraction_model, self._extract_fields, _build_instance,
_extract_value, and _apply_transform when locating code to change.
- Around line 134-135: When skipping a field because "annotation is None" add a
debug log (like the one at the nearby log on line 129) before the "continue" so
skipped fields are visible during troubleshooting; use the same logger instance
used in that scope and include the field identifier (e.g., the local "name" or
"field" variable) and that the type annotation is missing to give context.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: eb0aa412-0b55-4777-8028-24a2c169427c
📒 Files selected for processing (1)
pydoll/extractor/engine.py
Summary by CodeRabbit