Skip to content
This repository was archived by the owner on Nov 10, 2025. It is now read-only.

Fix FirecrawlScrapeWebsiteTool#298

Merged
lucasgomide merged 4 commits intocrewAIInc:mainfrom
nicoferdi96:patch-1
May 7, 2025
Merged

Fix FirecrawlScrapeWebsiteTool#298
lucasgomide merged 4 commits intocrewAIInc:mainfrom
nicoferdi96:patch-1

Conversation

@nicoferdi96
Copy link
Contributor

Add missing config parameter and correct Dict type annotation

  • Add required config parameter when creating the tool
  • Change type hint from dict to Dict to resolve Pydantic validation issues

…ect Dict type annotation

- Add required config parameter when creating the tool
- Change type hint from `dict` to `Dict` to resolve Pydantic validation issues
@joaomdmoura
Copy link
Collaborator

Disclaimer: This review was made by a crew of AI Agents.

Code Review Comment for PR #298: FirecrawlScrapeWebsiteTool Changes

Overview

The changes in this pull request aim to enhance the implementation of the FirecrawlScrapeWebsiteTool by correcting type annotations and adding necessary configuration parameters. Below is a detailed analysis of the code along with recommendations for further improvement.

Positive Aspects

  • The import of Dict from the typing module has been correctly implemented.
  • A config parameter has been added to the __init__ method, providing enhanced flexibility for user configurations.
  • The type annotation for the config field has improved, leading to better static type checking and readability.
  • Documentation for new parameters in the class docstring has been added, improving usability for future developers.

Issues and Recommendations

1. Type Hint Consistency

The timeout parameter in the _run method should have a more appropriate default value that reflects expected usage.

Current Implementation:

def _run(self, url: str, timeout: Optional[int] = 30000):
    return self._firecrawl.scrape_url(url, **self.config)

Recommended Implementation:

def _run(self, url: str, timeout: Optional[int] = None) -> str:
    config = self.config.copy()
    if timeout:
        config['timeout'] = timeout
    return self._firecrawl.scrape_url(url, **config)

2. Configuration Management

The current implementation does not merge user-defined config values with default values.

Current Implementation:

if config:
    kwargs["config"] = config

Recommended Implementation:

def __init__(self, api_key: Optional[str] = None, config: Optional[Dict[str, Any]] = None, **kwargs):
    default_config = {
        "formats": ["markdown"],
        "only_main_content": True,
        "include_tags": [],
        "exclude_tags": [],
        "headers": {},
        "wait_for": 0,
        "json_options": None
    }
    if config:
        default_config.update(config)
    kwargs["config"] = default_config
    super().__init__(**kwargs)

3. Enhanced Error Handling

Improving error messages during imports can enhance debugging experience.

Current Implementation:

try:
    from firecrawl import FirecrawlApp  # type: ignore
except ImportError:
    raise ImportError("Firecrawl is not installed. Please install it with `pip install firecrawl`")

Recommended Implementation:

try:
    from firecrawl import FirecrawlApp  # type: ignore
except ImportError as e:
    raise ImportError(
        "Firecrawl package is required but not installed. "
        "Please install it with `pip install firecrawl`. "
        f"Original error: {str(e)}"
    ) from e

4. Documentation Clarity

Adding type hints in the docstring can clarify expected input and output types.

Recommended Docstring enhancement:

class FirecrawlScrapeWebsiteTool(BaseTool):
    """Tool for scraping websites using Firecrawl.
    
    Args:
        api_key (Optional[str]): Firecrawl API key
        config (Optional[Dict[str, Any]]): Configuration dictionary with the following options:
            formats (List[str]): Output formats, e.g., ["markdown"]
            only_main_content (bool): Extract only main content. Default: True
            include_tags (List[str]): Tags to include. Default: []
            exclude_tags (List[str]): Tags to exclude. Default: []
            headers (Dict[str, str]): Headers to include. Default: {}
            wait_for (int): Time to wait for page to load in ms. Default: 0
            json_options (Optional[Dict]): Options for JSON extraction. Default: None

    Returns:
        str: Scraped content in specified format
    """

5. Configuration Validation

Adding validation checks ensures that required fields are correctly set and of the right type.

Recommended Implementation:

@property
def config(self) -> Dict[str, Any]:
    required_fields = {
        "formats": list,
        "only_main_content": bool,
        "include_tags": list,
        "exclude_tags": list,
        "headers": dict
    }
    
    for field, expected_type in required_fields.items():
        if field not in self._config or not isinstance(self._config[field], expected_type):
            raise ValueError(f"Config field '{field}' must be of type {expected_type}")
    
    return self._config

Historical Context from Related PRs

  1. PR #249 feat: add support for local qdrant client #250 - Enhanced Configuration Handling highlighted the importance of merging user configurations with defaults effectively and avoiding overwrites.
  2. PR Fix #2586: Add ollama as optional dependency for PDFSearchTool #265 - Improved Type Annotations discussed the balance between strict typing and usability, indicating that more descriptive type hints can improve user understanding.
  3. PR fix: do not use deprecated distutils in FileWriterTool #280 - Error Handling Enhancements emphasized the need for clear and contextual error messages, which can significantly aid debugging efforts.

Conclusion

The changes made in this PR move in the right direction by improving type handling, documentation, and error management. Implementing the above recommendations will further enhance the robustness and usability of the FirecrawlScrapeWebsiteTool. Please consider these suggestions for the next revision.

Comment on lines 57 to 59
def __init__(self, api_key: Optional[str] = None, config: Optional[Dict[str, Any]] = None, **kwargs):
if config:
kwargs["config"] = config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a bit confused here since

config: Dict[str, Any] = Field(

means that config is required

however, your initializer is treating is as optional:

config: Optional[Dict[str, Any]] = None

Does this change to config really needed?

self._firecrawl = FirecrawlApp(api_key=api_key)

def _run(self, url: str):
def _run(self, url: str, timeout: Optional[int] = 30000):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this parameter very useless haha.. We are even using that here..

What about removing from FirecrawlScrapeWebsiteToolSchema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucasgomide 100% totally useless

- removing optional config
- removing timeout from Pydantic model
- removing config from __init__
Copy link
Contributor

@lucasgomide lucasgomide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @nicoferdi96 welcome and thanks for your first contribution

@lucasgomide lucasgomide merged commit f86c6ac into crewAIInc:main May 7, 2025
1 check passed
mplachta pushed a commit to mplachta/crewAI-tools that referenced this pull request Aug 27, 2025
* fix FirecrawlScrapeWebsiteTool: add missing config parameter and correct Dict type annotation

- Add required config parameter when creating the tool
- Change type hint from `dict` to `Dict` to resolve Pydantic validation issues

* Update firecrawl_scrape_website_tool.py

- removing optional config
- removing timeout from Pydantic model

* Removing config from __init__

- removing config from __init__

* Update firecrawl_scrape_website_tool.py

- removing timeout
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants