Skip to content
This repository was archived by the owner on Nov 10, 2025. It is now read-only.

arxiv_paper_tool.py#310

Merged
lucasgomide merged 14 commits intocrewAIInc:mainfrom
HarikrishnanK9:main
Jul 29, 2025
Merged

arxiv_paper_tool.py#310
lucasgomide merged 14 commits intocrewAIInc:mainfrom
HarikrishnanK9:main

Conversation

@HarikrishnanK9
Copy link
Contributor

Custom tool for fetching data from Arxiv.Will be helpful for Crewai community to utilize this data for their projects and experiments.This Integration is crucial for learners and enthusiasts as well as researchers to identify and explore advanced research outcomes from Arxiv papers on various topics with the help of Crewai

@joaomdmoura
Copy link
Collaborator

Disclaimer: This review was made by a crew of AI Agents.

Code Review Comment for arxiv_paper_tool.py

Overview

The implementation of the ArxivPaperTool for fetching and downloading ArXiv papers is commendable, showcasing a well-structured approach to utilizing the ArXiv API. However, several key improvements can enhance code quality, maintainability, and overall functionality.

Positive Aspects

  1. Type Annotations and Data Validation: The integration of type hints and Pydantic models aids in clarity and error prevention.
  2. Effective Logging and Error Handling: The focus on logging enhances the tool's debuggability during both development and production use.
  3. Politeness to API: The inclusion of rate limiting via sleep intervals exhibits a responsible approach to API usage.

Identified Issues and Recommendations

1. Duplicate Imports

The os module is imported twice. This should be corrected to improve code cleanliness:

import os  # Remove duplicate occurrences

2. Unused Imports

Eliminate the imports from crewai since they are not utilized in this context:

# Remove these lines
from crewai import Agent, Task, Crew, LLM
from crewai.process import Process

3. Use of Constants

Hardcoded values should be defined as constants within the class for better maintainability:

class ArxivPaperTool(BaseTool):
    BASE_API_URL = "http://export.arxiv.org/api/query"
    SLEEP_DURATION = 1

This allows for easier modifications and encourages clarity regarding the purpose of these values.

4. Improvement in XML Namespace Handling

Reduce redundancy with XML namespace handling by employing a single method:

@staticmethod
def _get_element_text(entry, element_name):
    return entry.find(f'{self.ATOM_NAMESPACE}{element_name}').text.strip() if entry.find(f'{self.ATOM_NAMESPACE}{element_name}') is not None else None

5. Refined PDF URL Extraction

Optimize the function for extracting PDF links:

def _extract_pdf_url(self, entry) -> Optional[str]:
    return next((link.attrib['href'] for link in entry.findall(f'{self.ATOM_NAMESPACE}link') if 'pdf' in link.attrib.get('title', '').lower()), None)

This cleanly uses the next generator pattern to find the desired link.

6. Separating Formatting Logic

Create a dedicated method for generating formatted output from paper metadata:

def _format_paper_result(self, paper: dict) -> str:
    summary = (paper['summary'][:self.SUMMARY_TRUNCATE_LENGTH] + '...') if len(paper['summary']) > self.SUMMARY_TRUNCATE_LENGTH else paper['summary']
    return f"Title: {paper['title']}\nAuthors: {', '.join(paper['authors'])}\nPublished: {paper['published_date']}\nPDF: {paper['pdf_url'] or 'N/A'}\nSummary: {summary}"

7. Input Validation

Enhance validation on max_results and other inputs:

class ArxivToolInput(BaseModel):
    max_results: int = Field(5, ge=1, le=100, description="Max results to fetch; must be between 1 and 100")

8. File Safety Enhancements

Ensure the validity of file paths to mitigate security risks:

def _validate_save_path(self, save_dir: str) -> None:
    if not os.path.isabs(save_dir):
        save_dir = os.path.abspath(save_dir)
    if not os.path.exists(save_dir):
        os.makedirs(save_dir, exist_ok=True)

9. Documentation Improvements

Enhance the code with detailed docstrings for classes and methods, improving user comprehension and aiding future maintainers.

10. Enhanced Error Messages

Make error handling more informative for troubleshooting:

def download_pdf(self, pdf_url: str, save_path: str):
    try:
        urllib.request.urlretrieve(pdf_url, save_path)
    except urllib.error.URLError as e:
        logger.error(f"Network error occurred while downloading {pdf_url}: {e}")
        raise
    except OSError as e:
        logger.error(f"File save error for {save_path}: {e}")
        raise

Security Considerations

  1. URL Validation: Ensure that all URLs being fetched are validated to prevent malicious downloads.
  2. Timeouts: Implement timeouts in network requests to improve robustness against connectivity issues.
  3. File Path Validation: Mitigate risks of directory traversal attacks by validating paths provided for saving files.

Performance Considerations

  1. Asynchronous Handling: Consider adopting async methodologies for trying multiple downloads concurrently to improve performance for heavy workloads.
  2. Caching Mechanisms: Implement caching for repeated requests to enhance speed and reduce load on the ArXiv server.
  3. Retry Logic: Introduce retry mechanisms for failed requests for better fault tolerance.

Conclusion

The arxiv_paper_tool.py serves as a valuable tool for users accessing academic papers. Implementing the recommendations outlined above will substantially enhance functionality, security, and maintainability. Future modifications should also include documentation and consistent code review practices to uphold quality.

Please review these suggestions and consider implementing the proposed changes to ensure a robust and reliable tool for the community.

@HarikrishnanK9
Copy link
Contributor Author

@joaomdmoura Updated as per your review

@HarikrishnanK9
Copy link
Contributor Author

HarikrishnanK9 commented May 31, 2025

Hi @joaomdmoura, I’ve updated the PR based on the feedback. Let me know when you are available to review it. @lucasgomide tagging you as well in case you're available to help review. Thanks, Crewai Team, for your valuable time and consideration.

@HarikrishnanK9
Copy link
Contributor Author

@joaomdmoura I am waiting for review and approval. Please let me know if any further changes are required from my side. Thank You

Copy link
Contributor

@lucasgomide lucasgomide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HarikrishnanK9 thank you for your collaboration. I just dropped some comments

Would you mind sharing a loom video showing this tool working within Crew?

I also missing test for this tools, can you add some?

@HarikrishnanK9
Copy link
Contributor Author

@lucasgomide Apologies for the delayed response. Thank you for your valuable time and for the detailed review and instructions. I will complete the changes as soon as possible. Once everything is finalised on my end, I will inform you for the final review.

@lucasgomide
Copy link
Contributor

@HarikrishnanK9 hey! is it ready for review?

@HarikrishnanK9
Copy link
Contributor Author

@lucasgomide I've addressed the feedback and made the necessary updates. I didn’t inform you earlier as I hadn’t included any examples using the tool. Could you please review the updated version? If it’s not fully aligned with CrewAI requirements, I’m happy to revise it accordingly. Apologies for the delay — I was unavailable due to some personal commitments. I'm currently available and also working in parallel to add a few more tools and examples

@HarikrishnanK9
Copy link
Contributor Author

@lucasgomide it's ready for review!. If you find anything that’s not fully aligned with the CrewAI standards or workflows, I’m happy to revise it further.

@HarikrishnanK9
Copy link
Contributor Author

Hi @lucasgomide, just following up on this PR. Let me know if any changes are needed .

@lucasgomide
Copy link
Contributor

The test still failing 🤔

We'd fixed this one - sqlite3.OperationalError: no such module: fts5 - last week. Can you try sync your branch?

@HarikrishnanK9
Copy link
Contributor Author

@lucasgomide syncing the branch resolved those issues.

@HarikrishnanK9
Copy link
Contributor Author

Hi @lucasgomide, I just wanted to follow up one more time on this PR. All feedback has been addressed. Thanks again for your time

@lucasgomide lucasgomide merged commit a4d5b3b into crewAIInc:main Jul 29, 2025
4 checks passed
mplachta pushed a commit to mplachta/crewAI-tools that referenced this pull request Aug 27, 2025
* arxiv_paper_tool.py

* Updating as per the review

* Update __init__.py

* Update __init__.py

* Update arxiv_paper_tool.py

* added test cases

* Create README.md

* Create Examples.md

* Update Examples.md

* Updated logger

* Updated with package_dependencies,env_vars
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants