arxiv_paper_tool.py by HarikrishnanK9 · Pull Request #310 · crewAIInc/crewAI-tools

HarikrishnanK9 · 2025-05-23T08:51:33Z

Custom tool for fetching data from Arxiv.Will be helpful for Crewai community to utilize this data for their projects and experiments.This Integration is crucial for learners and enthusiasts as well as researchers to identify and explore advanced research outcomes from Arxiv papers on various topics with the help of Crewai

joaomdmoura · 2025-05-23T08:54:05Z

Disclaimer: This review was made by a crew of AI Agents.

Code Review Comment for `arxiv_paper_tool.py`

Overview

The implementation of the ArxivPaperTool for fetching and downloading ArXiv papers is commendable, showcasing a well-structured approach to utilizing the ArXiv API. However, several key improvements can enhance code quality, maintainability, and overall functionality.

Positive Aspects

Type Annotations and Data Validation: The integration of type hints and Pydantic models aids in clarity and error prevention.
Effective Logging and Error Handling: The focus on logging enhances the tool's debuggability during both development and production use.
Politeness to API: The inclusion of rate limiting via sleep intervals exhibits a responsible approach to API usage.

Identified Issues and Recommendations

1. Duplicate Imports

The os module is imported twice. This should be corrected to improve code cleanliness:

import os  # Remove duplicate occurrences

2. Unused Imports

Eliminate the imports from crewai since they are not utilized in this context:

# Remove these lines
from crewai import Agent, Task, Crew, LLM
from crewai.process import Process

3. Use of Constants

Hardcoded values should be defined as constants within the class for better maintainability:

class ArxivPaperTool(BaseTool):
    BASE_API_URL = "http://export.arxiv.org/api/query"
    SLEEP_DURATION = 1

This allows for easier modifications and encourages clarity regarding the purpose of these values.

4. Improvement in XML Namespace Handling

Reduce redundancy with XML namespace handling by employing a single method:

@staticmethod
def _get_element_text(entry, element_name):
    return entry.find(f'{self.ATOM_NAMESPACE}{element_name}').text.strip() if entry.find(f'{self.ATOM_NAMESPACE}{element_name}') is not None else None

5. Refined PDF URL Extraction

Optimize the function for extracting PDF links:

def _extract_pdf_url(self, entry) -> Optional[str]:
    return next((link.attrib['href'] for link in entry.findall(f'{self.ATOM_NAMESPACE}link') if 'pdf' in link.attrib.get('title', '').lower()), None)

This cleanly uses the next generator pattern to find the desired link.

6. Separating Formatting Logic

Create a dedicated method for generating formatted output from paper metadata:

def _format_paper_result(self, paper: dict) -> str:
    summary = (paper['summary'][:self.SUMMARY_TRUNCATE_LENGTH] + '...') if len(paper['summary']) > self.SUMMARY_TRUNCATE_LENGTH else paper['summary']
    return f"Title: {paper['title']}\nAuthors: {', '.join(paper['authors'])}\nPublished: {paper['published_date']}\nPDF: {paper['pdf_url'] or 'N/A'}\nSummary: {summary}"

7. Input Validation

Enhance validation on max_results and other inputs:

class ArxivToolInput(BaseModel):
    max_results: int = Field(5, ge=1, le=100, description="Max results to fetch; must be between 1 and 100")

8. File Safety Enhancements

Ensure the validity of file paths to mitigate security risks:

def _validate_save_path(self, save_dir: str) -> None:
    if not os.path.isabs(save_dir):
        save_dir = os.path.abspath(save_dir)
    if not os.path.exists(save_dir):
        os.makedirs(save_dir, exist_ok=True)

9. Documentation Improvements

Enhance the code with detailed docstrings for classes and methods, improving user comprehension and aiding future maintainers.

10. Enhanced Error Messages

Make error handling more informative for troubleshooting:

def download_pdf(self, pdf_url: str, save_path: str):
    try:
        urllib.request.urlretrieve(pdf_url, save_path)
    except urllib.error.URLError as e:
        logger.error(f"Network error occurred while downloading {pdf_url}: {e}")
        raise
    except OSError as e:
        logger.error(f"File save error for {save_path}: {e}")
        raise

Security Considerations

URL Validation: Ensure that all URLs being fetched are validated to prevent malicious downloads.
Timeouts: Implement timeouts in network requests to improve robustness against connectivity issues.
File Path Validation: Mitigate risks of directory traversal attacks by validating paths provided for saving files.

Performance Considerations

Asynchronous Handling: Consider adopting async methodologies for trying multiple downloads concurrently to improve performance for heavy workloads.
Caching Mechanisms: Implement caching for repeated requests to enhance speed and reduce load on the ArXiv server.
Retry Logic: Introduce retry mechanisms for failed requests for better fault tolerance.

Conclusion

The arxiv_paper_tool.py serves as a valuable tool for users accessing academic papers. Implementing the recommendations outlined above will substantially enhance functionality, security, and maintainability. Future modifications should also include documentation and consistent code review practices to uphold quality.

Please review these suggestions and consider implementing the proposed changes to ensure a robust and reliable tool for the community.

HarikrishnanK9 · 2025-05-23T09:55:46Z

@joaomdmoura Updated as per your review

HarikrishnanK9 · 2025-05-31T17:11:15Z

Hi @joaomdmoura, I’ve updated the PR based on the feedback. Let me know when you are available to review it. @lucasgomide tagging you as well in case you're available to help review. Thanks, Crewai Team, for your valuable time and consideration.

HarikrishnanK9 · 2025-06-03T06:30:20Z

@joaomdmoura I am waiting for review and approval. Please let me know if any further changes are required from my side. Thank You

crewai_tools/tools/arxiv_paper_tool/arxiv_paper_tool.py

lucasgomide

@HarikrishnanK9 thank you for your collaboration. I just dropped some comments

Would you mind sharing a loom video showing this tool working within Crew?

I also missing test for this tools, can you add some?

HarikrishnanK9 · 2025-06-07T16:29:46Z

@lucasgomide Apologies for the delayed response. Thank you for your valuable time and for the detailed review and instructions. I will complete the changes as soon as possible. Once everything is finalised on my end, I will inform you for the final review.

lucasgomide · 2025-07-02T17:19:50Z

@HarikrishnanK9 hey! is it ready for review?

HarikrishnanK9 · 2025-07-03T03:07:44Z

@lucasgomide I've addressed the feedback and made the necessary updates. I didn’t inform you earlier as I hadn’t included any examples using the tool. Could you please review the updated version? If it’s not fully aligned with CrewAI requirements, I’m happy to revise it accordingly. Apologies for the delay — I was unavailable due to some personal commitments. I'm currently available and also working in parallel to add a few more tools and examples

HarikrishnanK9 · 2025-07-10T04:43:33Z

@lucasgomide it's ready for review!. If you find anything that’s not fully aligned with the CrewAI standards or workflows, I’m happy to revise it further.

HarikrishnanK9 · 2025-07-15T09:26:21Z

Hi @lucasgomide, just following up on this PR. Let me know if any changes are needed .

lucasgomide · 2025-07-15T16:54:30Z

The test still failing 🤔

We'd fixed this one - sqlite3.OperationalError: no such module: fts5 - last week. Can you try sync your branch?

HarikrishnanK9 · 2025-07-17T03:31:42Z

@lucasgomide syncing the branch resolved those issues.

crewai_tools/tools/arxiv_paper_tool/arxiv_paper_tool.py

HarikrishnanK9 · 2025-07-28T02:48:52Z

Hi @lucasgomide, I just wanted to follow up one more time on this PR. All feedback has been addressed. Thanks again for your time

* arxiv_paper_tool.py * Updating as per the review * Update __init__.py * Update __init__.py * Update arxiv_paper_tool.py * added test cases * Create README.md * Create Examples.md * Update Examples.md * Updated logger * Updated with package_dependencies,env_vars

arxiv_paper_tool.py

149dc54

Updating as per the review

df9dd0d

HarikrishnanK9 added 3 commits May 29, 2025 09:10

Merge branch 'crewAIInc:main' into main

98d4228

Update __init__.py

b9928ad

Update __init__.py

8334446