Skip to content

SujalRajpt/ResearchPaperCrawler

Repository files navigation

🧠 Research Paper Crawler & Summarizer

A Python-based tool that automatically searches for research papers related to a given query using Semantic Scholar, then generates a concise literature review by summarizing each paper's abstract using state-of-the-art NLP models (Pegasus or Gemini).


🚀 Features

  • 🔍 Search relevant papers using Semantic Scholar's API.
  • 🧾 Automatically summarize each abstract using:
    • Pegasus (local, transformer-based)
    • or Google's Gemini API (if enabled)
  • 📚 Generates a general literature summary with references to each paper.
  • 🧪 Supports a debug mode to work offline using pre-downloaded mock papers.

🧱 Project Structure

.
├── main.py                         # Entry point of the app
├── semantic_scholar.py            # Contains `search_papers()` logic
├── literature_summary.py          # Summarization utilities
├── json_raw_data/
│   └── human genome sequencing variation.json.json  # Mock paper data

⚙️ Installation

# Clone the repo
git clone https://github.com/SujalRajpt/ResearchPaperCrawler.git
cd ResearchPaperCrawler

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

📦 Dependencies

Some key packages include:

  • transformers
  • torch
  • requests
  • tqdm (optional for progress)
  • semantic-scholar (or your wrapper for querying papers)

You can generate a requirements.txt using:

pip freeze > requirements.txt

🧪 Usage

1. Run in Debug Mode (with mock data)

python main.py

2. Run with Live Semantic Scholar Search

Edit the main.py:

DEBUG_MODE = False

Then provide a search query inside main.py or modify it to accept CLI input.


🧠 Example Output

=== Literature Review Summary ===
Paper A shows how genome sequencing improves diagnosis [1]. 
Another study highlights challenges in capturing variation [2]. 
...

References:
[1] Title of Paper A
[2] Title of Paper B
...

🔐 Gemini API (Optional)

To use Gemini for summarization, update:

USE_GEMINI_API = True

And insert your API key into summarize_abstract_gemini():

api_key = "YOUR_API_KEY"

📌 Notes

  • Pegasus may truncate long abstracts, so keep summaries concise.
  • Gemini is useful for faster or higher-quality summaries, but requires a valid API key and internet access.
  • Abstracts with job-related content (e.g., "apply", "email") are filtered out to avoid noise.

📜 License

MIT License. See LICENSE for details.


About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages