Scientific Paper Analysis Tool (Gemini-Powered)

Automate the extraction of structured data from scientific PDFs using Google's Gemini 2.5 Flash. This tool transforms a chaotic folder of research papers into a structured, publication-quality Excel database, complete with an interactive Knowledge Graph and automated Citation Management.

🚀 Key Features

AI-Powered Deep Analysis: Extracts complex scientific metadata including:
- Central Problem, Hypothesis, and Objectives.
- Independent/Dependent Variables (X/Y).
- Methodology & Tools.
- Key Results and Conclusions.
- Short Summaries and Glossaries of technical terms.
Pub-Quality Knowledge Graph:
- Builds a dynamic NetworkX graph linking papers to technical concepts.
- Uses adjustText physics simulation to prevent label overlap.
- Auto-embedded into the Excel workbook with customizable Obsidian-style dark aesthetics.
Smart Excel Dashboard:
- Dashboard: Clickable Table of Contents with summaries and glossary previews.
- Individual Sheets: Dedicated pages for each paper with structured data.
- Strict Deduplication: Automatically updates existing sheets instead of creating duplicates.
Robust File Management:
- Content-Based Deduplication: Uses MD5 hashing to move identical PDFs to a duplicates/ folder, even if filenames differ.
- Automated Renaming: Standardizes files to Author-Year-ShortTitle.pdf.
- Safe Grounding: Uses Google Search to verify citations (DOI, Journal, etc.) without losing the original paper's identity.
- Garbage Collection: Automatically removes "Zombie" log entries if files are deleted from the disk.

📋 Prerequisites

Python 3.9+
A Google Cloud API Key with access to Gemini 2.5 Flash.

🏗️ Architecture & Workflow

The application uses a modular architecture with strict atomic logging to ensure data safety.

flowchart TD
    Start([Start]) --> Scan[Scan Input Directory]
    Scan --> StateCheck{In Processed Log?}
    StateCheck -- Yes --> Skip[Skip File]
    StateCheck -- No --> Upload[Upload to Gemini]
    
    Upload --> Analyze[AI Analysis & Metadata]
    Analyze --> Excel[Write to Excel]
    Excel --> AtomicLog[ATOMIC: Update Log & Registry]
    
    AtomicLog --> RenameTry{Try Rename?}
    RenameTry -- Success --> UpdateLog[Update Log with New Name]
    RenameTry -- Fail --> Warn[Log Warning & Continue]
    
    UpdateLog --> Loop[Next File]
    Warn --> Loop
    Skip --> Loop
    
    Loop --> Graph[Generate Knowledge Graph]
    Graph --> End([End])

Key Modules

main.py: The central orchestrator that manages the high-level flow.
state_manager.py: Encapsulates all state tracking and JSON logging (processed_log.json and processed_hashes.json). Ensures atomic updates across runs.
file_utils.py: Handles file system operations safely, including path normalization, unique filename generation, and robust renaming.
analyzer.py: Manages interaction with the Google Gen AI SDK (Gemini 2.5 Flash).
excel_writer.py: Manages the construction and formatting of the Excel database.
reference_manager.py: Handles citation enrichment and BibTeX/APA/APS exports.
graph_builder.py: Builds and visualizes the Knowledge Graph using NetworkX and Matplotlib.
verify_pdfs.py: Provides MD5 hashing and integrity checks for PDF files.

🛠️ Installation

Clone the repository:

git clone https://github.com/yourusername/papers-to-xlsx.git
cd papers-to-xlsx

Create and activate a virtual environment:

python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```
Configuration: Create a .env file in the project root:
```
GOOGLE_API_KEY=your_actual_api_key_here
```

💻 Usage

Run the tool by providing the path to your PDF folder:

python3 main.py /path/to/your/pdf_folder

Advanced: Resetting State

If you want to perform a fresh rebuild without re-downloading PDFs, use the cleanup tool:

python3 clean_state.py /path/to/your/pdf_folder

📂 Project Structure

main.py: Entry point and orchestrator.
state_manager.py: Manages processed_log.json and processed_hashes.json.
file_utils.py: Safe file operations and path manipulation.
analyzer.py: Gemini API interface (File uploads, prompts, and JSON parsing).
excel_writer.py: Handles formatting, Dashboard, and Excel logic.
graph_builder.py: Generates and embeds the NetworkX Knowledge Graph.
reference_manager.py: Citation enrichment and BibTeX exports.
verify_pdfs.py: MD5 hashing and file integrity.
clean_state.py: Utility to reset the analysis state.
tests/: Contains integration tests to verify the architecture.
- integration_test.py: Validates Clean Run, Idempotency, and Duplicate handling.

📊 Output Organization

The script creates an outputs/ folder inside your target directory:

Target-Folder/
├── outputs/
│   ├── Paper_Analysis_Results.xlsx  # The main Database
│   ├── processed_log.json           # Progress tracker
│   ├── processed_hashes.json        # Duplicate prevention registry
│   ├── error_log.txt                # Log of any failed attempts
│   └── citations/                   # BibTeX, APA, and APS exports
├── duplicates/                      # Identical files moved here
└── [Renamed-Papers].pdf             # Cleanly organized PDF files

🧪 Testing

The tool includes an automated integration test suite to ensure the stability of the processing pipeline.

To run the tests:

python3 tests/integration_test.py

The tests verify:

Clean Run: Full process from empty state to Excel output.
Idempotency: Ensuring re-runs safely skip already-analyzed files.
Deduplication Resilience: Handling collisions and duplicate content detection.

🛡️ Synchronization Details

The tool includes several "Self-Healing" features:

Atomic Success: Logs are updated before risky operations (like renaming) to prevent data loss or infinite loops.
Collision Protection: If two papers result in the same standardized filename, the script automatically handles it with versioning (_v2, _v3).
Unicode Safety: Handles complex characters (e.g., Wójcik) across all modules to prevent path-related crashes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scientific Paper Analysis Tool (Gemini-Powered)

🚀 Key Features

📋 Prerequisites

🏗️ Architecture & Workflow

Key Modules

🛠️ Installation

💻 Usage

Advanced: Resetting State

📂 Project Structure

📊 Output Organization

🧪 Testing

🛡️ Synchronization Details

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
tests		tests
.gitignore		.gitignore
GEMINI.md		GEMINI.md
README.md		README.md
analyzer.py		analyzer.py
clean_state.py		clean_state.py
excel_writer.py		excel_writer.py
file_utils.py		file_utils.py
graph_builder.py		graph_builder.py
main.py		main.py
reference_manager.py		reference_manager.py
requirements.txt		requirements.txt
state_manager.py		state_manager.py
verify_pdfs.py		verify_pdfs.py

Folders and files

Latest commit

History

Repository files navigation

Scientific Paper Analysis Tool (Gemini-Powered)

🚀 Key Features

📋 Prerequisites

🏗️ Architecture & Workflow

Key Modules

🛠️ Installation

💻 Usage

Advanced: Resetting State

📂 Project Structure

📊 Output Organization

🧪 Testing

🛡️ Synchronization Details

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages