An intelligent, AI-driven research assistant designed to accelerate pharmaceutical R&D by providing evidence-backed answers to complex biomedical questions. Built with Retrieval-Augmented Generation (RAG), this project leverages Large Language Models to synthesize information from PubMed and ClinicalTrials.gov, helping researchers identify drug targets, discover repurposing opportunities, and streamline literature reviews.
- Project Overview
- Features
- Architecture
- Technology Stack
- Project Structure
- Prerequisites
- Installation
- Configuration
- How to Run
- Project Phases
- Usage Examples
- Troubleshooting
- Future Enhancements
- Resources
- License
Pharma-Intellect solves a critical bottleneck in pharmaceutical research: the overwhelming volume of scientific literature and clinical trial data. Instead of manually reviewing thousands of documents, researchers can ask complex questions and receive synthesized, citation-backed answers in seconds.
- Time Savings: Reduces literature review time from days to minutes
- Risk Reduction: Helps uncover novel therapeutic pathways and drug repurposing opportunities
- Evidence-Based: Every answer is backed by citations to original sources
- Scalable: Can process thousands of research papers and clinical trials
Query: "Summarize recent phase 2 and phase 3 clinical trials that investigated the efficacy of risperidone or its derivatives for treating autoimmune disorders, and list any reported adverse events."
Result: A synthesized answer with direct links to relevant PubMed articles and ClinicalTrials.gov entries.
- Data collection from PubMed and ClinicalTrials.gov APIs
- Semantic search using vector embeddings (Sentence-Transformers)
- Fast vector database storage (ChromaDB)
- LLM-powered answer generation (OpenAI GPT-3.5)
- Interactive web interface (Streamlit)
- Citation tracking and source linking
- Query history management
- Example queries for quick start
- Error handling and logging
- Environment-based configuration
User Query
|
v
[Embedding] -> Convert query to vector using Sentence-Transformers
|
v
[Retrieval] -> Search ChromaDB for top-k similar documents
|
v
[Augmentation] -> Combine query + retrieved context
|
v
[Generation] -> Send to LLM (OpenAI GPT) for synthesis
|
v
[Response] -> Return answer + source citations
- Phase 1: Data Ingestion - Collect data from APIs and save to CSV
- Phase 2: Knowledge Base - Create embeddings and store in vector database
- Phase 3: RAG Engine - Build query engine with LLM integration
- Phase 4: Web Interface - Interactive Streamlit application
| Component | Technology | Purpose |
|---|---|---|
| Language | Python 3.13 | Core programming language |
| Web Framework | Streamlit | Interactive web interface |
| Data Sources | PubMed & ClinicalTrials.gov APIs | Biomedical literature and trial data |
| Embeddings | Sentence-Transformers (all-MiniLM-L6-v2) | Text-to-vector conversion |
| Vector Database | ChromaDB | Fast semantic search storage |
| LLM | OpenAI GPT-3.5-turbo | Answer generation |
| Data Processing | Pandas | ETL and data transformation |
| HTTP Requests | Requests library | API communication |
| Configuration | python-dotenv | Environment variable management |
pharma_intellect_project/
|-- README.md (This file)
|-- requirements.txt (Python dependencies)
|-- .gitignore (Git ignore rules)
|-- .env.example (Environment template)
|-- config.py (Configuration module)
|-- LICENSE (MIT License)
|
|-- Phase_1_Data_Collection/
| |-- data_collector.py (Fetch from APIs)
| |-- fetch_pubmed_alternative.py (Alternative PubMed source)
| |-- data/
| | |-- pubmed_data.csv
| | |-- clinical_trials_data.csv
| | |-- .gitkeep
|
|-- Phase_2_Knowledge_Base/
| |-- embeddings_creator.py (Generate embeddings)
| |-- quick_embeddings.py (Demo version)
| |-- chroma_db/ (Vector database)
|
|-- Phase_3_RAG_Engine/
| |-- rag_engine.py (RAG query engine)
|
|-- Phase_4_UI/
|-- app.py (Streamlit interface)
- Python 3.11 or higher (tested with 3.13)
- pip package manager
- Internet connection (for API calls)
- OpenAI API key (for LLM access)
- At least 2GB RAM
git clone https://github.com/Sharmaharshitnone/Pharma-Intellect.git
cd pharma_intellect_projectpython -m venv env
source env/bin/activate # On Windows: env\Scripts\activateYou should see (env) in your terminal prompt.
pip install -r requirements.txtpip list | grep -E "requests|pandas|sentence-transformers|streamlit|chromadb|openai"All packages should be listed.
Option 1: Export as environment variable
export OPENAI_API_KEY="sk-proj-your-actual-key"Option 2: Create .env file from template
cp .env.example .env
# Edit .env with your API key
nano .envThen add your key:
OPENAI_API_KEY=sk-proj-your-actual-key
Option 3: Load from Python code
from dotenv import load_dotenv
import os
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")Get your OpenAI API key from: https://platform.openai.com/api-keys
# 1. Activate environment
source env/bin/activate
# 2. Collect data
python Phase_1_Data_Collection/data_collector.py
# 3. Create embeddings
python Phase_2_Knowledge_Base/quick_embeddings.py
# 4. Launch web interface
streamlit run Phase_4_UI/app.pyThen open http://localhost:8501 in your browser.
python Phase_1_Data_Collection/data_collector.pyExpected output: CSV files with PubMed articles and clinical trials
python Phase_2_Knowledge_Base/quick_embeddings.pyExpected output: ChromaDB vector database initialized
python Phase_3_RAG_Engine/rag_engine.pyExpected output: Sample query result with answer and sources
streamlit run Phase_4_UI/app.pyOpens interactive web interface at http://localhost:8501
If data and embeddings already exist:
streamlit run Phase_4_UI/app.py- Enter your research question in the text area
- Adjust the number of sources (3-15 documents)
- Click "Get Answer"
- View the synthesized answer
- Expand sources to see detailed citations
- Click links to access original sources
Example questions:
- "What are the latest clinical trials for treating autoimmune disorders?"
- "Summarize recent findings on drug repurposing for oncology"
- "What adverse events have been reported for risperidone?"
Goal: Collect raw biomedical data from public APIs and transform into CSV files.
Data Sources:
- PubMed API: Biomedical literature abstracts
- ClinicalTrials.gov API v2: Clinical trial information
Key Functions:
- fetch_pubmed_data(): Query PubMed API
- fetch_clinical_trials_data(): Query ClinicalTrials.gov API
- clean_and_deduplicate(): Remove duplicates
Output: CSV files with structured data
- pubmed_data.csv: 100+ articles with title, abstract, PMID, URL
- clinical_trials_data.csv: 474+ trials with title, status, phase, outcomes
To Run:
python Phase_1_Data_Collection/data_collector.pyGoal: Convert text to vectors and store in ChromaDB for semantic search.
Process:
- Load CSV files from Phase 1
- Text chunking (300 chars with 50 char overlap)
- Generate embeddings using Sentence-Transformers
- Store in ChromaDB collections
Collections:
- pubmed_abstracts: PubMed articles
- clinical_trials: Clinical trial information
To Run:
python Phase_2_Knowledge_Base/quick_embeddings.pyGoal: Create query engine that retrieves documents and generates answers.
Process:
- Encode user query to vector
- Search ChromaDB collections
- Retrieve top-k most relevant documents
- Send query + context to OpenAI LLM
- Return synthesized answer with sources
Main Class: PharmaIntellectRAG
- retrieve_documents(query, num_results): Get relevant documents
- generate_answer(query, retrieved_docs): Generate LLM answer
- query(question, num_results): Complete query execution
To Run:
python Phase_3_RAG_Engine/rag_engine.pyGoal: Provide interactive user interface for queries.
Features:
- Text input for research questions
- Adjustable source count (3-15)
- Pre-loaded example queries
- Answer display with formatting
- Expandable source cards with metadata
- Query history tracking
To Run:
streamlit run Phase_4_UI/app.pyThen open http://localhost:8501
Input: "What clinical trials have investigated risperidone for treating autoimmune disorders?"
Output:
- Synthesized answer from relevant trials
- Links to each clinical trial
- Publication dates and trial status
- Reported efficacy and adverse events
Input: "Summarize the latest findings on oncology drug discovery targeting immune checkpoints."
Output:
- Summary of recent research
- Links to PubMed articles
- Publication dates
- Key findings and mechanisms
Input: "What adverse events have been reported in phase 3 trials for Valsartan in hypertension?"
Output:
- List of reported adverse events
- Frequency and severity
- Clinical trial references
- Direct links to trial data
Solution:
# Check if OPENAI_API_KEY is set
echo $OPENAI_API_KEY
# If empty, set it
export OPENAI_API_KEY="sk-proj-your-key"Solution:
# Regenerate embeddings
python Phase_2_Knowledge_Base/quick_embeddings.py
# Or check if chroma_db directory exists
ls -la chroma_db/Solution:
# Test API connectivity
curl -X GET "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=oncology&retmax=5&rettype=json"
# Check internet connection
ping 8.8.8.8Solution:
- Use quick_embeddings.py instead of full embeddings_creator.py
- Use GPU acceleration if available
- Reduce number of documents processed
Solution:
# Use different port
streamlit run Phase_4_UI/app.py --server.port 8502Solution:
- Close other applications
- Use quick_embeddings.py for demo
- Reduce NUM_PUBMED and NUM_TRIALS in quick_embeddings.py
- Increase system RAM or use cloud deployment
Solution:
- Check internet connection
- Wait a few minutes (API rate limiting)
- Use alternative data sources
- Reduce number of results requested
- Drug interaction checker for multiple compounds
- ML model for drug target prediction
- Google Patents API integration for prior art search
- SMILES parser for chemical structure queries
- Multi-language support for international papers
- R&D cost and timeline estimation
- Cloud deployment (AWS Lambda, Google Cloud Run)
- User authentication for enterprise use
- Advanced filtering and sorting
- Batch query processing
- Custom knowledge base upload
- Export results to PDF/Excel
- API endpoint for programmatic access
- PubMed API: https://www.ncbi.nlm.nih.gov/home/develop/api/
- ClinicalTrials.gov API: https://clinicaltrials.gov/api/gui
- Streamlit: https://docs.streamlit.io/
- ChromaDB: https://docs.trychroma.com/
- Sentence-Transformers: https://www.sbert.net/
- OpenAI API: https://platform.openai.com/docs/api-reference
This project is created for educational and research purposes. Licensed under MIT License.
Have improvements? Feel free to fork and submit PRs!
For questions or support, please open an issue on GitHub.
Created for Jubilant Pharmova AI Internship Application
Last Updated: October 2025