diff --git a/DEPLOYMENT_REQUIREMENTS.md b/DEPLOYMENT_REQUIREMENTS.md new file mode 100644 index 0000000..eb2c7e3 --- /dev/null +++ b/DEPLOYMENT_REQUIREMENTS.md @@ -0,0 +1,179 @@ +# BioAnalyzer Deployment Requirements + +System requirements and deployment info for issue #25. + +## System Requirements + +### Hardware + +Minimum setup: +- 2 CPU cores +- 2GB RAM +- 5GB disk space (for Docker image and dependencies) +- Internet access for API calls + +Recommended for better performance: +- 4+ CPU cores +- 4GB+ RAM +- 10GB+ disk space (for cache, logs, results) +- Stable internet connection + +### Software + +You'll need: +- Docker 20.0+ with Docker Compose 2.0+, or Python 3.8+ if not using Docker +- Internet access for NCBI E-utilities API and LLM provider APIs + +Optional but useful: +- Redis for caching (SQLite is used by default though) +- Reverse proxy like Nginx or Traefik for production +- SSL/TLS certificates if you need HTTPS + +### Dependencies + +All dependencies are in the Docker image or `requirements.txt`. Main ones are FastAPI, Uvicorn, LiteLLM, Paper-QA, and PyTorch (CPU version). See `config/requirements.txt` for the full list. + +## Deployment Options + +### API-Only Deployment + +Users don't need CLI access if you're just using the API. The CLI is optional and only useful for command-line analysis, local development, or admin tasks. + +For API-only deployment, just run the FastAPI server. No CLI installation needed. Everything works through REST API endpoints, so it can run on Shiny server alongside other services like metaharmonizer. + +### CLI Access + +You only need CLI access if users want to run analysis from the command line, need it for admin tasks, or for local testing. If you do need it, you'll need a Python environment on the server and run the `install.sh` script, which adds some complexity to the deployment. + +## Server Selection: Shiny Server vs Superstudio + +### Shiny Server + +Good choice for API-only deployment. You can run it alongside other services like metaharmonizer. Deployment is simpler since it's just the API service, and you don't need local LLM installations - it uses external APIs like Gemini or OpenAI. Easier to manage too. + +Requirements: +- Docker support or Python 3.8+ environment +- API keys for LLM providers (Gemini works well) +- Port 8000 available (or change it in config) +- Internet access for API calls + +Deployment is straightforward: +```bash +docker compose up -d +# or +python main.py --host 0.0.0.0 --port 8000 +``` + +### Superstudio + +Use Superstudio if you have local LLM models installed and want to use them, need Ollama or Llamafile for local inference, want to avoid external API costs, or have specific requirements for on-premise LLM access. + +Requirements are the same as Shiny Server, plus you'll need local LLM setup (Ollama, Llamafile, etc.) and more resources if you're running local models. + +## API Key Requirements + +### Required API Keys + +You'll need an NCBI API key for PubMed/PMC data access. Get it from https://www.ncbi.nlm.nih.gov/account/settings/. It's free and gives you 3 requests/second rate limit. + +For LLM, you need at least one API key. Gemini is recommended and has a free tier. Get it from https://makersuite.google.com/app/apikey. After the free tier it's pay-as-you-go. + +OpenAI and Anthropic are optional alternatives. OpenAI keys are at https://platform.openai.com/api-keys, Anthropic at https://console.anthropic.com/. Both are pay-per-use. + +### Creating a New API Key + +Yes, it's fine to create a new key for BioAnalyzer. Actually, it's better to have a dedicated key. Store it in environment variables or a `.env` file, never commit it to version control, and keep an eye on usage and costs. + +### Environment Variables + +Create a `.env` file with: +```bash +# Required +NCBI_API_KEY=your_ncbi_key_here +EMAIL=your_email@example.com + +# At least one LLM provider +GEMINI_API_KEY=your_gemini_key_here +# OR +OPENAI_API_KEY=your_openai_key_here +# OR +ANTHROPIC_API_KEY=your_anthropic_key_here + +# Optional: Local LLM (if using Ollama) +OLLAMA_BASE_URL=http://localhost:11434 +LLM_PROVIDER=ollama +``` + +## Deployment Steps + +### Docker Deployment + +```bash +git clone https://github.com/waldronlab/bioanalyzer-backend.git +cd bioanalyzer-backend + +cp .env.example .env +# Edit .env with your API keys + +docker compose build +docker compose up -d + +curl http://localhost:8000/health +``` + +### Python Deployment + +```bash +git clone https://github.com/waldronlab/bioanalyzer-backend.git +cd bioanalyzer-backend + +python3 -m venv .venv +source .venv/bin/activate + +# Install package from pyproject.toml (modern Python packaging, PEP 518/621) +pip install -e . +# This installs all dependencies defined in pyproject.toml + +# Create .env file with your API keys + +python main.py --host 0.0.0.0 --port 8000 +``` + +## API Endpoints + +Once deployed, you can access: +- Health check: `GET /health` +- API docs: `GET /docs` (Swagger UI) +- Analysis v1: `GET /api/v1/analyze/{pmid}` +- Analysis v2 with RAG: `GET /api/v2/analyze/{pmid}` +- Retrieval: `GET /api/v1/retrieve/{pmid}` +- System status: `GET /api/v1/status` + +## Current Status + +As @lwaldron mentioned, the app isn't production-ready yet. It needs testing by a couple of people first. + +I'd suggest deploying to a staging environment first, have 2-3 users test the API endpoints, monitor for issues, fix any problems, then consider production. Shiny Server might be easier for testing since it's already set up there. + +## Testing + +Before calling it done, make sure: +- Health endpoint works +- API docs are accessible at `/docs` +- You can analyze a test PMID +- You can retrieve paper data +- Error handling works +- API keys are configured correctly +- Logs are being generated +- Cache works (if enabled) +- Rate limiting works (if enabled) + +## Troubleshooting + +If something goes wrong: +- Check logs with `docker compose logs` or look in the `logs/` directory +- Verify API keys are set correctly +- Test the health endpoint: `curl http://localhost:8000/health` +- Check network connectivity for API calls +- See [PRODUCTION_DEPLOYMENT.md](docs/PRODUCTION_DEPLOYMENT.md) for more details + diff --git a/Dockerfile b/Dockerfile index 4e1d6c0..5d2c48e 100644 --- a/Dockerfile +++ b/Dockerfile @@ -14,14 +14,16 @@ RUN apt-get update && apt-get install -y \ git \ && rm -rf /var/lib/apt/lists/* -# Copy requirements first for better caching -COPY config/requirements.txt . +# Copy pyproject.toml and README.md first for better caching +COPY pyproject.toml README.md ./ # Upgrade pip and setuptools first -RUN pip install --upgrade pip setuptools wheel +RUN pip install --upgrade pip setuptools wheel build # ------------------------------------------------------------ # Step 1: Install PyTorch CPU versions (fixed +cpu issue) +# Note: PyTorch CPU versions require special index URL, so we install them separately +# before installing the package from pyproject.toml # ------------------------------------------------------------ RUN pip install --no-cache-dir --default-timeout=600 --retries=10 \ --extra-index-url https://download.pytorch.org/whl/cpu \ @@ -30,69 +32,23 @@ RUN pip install --no-cache-dir --default-timeout=600 --retries=10 \ torchaudio==2.1.0+cpu # ------------------------------------------------------------ -# Step 2: Install ML and NLP packages +# Step 2: Copy application code # ------------------------------------------------------------ -RUN pip install --no-cache-dir --default-timeout=300 --retries=5 \ - transformers>=4.34.0 \ - scikit-learn>=1.3.0 \ - pandas>=2.1.1 \ - numpy>=1.26.0 \ - sentencepiece>=0.1.99 \ - accelerate>=0.24.0 \ - datasets>=2.14.0 \ - tiktoken>=0.5.0 \ - tokenizers>=0.14.1 - -# ------------------------------------------------------------ -# Step 3: Install web framework and async packages -# ------------------------------------------------------------ -RUN pip install --no-cache-dir --default-timeout=300 --retries=5 \ - fastapi>=0.104.0 \ - "uvicorn[standard]>=0.23.2" \ - aiohttp>=3.8.6 \ - websockets>=11.0.3 \ - python-multipart>=0.0.5 \ - aiofiles>=0.7.0 \ - pydantic>=2.4.2 \ - starlette>=0.31.1 \ - httptools>=0.3.0 \ - h11>=0.12.0 \ - wsproto>=1.0.0 - -# ------------------------------------------------------------ -# Step 4: Install utility packages -# ------------------------------------------------------------ -RUN pip install --no-cache-dir --default-timeout=300 --retries=5 \ - requests>=2.31.0 \ - beautifulsoup4>=4.12.2 \ - lxml>=4.9.0 \ - openpyxl>=3.1.0 \ - xlrd>=2.0.1 \ - tqdm>=4.65.0 \ - python-dotenv>=1.0.0 \ - click>=8.0.1 \ - PyYAML>=5.4.1 \ - watchfiles>=1.0.0 \ - typing-extensions>=3.10.0.2 \ - pytz>=2023.3 \ - biopython>=1.81 \ - google-generativeai>=0.7.2 +COPY . . # ------------------------------------------------------------ -# Step 5: Install paper-qa from PyPI +# Step 3: Install the package from pyproject.toml +# This installs the package and all its dependencies from pyproject.toml +# PyTorch is already installed above, so pip will skip it +# Installing in editable mode (-e) ensures entry points are properly installed # ------------------------------------------------------------ -RUN pip install --no-cache-dir --default-timeout=300 --retries=5 paper-qa>=5.0.0 +RUN pip install --no-cache-dir --default-timeout=300 --retries=5 -e . # ------------------------------------------------------------ -# Step 6: Install testing dependencies +# Step 4: Install testing dependencies (optional, for development) # ------------------------------------------------------------ RUN pip install --no-cache-dir pytest>=7.4.0 pytest-cov>=4.1.0 -# ------------------------------------------------------------ -# Copy application code -# ------------------------------------------------------------ -COPY . . - # Create necessary directories RUN mkdir -p cache logs results diff --git a/README.md b/README.md index c626343..50ed0e8 100644 --- a/README.md +++ b/README.md @@ -178,9 +178,12 @@ cd bioanalyzer-backend python3 -m venv .venv source .venv/bin/activate -# Install dependencies -pip install -r config/requirements.txt +# Install dependencies and package +# The package uses pyproject.toml (PEP 518/621) for modern Python packaging pip install -e . +# Or install with optional dependencies: +# pip install -e .[dev] # for development dependencies +# pip install -e .[cli] # for CLI enhancements # Set up environment (optional) cp .env.example .env @@ -648,9 +651,10 @@ bioanalyzer-backend/ │ ├── chunking.py # Text chunking service │ └── performance_logger.py # Performance monitoring ├── config/ # Configuration files -│ ├── requirements.txt # Python dependencies -│ ├── setup.py # Package configuration +│ ├── requirements.txt # Python dependencies (legacy) │ └── pytest.ini # Test configuration +├── pyproject.toml # Modern Python packaging (PEP 518/621) +├── setup.py # Legacy setup.py (kept for backward compatibility) ├── docs/ # Documentation │ ├── README.md # Main documentation │ ├── DOCKER_DEPLOYMENT.md # Docker deployment guide @@ -750,9 +754,9 @@ cd bioanalyzer-backend python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate -# Install dependencies -pip install -r config/requirements.txt -pip install -e .[dev] +# Install dependencies and package +# The package uses pyproject.toml (PEP 518/621) for modern Python packaging +pip install -e .[dev] # Installs package with development dependencies # Set up pre-commit hooks pre-commit install diff --git a/docs/QUICKSTART.md b/docs/QUICKSTART.md index 3ceca1b..392bcf3 100644 --- a/docs/QUICKSTART.md +++ b/docs/QUICKSTART.md @@ -45,8 +45,9 @@ sudo apt install python3.12-venv python3-full python3 -m venv .venv source .venv/bin/activate -pip install -r config/requirements.txt +# Install package from pyproject.toml (modern Python packaging) pip install -e . +# Or with development dependencies: pip install -e .[dev] cat > .env << EOF NCBI_API_KEY=your_ncbi_key_here diff --git a/pyproject.toml b/pyproject.toml new file mode 100644 index 0000000..028fb64 --- /dev/null +++ b/pyproject.toml @@ -0,0 +1,118 @@ +[build-system] +requires = ["setuptools>=61.0", "wheel"] +build-backend = "setuptools.build_meta" + +[project] +name = "bioanalyzer-backend" +version = "1.0.0" +description = "A specialized AI-powered tool for analyzing scientific papers for BugSigDB curation readiness" +readme = "README.md" +requires-python = ">=3.8" +license = {text = "MIT"} +authors = [ + {name = "BioAnalyzer Team", email = "team@bioanalyzer.org"} +] +keywords = ["bioinformatics", "microbiome", "bugsigdb", "curation", "ai", "analysis"] +classifiers = [ + "Development Status :: 4 - Beta", + "Intended Audience :: Science/Research", + "Topic :: Scientific/Engineering :: Bio-Informatics", + "License :: OSI Approved :: MIT License", + "Programming Language :: Python :: 3", + "Programming Language :: Python :: 3.8", + "Programming Language :: Python :: 3.9", + "Programming Language :: Python :: 3.10", + "Programming Language :: Python :: 3.11", +] +dependencies = [ + # Core dependencies - PyTorch (standard versions) + # Note: For CPU-only versions, install separately with: + # pip install --extra-index-url https://download.pytorch.org/whl/cpu torch>=2.1.0+cpu torchvision>=0.16.0+cpu torchaudio>=2.1.0+cpu + "torch>=2.1.0", + "torchvision>=0.16.0", + "torchaudio>=2.1.0", + # Core ML dependencies + "transformers>=4.34.0", + "scikit-learn>=1.3.0", + "pandas>=2.1.1", + "numpy>=1.26.0", + "biopython>=1.81", + # NLP and LLM dependencies + "sentencepiece>=0.1.99", + "accelerate>=0.24.0", + "datasets>=2.14.0", + "google-generativeai>=0.7.2", + "tiktoken>=0.5.0", + "litellm>=1.50.0", + # Paper-QA agent integration + "paper-qa>=5.0.0", + # Web and data retrieval + "requests>=2.31.0", + "beautifulsoup4>=4.12.2", + "lxml>=4.9.0", + # Excel file processing + "openpyxl>=3.1.0", + "xlrd>=2.0.1", + # Utilities + "tqdm>=4.65.0", + "python-dotenv>=1.0.0", + "psutil>=5.9.0", + # WebSocket dependencies + "fastapi>=0.104.0", + "uvicorn[standard]>=0.23.2", + "aiohttp>=3.8.6", + "websockets>=11.0.3", + "python-multipart>=0.0.5", + "aiofiles>=0.7.0", + "pydantic>=2.4.2", + "typing-extensions>=3.10.0.2", + "starlette>=0.31.1", + "click>=8.0.1", + "h11>=0.12.0", + "httptools>=0.3.0", + "PyYAML>=5.4.1", + "watchfiles[watchdog]>=1.0.0", + "wsproto>=1.0.0", + "tokenizers>=0.14.1", + "pytz>=2023.3", + # Vector database (optional - for persistent storage) + "qdrant-client>=1.7.0", +] + +[project.optional-dependencies] +dev = [ + "pytest>=7.4.0", + "pytest-cov>=4.1.0", + "black>=23.0.0", + "flake8>=6.0.0", + "mypy>=1.5.0", +] +cli = [ + "click>=8.0.1", + "rich>=13.0.0", + "tabulate>=0.9.0", +] +torch-cpu = [ + # PyTorch CPU versions - install separately with: pip install -e ".[torch-cpu]" + # Note: These require --extra-index-url https://download.pytorch.org/whl/cpu + # Install with: pip install --extra-index-url https://download.pytorch.org/whl/cpu torch>=2.1.0+cpu torchvision>=0.16.0+cpu torchaudio>=2.1.0+cpu + # Or use the standard PyTorch versions from PyPI +] + +[project.urls] +Homepage = "https://github.com/your-repo/bioanalyzer-backend" +Documentation = "https://github.com/your-repo/bioanalyzer-backend/docs" +Repository = "https://github.com/your-repo/bioanalyzer-backend" +"Bug Reports" = "https://github.com/your-repo/bioanalyzer-backend/issues" + +[project.scripts] +bioanalyzer = "cli:main" +bioanalyzer-cli = "cli:main" + +[tool.setuptools] +packages = {find = {}} +include-package-data = true + +[tool.setuptools.package-data] +"*" = ["*.md", "*.txt", "*.yml", "*.yaml"] + diff --git a/setup.py b/setup.py index 4ea6420..0d08709 100644 --- a/setup.py +++ b/setup.py @@ -1,3 +1,18 @@ +""" +Legacy setup.py - kept for backward compatibility. + +This project uses pyproject.toml (PEP 518/621) as the primary source for package metadata. +Modern tools (pip, build, etc.) will automatically use pyproject.toml when available. + +This setup.py is maintained for: +- Backward compatibility with older tools +- Fallback support if pyproject.toml is not available +- Tools that haven't migrated to pyproject.toml yet + +For new installations, use: pip install -e . +This will automatically use pyproject.toml. +""" + from setuptools import setup, find_packages # Read the README file