A FastAPI-based service for running inference with DeepSeek language models. This API provides a simple interface for text generation using DeepSeek's 7B model.
Cloud Run offers several advantages for deploying AI inference APIs, especially for early-stage projects and startups:
- Pay-per-use pricing: Only pay for actual compute time used, ideal for sporadic workloads
- Auto-scaling: Scales to zero when not in use, perfect for development and testing
- Cost efficiency: No need to maintain constantly running instances
- Serverless: Focus on code, not infrastructure
- GPU support: Access to T4/V100 GPUs without long-term commitments
- Quick deployment: From code to production in minutes
-
Cost Optimization
- Zero cost when the service is idle
- Perfect for development and testing phases
- No minimum monthly commitments
-
Development Flexibility
- Easy A/B testing of different models
- Quick iteration and deployment
- Simple rollback capabilities
-
Security & Control
- Self-hosted solution reduces dependency on third-party services
- Protection against service disruptions
- Full control over model versions and updates
-
Scalability
- Handles traffic spikes automatically
- Scales down to zero during quiet periods
- No infrastructure management overhead
- FastAPI-based REST API
- Support for DeepSeek models
- Environment-based configuration
- Token-based authentication with Hugging Face
- Docker support for containerization
- GPU acceleration support
- Python 3.11+
- Hugging Face account and API token
- GPU support (recommended)
- pip or another Python package manager
-
Set Concurrency
- Adjust request concurrency to optimize resource usage
- Example:
--concurrency 80
-
Memory/CPU Allocation
- Start with minimal resources
- Scale up based on actual usage patterns
-
Monitoring
- Use Cloud Monitoring to track usage
- Set up alerts for unusual patterns
This approach:
- Downloads model during build time
- Caches the model in the image
- Uses local files at runtime
- No need for token at runtime
- Faster container startup
Benefits:
- Faster cold starts
- No runtime downloads
- More reliable
- Works in airgapped environments
- Better for production
- The tradeoff is a larger container image, but the runtime benefits usually outweigh this.
- Clone or Fork the repository
# set up env
python3 -m venv venv
# activate the python environment
source venv/bin/activate
# install dependencies
pip install -r requirements.txt
# run the FastAPI application (development)
uvicorn src.main:app --reload --port 8000
# Build with Hugging Face token
docker build -t deepseek-inference-api .
# Run the container
docker run deepseek-inference-api
deploy using one of these two methods:
# Build the image using Cloud Build
gcloud builds submit --config cloudbuild.yaml
# Then deploy the built image to Cloud Run
gcloud run deploy deepseek-service \
--image gcr.io/$PROJECT_ID/deepseek-inference-api \
--region us-central1 \
--platform managed \
--gpu \
--memory 16Gi \
--cpu 4 \
--allow-unauthenticated
gcloud run deploy deepseek-service \
--source . \
--region us-central1 \
--platform managed \
--gpu \
--memory 16Gi \
--cpu 4 \
--allow-unauthenticated
- Simply go to Cloud Run on the GCP UI
- Create a new Cloud Run Service
- Connect to Cloud Build and to the Github Service associated with this code
- Deploy with the Dockerfile
- Ensure the Cloud Run Service has the same recommended config for serverless GPU
Simply run pytest to run the unit & integration tests.
pytest