Skip to content

Conversation

@oded996
Copy link

@oded996 oded996 commented Aug 25, 2025

This PR introduces a new tool, cloud-run-deploy-model, which simplifies the deployment of Large Language Models (LLMs) to Google Cloud Run.

Key Features:

  • New Tool: A cloud-run-deploy-model tool that supports deploying models using the Ollama and vLLM frameworks.
  • Writable GCS Mounts: The GCS volume is now mounted with write permissions, allowing Ollama to download and store models.
  • Improved Error Handling: The tool now provides clearer error messages for authentication issues (403) and service unavailability (503) during model download.
  • Retry Mechanism: A retry mechanism has been added to handle transient 503 errors, making the deployment process more robust.

Example

Prompt:

please deploy llama2 on cloud run, use ollama, project my-gcp-project, region europe-west1

Output:

Cloud Run service llama2 deployed in project my-gcp-project
Cloud Console: https://console.cloud.google.com/run/detail/europe-west1/llama2?project=my-gcp-project
Service URL: https://llama2-....a.run.app

@oded996 oded996 changed the title fix: Allow writable GCS mounts for Ollama models feat: Add tool to deploy LLM models to Cloud Run Aug 25, 2025
Copy link
Collaborator

@steren steren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR
While I like the idea. I wonder if we should add it. What are the use cases for an Agent deploying a model?

Note that later, we will have presets that will cover this use case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove file

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove file

);

server.registerTool(
'cloud_run_deploy_model',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the cloud_run_ prefix.

Suggestion: deploy_ai_model

framework:
z.enum(['ollama', 'vllm']).describe('The framework to use for serving the model.'),
model:
z.string().describe('The model to deploy from Ollama library or Hugging Face Hub.'),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add more details. give examples of accepted formats

Refactors the vLLM deployment strategy to use a dedicated Cloud Function for streaming models from Hugging Face to GCS. This avoids slow local downloads and network bottlenecks.

Also includes:
- Hardening the vLLM container with --max-model-len and HF_HUB_OFFLINE.
- Correcting the container port to 8000.
- Cleaning up unused code in the deployment scripts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants