diff --git a/articles/gpt-oss/run-skypilot-vllm.md b/articles/gpt-oss/run-skypilot-vllm.md new file mode 100644 index 0000000000..98c158a917 --- /dev/null +++ b/articles/gpt-oss/run-skypilot-vllm.md @@ -0,0 +1,202 @@ +# How to run gpt-oss with SkyPilot and vLLM + +[SkyPilot](https://skypilot.readthedocs.io/) is an open-source system that lets you run AI workloads on any infrastructure, cloud or K8s, with unified execution and high GPU availability. Combined with [vLLM](https://docs.vllm.ai/en/latest/), you can deploy **gpt-oss-20b** or **gpt-oss-120b** models across 16+ cloud providers and on-prem GPU workstations to Kubernetes clusters, with automatic failover and cost optimization. + +This guide shows you how to leverage SkyPilot's multi-cloud capabilities to serve gpt-oss models at scale, building on top of the vLLM serving infrastructure covered in [our vLLM guide](https://cookbook.openai.com/articles/gpt-oss/run-vllm). + +## Why SkyPilot + vLLM? + +Using SkyPilot with vLLM provides several advantages: + +- **Multi-cloud deployment**: Run on AWS, GCP, Azure, Lambda Labs, RunPod, and 10+ other clouds +- **On-premises and Kubernetes support**: Deploy on your own GPU workstations or existing K8s clusters +- **Reserved GPU optimization**: Efficiently utilize your pre-allocated cloud resources and reserved instances +- **Automatic GPU provisioning**: SkyPilot finds available GPUs across clouds and on-prem infrastructure +- **Cost optimization**: Automatically chooses the cheapest cloud/region or uses your existing hardware +- **Built-in autoscaling**: SkyServe provides production-ready scaling and load balancing across hybrid environments +- **Spot instance support**: Save up to 70% with automatic failover between spot, on-demand, and on-prem resources + +## Prerequisites + +Before starting, you'll need: + +- **Python 3.11+** installed on your local machine +- **Cloud credentials** for at least one provider (AWS, GCP, Azure, Lambda Labs, Nebius, etc.) +- **GPU availability** in your chosen cloud(s) + +## Quick Setup + +1. **Install SkyPilot** + Install SkyPilot with support for all cloud providers: + +```shell +pip install 'skypilot[all]' +``` + +2. **Configure cloud credentials** + SkyPilot will automatically detect credentials from your cloud CLI tools: + +```shell +# Check which clouds are configured +sky check +``` + +3. **Create a SkyPilot configuration** + Create a file named `gpt-oss-vllm.sky.yaml`: + +```yaml +resources: + accelerators: {H100:1, H200:1} + disk_size: 512 + ports: 8000 + +envs: + MODEL_NAME: openai/gpt-oss-20b + +setup: | + sudo apt-get update + sudo apt-get install -y python3-dev build-essential + uv venv --python 3.12 --seed + source .venv/bin/activate + uv pip install --pre vllm==0.10.1+gptoss \ + --extra-index-url https://wheels.vllm.ai/gpt-oss/ \ + --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \ + --index-strategy unsafe-best-match + uv pip install hf_xet openai==1.99.1 + hf download $MODEL_NAME + +run: | + source .venv/bin/activate + vllm serve $MODEL_NAME \ + --host 0.0.0.0 \ + --port 8000 \ + --trust-remote-code \ + +service: + replica_policy: + min_replicas: 1 + max_replicas: 3 + target_qps_per_replica: 5.0 + upscale_delay_seconds: 300 + downscale_delay_seconds: 600 + readiness_probe: /v1/models +``` + +## Deploy to Any Cloud + +### Basic deployment + +Launch gpt-oss on the cheapest available GPU: + +```shell +# For 20B model +sky launch -c gpt-oss gpt-oss-vllm.sky.yaml +# For 120B model +sky launch -c gpt-oss gpt-oss-vllm.sky.yaml --env MODEL_NAME=openai/gpt-oss-120b +``` + +SkyPilot will: +1. Find the cheapest cloud/region with available GPUs (defined in the `resources.accelerators` field) +2. Provision the instance +3. Install dependencies +4. Start the vLLM server +5. Expose the endpoint + +### Get your endpoint + +Once deployed, get the API endpoint: + +```shell +ENDPOINT=$(sky status --endpoint 8000 gpt-oss) +echo "API available at: http://$ENDPOINT" +``` + +## Use the deployed model + +The deployed model exposes the same OpenAI-compatible API as local vLLM. + +Test with cURL: + +```shell +curl http://$ENDPOINT/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "openai/gpt-oss-20b", + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "Explain what MXFP4 quantization is." + } + ] + }' | jq . +``` + +Test with Python: + +```py +import os +from openai import OpenAI + +ENDPOINT = os.getenv('ENDPOINT') +client = OpenAI( + base_url=f"http://{ENDPOINT}/v1", + api_key="EMPTY" +) + +result = client.chat.completions.create( + model="openai/gpt-oss-20b", + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Explain what MXFP4 quantization is."} + ] +) + +print(result.choices[0].message.content) +``` + +## Production deployment with SkyServe + +For production workloads, use [**SkyServe**](https://docs.skypilot.co/en/latest/serving/sky-serve.html) for automatic scaling, load balancing, and high availability: + +```shell +sky serve up -n gpt-oss-service gpt-oss-vllm.sky.yaml +``` + +This provides: +- **Automatic scaling**: Scale replicas based on load +- **Load balancing**: Distribute requests across replicas +- **Health checks**: Automatic restart on failures +- **Rolling updates**: Zero-downtime deployments + +Monitor your service: + +```shell +sky serve status gpt-oss-service +``` + +## Integration with existing tools + +The deployed endpoint is OpenAI-compatible, so it works with: +- [**LangChain**](https://www.langchain.com/): For building complex AI applications +- [**OpenAI Agents SDK**](https://openai.github.io/openai-agents-python/): For agentic workflows +- [**llm CLI tool**](https://github.com/simonw/llm): For command-line interactions +- **Any OpenAI-compatible client**: Drop-in replacement + +## Cleanup + +When you're done, clean up your resources: + +```shell +# Fully terminate the cluster +sky down gpt-oss + +# For SkyServe deployments +sky serve down gpt-oss-service + +# Check all active resources +sky status +``` diff --git a/authors.yaml b/authors.yaml index 592296d19d..0b51f22b7e 100644 --- a/authors.yaml +++ b/authors.yaml @@ -431,3 +431,8 @@ Vaibhavs10: name: "vb" website: "https://huggingface.co/reach-vb" avatar: "https://cdn-avatars.huggingface.co/v1/production/uploads/1655385361868-61b85ce86eb1f2c5e6233736.jpeg" + +alex000kim: + name: "Alex Kim" + website: "https://alex000kim.com" + avatar: "https://alex000kim.com/about/profile-pic.png" diff --git a/registry.yaml b/registry.yaml index 85f30787c4..e4d6c69594 100644 --- a/registry.yaml +++ b/registry.yaml @@ -4,6 +4,17 @@ # should build pages for, and indicates metadata such as tags, creation date and # authors for each page. +- title: How to run gpt-oss with SkyPilot and vLLM + path: articles/gpt-oss/run-skypilot-vllm.md + date: 2025-08-06 + authors: + - alex000kim + tags: + - open-models + - gpt-oss + - deployment + - cloud + - title: How to run gpt-oss-20b on Google Colab path: articles/gpt-oss/run-colab.ipynb date: 2025-08-06