|
| 1 | +# How to run gpt-oss locally with Ollama |
| 2 | + |
| 3 | +Want to get [**OpenAI gpt-oss**](https://openai.com/open-models) running on your own hardware? This guide will walk you through how to use [Ollama](https://ollama.ai) to set up **gpt-oss-20b** or **gpt-oss-120b** locally, to chat with it offline, use it through an API, and even connect it to the Agents SDK. |
| 4 | + |
| 5 | +Note that this guide is meant for consumer hardware, like running a model on a PC or Mac. For server applications with dedicated GPUs like NVIDIA’s H100s, [check out our vLLM guide](https://cookbook.openai.com/articles/gpt-oss/run-vllm). |
| 6 | + |
| 7 | +## Pick your model |
| 8 | + |
| 9 | +Ollama supports both model sizes of gpt-oss: |
| 10 | + |
| 11 | +- **`gpt-oss-20b`** |
| 12 | + - The smaller model |
| 13 | + - Best with **≥16GB VRAM** or **unified memory** |
| 14 | + - Perfect for higher-end consumer GPUs or Apple Silicon Macs |
| 15 | +- **`gpt-oss-120b`** |
| 16 | + - Our larger full-sized model |
| 17 | + - Best with **≥60GB VRAM** or **unified memory** |
| 18 | + - Ideal for multi-GPU or beefy workstation setup |
| 19 | + |
| 20 | +**A couple of notes:** |
| 21 | + |
| 22 | +- These models ship **MXFP4 quantized** out the box and there is currently no other quantization |
| 23 | +- You _can_ offload to CPU if you’re short on VRAM, but expect it to run slower. |
| 24 | + |
| 25 | +## Quick setup |
| 26 | + |
| 27 | +1. **Install Ollama** → [Get it here](https://ollama.com/download) |
| 28 | +2. **Pull the model you want:** |
| 29 | + |
| 30 | +```shell |
| 31 | +# For 20B |
| 32 | +ollama pull gpt-oss:20b |
| 33 | + |
| 34 | +# For 120B |
| 35 | +ollama pull gpt-oss:120b |
| 36 | +``` |
| 37 | + |
| 38 | +## Chat with gpt-oss |
| 39 | + |
| 40 | +Ready to talk to the model? You can fire up a chat in the app or the terminal: |
| 41 | + |
| 42 | +```shell |
| 43 | +ollama run gpt-oss:20b |
| 44 | +``` |
| 45 | + |
| 46 | +Ollama applies a **chat template** out of the box that mimics the [OpenAI harmony format](https://cookbook.openai.com/articles/openai-harmony). Type your message and start the conversation. |
| 47 | + |
| 48 | +## Use the API |
| 49 | + |
| 50 | +Ollama exposes a **Chat Completions-compatible API**, so you can use the OpenAI SDK without changing much. Here’s a Python example: |
| 51 | + |
| 52 | +```py |
| 53 | +from openai import OpenAI |
| 54 | + |
| 55 | +client = OpenAI( |
| 56 | + base_url="http://localhost:11434/v1", # Local Ollama API |
| 57 | + api_key="ollama" # Dummy key |
| 58 | +) |
| 59 | + |
| 60 | +response = client.chat.completions.create( |
| 61 | + model="gpt-oss:20b", |
| 62 | + messages=[ |
| 63 | + {"role": "system", "content": "You are a helpful assistant."}, |
| 64 | + {"role": "user", "content": "Explain what MXFP4 quantization is."} |
| 65 | + ] |
| 66 | +) |
| 67 | + |
| 68 | +print(response.choices[0].message.content) |
| 69 | +``` |
| 70 | + |
| 71 | +If you’ve used the OpenAI SDK before, this will feel instantly familiar. |
| 72 | + |
| 73 | +Alternatively, you can use the Ollama SDKs in [Python](https://github.com/ollama/ollama-python) or [JavaScript](https://github.com/ollama/ollama-js) directly. |
| 74 | + |
| 75 | +## Using tools (function calling) |
| 76 | + |
| 77 | +Ollama can: |
| 78 | + |
| 79 | +- Call functions |
| 80 | +- Use a **built-in browser tool** (in the app) |
| 81 | + |
| 82 | +Example of invoking a function via Chat Completions: |
| 83 | + |
| 84 | +```py |
| 85 | +tools = [ |
| 86 | + { |
| 87 | + "type": "function", |
| 88 | + "function": { |
| 89 | + "name": "get_weather", |
| 90 | + "description": "Get current weather in a given city", |
| 91 | + "parameters": { |
| 92 | + "type": "object", |
| 93 | + "properties": {"city": {"type": "string"}}, |
| 94 | + "required": ["city"] |
| 95 | + }, |
| 96 | + }, |
| 97 | + } |
| 98 | +] |
| 99 | + |
| 100 | +response = client.chat.completions.create( |
| 101 | + model="gpt-oss:20b", |
| 102 | + messages=[{"role": "user", "content": "What's the weather in Berlin right now?"}], |
| 103 | + tools=tools |
| 104 | +) |
| 105 | + |
| 106 | +print(response.choices[0].message) |
| 107 | +``` |
| 108 | + |
| 109 | +Since the models can perform tool calling as part of the chain-of-thought (CoT) it’s important for you to return the reasoning returned by the API back into a subsequent call to a tool call where you provide the answer until the model reaches a final answer. |
| 110 | + |
| 111 | +## Responses API workarounds |
| 112 | + |
| 113 | +Ollama doesn’t (yet) support the **Responses API** natively. |
| 114 | + |
| 115 | +If you do want to use the Responses API you can use [**Hugging Face’s `Responses.js` proxy**](https://github.com/huggingface/responses.js) to convert Chat Completions to Responses API. |
| 116 | + |
| 117 | +For basic use cases you can also [**run our example Python server with Ollama as the backend.**](https://github.com/openai/gpt-oss?tab=readme-ov-file#responses-api) This server is a basic example server and does not have the |
| 118 | + |
| 119 | +```shell |
| 120 | +pip install gpt-oss |
| 121 | +python -m gpt_oss.responses_api.serve \ |
| 122 | + --inference_backend=ollama \ |
| 123 | + --checkpoint gpt-oss:20b |
| 124 | +``` |
| 125 | + |
| 126 | +## Agents SDK integration |
| 127 | + |
| 128 | +Want to use gpt-oss with OpenAI’s **Agents SDK**? |
| 129 | + |
| 130 | +Both Agents SDK enable you to override the OpenAI base client to point to Ollama using Chat Completions or your Responses.js proxy for your local models. Alternatively, you can use the built-in functionality to point the Agents SDK against third party models. |
| 131 | + |
| 132 | +- **Python:** Use [LiteLLM](https://openai.github.io/openai-agents-python/models/litellm/) to proxy to Ollama through LiteLLM |
| 133 | +- **TypeScript:** Use [AI SDK](https://openai.github.io/openai-agents-js/extensions/ai-sdk/) with the [ollama adapter](https://ai-sdk.dev/providers/community-providers/ollama) |
| 134 | + |
| 135 | +Here’s a Python Agents SDK example using LiteLLM: |
| 136 | + |
| 137 | +```py |
| 138 | +import asyncio |
| 139 | +from agents import Agent, Runner, function_tool, set_tracing_disabled |
| 140 | +from agents.extensions.models.litellm_model import LitellmModel |
| 141 | + |
| 142 | +set_tracing_disabled(True) |
| 143 | + |
| 144 | +@function_tool |
| 145 | +def get_weather(city: str): |
| 146 | + print(f"[debug] getting weather for {city}") |
| 147 | + return f"The weather in {city} is sunny." |
| 148 | + |
| 149 | + |
| 150 | +async def main(model: str, api_key: str): |
| 151 | + agent = Agent( |
| 152 | + name="Assistant", |
| 153 | + instructions="You only respond in haikus.", |
| 154 | + model=LitellmModel(model="ollama/gpt-oss:120b", api_key=api_key), |
| 155 | + tools=[get_weather], |
| 156 | + ) |
| 157 | + |
| 158 | + result = await Runner.run(agent, "What's the weather in Tokyo?") |
| 159 | + print(result.final_output) |
| 160 | + |
| 161 | +if __name__ == "__main__": |
| 162 | + asyncio.run(main()) |
| 163 | +``` |
0 commit comments