diff --git a/docs/blog/2026-03-18-responses-api.md b/docs/blog/2026-03-18-responses-api.md new file mode 100644 index 0000000000..b39900274a --- /dev/null +++ b/docs/blog/2026-03-18-responses-api.md @@ -0,0 +1,91 @@ +--- +slug: responses-api +title: "Your Agent, Your Rules: Building Powerful Agents with the Responses API in Llama Stack" +authors: [jwm4] +tags: [responses-api, agents, rag, mcp, open-responses] +date: 2026-03-18 +--- + +The [Responses API](https://developers.openai.com/blog/responses-api) is rapidly emerging as one of the most influential interfaces for building AI agents. It handles multi-step reasoning, tool orchestration, and conversational state in a single interaction, which is a big improvement over the manual orchestration loops that developers had to build on top of chat completion APIs. Llama Stack's implementation of the Responses API brings these capabilities to the open source world, where you can choose your own models and run on your own infrastructure. + +This post covers why the Responses API matters, what Llama Stack's implementation enables, and how it connects to the broader move toward open agent standards like Open Responses. + +{/* truncate */} + +## Why the Responses API? + +Before the Responses API, building an agent that could use tools was a multi-step exercise in client-side orchestration. Your application had to call the model with a list of available tools, inspect the response for tool call requests, execute those tools, send the results back, and repeat until the model produced a final answer. All of the state management, error handling, and retry logic lived in your code. + +This approach put a real burden on application developers. The orchestration logic got duplicated across every application, and subtle mistakes in state management could lead to poor accuracy or unnecessary model calls. + +The Responses API moves this orchestration to the server. The client sends a question along with a set of available tools and documents, and the server handles the planning, tool execution, and synthesis internally. Your client code gets much simpler, and the behavior is more consistent because the orchestration logic is shared rather than reimplemented by every application. + +## What Llama Stack brings to the table + +Llama Stack is an open source server for building AI applications. It provides a unified set of APIs for inference, RAG, tool calling, safety, evaluation, and more, backed by a pluggable provider architecture that lets you swap components without changing application code. + +Llama Stack implements the Responses API with support for built-in RAG through `file_search`, automated multi-tool orchestration through the Model Context Protocol (MCP), conversation state management, and compatibility with the OpenAI client ecosystem. + +But the interesting part is what Llama Stack adds beyond the API surface itself. + +### Model freedom + +With a proprietary hosted service, the Responses API is tied to a specific set of models from a single provider. With Llama Stack, you can use any model accessible through its inference providers: open source models like the Llama family, fine-tuned models you've created yourself, or optimized models from the broader ecosystem. The same Responses API interface works regardless of which model backs it. You can start with a small model during development, scale up for production, or swap models entirely, and your application code stays the same. + +### Data sovereignty + +If you work in a regulated industry like finance, healthcare, or government, sending sensitive documents to a third-party cloud service is often a non-starter. Llama Stack lets you run the entire stack on your own infrastructure: the model, the vector store for RAG, and the tool execution environment. Documents stay within your security perimeter, and the agent's reasoning about those documents does too. + +### Open, extensible architecture + +Llama Stack's provider architecture means you are not locked into a single implementation for any component. Need FAISS for your vector store in development and Milvus in production? Change a configuration setting. Want to use Ollama locally and a cloud inference provider in production? Same application code, different distribution. This flexibility extends across the full Llama Stack API surface, not just inference. + +## Private RAG with `file_search` + +Retrieval-augmented generation (RAG) grounds a model's responses in authoritative documents, which reduces hallucination and enables accurate answers from private knowledge bases. + +The Responses API formalizes RAG with the `file_search` tool. You create a vector store, upload documents to it, and then include `file_search` as an available tool when calling the Responses API. The model generates search queries, retrieves relevant passages, and synthesizes them into a grounded answer, all in a single API call. + +With Llama Stack, this entire pipeline runs on your infrastructure. Document ingestion, embedding, storage, retrieval, and synthesis all happen locally. The response includes references to the source passages, so your application can provide citations for verification. + +This makes it practical to build RAG applications over sensitive internal documents like compliance policies, medical records, or proprietary research, with confidence that the data never leaves your environment. + +## Multi-tool orchestration with MCP + +The Responses API gets especially interesting when an agent needs to coordinate multiple tools to answer a complex question. Consider a question like: "What parks are in Rhode Island, and are there any upcoming events at them?" Answering this requires discovering available tools, searching for parks, querying events for each park found, and synthesizing all the results. + +With Llama Stack's Responses API and MCP integration, this entire workflow happens within a single API call. The model discovers available tools from a connected MCP server, plans and executes a sequence of tool calls, and produces a consolidated answer. The client application doesn't need to write any orchestration logic. + +MCP is an open standard for tool integration, so the ecosystem of available tools is broad and growing. Any MCP server can be connected to Llama Stack and used by the Responses API, whether it provides access to databases, internal services, or external data sources. + +Llama Stack also provides fine-grained control over tool access. You can restrict which tools are available for a given request, pass per-request authentication headers to MCP servers so that an agent can only access data for the current user, and configure tool behavior without modifying the agent's prompt. This matters a lot in production deployments where security and access control are real concerns. + +## Framework compatibility + +Llama Stack exposes OpenAI-compatible endpoints at `/v1`, so you can use the official OpenAI Python client, the Llama Stack client, or any other client that speaks the OpenAI API. They all work the same way. + +If you have existing code built with the OpenAI client, migrating to Llama Stack means pointing your client at your Llama Stack server. That's it. This also applies to frameworks like LangChain that build on top of OpenAI's API. Switching the inference backend to Llama Stack requires changing a constructor parameter, not rewriting your agent logic. + +This drop-in compatibility has practical implications beyond convenience. You can develop and test against a local Llama Stack server, deploy against a production Llama Stack distribution, or switch between Llama Stack and other OpenAI-compatible providers, all with the same application code. + +## Toward an open standard: Open Responses + +When Llama Stack first implemented the Responses API, the specification was proprietary. Llama Stack had to track a moving target, and there was always a gap between when OpenAI added a feature and when Llama Stack could implement it. + +The [Open Responses specification](https://www.openresponses.org/) changes this. Open Responses is an open source specification backed by a broad community including OpenAI, Hugging Face, and providers like Ollama, vLLM, and LM Studio. It formalizes the core concepts of the Responses API into an open standard: items as the atomic unit of context, semantic streaming events, and the agentic loop of reasoning and tool invocation. + +For Llama Stack, Open Responses provides a stable, community-governed specification to build against rather than a proprietary one. It also means that Llama Stack's Responses API implementation is part of a broader ecosystem of interoperable providers. Applications built against the Open Responses specification can run on Llama Stack, on OpenAI, on Hugging Face's infrastructure, or on local providers like Ollama, without code changes. + +The Open Responses specification also introduces concepts that matter for production deployments: + +- **Reasoning visibility:** The specification formalizes how models expose their reasoning process, which enables audit trails and governance workflows. +- **Internal vs. external tools:** A clear distinction between tools executed within the provider's infrastructure (like `file_search`) and tools executed by the client, so developers know exactly where computation happens. +- **Extensibility without fragmentation:** Providers can add custom capabilities while maintaining a stable, interoperable core. + +For the Llama Stack community, this means that investing in the Responses API is about more than compatibility with one vendor. It's about building on an open standard that the industry is starting to converge around. + +## Getting started + +If you're new to Llama Stack, the [Getting Started guide](https://llamastack.github.io/docs/getting_started/) will walk you through setting up a server with your preferred inference provider. From there, the [OpenAI Implementation Guide](https://llamastack.github.io/docs/providers/openai) has examples of using the Responses API for everything from simple text generation to multi-tool agentic workflows. + +The Responses API is still evolving, both in Llama Stack and in the Open Responses specification, and contributions are welcome. Whether it's implementing new features, improving test coverage, or reporting issues, the project benefits from developers who are building real applications and sharing what they learn.