Skip to content

Commit 424f280

Browse files
committed
Updated Specification and Docs to support Audio Modality.
1 parent b9d368f commit 424f280

File tree

5 files changed

+107
-4
lines changed

5 files changed

+107
-4
lines changed

docs/specification/client/sampling.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ weight: 40
88
**Protocol Revision**: {{< param protocolRevision >}}
99
{{< /callout >}}
1010

11-
The Model Context Protocol (MCP) provides a standardized way for servers to request LLM sampling ("completions" or "generations") from language models via clients. This flow allows clients to maintain control over model access, selection, and permissions while enabling servers to leverage AI capabilities&mdash;with no server API keys necessary. Servers can request text or image-based interactions and optionally include context from MCP servers in their prompts.
11+
The Model Context Protocol (MCP) provides a standardized way for servers to request LLM sampling ("completions" or "generations") from language models via clients. This flow allows clients to maintain control over model access, selection, and permissions while enabling servers to leverage AI capabilities&mdash;with no server API keys necessary. Servers can request text, audio or image-based interactions and optionally include context from MCP servers in their prompts.
1212

1313
## User Interaction Model
1414

@@ -142,6 +142,16 @@ Sampling messages can contain:
142142
}
143143
```
144144

145+
#### Audio Content
146+
```json
147+
{
148+
"type": "audio",
149+
"data": "base64-encoded-audio-data",
150+
"mimeType": "audio/wav"
151+
}
152+
```
153+
154+
145155
### Model Preferences
146156

147157
Model selection in MCP requires careful abstraction since servers and clients may use different AI providers with distinct model offerings. A server cannot simply request a specific model by name since the client may not have access to that exact model or may prefer to use a different provider's equivalent model.

docs/specification/server/prompts.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -189,6 +189,17 @@ Image content allows including visual information in messages:
189189
```
190190
The image data MUST be base64-encoded and include a valid MIME type. This enables multi-modal interactions where visual context is important.
191191

192+
#### Audio Content
193+
Audio content allows including audio information in messages:
194+
```json
195+
{
196+
"type": "audio",
197+
"data": "base64-encoded-audio-data",
198+
"mimeType": "audio/wav"
199+
}
200+
```
201+
The audio data MUST be base64-encoded and include a valid MIME type. This enables multi-modal interactions where audio context is important.
202+
192203
#### Embedded Resources
193204
Embedded resources allow referencing server-side resources directly in messages:
194205
```json

docs/specification/server/tools.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -188,6 +188,15 @@ Tool results can contain multiple content items of different types:
188188
}
189189
```
190190

191+
#### Audio Content
192+
```json
193+
{
194+
"type": "audio",
195+
"data": "base64-encoded-audio-data",
196+
"mimeType": "audio/wav"
197+
}
198+
```
199+
191200
#### Embedded Resources
192201

193202
[Resources]({{< ref "/specification/server/resources" >}}) **MAY** be embedded, to provide additional context or data, behind a URI that can be subscribed to or fetched again by the client later:

schema/schema.json

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,48 @@
2525
},
2626
"type": "object"
2727
},
28+
"AudioContent": {
29+
"description": "Audio provided to or from an LLM.",
30+
"properties": {
31+
"annotations": {
32+
"properties": {
33+
"audience": {
34+
"description": "Describes who the intended customer of this object or data is.\n\nIt can include multiple entries to indicate content useful for multiple audiences (e.g., `[\"user\", \"assistant\"]`).",
35+
"items": {
36+
"$ref": "#/definitions/Role"
37+
},
38+
"type": "array"
39+
},
40+
"priority": {
41+
"description": "Describes how important this data is for operating the server.\n\nA value of 1 means \"most important,\" and indicates that the data is\neffectively required, while 0 means \"least important,\" and indicates that\nthe data is entirely optional.",
42+
"maximum": 1,
43+
"minimum": 0,
44+
"type": "number"
45+
}
46+
},
47+
"type": "object"
48+
},
49+
"data": {
50+
"description": "The base64-encoded audio data.",
51+
"format": "byte",
52+
"type": "string"
53+
},
54+
"mimeType": {
55+
"description": "The MIME type of the audio. Different providers may support different audio types.",
56+
"type": "string"
57+
},
58+
"type": {
59+
"const": "audio",
60+
"type": "string"
61+
}
62+
},
63+
"required": [
64+
"data",
65+
"mimeType",
66+
"type"
67+
],
68+
"type": "object"
69+
},
2870
"BlobResourceContents": {
2971
"properties": {
3072
"blob": {
@@ -94,6 +136,9 @@
94136
{
95137
"$ref": "#/definitions/ImageContent"
96138
},
139+
{
140+
"$ref": "#/definitions/AudioContent"
141+
},
97142
{
98143
"$ref": "#/definitions/EmbeddedResource"
99144
}
@@ -409,6 +454,9 @@
409454
},
410455
{
411456
"$ref": "#/definitions/ImageContent"
457+
},
458+
{
459+
"$ref": "#/definitions/AudioContent"
412460
}
413461
]
414462
},
@@ -1349,6 +1397,9 @@
13491397
{
13501398
"$ref": "#/definitions/ImageContent"
13511399
},
1400+
{
1401+
"$ref": "#/definitions/AudioContent"
1402+
},
13521403
{
13531404
"$ref": "#/definitions/EmbeddedResource"
13541405
}
@@ -1718,6 +1769,9 @@
17181769
},
17191770
{
17201771
"$ref": "#/definitions/ImageContent"
1772+
},
1773+
{
1774+
"$ref": "#/definitions/AudioContent"
17211775
}
17221776
]
17231777
},

schema/schema.ts

Lines changed: 22 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -600,7 +600,7 @@ export type Role = "user" | "assistant";
600600
*/
601601
export interface PromptMessage {
602602
role: Role;
603-
content: TextContent | ImageContent | EmbeddedResource;
603+
content: TextContent | ImageContent | AudioContent | EmbeddedResource;
604604
}
605605

606606
/**
@@ -649,7 +649,7 @@ export interface ListToolsResult extends PaginatedResult {
649649
* should be reported as an MCP error response.
650650
*/
651651
export interface CallToolResult extends Result {
652-
content: (TextContent | ImageContent | EmbeddedResource)[];
652+
content: (TextContent | ImageContent | AudioContent | EmbeddedResource)[];
653653

654654
/**
655655
* Whether the tool call ended in an error.
@@ -804,7 +804,7 @@ export interface CreateMessageResult extends Result, SamplingMessage {
804804
*/
805805
export interface SamplingMessage {
806806
role: Role;
807-
content: TextContent | ImageContent;
807+
content: TextContent | ImageContent | AudioContent;
808808
}
809809

810810
/**
@@ -862,6 +862,25 @@ export interface ImageContent extends Annotated {
862862
mimeType: string;
863863
}
864864

865+
866+
/**
867+
* Audio provided to or from an LLM.
868+
*/
869+
export interface AudioContent extends Annotated {
870+
type: "audio";
871+
/**
872+
* The base64-encoded audio data.
873+
*
874+
* @format byte
875+
*/
876+
data: string;
877+
/**
878+
* The MIME type of the audio. Different providers may support different audio types.
879+
*/
880+
mimeType: string;
881+
}
882+
883+
865884
/**
866885
* The server's preferences for model selection, requested of the client during sampling.
867886
*

0 commit comments

Comments
 (0)