Skip to content

Commit 6869123

Browse files
rui-renruiren_microsoft
andauthored
Align live transcription response type with OpenAI Realtime ConversationItem pattern (#561)
### Description Redesigns `LiveAudioTranscriptionResponse` to follow the OpenAI Realtime API's `ConversationItem` shape, enabling forward compatibility with a future WebSocket-based architecture. **Motivation:** - Customers using OpenAI's Realtime API access transcription via `result.content[0].transcript` - By adopting this pattern now, customers who write `result.Content[0].Text` won't need to change their code when we migrate to WebSocket transport - Aligns with the team's plan to move toward OpenAI Realtime API compatibility **Before:** ```csharp // Extended AudioCreateTranscriptionResponse from Betalgo await foreach (var result in session.GetTranscriptionStream()) { Console.Write(result.Text); // inherited from base bool final = result.IsFinal; // custom field var segments = result.Segments; // inherited from base } ``` **After:** ```csharp // Own type shaped like OpenAI Realtime ConversationItem await foreach (var result in session.GetTranscriptionStream()) { Console.Write(result.Content[0].Text); // ConversationItem pattern Console.Write(result.Content[0].Transcript); // alias for Text (Realtime compat) bool final = result.IsFinal; double? start = result.StartTime; } ``` **Changes:** | File | Change | |------|--------| | LiveAudioTranscriptionTypes.cs | Removed `AudioCreateTranscriptionResponse` inheritance. New standalone `LiveAudioTranscriptionResponse` with `Content` list + new `TranscriptionContentPart` type | | LiveAudioTranscriptionClient.cs | Updated text checks: `.Text` → `.Content?[0]?.Text` | | JsonSerializationContext.cs | Registered `TranscriptionContentPart`, removed `AudioCreateTranscriptionResponse.Segment` | | LiveAudioTranscriptionTests.cs | Updated assertions to match new type shape | | Program.cs (sample) | Updated result reading to `result.Content?[0]?.Text` | | README.md | Updated docs and output type table | **Key design decisions:** - `TranscriptionContentPart` has both `Text` and `Transcript` (set to the same value) for maximum compatibility with both Whisper and Realtime API patterns - `StartTime`/`EndTime` are top-level on the response (not nested in Segments) — simpler access, maps to Realtime's `audio_start_ms`/`audio_end_ms` - No dependency on Betalgo's `ConversationItem` — we own the type to avoid carrying unused chat/tool-calling fields - `LiveAudioTranscriptionRaw` (Core JSON deserialization) is unchanged — this is purely an SDK presentation change, no Core/neutron-server impact **No breaking changes to:** Core API, native interop, audio pipeline, session lifecycle --------- Co-authored-by: ruiren_microsoft <ruiren@microsoft.com>
1 parent 3322120 commit 6869123

File tree

6 files changed

+60
-62
lines changed

6 files changed

+60
-62
lines changed

samples/cs/GettingStarted/src/LiveAudioTranscriptionExample/Program.cs

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ await model.DownloadAsync(progress =>
4141

4242
var audioClient = await model.GetAudioClientAsync();
4343
var session = audioClient.CreateLiveTranscriptionSession();
44-
session.Settings.SampleRate = 16000;
44+
session.Settings.SampleRate = 16000; // Default is 16000; shown here to match the NAudio WaveFormat below
4545
session.Settings.Channels = 1;
4646
session.Settings.Language = "en";
4747

@@ -54,16 +54,17 @@ await model.DownloadAsync(progress =>
5454
{
5555
await foreach (var result in session.GetTranscriptionStream())
5656
{
57+
var text = result.Content?[0]?.Text;
5758
if (result.IsFinal)
5859
{
5960
Console.WriteLine();
60-
Console.WriteLine($" [FINAL] {result.Text}");
61+
Console.WriteLine($" [FINAL] {text}");
6162
Console.Out.Flush();
6263
}
63-
else if (!string.IsNullOrEmpty(result.Text))
64+
else if (!string.IsNullOrEmpty(text))
6465
{
6566
Console.ForegroundColor = ConsoleColor.Cyan;
66-
Console.Write(result.Text);
67+
Console.Write(text);
6768
Console.ResetColor();
6869
Console.Out.Flush();
6970
}

sdk/cs/README.md

Lines changed: 11 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -259,12 +259,12 @@ waveIn.DataAvailable += (sender, e) =>
259259
// Read transcription results as they arrive
260260
await foreach (var result in session.GetTranscriptionStream())
261261
{
262-
// result inherits from AudioCreateTranscriptionResponse
263-
// - result.Text — incremental transcribed text (per chunk, not accumulated)
264-
// - result.IsFinal — true for final results, false for interim hypotheses
265-
// - result.Segments — segment-level timing data (Start/End in seconds)
266-
// - result.Language language code
267-
Console.Write(result.Text);
262+
// result follows the OpenAI Realtime ConversationItem pattern:
263+
// - result.Content[0].Text — incremental transcribed text (per chunk, not accumulated)
264+
// - result.Content[0].Transcript — alias for Text (OpenAI Realtime compatibility)
265+
// - result.IsFinal — true for final results, false for interim hypotheses
266+
// - result.StartTime / EndTimesegment timing in seconds
267+
Console.Write(result.Content?[0]?.Text);
268268
}
269269

270270
await session.StopAsync();
@@ -274,12 +274,11 @@ await session.StopAsync();
274274

275275
| Field | Type | Description |
276276
|-------|------|-------------|
277-
| `Text` | `string` | Transcribed text from this audio chunk (inherited from `AudioCreateTranscriptionResponse`) |
277+
| `Content` | `List<TranscriptionContentPart>` | Content parts. Access text via `Content[0].Text` or `Content[0].Transcript`. |
278278
| `IsFinal` | `bool` | Whether this is a final or interim result. Nemotron always returns `true`. |
279-
| `Language` | `string` | Language code (inherited) |
280-
| `Duration` | `float` | Audio duration in seconds (inherited) |
281-
| `Segments` | `List<Segment>` | Segment timing with `Start`/`End` offsets (inherited) |
282-
| `Words` | `List<WordSegment>` | Word-level timing (inherited, when available) |
279+
| `StartTime` | `double?` | Start time offset in the audio stream (seconds). |
280+
| `EndTime` | `double?` | End time offset in the audio stream (seconds). |
281+
| `Id` | `string?` | Unique identifier for this result (if available). |
283282

284283
#### Session Lifecycle
285284

@@ -356,7 +355,7 @@ Key types:
356355
| [`OpenAIChatClient`](./docs/api/microsoft.ai.foundry.local.openaichatclient.md) | Chat completions (sync + streaming) |
357356
| [`OpenAIAudioClient`](./docs/api/microsoft.ai.foundry.local.openaiaudioclient.md) | Audio transcription (sync + streaming) |
358357
| [`LiveAudioTranscriptionSession`](./docs/api/microsoft.ai.foundry.local.openai.liveaudiotranscriptionsession.md) | Real-time audio streaming session |
359-
| [`LiveAudioTranscriptionResponse`](./docs/api/microsoft.ai.foundry.local.openai.liveaudiotranscriptionresponse.md) | Streaming transcription result (extends `AudioCreateTranscriptionResponse`) |
358+
| [`LiveAudioTranscriptionResponse`](./docs/api/microsoft.ai.foundry.local.openai.liveaudiotranscriptionresponse.md) | Streaming transcription result (ConversationItem-shaped) |
360359
| [`ModelInfo`](./docs/api/microsoft.ai.foundry.local.modelinfo.md) | Full model metadata record |
361360

362361
## Tests

sdk/cs/src/Detail/JsonSerializationContext.cs

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -33,11 +33,10 @@ namespace Microsoft.AI.Foundry.Local.Detail;
3333
[JsonSerializable(typeof(IList<FunctionDefinition>))]
3434
[JsonSerializable(typeof(PropertyDefinition))]
3535
[JsonSerializable(typeof(IList<PropertyDefinition>))]
36-
// --- Audio streaming types ---
37-
[JsonSerializable(typeof(LiveAudioTranscriptionResponse))]
36+
// --- Audio streaming types (LiveAudioTranscriptionResponse inherits ConversationItem
37+
// which has AOT-incompatible JsonConverters, so we only register the raw deserialization type) ---
3838
[JsonSerializable(typeof(LiveAudioTranscriptionRaw))]
3939
[JsonSerializable(typeof(CoreErrorResponse))]
40-
[JsonSerializable(typeof(AudioCreateTranscriptionResponse.Segment))]
4140
[JsonSourceGenerationOptions(DefaultIgnoreCondition = JsonIgnoreCondition.WhenWritingNull,
4241
WriteIndented = false)]
4342
internal partial class JsonSerializationContext : JsonSerializerContext

sdk/cs/src/OpenAI/LiveAudioTranscriptionClient.cs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -218,7 +218,7 @@ private async Task PushLoopAsync(CancellationToken ct)
218218
try
219219
{
220220
var transcription = LiveAudioTranscriptionResponse.FromJson(response.Data);
221-
if (!string.IsNullOrEmpty(transcription.Text))
221+
if (!string.IsNullOrEmpty(transcription.Content?[0]?.Text))
222222
{
223223
_outputChannel?.Writer.TryWrite(transcription);
224224
}
@@ -331,7 +331,7 @@ public async Task StopAsync(CancellationToken ct = default)
331331
try
332332
{
333333
var finalResult = LiveAudioTranscriptionResponse.FromJson(response.Data);
334-
if (!string.IsNullOrEmpty(finalResult.Text))
334+
if (!string.IsNullOrEmpty(finalResult.Content?[0]?.Text))
335335
{
336336
_outputChannel?.Writer.TryWrite(finalResult);
337337
}

sdk/cs/src/OpenAI/LiveAudioTranscriptionTypes.cs

Lines changed: 23 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -2,16 +2,18 @@ namespace Microsoft.AI.Foundry.Local.OpenAI;
22

33
using System.Text.Json;
44
using System.Text.Json.Serialization;
5-
using Betalgo.Ranul.OpenAI.ObjectModels.ResponseModels;
5+
using Betalgo.Ranul.OpenAI.ObjectModels.RealtimeModels;
66
using Microsoft.AI.Foundry.Local;
77
using Microsoft.AI.Foundry.Local.Detail;
88

99
/// <summary>
1010
/// Transcription result for real-time audio streaming sessions.
11-
/// Extends <see cref="AudioCreateTranscriptionResponse"/> to provide a consistent
12-
/// output format with file-based transcription, while adding streaming-specific fields.
11+
/// Extends the OpenAI Realtime API's <see cref="ConversationItem"/> so that
12+
/// customers access text via <c>result.Content[0].Text</c> or
13+
/// <c>result.Content[0].Transcript</c>, ensuring forward compatibility
14+
/// when the transport layer moves to WebSocket.
1315
/// </summary>
14-
public record LiveAudioTranscriptionResponse : AudioCreateTranscriptionResponse
16+
public class LiveAudioTranscriptionResponse : ConversationItem
1517
{
1618
/// <summary>
1719
/// Whether this is a final or partial (interim) result.
@@ -22,35 +24,34 @@ public record LiveAudioTranscriptionResponse : AudioCreateTranscriptionResponse
2224
[JsonPropertyName("is_final")]
2325
public bool IsFinal { get; init; }
2426

27+
/// <summary>Start time offset of this segment in the audio stream (seconds).</summary>
28+
[JsonPropertyName("start_time")]
29+
public double? StartTime { get; init; }
30+
31+
/// <summary>End time offset of this segment in the audio stream (seconds).</summary>
32+
[JsonPropertyName("end_time")]
33+
public double? EndTime { get; init; }
34+
2535
internal static LiveAudioTranscriptionResponse FromJson(string json)
2636
{
27-
// Deserialize the core's JSON (which has is_final, text, start_time, end_time)
28-
// into an intermediate record, then map to the response type.
2937
var raw = JsonSerializer.Deserialize(json,
3038
JsonSerializationContext.Default.LiveAudioTranscriptionRaw)
3139
?? throw new FoundryLocalException("Failed to deserialize live audio transcription result");
3240

33-
var response = new LiveAudioTranscriptionResponse
41+
return new LiveAudioTranscriptionResponse
3442
{
35-
Text = raw.Text,
3643
IsFinal = raw.IsFinal,
37-
};
38-
39-
// Map start_time/end_time into a Segment for OpenAI-compatible output
40-
if (raw.StartTime.HasValue || raw.EndTime.HasValue)
41-
{
42-
response.Segments =
44+
StartTime = raw.StartTime,
45+
EndTime = raw.EndTime,
46+
Content =
4347
[
44-
new Segment
48+
new ContentPart
4549
{
46-
Start = (float)(raw.StartTime ?? 0),
47-
End = (float)(raw.EndTime ?? 0),
48-
Text = raw.Text
50+
Text = raw.Text,
51+
Transcript = raw.Text
4952
}
50-
];
51-
}
52-
53-
return response;
53+
]
54+
};
5455
}
5556
}
5657

sdk/cs/test/FoundryLocal.Tests/LiveAudioTranscriptionTests.cs

Lines changed: 17 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -21,25 +21,24 @@ public async Task FromJson_ParsesTextAndIsFinal()
2121

2222
var result = LiveAudioTranscriptionResponse.FromJson(json);
2323

24-
await Assert.That(result.Text).IsEqualTo("hello world");
24+
await Assert.That(result.Content).IsNotNull();
25+
await Assert.That(result.Content!.Count).IsEqualTo(1);
26+
await Assert.That(result.Content[0].Text).IsEqualTo("hello world");
27+
await Assert.That(result.Content[0].Transcript).IsEqualTo("hello world");
2528
await Assert.That(result.IsFinal).IsTrue();
26-
await Assert.That(result.Segments).IsNull();
2729
}
2830

2931
[Test]
30-
public async Task FromJson_MapsTimingToSegments()
32+
public async Task FromJson_MapsTimingFields()
3133
{
3234
var json = """{"is_final":false,"text":"partial","start_time":1.5,"end_time":3.0}""";
3335

3436
var result = LiveAudioTranscriptionResponse.FromJson(json);
3537

36-
await Assert.That(result.Text).IsEqualTo("partial");
38+
await Assert.That(result.Content?[0]?.Text).IsEqualTo("partial");
3739
await Assert.That(result.IsFinal).IsFalse();
38-
await Assert.That(result.Segments).IsNotNull();
39-
await Assert.That(result.Segments!.Count).IsEqualTo(1);
40-
await Assert.That(result.Segments[0].Start).IsEqualTo(1.5f);
41-
await Assert.That(result.Segments[0].End).IsEqualTo(3.0f);
42-
await Assert.That(result.Segments[0].Text).IsEqualTo("partial");
40+
await Assert.That(result.StartTime).IsEqualTo(1.5);
41+
await Assert.That(result.EndTime).IsEqualTo(3.0);
4342
}
4443

4544
[Test]
@@ -49,21 +48,20 @@ public async Task FromJson_EmptyText_ParsesSuccessfully()
4948

5049
var result = LiveAudioTranscriptionResponse.FromJson(json);
5150

52-
await Assert.That(result.Text).IsEqualTo("");
51+
await Assert.That(result.Content?[0]?.Text).IsEqualTo("");
5352
await Assert.That(result.IsFinal).IsTrue();
5453
}
5554

5655
[Test]
57-
public async Task FromJson_OnlyStartTime_CreatesSegment()
56+
public async Task FromJson_OnlyStartTime_SetsStartTime()
5857
{
5958
var json = """{"is_final":true,"text":"word","start_time":2.0,"end_time":null}""";
6059

6160
var result = LiveAudioTranscriptionResponse.FromJson(json);
6261

63-
await Assert.That(result.Segments).IsNotNull();
64-
await Assert.That(result.Segments!.Count).IsEqualTo(1);
65-
await Assert.That(result.Segments[0].Start).IsEqualTo(2.0f);
66-
await Assert.That(result.Segments[0].End).IsEqualTo(0f);
62+
await Assert.That(result.StartTime).IsEqualTo(2.0);
63+
await Assert.That(result.EndTime).IsNull();
64+
await Assert.That(result.Content?[0]?.Text).IsEqualTo("word");
6765
}
6866

6967
[Test]
@@ -75,15 +73,15 @@ public async Task FromJson_InvalidJson_Throws()
7573
}
7674

7775
[Test]
78-
public async Task FromJson_InheritsFromAudioCreateTranscriptionResponse()
76+
public async Task FromJson_ContentHasTextAndTranscript()
7977
{
8078
var json = """{"is_final":true,"text":"test","start_time":null,"end_time":null}""";
8179

8280
var result = LiveAudioTranscriptionResponse.FromJson(json);
8381

84-
// Verify it's assignable to the base type
85-
Betalgo.Ranul.OpenAI.ObjectModels.ResponseModels.AudioCreateTranscriptionResponse baseRef = result;
86-
await Assert.That(baseRef.Text).IsEqualTo("test");
82+
// Both Text and Transcript should have the same value
83+
await Assert.That(result.Content?[0]?.Text).IsEqualTo("test");
84+
await Assert.That(result.Content?[0]?.Transcript).IsEqualTo("test");
8785
}
8886

8987
// --- LiveAudioTranscriptionOptions tests ---

0 commit comments

Comments
 (0)