Added trade-offs

minh-hoque · minh-hoque · commit a22e6cbbf6b6 · 2025-11-20T14:47:45.000-05:00
diff --git a/examples/Realtime_out_of_band_transcription.ipynb b/examples/Realtime_out_of_band_transcription.ipynb
@@ -58,10 +58,40 @@
         "- **Greater Steerability**: The Realtime model is more steerable, can better follow custom instructions for higher transcription quality, and is not limited by a 1024-token input maximum.\n",
         "- **Session Context Awareness**: The model has access to the full session context, so, for example, if you mention your name multiple times, it will transcribe it correctly.\n",
         "\n",
+        "\n",
         "In terms of **trade-offs**:\n",
-        "- Different cost profile: the realtime model for transcription will take audio in and do text out $32.00\t$0.40. It is also important to note that the whole SESSION CONTEXT is passed in every transcription context, however it will be cached and be priced at $0.40 for cached tokens. The output text tokens is priced at $16.00/M Tokens.\n",
-        "- gpt-4o-transcription is \t$2.50 text in\t$10.00 text out and $6.00 audio input all per 1M tokens.\n",
-        "- Other tradfe-offs would be slightly mroe complex to imp-lement compared to simply using the built in transcription option with realtime api with a transcription model.\n",
+        "\n",
+        "- Realtime Model (for transcription):\n",
+        "    - Audio Input → Text Output: $32.00 per 1M audio tokens + $16.00 per 1M text tokens out.\n",
+        "    - Cached Session Context: $0.40 per 1M cached context tokens (typically negligible).\n",
+        "\n",
+        "    - Total Cost (for 1M audio tokens in + 1M text tokens out): ≈ $48.00\n",
+        "\n",
+        "- GPT-4o Transcription:\n",
+        "\n",
+        "    - Audio Input: $6.00 per 1M audio tokens\n",
+        "\n",
+        "    - Text Input: $2.50 per 1M tokens (capped at 1024 tokens, negligible)\n",
+        "\n",
+        "    - Text Output: $10.00 per 1M tokens\n",
+        "\n",
+        "    - Total Cost (for 1M audio tokens in + 1M text tokens out): ≈ $16.00\n",
+        "\n",
+        "- Direct Cost Comparison:\n",
+        "\n",
+        "    - Realtime Transcription: ~$48.00\n",
+        "\n",
+        "    - GPT-4o Transcription: ~$16.00\n",
+        "\n",
+        "    - Absolute Difference: $48.00 − $16.00 = $32.00\n",
+        "\n",
+        "    - Cost Ratio: $48.00 / $16.00 = 3×\n",
+        "\n",
+        "    Note: Costs related to cached session context ($0.40 per 1M tokens) and the capped text input tokens for GPT-4o ($2.50 per 1M tokens) are negligible and thus excluded from detailed calculations.\n",
+        "\n",
+        "- Other Considerations:\n",
+        "\n",
+        "    - Implementing transcription via the realtime model might be slightly more complex compared to using the built-in GPT-4o transcription option through the Realtime API.\n",
         "\n",
         "> Note: Ouf-of-band responses using the realtime model can be used for other use cases beyond user turn transcription. Examples include generating structured summaries, triggering background actions, or performing validation tasks without affecting the main conversation.\n",
         "\n",