Skip to content

Implement OpenAI provider with multi-modal output (images, audio) #22

@djthorpe

Description

@djthorpe

Summary

Implement an OpenAI provider (pkg/provider/openai) supporting chat completions, embeddings, image generation, and audio output. This will likely require extending the response content model beyond text to support multi-modal outputs.

Requirements

Core Provider

  • Implement llm.Client interface for OpenAI API (chat completions, model listing)
  • API key via OPENAI_API_KEY environment variable
  • Support streaming and non-streaming chat completions
  • Support tool/function calling
  • Support thinking/reasoning (o1, o3 models)

Image Output

  • Support DALL-E and GPT-image models for image generation
  • Responses may contain image data (base64 or URLs) alongside text
  • Extend the content block model to represent image outputs (not just text)
  • Images should be renderable in Telegram (send as photo) and CLI (save to file or display URL)

Audio Output

  • Support audio output from GPT-4o-audio and similar models
  • Responses may contain audio data (base64 WAV/MP3)
  • Extend the content block model to represent audio outputs
  • Audio should be sendable in Telegram (as voice/audio message) and CLI (save to file)

Content Model Changes

  • Current schema.Content may need to support typed content blocks: text, image, audio, etc.
  • Each block should carry MIME type and either inline data or a URL
  • Downstream consumers (Telegram bot, CLI, API responses) need to handle multi-modal content blocks
  • Consider how this interacts with session storage (storing large binary blobs vs references)

Models

  • GPT-4o, GPT-4o-mini, GPT-4.1, o1, o3, o4-mini (chat)
  • DALL-E 3, GPT-image (image generation)
  • GPT-4o-audio (audio output)
  • Embedding models (text-embedding-3-small, text-embedding-3-large)

Motivation

OpenAI is a major LLM provider and its multi-modal output capabilities (images, audio) will drive the content model to support rich responses across all providers, improving the overall architecture.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions