Terminal Audio Protocol #9507

sopyb · 2026-02-14T16:03:43Z

sopyb
Feb 14, 2026

Note

TLDR: Proposing an audio protocol for terminal emulators roughly following the initial development of the graphics protocol, starting with raw PCM frames. Looking for community feedback before drafting a full spec for v1.

Hey there,

I've been reading on terminal emulators and terminal emulator protocols for the past year or so thinking of something I could implement for my Bachelor's thesis. I have roughly 4-5 months until my thesis presentation, and I'm planning to maintain and expand this work afterward, especially as a potential theme for my Master's degree. I recently had the stupid idea of making a wayland compositor using the kitty graphics protocol, but quickly realized that would be incomplete without audio, since I couldn't find any fitting protocol (a buzzer with no pitch/duration control won't do). I looked up if there's any protocol like the graphics protocol but for audio and didn't find anything. So I started reading on how the graphics protocol was designed in #33. Then looked to see if anyone proposed something for audio in Kitty and found #7722 and it was closed with:

Not something I am personally interested in, but you are welcome to contribute it via a PR.

I'm interested in working on this and would love the community's input. I agree and find the major goals of the graphics protocol as a great start point, but I am not sure if raw PCM frames would be the best way to go long term, the nature of TCP/IP would probably make audio become choppy in anything, but the most ideal scenarios and on the other side I am wondering if having support for an open standard like ogg would be too high of a burden on terminals implementing, since a basic implementation of ogg decoding and ogg vorbis stream framing would greatly help with issues that raw PCM frames would have.

Currently I am thinking of proposing a V1 as raw PCM audio only and keeping the initial implementation simple to follow the philosophy of the graphics protocol, start small and expand afterwards. Here's what I have in mind so far:

Version 1 - Raw PCM Audio

Format: LPCM framing (8/16/20/24-bit depth, advertised in chunk headers)
Sample rates: the common 44.1 kHz / 48 kHz and the slightly less common 88.2kHz / 96kHz / 192kHz (advertised in chunk headers)
Channels: Mono/stereo/other negotiation (advertised in chunk headers)
Delivery: Framed chunks with timestamps over the existing terminal connection
Buffering: Client-side static jitter buffer to prevent choppy playback
Optimization: Optional shared-memory transport when terminal is local

This avoids:

Transferring entire audio files upfront or chopping the files before transferring them
Codec dependency requirements

The protocol would include a query mechanism (mostly inspired by the graphics protocol pattern) to detect if a terminal supports audio playback before the clients attempts to send audio data.

Future improvements

Capability negotiation via escape sequence responses
Ogg Vorbis as the first candidate (open standard, well-documented)
Hint to fallback to PCM in terminals that don't support compression
Significant bandwidth savings for longer audio clips

Questions for the Community (and my opinion as of now)

Does starting with PCM-only for v1 seem reasonable, or should compressed formats be in scope from day one?
- I think PCM would be a reasonable starting point for experimentation, maybe even cut some fluff like the more uncommon sample rates or non-mono audio for the first version.
What playback controls would be essential? (play/pause/seek/volume/stop)
- At least for v1 I'd assume the client application can choose to queue up a full file before playing it or seeking through it and pausing at will, stopping to clear the file or just leave it to be handled on terminal clear. But I am not sure of a "play as it is streamed" option? The application could queue up bits of the audio at a time and play them, and there could be a "play at timestamp" command so it doesn't block the IO.
Should this support bidirectional audio (microphone input) eventually, or stay output-only?
Any concerns about the proposed sample rates and bit depths?
Would this be useful beyond the compositor and file playback use cases of icat for audio (+over ssh) #7722? What applications would benefit?

I'm planning to draft a full specification document (similar to the graphics protocol spec) and would love feedback before going too far down any particular path. Also willing to contribute code for implementing a draft specification.

kovidgoyal · 2026-02-15T04:11:40Z

kovidgoyal
Feb 15, 2026
Maintainer

Playing audio over network connections is a solved problem, I think. At least I have no issues playing audio over TCP/IP in applications ranging from phone calls to music players. The graphics protocol isn't really a good analog for this as the graphics protocol is fundamentally about static images (yes it gets abused for video but that was not its original design intent). You'd basically want some kind of streaming protocol, not a static one. Unfortunately, I know very little about audio, so I am not the best person to design such a protocol. There would need to be framing and time stamping so that in case of network latency issues, the terminal can keep skip frames rather than get desynced. You would also want some mechanism to send audio data via shared memory/filesystem for local clients falling back to escape code just as the graphics protocol does.

Coming to less meta questions, I suggest start with raw PCM samples. Keep it unidirectional, audio input is a separate issue with unrelated concerns like security when accessing the microphone. Once the basic protocol with PCM is implemented and working, we can investigate compressed formats. Since audio is more performance (in terms of latency and synchronisation) more sensitive than static graphics, compression is probably more important here.

Solicit some feedback from developers of, say, terminal music players if they would be interested in such a thing and what design would work for them to move beyond the compositor use case. Although I believe there already are kitty based wayland compositors I vaguely recall hearing about them a year or so ago. Though if you ask me, there are other problems you will need to solve for that use case as well, such as (and these are on my TODO list) proper pointer events with high res scroll and touch input and drag and drop protocols.

1 reply

sopyb Feb 16, 2026
Author

Thanks a lot for the feedback and for bringing up the existence of https://github.com/mmulet/term.everything. It didn't come up when I began researching the idea of a Wayland compositor using the Terminal Graphics API.

I should mention upfront that I'm not deeply familiar with audio engineering beyond some music production as a hobby, but this is something I'm committed to learning more about and getting help with where needed (probably from someone working on a project like FFmpeg).

You're right about the streaming vs static protocol distinction and it's something I should probably pay more attention to when I write a protocol proposal draft over the next few weeks.

Regarding outreach to developers: I'm doing it in multiple phases. I'll do some initial conversations to validate the basic concept and gather high-level requirements (including this discussion), then draft a preliminary spec based on that feedback and your feedback here. After that, I'll do a second round of feedback with a concrete proposal to discuss, which should help catch any fundamental design flaws before I get too far into implementation. And finally I'd gather feedback on the v1 implementation that developers can play around with (even if in a separate fork if it ends up being something that doesn't fit into the project).

I already have some people I want to contact - specifically the developers of the MOC, Benben, and kew music players at this stage since they're likely to have the most information to contribute at this stage. I just thought starting with the terminal that I've always seen as more willing to experiment and that I'd like to implement this protocol into would be the best place to start, but if you have any specific suggestions of projects I should get in contact with I'd love to hear them.

As a side tangent I have tried postmarketOS on a couple devices. Anything in particular that needs work on in regards to touch input? The current General Purpose Mouse protocol seemed to be serviceable even with a touch device anything that we might want to do and we can't? I can think of a few things, mainly gestures, but that would not be exclusive to touch input and would apply to trackpads.

kovidgoyal · 2026-02-17T03:25:27Z

kovidgoyal
Feb 17, 2026
Maintainer

On Mon, Feb 16, 2026 at 03:44:03AM -0800, Sopy wrote: As a side tangent I have tried postmarketOS on a couple devices. Anything in particular that needs work on in regards to touch input? The current General Purpose Mouse protocol seemed to be serviceable even with a touch device anything that we might want to do and we can't? I can think of a few things, mainly gestures, but that would not be exclusive to touch input and would apply to trackpads.

High res scroll input, touch input different types of pointer devices, like pens so we would need pressure and related events. No need for gestures that should be the province of the application IMO, not the terminal. And the existing best Mouse protocol SGR_PIXEL is woefully incomplete, in that it cant represent co-ordinates outside the window and has no leave/enter events.

0 replies

sopyb · 2026-03-11T14:47:42Z

sopyb
Mar 11, 2026
Author

Hey @kovidgoyal,

I've been working on the draft and made some last-minute changes over the past hour or so to a few things that didn't quite make sense on reflection. I've uploaded the markdown of the draft to a Gist.

https://gist.github.com/sopyb/ec682c9dbb1899f70039b6e81b0b546c

Looking forward to your feedback, let me know if there's anything you think I might have missed or gotten wrong.

All the best,
Sopy

0 replies

kovidgoyal · 2026-03-11T15:17:30Z

kovidgoyal
Mar 11, 2026
Maintainer

After a quick read through: 1) Why no temp file transfer mode with terminal side delete? The client may not be long lived, think of a command to play an audio file. It could just send the audio data to the terminal and exit. Requiring the client to manage the temp file lifetime means that such simple use cases are not possible. 2) If you are using single letter keys everywhere then for consistency use them in capability response as well 3) You cant query for shared filesystem and shared memory support without actually having the client create a file/shared mem and sending it to the terminal. This means your querying support needs to be reworked. Even if the terminal supports them, it may not have a common filesystem/shared mem namespace with the client, the only way to know that is to test transmission, see graphics protocol querying. 4) At the end you talk about client exit. Terminals have no concept of knowing when a client exist, unless you mean the owner of the pty device exiting and closing the pty. When pty is closed audio must stop. I think this is what you mean by session close? 5) What's the use case for global streams? 6) Your example script looks linux specific. It needs to be POSIX sh and work on an POSIX system to play a raw audio file. See the POSIX exmaple script used for the graphics protocol for inspiration. I didnt look in detail at the audio specific parts as again, I am no expert there.

2 replies

sopyb Mar 11, 2026
Author

Thanks a lot for the fast feedback. It seems some of the tables were also a bit broken (oops)

1, 3, 4) I made an assumption I didn't challenge during design: that the terminal has the same information as the shell about running processes and I also failed to consider how file/memory sharing interacts as a transfer method, instead just thinking of the base64 encoded transfer. For the next version I'll make remedy those.

2) Just fixed it.

5) I had a daemon-style music player in mind where the front-end might be hidden and shown, but that model doesn't actually work since the daemon would need to queue audio while in the background. I forgot to remove it from the draft. I just did it now.

6) Force of habit, I will work on it.

I will probably come to you with a draft v2 this week or early next week.

sopyb Mar 16, 2026
Author

I've been working on it and I'm thinking about the audio clear on <ESC>[2J and whether that makes sense, since that sequence is specifically for clearing the screen and it might cause unexpected behavior for the protocol users. Audio is conceptually tied to the screen but it is not visual content, unsure if there is any analogue in a terminal of a state that is cleared by <ESC>[2J since that would make it a lot clearer, to my knowledge <ESC>[2J doesn't even clear bg/fg colors or terminal scroll-back. <ESC>c sounds a bit heavy-handed to be the only option other than <ESC>_Aa=d,i=0<ESC>\ for stopping a stream when a client exits without cleaning up. (ie. short lived clients)

I am thinking of the case where a user queues something up with a short lived client and does clear hoping it would stop the audio as well. So by that logic it should stop the audio streams but I am undecided.

I also looked into OSC 133 but there doesn't seem to be a good way to get the command that was run, and that's probably the wrong approach anyway.

For now I will take <ESC>[2J off the draft, until I can figure out a better way to go about it.

kovidgoyal · 2026-03-16T03:01:40Z

kovidgoyal
Mar 16, 2026
Maintainer

IMO clear *should not* stop audio streams. This is because it is used in full screen applications that are designed to redraw the full screen on every frame. If such an application uses audio it will be stopped on every frame.

0 replies

sopyb · 2026-03-18T00:37:20Z

sopyb
Mar 18, 2026
Author

Hi again, @kovidgoyal!

I’ve addressed the issues you brought up and spent a lot of time debating with myself on a few things. Mainly the terminal-enforced minimum jitter buffer, that I ended up dropping after I couldn't find a use for it other than to cause annoyances for apps, which audio formats to list in the document, and which ones to actually recommend as worthwhile for implementation. In the end, s16le is still the only format that's strictly required, mostly because it is a great option for most use cases.

Designing a terminal protocol for my thesis seemed a lot easier before I started having to consider ten different things pulling me in different directions. I kept running into issues where I’d tweak one aspect I wasn't satisfied with, only to forget it was referenced elsewhere. Leaving the document in a bit of a mess, and I'd catch it several reads down the line having no idea how I missed it. Anyway, here is a summary of the changes I've implemented since you last looked at it:

switched from index based audio formats to a container/codec format, since it would make extending it way easier later on and shouldn't eat too much bandwidth/CPU resources to handle parsing at the beginning of a transport.
figured out the transfer through local temporary file (t=t), regular file (t=f), and shared memory (t=s)
changed control keys to be unique, like separating volume(v) and version(V) so they don't share a single v key
defined explicitly how errors should be handled in most cases
changed the loop count (L) logic so that ~~the default is 1 (play once)~~ (and I just realized this is stupid as I was writing this) the default is 0, play 0 additional times. Infinite looping is now achieved using -1 (technically 4294967295/0xFFFFFFFF since its a uint32) instead of L=0.
changed the stream-relative timestamps (T) from microseconds to milliseconds so they are consistent with the rest of the timing logic
clarified the logic for stream pausing/resetting when switching terminal screens, and established that all streams must stop and resources cleared on <ESC>c

The updated draft is accessible at the same link as the previous iteration: https://gist.github.com/sopyb/ec682c9dbb1899f70039b6e81b0b546c

I appreciate you taking time to give feedback.

All the best,
Sopy

0 replies

kovidgoyal · 2026-03-20T02:53:29Z

kovidgoyal
Mar 20, 2026
Maintainer

Yes, designing protocols is harder than it first appears :) Some more comments: 1) Clean up the language around chunking. Since you want to support streaming data, as opposed to the graphics protocol that supports only display a frame after it is fully transmitted, chunking needs more careful design. In particular, you need to consider the scenario where the client is itself getting the audio data slowly, say over a network and the transmitting it to the terminal in chunks. In this case, how is chunking supposed to work? Does the client send multiple padding bytes, or does it only transmit data when it has audio data that is a multiple of 3 in size? 2) You have some language about valid shm names, this varies across platforms. You should just link to the POSIX and windows specs as is done in the graphics protocol docs 3) When using t=t you need to have a restriction on allowed file paths as in the graphics protocol, `tty-graphics-protocol` this is because badly designed programs can store files necessary for their operation in /tmp with fixed names (usually sockets but other files as well) and we dont want attackers to be able to DoS clients by having the terminal delete these file. So for you it would be tty-audio-protocol 4) You state client MUST query terminals before sending audio data. That should be SHOULD not MUST. 5) In the storage quote section simply state a minimum storage size terminals must implement. Leave details like storage on disk etc to implementations. 6) Stream arbitration: you need to think a little about multiplexers here. How are in terminal multiplexers such as tmux (admittedly a terrible concept but as protocol designers we have to consider these abominations) supposed to deal with this? In general with a query/response protocol multiplexers become a very vexed issue. 7) With ids you have to worry about id collisions from unrelated programs. With audio this is less of an issue since unlike with graphics ou can only really have one audio stream playing, so maybe ids alone are enough and we dont need client numbers in the protocol like we do in the graphics protocol.

1 reply

sopyb Mar 24, 2026
Author

Hey sorry for the late reply,

I was out of country attending a conference for work and right now I am recovering from the trip. I read through, it seems reasonable, but I am too tired to think about any of this critically right now so I might bother your if I find something doesn't line up in my logic.

Thanks again for the feedback! I'll be working on it this and the next week and let you know when I am done with the next version.

All the best,
Sopy

Uh oh!

Terminal Audio Protocol #9507

Uh oh!

Uh oh!

sopyb Feb 14, 2026

Version 1 - Raw PCM Audio

Future improvements

Questions for the Community (and my opinion as of now)

Replies: 7 comments · 4 replies

Uh oh!

kovidgoyal Feb 15, 2026 Maintainer

Uh oh!

sopyb Feb 16, 2026 Author

Uh oh!

kovidgoyal Feb 17, 2026 Maintainer

Uh oh!

Uh oh!

sopyb Mar 11, 2026 Author

Uh oh!

kovidgoyal Mar 11, 2026 Maintainer

Uh oh!

Uh oh!

sopyb Mar 11, 2026 Author

Uh oh!

sopyb Mar 16, 2026 Author

Uh oh!

kovidgoyal Mar 16, 2026 Maintainer

Uh oh!

sopyb Mar 18, 2026 Author

Uh oh!

kovidgoyal Mar 20, 2026 Maintainer

Uh oh!

sopyb Mar 24, 2026 Author

sopyb
Feb 14, 2026

Replies: 7 comments 4 replies

kovidgoyal
Feb 15, 2026
Maintainer

sopyb Feb 16, 2026
Author

kovidgoyal
Feb 17, 2026
Maintainer

sopyb
Mar 11, 2026
Author

kovidgoyal
Mar 11, 2026
Maintainer

sopyb Mar 11, 2026
Author

sopyb Mar 16, 2026
Author

kovidgoyal
Mar 16, 2026
Maintainer

sopyb
Mar 18, 2026
Author

kovidgoyal
Mar 20, 2026
Maintainer

sopyb Mar 24, 2026
Author