Replies: 7 comments 4 replies
-
|
Playing audio over network connections is a solved problem, I think. At least I have no issues playing audio over TCP/IP in applications ranging from phone calls to music players. The graphics protocol isn't really a good analog for this as the graphics protocol is fundamentally about static images (yes it gets abused for video but that was not its original design intent). You'd basically want some kind of streaming protocol, not a static one. Unfortunately, I know very little about audio, so I am not the best person to design such a protocol. There would need to be framing and time stamping so that in case of network latency issues, the terminal can keep skip frames rather than get desynced. You would also want some mechanism to send audio data via shared memory/filesystem for local clients falling back to escape code just as the graphics protocol does. Coming to less meta questions, I suggest start with raw PCM samples. Keep it unidirectional, audio input is a separate issue with unrelated concerns like security when accessing the microphone. Once the basic protocol with PCM is implemented and working, we can investigate compressed formats. Since audio is more performance (in terms of latency and synchronisation) more sensitive than static graphics, compression is probably more important here. Solicit some feedback from developers of, say, terminal music players if they would be interested in such a thing and what design would work for them to move beyond the compositor use case. Although I believe there already are kitty based wayland compositors I vaguely recall hearing about them a year or so ago. Though if you ask me, there are other problems you will need to solve for that use case as well, such as (and these are on my TODO list) proper pointer events with high res scroll and touch input and drag and drop protocols. |
Beta Was this translation helpful? Give feedback.
-
|
On Mon, Feb 16, 2026 at 03:44:03AM -0800, Sopy wrote:
As a side tangent I have tried postmarketOS on a couple devices. Anything in particular that needs work on in regards to touch input? The current General Purpose Mouse protocol seemed to be serviceable even with a touch device anything that we might want to do and we can't? I can think of a few things, mainly gestures, but that would not be exclusive to touch input and would apply to trackpads.
High res scroll input, touch input different types of pointer devices,
like pens so we would need pressure and related events.
No need for gestures that should be the province of the application IMO, not the terminal.
And the existing best Mouse protocol SGR_PIXEL is woefully incomplete,
in that it cant represent co-ordinates outside the window and has no
leave/enter events.
|
Beta Was this translation helpful? Give feedback.
-
|
Hey @kovidgoyal, I've been working on the draft and made some last-minute changes over the past hour or so to a few things that didn't quite make sense on reflection. I've uploaded the markdown of the draft to a Gist. https://gist.github.com/sopyb/ec682c9dbb1899f70039b6e81b0b546c Looking forward to your feedback, let me know if there's anything you think I might have missed or gotten wrong. All the best, |
Beta Was this translation helpful? Give feedback.
-
|
After a quick read through:
1) Why no temp file transfer mode with terminal side delete? The client
may not be long lived, think of a command to play an audio file. It
could just send the audio data to the terminal and exit. Requiring the
client to manage the temp file lifetime means that such simple use cases
are not possible.
2) If you are using single letter keys everywhere then for consistency
use them in capability response as well
3) You cant query for shared filesystem and shared memory support
without actually having the client create a file/shared mem and sending
it to the terminal. This means your querying support needs to be
reworked. Even if the terminal supports them, it may not have a common
filesystem/shared mem namespace with the client, the only way to know
that is to test transmission, see graphics protocol querying.
4) At the end you talk about client exit. Terminals have no concept of
knowing when a client exist, unless you mean the owner of the pty device
exiting and closing the pty. When pty is closed audio must stop. I think
this is what you mean by session close?
5) What's the use case for global streams?
6) Your example script looks linux specific. It needs to be POSIX sh and
work on an POSIX system to play a raw audio file. See the POSIX exmaple
script used for the graphics protocol for inspiration.
I didnt look in detail at the audio specific parts as again, I am no
expert there.
|
Beta Was this translation helpful? Give feedback.
-
|
IMO clear *should not* stop audio streams. This is because it is used in
full screen applications that are designed to redraw the full screen on
every frame. If such an application uses audio it will be stopped on
every frame.
|
Beta Was this translation helpful? Give feedback.
-
|
Hi again, @kovidgoyal! I’ve addressed the issues you brought up and spent a lot of time debating with myself on a few things. Mainly the terminal-enforced minimum jitter buffer, that I ended up dropping after I couldn't find a use for it other than to cause annoyances for apps, which audio formats to list in the document, and which ones to actually recommend as worthwhile for implementation. In the end, Designing a terminal protocol for my thesis seemed a lot easier before I started having to consider ten different things pulling me in different directions. I kept running into issues where I’d tweak one aspect I wasn't satisfied with, only to forget it was referenced elsewhere. Leaving the document in a bit of a mess, and I'd catch it several reads down the line having no idea how I missed it. Anyway, here is a summary of the changes I've implemented since you last looked at it:
The updated draft is accessible at the same link as the previous iteration: https://gist.github.com/sopyb/ec682c9dbb1899f70039b6e81b0b546c I appreciate you taking time to give feedback. All the best, |
Beta Was this translation helpful? Give feedback.
-
|
Yes, designing protocols is harder than it first appears :)
Some more comments:
1) Clean up the language around chunking. Since you want to support
streaming data, as opposed to the graphics protocol that supports only
display a frame after it is fully transmitted, chunking needs more
careful design. In particular, you need to consider the scenario where
the client is itself getting the audio data slowly, say over a network
and the transmitting it to the terminal in chunks. In this case, how is
chunking supposed to work? Does the client send multiple padding bytes,
or does it only transmit data when it has audio data that is a multiple
of 3 in size?
2) You have some language about valid shm names, this varies across
platforms. You should just link to the POSIX and windows specs as is
done in the graphics protocol docs
3) When using t=t you need to have a restriction on allowed file paths
as in the graphics protocol, `tty-graphics-protocol` this is because
badly designed programs can store files necessary for their operation
in /tmp with fixed names (usually sockets but other files as well) and we
dont want attackers to be able to DoS clients by having the terminal
delete these file. So for you it would be tty-audio-protocol
4) You state client MUST query terminals before sending audio data. That
should be SHOULD not MUST.
5) In the storage quote section simply state a minimum storage size
terminals must implement. Leave details like storage on disk etc to
implementations.
6) Stream arbitration: you need to think a little about multiplexers
here. How are in terminal multiplexers such as tmux (admittedly a
terrible concept but as protocol designers we have to consider these
abominations) supposed to deal with this? In general with a
query/response protocol multiplexers become a very vexed issue.
7) With ids you have to worry about id collisions from unrelated
programs. With audio this is less of an issue since unlike with graphics
ou can only really have one audio stream playing, so maybe ids alone are
enough and we dont need client numbers in the protocol like we do in the
graphics protocol.
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Note
TLDR: Proposing an audio protocol for terminal emulators roughly following the initial development of the graphics protocol, starting with raw PCM frames. Looking for community feedback before drafting a full spec for v1.
Hey there,
I've been reading on terminal emulators and terminal emulator protocols for the past year or so thinking of something I could implement for my Bachelor's thesis. I have roughly 4-5 months until my thesis presentation, and I'm planning to maintain and expand this work afterward, especially as a potential theme for my Master's degree. I recently had the stupid idea of making a wayland compositor using the kitty graphics protocol, but quickly realized that would be incomplete without audio, since I couldn't find any fitting protocol (a buzzer with no pitch/duration control won't do). I looked up if there's any protocol like the graphics protocol but for audio and didn't find anything. So I started reading on how the graphics protocol was designed in #33. Then looked to see if anyone proposed something for audio in Kitty and found #7722 and it was closed with:
I'm interested in working on this and would love the community's input. I agree and find the major goals of the graphics protocol as a great start point, but I am not sure if raw PCM frames would be the best way to go long term, the nature of TCP/IP would probably make audio become choppy in anything, but the most ideal scenarios and on the other side I am wondering if having support for an open standard like ogg would be too high of a burden on terminals implementing, since a basic implementation of ogg decoding and ogg vorbis stream framing would greatly help with issues that raw PCM frames would have.
Currently I am thinking of proposing a V1 as raw PCM audio only and keeping the initial implementation simple to follow the philosophy of the graphics protocol, start small and expand afterwards. Here's what I have in mind so far:
Version 1 - Raw PCM Audio
This avoids:
The protocol would include a query mechanism (mostly inspired by the graphics protocol pattern) to detect if a terminal supports audio playback before the clients attempts to send audio data.
Future improvements
Questions for the Community (and my opinion as of now)
Does starting with PCM-only for v1 seem reasonable, or should compressed formats be in scope from day one?
What playback controls would be essential? (play/pause/seek/volume/stop)
Should this support bidirectional audio (microphone input) eventually, or stay output-only?
Any concerns about the proposed sample rates and bit depths?
Would this be useful beyond the compositor and file playback use cases of icat for audio (+over ssh) #7722? What applications would benefit?
I'm planning to draft a full specification document (similar to the graphics protocol spec) and would love feedback before going too far down any particular path. Also willing to contribute code for implementing a draft specification.
Beta Was this translation helpful? Give feedback.
All reactions