|
| 1 | +# Extra Timestamps for encoded RTC media frames |
| 2 | + |
| 3 | +## Authors: |
| 4 | + |
| 5 | +- Guido Urdaneta (Google) |
| 6 | + |
| 7 | +## Participate |
| 8 | +- https://github.com/w3c/webrtc-encoded-transform |
| 9 | + |
| 10 | + |
| 11 | +## Introduction |
| 12 | + |
| 13 | +The [WebRTC Encoded Transform](https://w3c.github.io/webrtc-encoded-transform/) |
| 14 | +API allows applications to access encoded media flowing through a WebRTC |
| 15 | +[RTCPeerConnection](https://w3c.github.io/webrtc-pc/#dom-rtcpeerconnection). |
| 16 | +Video data is exposed as |
| 17 | +[RTCEncodedVideoFrame](https://w3c.github.io/webrtc-encoded-transform/#rtcencodedvideoframe)s |
| 18 | +and audio data is exposed as |
| 19 | +[RTCEncodedAudioFrame](https://w3c.github.io/webrtc-encoded-transform/#rtcencodedaudioframe)s. |
| 20 | +Both types of frames have a getMetadata() method that returns a number of |
| 21 | +metadata fields containing more information about the frames. |
| 22 | + |
| 23 | +This proposal consists in adding a number of additional metadata fields |
| 24 | +containing timestamps, in line with recent additions to |
| 25 | +[VideoFrameMetadata](https://w3c.github.io/webcodecs/video_frame_metadata_registry.html#videoframemetadata-members) |
| 26 | +in [WebCodecs](https://w3c.github.io/webcodecs/) and |
| 27 | +[requestVideoFrameCallback](https://wicg.github.io/video-rvfc/#video-frame-callback-metadata-attributes). |
| 28 | + |
| 29 | +For the purposes of this proposal, we use the following definitions: |
| 30 | +* The *capturer system* is a system that originally captures a media frame, |
| 31 | + typically from a local camera, microphone or screen-share session. This frame |
| 32 | + can be relayed through multiple systems before it reaches its final |
| 33 | + destination. |
| 34 | +* The *receiver system* is the final destination of the captured frames. It |
| 35 | + receives the data via an [RTCPeerConnection] and it uses the WebRTC Encoded |
| 36 | + Transform API with the changes included in this proposal. |
| 37 | +* The *sender system* is the system that communicates directly with the |
| 38 | + *receiver system*. It may be the same as the capturer system, but not |
| 39 | + necessarily. It is the last hop before the captured frames reach the receiver |
| 40 | + system. |
| 41 | + |
| 42 | +The proposed new metadata fields are: |
| 43 | +* `receiveTime`: The time when the frame was received from the sender system. |
| 44 | +* `captureTime`: The time when the frame was captured by the capturer system. |
| 45 | + This timestamp is set by the capturer system. |
| 46 | +* `senderCaptureTimeOffset`: An estimate of the offset between the capturer |
| 47 | + system clock system and the sender system clock. The receiver system can |
| 48 | + compute the clock offset between the receiver system and the sender system |
| 49 | + and these two offset can be used to adjust the `captureTime` to the |
| 50 | + receiver system clock. |
| 51 | + |
| 52 | +`captureTime` and `senderCaptureTimeOffset` are provided in WebRTC by the |
| 53 | +[Absolute Capture Time" header extension](https://webrtc.googlesource.com/src/+/refs/heads/main/docs/native-code/rtp-hdrext/abs-capture-time). |
| 54 | + |
| 55 | +Note that the [RTCRtpContributingSource](https://www.w3.org/TR/webrtc/#dom-rtcrtpcontributingsource) |
| 56 | +interface also exposes these timestamps |
| 57 | +(see also [extensions[(https://w3c.github.io/webrtc-extensions/#rtcrtpcontributingsource-extensions)]), |
| 58 | +but in a way that is not suitable for applications using the WebRTC Encoded |
| 59 | +Transform API. The reason is that encoded transforms operate per frame, while |
| 60 | +the values in [RTCRtpContributingSource]() are the most recent seen by the UA, |
| 61 | +which make it impossible to know if the values provided by |
| 62 | +[RTCRtpContributingSource]() actually correspond to the frames being processed |
| 63 | +by the application. |
| 64 | + |
| 65 | + |
| 66 | +## User-Facing Problem |
| 67 | + |
| 68 | +This API supports applications where measuring the delay between the reception |
| 69 | +of a media frame and its original capture is useful. |
| 70 | + |
| 71 | +Some examples use cases are: |
| 72 | +1. Audio/video synchronization measurements |
| 73 | +2. Performance measurements |
| 74 | +3. Delay measurements |
| 75 | + |
| 76 | +In all of these cases, the application can log the measurements for offline |
| 77 | +analysis or A/B testing, but also adjust application parameters in real time. |
| 78 | + |
| 79 | + |
| 80 | +### Goals [or Motivating Use Cases, or Scenarios] |
| 81 | + |
| 82 | +- Provide Web applications using WebRTC Encoded Transform access to receive and |
| 83 | + capture timestamps in addition to existing metadata already provided. |
| 84 | +- Align encoded frame metadata with [metadata provided for raw frames](). |
| 85 | + |
| 86 | +### Non-goals |
| 87 | + |
| 88 | +- Provide mechanisms to improve WebRTC communication mechanisms based on the |
| 89 | +information provided by these new metadata fields. |
| 90 | + |
| 91 | + |
| 92 | +### Example |
| 93 | + |
| 94 | +This shows an example of an application that: |
| 95 | +1. Computes the delay between audio and video |
| 96 | +2. Computes the processing and logs and/or updates remote parameters based on the |
| 97 | +delay. |
| 98 | + |
| 99 | +```js |
| 100 | +// code in a DedicatedWorker |
| 101 | +let lastVideoCaptureTime; |
| 102 | +let lastAudioCaptureTime; |
| 103 | +let lastVideoSenderCaptureTimeOffset; |
| 104 | +let lastVideoProcessingTime; |
| 105 | +let senderReceiverClockOffset = null; |
| 106 | + |
| 107 | +function updateAVSync() { |
| 108 | + const avSyncDifference = lastVideoCaptureTime - lastAudioCaptureTime; |
| 109 | + doSomethingWithAVSync(avSyncDifference); |
| 110 | +} |
| 111 | + |
| 112 | +// Measures delay from original capture until reception by this system. |
| 113 | +// Other forms of delay are also possible. |
| 114 | +function updateEndToEndVideoDelay() { |
| 115 | + if (senderReceiverClockOffset == null) { |
| 116 | + return; |
| 117 | + } |
| 118 | + |
| 119 | + const adjustedCaptureTime = |
| 120 | + senderReceiverClockOffset + lastVideoSenderCaptureTimeOffset + lastVideoCaptureTime; |
| 121 | + const endToEndDelay = lastVideoReceiveTime - adjustedCaptureTime; |
| 122 | + doSomethingWithEndToEndDelay(endToEndDelay); |
| 123 | +} |
| 124 | + |
| 125 | +function updateVideoProcessingTime() { |
| 126 | + const processingTime = lastVideoProcessingTime - lastVideoReceiveTime; |
| 127 | + doSomethingWithProcessingTime(); |
| 128 | +} |
| 129 | + |
| 130 | +function createReceiverAudioTransform() { |
| 131 | + return new TransformStream({ |
| 132 | + start() {}, |
| 133 | + flush() {}, |
| 134 | + async transform(encodedFrame, controller) { |
| 135 | + let metadata = encodedFrame.getMetadata(); |
| 136 | + lastAudioCaptureTime = metadata.captureTime; |
| 137 | + updateAVSync(); |
| 138 | + controller.enqueue(encodedFrame); |
| 139 | + } |
| 140 | + }); |
| 141 | +} |
| 142 | + |
| 143 | +function createReceiverVideoTransform() { |
| 144 | + return new TransformStream({ |
| 145 | + start() {}, |
| 146 | + flush() {}, |
| 147 | + async transform(encodedFrame, controller) { |
| 148 | + let metadata = encodedFrame.getMetadata(); |
| 149 | + lastVideoCaptureTime = metadata.captureTime; |
| 150 | + updateAVSync(); |
| 151 | + lastVideoReceiveTime = metadata.receiveTime; |
| 152 | + lastVideoSenderCaptureTimeOffset = metadata.senderCaptureTimeOffset; |
| 153 | + updateEndToEndDelay(); |
| 154 | + doSomeEncodedVideoProcessing(encodedFrame.data); |
| 155 | + lastVideoProcessingTime = performance.now(); |
| 156 | + updateProcessing(); |
| 157 | + controller.enqueue(encodedFrame); |
| 158 | + } |
| 159 | + }); |
| 160 | +} |
| 161 | + |
| 162 | +// Code to instantiate transforms and attach them to sender/receiver pipelines. |
| 163 | +onrtctransform = (event) => { |
| 164 | + let transform; |
| 165 | + if (event.transformer.options.name == "receiverAudioTransform") |
| 166 | + transform = createReceiverAudioTransform(); |
| 167 | + else if (event.transformer.options.name == "receiverVideoTransform") |
| 168 | + transform = createReceiverVideoTransform(); |
| 169 | + else |
| 170 | + return; |
| 171 | + event.transformer.readable |
| 172 | + .pipeThrough(transform) |
| 173 | + .pipeTo(event.transformer.writable); |
| 174 | +}; |
| 175 | + |
| 176 | +onmessage = (event) => { |
| 177 | + senderReceiverClockOffset = event.data; |
| 178 | +} |
| 179 | + |
| 180 | + |
| 181 | +// Code running on Window |
| 182 | +const worker = new Worker('worker.js'); |
| 183 | +const pc = new RTCPeerConnection(); |
| 184 | + |
| 185 | +// Do ICE and offer/answer exchange. Removed from this example for clarity. |
| 186 | + |
| 187 | +// Configure transforms in the worker |
| 188 | +pc.ontrack = e => { |
| 189 | + if (e.track.kind == "video") |
| 190 | + e.receiver.transform = new RTCRtpScriptTransform(worker, { name: "receiverVideoTransform" }); |
| 191 | + else // audio |
| 192 | + e.receiver.transform = new RTCRtpScriptTransform(worker, { name: "receiverAudioTransform" }); |
| 193 | +} |
| 194 | + |
| 195 | +// Compute the clock offset between the sender and this system. |
| 196 | +const stats = pc.getStats(); |
| 197 | +const remoteOutboundRtpStats = getRequiredStats(stats, "remote-outbound-rtp"); |
| 198 | +const remoteInboundRtpStats = getRequiredStats(stats, "remote-inbound-rtp") |
| 199 | +const senderReceiverTimeOffset = |
| 200 | + remoteOutboundRtpStats.timestamp - |
| 201 | + (remoteOutboundRtpStats.remoteTimestamp + |
| 202 | + remoteInboundRtpStats.roundTripTime / 2); |
| 203 | + |
| 204 | +worker.postMessage(senderReceiverTimeOffset); |
| 205 | +``` |
| 206 | + |
| 207 | + |
| 208 | +## Alternatives considered |
| 209 | + |
| 210 | +### [Alternative 1] |
| 211 | + |
| 212 | +Use the values already exposed in `RTCRtpContributingSource`. |
| 213 | + |
| 214 | +`RTCRtpContibutingSource` already exposes the same timestamps as in this proposal. |
| 215 | +The problem with using those timestamps is that it is impossible to reliably |
| 216 | +associate them to a specific encoded frame exposed by the WebRTC Encoded |
| 217 | +Transform API. |
| 218 | + |
| 219 | +This makes any of the computations in this proposal unreliable. |
| 220 | + |
| 221 | +### [Alternative 2] |
| 222 | + |
| 223 | +Expose only `captureTime` and `receiveTime`. |
| 224 | + |
| 225 | +`senderCaptureTimeOffset` is a value that is provided by the |
| 226 | +[Absolute Capture Timestamp]()https://webrtc.googlesource.com/src/+/refs/heads/main/docs/native-code/rtp-hdrext/abs-capture-time#absolute-capture-time |
| 227 | +WebRTC header extension, but that extension updates the value only periodically |
| 228 | +since there is little value in computing the estimatefor every packet, so it is |
| 229 | +strictly speaking not a per-frame value. Arguably, an application could use |
| 230 | +the `senderCaptureTimeOffset` already exposed in `RTCRtpContributingSource`. |
| 231 | + |
| 232 | +However, given that this value is coupled with `captureTime` in the header |
| 233 | +extension, it looks appropriate and more ergonomic to expose the pair in the |
| 234 | +frame as well. While clock offsets do not usually change significantly |
| 235 | +in a very short time, there is some extra accuracy in having the estimated |
| 236 | +offset between the capturer system and the sender for that particular frame. |
| 237 | +This could be more visible, for example, if the set of relays that frames |
| 238 | +go through from the capturer system to the sender system changes. |
| 239 | + |
| 240 | +Exposing `senderCaptureTimeOffset` also makes it clearer that the `captureTime` |
| 241 | +comes from the original capturer system, so it needs to be adjusted using the |
| 242 | +corresponding clock offset. |
| 243 | + |
| 244 | + |
| 245 | +### [Alternative 3] |
| 246 | + |
| 247 | +Expose a `captureTime` already adjusted to the receiver system's clock. |
| 248 | + |
| 249 | +The problem with this option is that clock offsets are estimates. Using |
| 250 | +estimates makes computing A/V Sync more difficult and less accurate. |
| 251 | + |
| 252 | +For example, if the UA uses the a single estimate during the whole session, |
| 253 | +the A/V sync computation will be accurate, but the capture times themselves will |
| 254 | +be inaccurate as the clock offset estimate is never updated. Any other |
| 255 | +computation made with the `captureTime` and other local timestamps will be |
| 256 | +inaccurate. |
| 257 | + |
| 258 | +### [Alternative 4] |
| 259 | + |
| 260 | +Expose a `localClockOffset` instead of a `senderClockOffset`. |
| 261 | + |
| 262 | +This would certainly support the use cases presented here, but it would have the |
| 263 | +following downsides: |
| 264 | +* It would introduce an inconsistency with the values exposed in `RTCRtpContibutingSource`. |
| 265 | + This can lead to confusion, as the `senderClockOffset` is always paired together |
| 266 | + with the `captureTime` in the header extension and developers expect this association. |
| 267 | +* Applications can compute their own estimate of the offset between sender |
| 268 | + and receiver using WebRTC Stats and can control how often to update it. |
| 269 | +* Some applications might be interested in computing delays using the sender |
| 270 | + as reference. |
| 271 | + |
| 272 | +In short, while this would be useful, the additional value is limited compared |
| 273 | +with the clarity, consistency and extra possibilities offered by exposing the |
| 274 | +`senderClockOffset`. |
| 275 | + |
| 276 | + |
| 277 | + |
| 278 | +## Accessibility, Privacy, and Security Considerations |
| 279 | + |
| 280 | +These timestamps are already available in a form less suitable for applications |
| 281 | +using WebRTC Encoded Transform as part of the RTCRtpContributingSource API. |
| 282 | + |
| 283 | +*The `captureTime` field is available via the |
| 284 | +[RTCRtpContributingSource.captureTimestamp](https://w3c.github.io/webrtc-extensions/#dom-rtcrtpcontributingsource-capturetimestamp) field. |
| 285 | + |
| 286 | + |
| 287 | +*The `senderCaptureTimeOffset` field is available via the |
| 288 | +[RTCRtpContributingSource.senderCaptureTimeOffset](https://w3c.github.io/webrtc-extensions/#dom-rtcrtpcontributingsource-sendercapturetimeoffset) field. |
| 289 | + |
| 290 | +*The `receiveTime` field is available via the |
| 291 | +[RTCRtpContributingSource.timestamp](https://w3c.github.io/webrtc-pc/#dom-rtcrtpcontributingsource-timestamp) field. |
| 292 | + |
| 293 | +While these fields are not 100% equivalent to the fields in this proposal, |
| 294 | +they have the same privacy characteristics. Therefore, we consider that the |
| 295 | +privacy delta of this proposal is zero. |
| 296 | + |
| 297 | +## References & acknowledgements |
| 298 | + |
| 299 | +Many thanks for valuable feedback and advice from: |
| 300 | +- Florent Castelli |
| 301 | +- Harald Avelstrand |
| 302 | +- Henrik Boström |
0 commit comments