Skip to content

Switch to fmp4 intead of mgepg#542

Draft
zoriya wants to merge 11 commits intomasterfrom
feat/fmp4
Draft

Switch to fmp4 intead of mgepg#542
zoriya wants to merge 11 commits intomasterfrom
feat/fmp4

Conversation

@zoriya
Copy link
Copy Markdown
Owner

@zoriya zoriya commented Jun 17, 2024

@zoriya zoriya added enhancement Feature part of changelog transcoder labels Jun 17, 2024
@zoriya zoriya added this to the v4.7.0 milestone Jun 17, 2024
Comment thread transcoder/src/stream.go
func (ts *Stream) run(start int32) error {
// Start the transcode up to the 100th segment (or less)
length, is_done := ts.file.Keyframes.Length()
end := min(start+100, length)
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to revert this after testing

@zoriya
Copy link
Copy Markdown
Owner Author

zoriya commented Jun 19, 2024

Maybe it would be wisest to switch to -f hls instead of -f segment to use their built-in fmp4 handling. Would need to check if it's possible to use segments from multiples transcode process this way tho.

@zoriya
Copy link
Copy Markdown
Owner Author

zoriya commented Jun 30, 2024

I'm realising supporting this would allow av1 streaming.

This would be huge to reduce bandwidth usage.

Definitely want to work on this on the following days

zoriya added 3 commits July 1, 2024 08:08
It seems presentation time and durations are not consistent with
previous segments so the hls renderer can't keep up
Comment thread transcoder/src/stream.go Outdated
zoriya added 3 commits July 1, 2024 15:39
For now, disabled all audios variants since it's handling will be
entierly different.
Found out that audio and video segments don't need to lineup. (same
number/duration). As long as the whole file stays long enough it's fine.

Video handling now fails when there are too many keyfranmes close
enough (like 0.01, 0.3, 0.4, 2, 4). It would only output 3 segments
instead of the 5 we would want. We might get arround using fragments
containing more than 1 keyframe if we handle things right
@zoriya
Copy link
Copy Markdown
Owner Author

zoriya commented Jul 3, 2024

Idea seem to work, need to:

@zoriya
Copy link
Copy Markdown
Owner Author

zoriya commented Jul 3, 2024

We would need to fix the assumption that keyframes are a file level things. This should move to a stream level thing.

It's legal to have audio/video with different segments. It's also legal to have two videos w/ different segments. The only constraint is that variant should have the same segments (aka video1/720p & video1/1080p needs same segments while video1/720p & video2/1080p can have different segments).

Doing this would also unlock video selection when the file contains multiple video streams.

@zoriya
Copy link
Copy Markdown
Owner Author

zoriya commented Jul 3, 2024

I'm thinking of first fixing this & data store in another PR before continuing this.

@JeWe37
Copy link
Copy Markdown

JeWe37 commented Aug 3, 2024

I'm realising supporting this would allow av1 streaming.

This would be huge to reduce bandwidth usage.

Definitely want to work on this on the following days

Given their similar encoding complexity and performance plus currently still wider support shouldn't HEVC be preferred here?

@zoriya
Copy link
Copy Markdown
Owner Author

zoriya commented Aug 3, 2024

hevc would be supported too

this would allow avc1 & hevc transmuxing. we could also add a avc1 or hevc transcode mode but i don't know how this could be negotiated. maybe an option on the server and make them available in the manifest for the client to pick

@JeWe37
Copy link
Copy Markdown

JeWe37 commented Aug 3, 2024

hevc would be supported too

this would allow avc1 & hevc transmuxing. we could also add a avc1 or hevc transcode mode but i don't know how this could be negotiated. maybe an option on the server and make them available in the manifest for the client to pick

From what I understand currently Kyoo doesn't offer any API to tell it what codecs the client supports. Jellyfin has a whole system for this, see the DeviceProfile key in this API. Presumably in the future something similar will be required as more devices are supported which have different capabilities. This would probably entail some changes to the transcoding API too though, a session concept would likely make sense in that case too.

@zoriya
Copy link
Copy Markdown
Owner Author

zoriya commented Aug 4, 2024

The server is not the one responsible for knowing if a media file can be played on the client. The client itself does the check and requests a format it can support. This is done either using the mimeCodec field of the /info endpoint for direct streaming (pristine quality) or using the hls spec with multi-variants.

As long as all formats can be served under a single manifest there won't be a need for a custom protocol to negotiate formats: the hls spec already handles this.

I heard that some devices play badely with fmp4 (but works well with mpegts) so we will probably need to at least support those two versions in 2 different manifests just for those devices.

PS: There is already a session concept, it's used to know if a user is going to need a segment or to kill old ffmpeg processes.

@zoriya
Copy link
Copy Markdown
Owner Author

zoriya commented Apr 30, 2026

Nothing too important, just trying to remember the state of this. I tried to use opencode for that, not sure it was really useful but whatever here it is:

llm summary

fMP4 support for lazy HLS transcoder

Session ID: ses_224422841ffe2gPU3SJwXdL7y7
Created: 4/30/2026, 2:15:45 AM
Updated: 4/30/2026, 2:47:46 AM


User

The transcoder in @transcoder/src/stream.go is able to encode an arbitrary file to hls lazily. This means any hls player can play the file in any quality as if everything was pre-encoded on the server but when requested a ffmpeg process is spawned to serve just the chunk requested (and prepare the next ones).

You can find a small writeup about it here: https://zoriya.dev/blogs/transcoder/

We are currently using .ts files for segments and we would like to switch to fmp4 to support hevc or av1.

Can you explore the options for this and how feasable that looks? Can we continue to use the -f segment or would -f hls be better suited? Is there room for simplification in the code if we do switch?

There was a previous attemp to switch to fmp4 that failed, some ideas from those attemp got implemented in the meantime.

Here are the draft PR of the previous attemps:

Take those previous attemps in mind when proposing a solution. Summarize all findings (from previous attemps, internet search and your knowledge)


Assistant

Great question — I dug through current code, both draft PRs, FFmpeg docs, and related notes.

What matters most: your lazy architecture depends on deterministic segment numbering/timing across independent ffmpeg runs. That is the core constraint.

  • Current implementation is built around explicit split points (-segment_times) in transcoder/src/stream.go:295, manual playlist generation in transcoder/src/stream.go:391, and .ts parsing in transcoder/src/api/streams.go:308.
  • FFmpeg docs confirm segment is deterministic-oriented (explicit split points), while hls is HLS-native but hls_time is a target, not strict per-cut control.

From docs (key excerpts):

  • hls_time: “Set the target segment length… Segment will be cut on the next key frame after this time has passed.”
  • segment: “Specify a list of split points…” (segment_times) and “Every segment starts with a keyframe…”
  • Also docs call hls “a more specific implementation for HLS segmentation.”

What previous attempts tell us

PR #542 (fMP4 + -f hls) found real issues:

  • timestamp/presentation discontinuity between restarted encodes,
  • need for init segment handling (init.mp4 / #EXT-X-MAP),
  • more complexity around parsing ffmpeg HLS output.

PR #592 confirmed the big blocker:

  • -hls_time behavior is not strict slicing per requested keyframes (you observed 5.9s/3.9s alternation).
  • Workaround attempt was keyframe coarsening (e.g. ~5s cadence), which changes your current “every keyframe segment” model.

Can you keep -f segment and just switch to fMP4?

Technically possible, but not ideal for true HLS-fMP4 semantics:

  • segment can write MP4 segments (segment_format + segment_format_options), but docs do not provide native EXT-X-MAP / init-fragment flow like hls.
  • You’ll likely re-implement HLS-fMP4 semantics yourself (init lifecycle, map tag correctness, fragment flags, continuity), which is fragile.
  • So: feasible for “MP4 chunk files”, weaker for “proper HLS-fMP4 behavior at scale”.

Is -f hls better suited?

For fMP4/HLS correctness: yes.
For your lazy deterministic model: not automatically.

It’s better for:

  • -hls_segment_type fmp4
  • -hls_fmp4_init_filename
  • native playlist semantics

But it conflicts with:

  • exact per-keyframe deterministic slicing you currently rely on with segment_times.

What got better since those attempts (important)

A few earlier blockers are now improved in main code:

  • per-stream keyframe handling is now stream-specific (GetKeyframes(... isVideo, idx) in transcoder/src/keyframes.go:105), matching your PR discussion.
  • audio drift handling improved (output_ts_offset path in transcoder/src/stream.go:275).
  • packet-aligned audio keyframe extraction exists now (transcoder/src/keyframes.go:277), better than old fixed-grid approaches.

So feasibility today is better than during #542/#592.

Practical recommendation

I’d recommend:

  1. Keep current TS/segment path as stable fallback.
  2. Add a new fMP4 path using -f hls (not segment) behind a flag.
  3. Accept/define a coarser deterministic cadence (e.g. 4–6s segment model), instead of “every source keyframe.”
  4. Add init-segment plumbing (init.mp4 route + #EXT-X-MAP) and strict timestamp continuity checks.
  5. Validate on target players (Safari/tvOS especially; AV1 support still uneven across clients).

So the answer to your core question:

  • If your goal is proper HLS fMP4 for HEVC/AV1, -f hls is the right muxer.
  • If your goal is preserving your current deterministic per-request segmentation model unchanged, -f segment is easier — but weaker/awkward for true HLS-fMP4 semantics.

If you want, I can draft a concrete migration plan with exact ffmpeg arg sets and code touchpoints (stream.go, videostream.go, audiostream.go, api/streams.go) for a low-risk phased rollout.


User

Some questions i'd like to clarify:

  • what's in the init.mp4? would it be the same init.mp4 if we transcode segments [0, 100[ and segments [200, 300[? if not, which should we save?
  • would pts be valid agross different transcodes? i remember playing with hls_segment_options to try to mitigate this.
  • could we predict deterministically at which times the -f hls muxer will cut segments? if it's always the next keyframe after hls_time * [the number of segments written so far] we should be able to as long as -ss is reliable but my last attemps talks about a +/- an initial time that i dont know of.
  • could we have a reliable -ss when doing -c:v copy? i think combining input & output seeking was not working, can you confirm?

Assistant

Great set of questions. Short answers first:

  • init.mp4 is codec/track metadata, not media samples.
  • Across independent ffmpeg runs, init.mp4 is not guaranteed identical.
  • -f hls cut points are mostly predictable, but there is indeed a start-offset term.
  • -c copy + seeking is only reliable at keyframe granularity; mixed input/output seek is not robust for exact boundaries.

1) What is in init.mp4, and which one to keep?

  • It contains ftyp + moov (track definitions, codec extradata like avcC/hvcC, timescales, track IDs, defaults), no media payload.
  • .m4s segments carry moof+mdat (actual fragment metadata + samples).
  • If you encode [0,100[ and [200,300[ in separate ffmpeg processes, init can be:
    • byte-different but compatible, or
    • genuinely incompatible (track IDs/extradata differences).
  • Safest policy: treat each independent run as potentially having its own init and pair it with its own #EXT-X-MAP.
  • Reusing one global init is safe only if stream layout + codec config are strictly identical across runs.

2) Are PTS valid across different transcodes?

  • Not automatically. Independent runs often rebase differently.
  • Your prior hls_segment_options experiments make sense: frag_discont helps signaling fragmentation discontinuity, but it does not magically enforce full timeline continuity.
  • For reliable cross-run timeline:
    • keep one timestamp policy consistently (-copyts, -copytb, -avoid_negative_ts choice fixed),
    • and if needed apply explicit offset (-output_ts_offset) to place each run on one absolute timeline.
  • If continuity is not strict, use #EXT-X-DISCONTINUITY at run boundaries.

3) Can -f hls cut times be predicted deterministically?

  • From hlsenc.c, split condition is effectively cumulative:
    • cut when (pkt_pts - start_pts) >= target.
    • target is recording_time * segment_index (with special first-phase behavior if hls_init_time is used).
  • So yes, there is an implicit initial offset via start_pts (this is likely your “unknown +/- initial time”).
  • Also, cut is gated by keyframe (unless split_by_time), so actual boundary is “first eligible keyframe after threshold”.
  • Bottom line: predictable if keyframes and start_pts are stable across runs; otherwise drift appears.

4) Reliable -ss with -c:v copy (and mixed input/output seek)?

  • Confirming your memory: exact boundary behavior is unreliable.
  • -c copy seeks at random access points / keyframes; not frame-accurate.
  • Input seek (-ss before -i) is fast/coarse; output seek (-ss after -i) can still be problematic with copy at boundaries.
  • Combining both usually increases edge-case weirdness rather than fixing it for copy mode.
  • So for lazy independent windows, -c:v copy is reliable only if you accept keyframe-snapped starts.

If you want, next I can propose a concrete “safe fMP4 lazy mode” contract (init rotation rules + discontinuity rules + exact ffmpeg flags) that fits your current architecture.


User

Some confirmations for each points:

Reusing one global init is safe only if stream layout + codec config are strictly identical across runs

I'm encoding the same file in a different ffmpeg process, would this change?

what would the start_pts refer to? the pts of the first keyframe after the input seek? the output_ts_offset flag? you might need to make a table with different flag combinations

Confirm this behavior by reading the source code or searching the internet, from my understanding/experimentations:
-ss time -i file -c:v copy seeks to the keyframe after OR before time (with no good way to predict which one)

I don't really remember how it works if we combine with output seek tho


Assistant

Perfect, these are exactly the right things to pin down.

  1. Same file, different ffmpeg process: can init.mp4 change?
  • Usually stable, if you use same stream mapping/codec settings and the source stream config does not change mid-file.
  • Can still differ across runs in edge cases:
    • stream has in-band parameter changes (new SPS/PPS/VPS, resolution/profile change),
    • different selected stream/order/track-id behavior,
    • slightly different muxer metadata/box layout (byte-different but possibly still compatible).
  • Practical rule:
    • same file + same variant settings => often reusable,
    • but safest is to verify compatibility (codec extradata + track layout), not assume.
  1. What exactly is start_pts in -f hls splitting?
    From hlsenc.c, split check is effectively:
  • pkt_pts - start_pts >= end_pts (plus keyframe gating unless split_by_time)

So start_pts is the muxer’s baseline PTS for that run (first packet timeline anchor for that variant stream), not output_ts_offset itself. output_ts_offset can shift packet PTS and therefore indirectly shift that baseline.

Quick table (same input, independent runs):

Flags Baseline / cut reference effect
none baseline depends on demuxer timestamps after seek; can vary run-to-run
-copyts keeps original input ts (less rebasing), but muxer may still shift in some cases
-copyts -start_at_zero preserves relative timeline but rebases origin consistently
-output_ts_offset X shifts output timeline by X, therefore shifts effective split thresholds too
-avoid_negative_ts ... muxer may shift timestamps, can break naive assumptions even with -copyts

And yes, your “unknown initial +/-” intuition matches the source: the cumulative threshold is anchored to start_pts, not absolute media time.

  1. -ss time -i file -c:v copy: before or after keyframe?
  • Docs say: input -ss seeks to closest seek point before requested time.
  • In practice, observed first emitted packet can look before/after requested time depending on container indexing, timestamp quirks, stream start offsets, and copy-mode packet handling.
  • So your statement “before OR after with no good prediction” is fair from ops perspective, even if nominal behavior is “before”.
  • For combined input+output seek with -c copy: not robust for exact boundaries. Output seek discards by timestamp, but copy-mode still lacks frame-accurate trimming guarantees.

So net: your previous conclusions are directionally correct, and they align with both docs and why your earlier attempts hit timeline/discontinuity issues.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Feature part of changelog transcoder

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Switch to fmp4 segments to support a broader range of codecs

2 participants