Support for `<img>` in SMIL by HadrienGardeur · Pull Request #2919 · w3c/epub-specs

HadrienGardeur · 2026-02-05T16:16:45Z

This PR closes #2883.

It's currently a draft PR and not ready to be reviewed yet.

Changes:

Added a new section for the img element
Added img to the content model for par (set to 0 or 1)
Added Media Fragment URI to normative references
Added a new paragraph in "Referencing document fragments"
Added new example for comics in "Structural semantics in overlays"
Added new entry in changelog

Links:

Preview | Diff

mattgarrish · 2026-02-05T17:43:50Z

epub34/authoring/index.html

 									<ul class="nomark">
 										<li>
 											<p> [^text^] <code>[exactly 1]</code>
 											</p>
 										</li>
 										<li>
 											<p> [^audio^] <code>[0 or 1]</code>
 											</p>
 										</li>
 									</ul>


The new content model for par needs to be specified. Is img + audio a separate combination, since relying on tts for images likely shouldn't be an option, or is img optionally allowed and text remains required so all three elements can be used together? (But is showing both an image and a text fragment realistic if the viewport is occupied by the fxl image?)

I guess the latter would be a new case of img and audio required and text optional, not img as an optional element in the current model. Probably doesn't make sense to always require text with img.

(Apologies if you're already working on this, but I assumed you were moving on to the RS aspects by opening the draft.)

No I'm not done with anything at this point, I just prefer to open a draft PR as early as possible.

For me, all combinations are valid:

img + text (textual alternatives for regions, for example description of a panel)

img + audio (audio narration for regions)

img + audio + text (this would allow someone to either listen to the pre-recorded or use for example a Braille tablet by consuming the textual content)

Even img on its own could be a valid use case and result in a panel-by-panel navigation for example in comics.

I didn't go through the section that you highlighted yet, but IMO text, img or audio would all become 0 or 1 with at least one of them present.

I don't believe that's how smil expects it to work by default, though, if I understand you correctly (that you pick the applicable format to synchronize). I'll preface by saying I'm not the expert on this, but my understanding is that if you specify all three in a single par then all three are expected to be synchronized.

This is definitely meant to synchronize all three media together.

You can open the following Google Slides in presenter mode to see a demo of what this could feel like: https://docs.google.com/presentation/d/1LGHRIN_vHl-H-bgXsHkhqrL0owMCy8qx854x9YU3d8w/edit?usp=sharing

That said, even with Media Overlays today you're free to use what you want:

just text

just audio

or text and audio together

Multiple apps already offer the ability to consume EPUB with Media Overlays like an audiobook with just a player interface on screen.

In the specific case of comics and highly illustrated content, the ideal scenario would be to customize things to your needs:

for example a dyslexic user could enable audio on captions and speech bubbles (either using <audio> or with TTS on the content of <text>) but skip descriptions using skippability (this would need a specific role that could be identified) but display text below the image fragment (this is what my example in Google Slides illustrates)

a blind user could go full audio (once again using either <audio>, with TTS on the content of <text> or using a textual view with a screen reader) without skipping anything at all

a user reading on a small screen could just use these image fragments to read more easily on their device and turn audio on just for speech bubbles

This is potentially problematic as it means you could validly have only an image listed, but what happens then?

That's the last use case that I've described above, this would give you region by region navigation.

With text, the reading system is supposed to tts the content before moving to the next par, but if someone only lists an image does it load and unload instantaneously?

Frankly, there's more of a use case for <img> or <audio> on their own than having just <text>.

What's the point of a SMIL with just <text> when you can just use TTS? Skippability and escapibility can be achieved without SMIL, the only use case I think of is to guide you through places in the publication.

This is definitely meant to synchronize all three media together.

Okay, I'll have to see the new model for rendering the content before I comment any more on this. Having text content synced with a roll image, or an image placed into reflowable text, seems complex to spec out.

That said, even with Media Overlays today you're free to use what you want:

I guess if you're okay with your media overlay being invalid. The content model requires a text element in all cases. There are hacks around how much text you have to provide, but you can't produce audio-only content without at least one element to synchronize with.

What's the point of a SMIL with just <text> when you can just use TTS?

The only advantage is that you don't have to change playback modes. If the body is professionally narrated you could sync the backmatter for TTS without having to prerecord it.

But that's not the issue. There's still a timing sequence for text if you push the rendering out to TTS. The reading system will present/highlight the text for as long as it takes to render it as speech.

If all you have is an image, there is no synchronization and there is no timing information. So how does the author state how long the user should see the image? Will reading systems just assign some common amount of time per image regardless of their complexity?

It's a little weird to use a synchronization markup language to not synchronize anything. An audio-only par element is equally strange from a synchronization standpoint, but at least the timing of the clip gives it a duration to play.

But, these are just my immediate concerns. I can wait until the draft is in a more complete state before commenting any more so we don't add a lot of noise to the pull request.

I guess if you're okay with your media overlay being invalid. The content model requires a text element in all cases. There are hacks around how much text you have to provide, but you can't produce audio-only content without at least one element to synchronize with.

I was mostly talking about the UX from a RS perspective, of course the current spec requires text.

Our current spec is really written in a way where we assume that:

the main way this will be handled is by displaying text on the screen with audio playback in the background

and the use of authored CSS is very much skewed towards FXL

In a reflowable EPUB where users are free to select different themes, using authored CSS for highlighting is a potential accessibility hazard, since you could end up with major contrast issues.

In practice, a user could just listen to an EPUB with Media Overlays without displaying anything on a screen, even if the SMIL includes text and/or img. This would be indistinguishable from an audiobook from a UX perspective.

If all you have is an image, there is no synchronization and there is no timing information. So how does the author state how long the user should see the image? Will reading systems just assign some common amount of time per image regardless of their complexity?

I don't think that this is any different from <text> which doesn't have any inherent duration (it only has one when TTS is used to read the text aloud). That's also why the dur attribute exists in SMIL but we don't have it in EPUB.

From a UX perspective, it's important to keep in mind that even with <text> and <audio> the playback can either be:

continuous, where you go through the entire publication

per page or per spread, where the current page/spread is read and the RS waits for the user to switch to another one before playback continues

or handled element by element

With this last one, you have something that makes perfect sense for img on its own. For example:

whole page displayed

then just the first panel

then the second panel

then a zoom into part of the second panel to showcase a character and a bubble

etc.

Once you sync text or audio to each of these image fragments, then playback can also become continuous or based on page/spread.

I don't think that this is any different from <text> which doesn't have any inherent duration (it only has one when TTS is used to read the text aloud)

Sure, and that's also why it was always required that text reference content with a possible duration when audio isn't present. TTS provides the duration for text content. We took out the section on embedded media last revision and advise people that referencing that kind of content from text will have unpredictable results, but text was never meant to link to content for which no duration could be established.

If you go back to 3.2, before we took that section out, the embedded media that text referred to had to have an audio component that could be played back:

When a text element references embedded media that contains audio, the audio sibling element is OPTIONAL.

Referring to images was always problematic because the text content to TTS wasn't as straightforward as getting the text content of a typical html element, but there was still the possibility of using alt.

That's why having img reference images outside of an xhtml wrapper as the only element of a par contradicts all expectations we've ever had for media overlays to synchronize content with audio.

Media overlays has never been about providing a non-auditory experience. The concern I have is not about the other content possibilities, which could all be viable:

img + text = duration through tts.

img + audio = duration through audio playback.

img + text + audio = duration through audio playback

But img on its own has no duration and no audio, so why do we even need it? It's like region-based nav but if you take away playback control from the user and expect the reading system to meaningfully automate it. If you drop that one case, the internal conflicts with what we have are greatly reduced.

I think that there's a use case for <img> on its own, but as you've correctly pointed out it can also be implemented using the lesser known region-based navigation.

Some resources related to this use case in the Kindle ecosystem include:

https://kdp.amazon.com/en_US/help/topic/G9GSTY4LTRT39D4Z

https://kdp.amazon.com/en_US/help/topic/GJMRD9F78MS9F43R

Sure, and that's also why it was always required that text reference content with a possible duration when audio isn't present.

It feels very vague quite frankly, because it's impossible to estimate the duration of a SMIL that's <text> only. Based on the voice and speed that I use, the duration will be very different.

Media overlays has never been about providing a non-auditory experience. The concern I have is not about the other content possibilities, which could all be viable

I completely understand your point here and for the sake of maximizing compatibility, I'm willing to focus strictly on two use cases right now:

img + text

and img + text + audio

With this approach, text (1 exactly) and audio (0 or 1) would keep their current content model. img would become optional (0 or 1) just like audio.

I think that this somehow raises the bar for what we require from content creators, but given the focus on accessibility and specialized libraries, it's a trade off that we can work with.

If this works out, we could always relax our approach in a future revision to allow img + audio or img on its own.

HadrienGardeur · 2026-02-06T14:32:20Z

I'd like to create an example for this PR as well. This will take a bit of work but I'd like to convert what I created for Readium that's currently available at: https://github.com/readium/guided-navigation/tree/main/examples/comics

This is a CC-licensed comic so it's a pretty good example to work with.

Here's what I have in mind:

images in spine with HTML fallbacks
script for the entire publication in its own HTML (outside of the spine) based on https://github.com/readium/guided-navigation/blob/main/examples/comics/textual.md#with-limited-context
one SMIL per page based on https://github.com/readium/guided-navigation/blob/main/examples/comics/guided.json
img, audio and text for each <par>

iherman · 2026-02-06T15:53:33Z

I'd like to create an example for this PR as well. This will take a bit of work but I'd like to convert what I created for Readium that's currently available at:

Alternatively, or in addition to, it would be great to have this example added to the test suite. I am happy to help to convert it into a bona fide test when the time comes (there are some metadata requirements).

HadrienGardeur · 2026-02-06T17:55:02Z

@iherman I'd like to create a full example (an entire chapter) but it could be easily shortened to a page or two for the test suite.

HadrienGardeur · 2026-02-07T15:57:01Z

Here's the WIP for this example: https://github.com/HadrienGardeur/accessible-epub-comics

OPF: https://github.com/HadrienGardeur/accessible-epub-comics/blob/main/EPUB/package.opf
Images in spine(page{x}.jpg)
HTML Fallbacks (page{x}.xhtml)
SMIL per page (page{x}.smil)
HTML script per page (page{x}-script.xhtml)
Navigation Document: https://github.com/HadrienGardeur/accessible-epub-comics/blob/main/EPUB/nav.xhtml

[UPDATE]: @iherman I'm done with the first page so there's probably enough content for a test file now.

iherman · 2026-02-09T06:36:41Z

[UPDATE]: @iherman I'm done with the first page so there's probably enough content for a test file now.

Thanks. I will look at this at some point, but I would prefer to wait until this PR gets indeed consensus, ie, get merged, before doing this.

HadrienGardeur · 2026-02-09T15:08:05Z

I'm done with a first version of a full publication with <img> in SMIL.

Of course, epubcheck is unhappy with this example:

it doesn't like that an image in the spine has a Media Overlay
it doesn't like that I'm using <img> in SMIL
it doesn't like that I'm using <seq epub:type="panel"> to group <par> for each panel without using epub:textref
and it doesn't like seeing multiple references to script.xhtml in different SMIL files using <text>

By the way, this example also illustrates how images in spine + Media Overlays can be more accessible for comics than just wrapping up images in XHTML.

iherman · 2026-02-10T14:47:15Z

I'm done with a first version of a full publication with <img> in SMIL.

Of course, epubcheck is unhappy with this example:

it doesn't like that an image in the spine has a Media Overlay

it doesn't like that I'm using <img> in SMIL

it doesn't like that I'm using <seq epub:type="panel"> to group <par> for each panel without using epub:textref

and it doesn't like seeing multiple references to script.xhtml in different SMIL files using <text>

By the way, this example also illustrates how images in spine + Media Overlays can be more accessible for comics than just wrapping up images in XHTML.

I can understand the first two entries, obviously. The third sounds like sg. that must be specified in the spec if we go ahead with images. But the fourth entry is weird. Is it an epubcheck error?

cc @mattgarrish @rdeltour

rdeltour · 2026-02-10T14:55:01Z

But the fourth entry is weird. Is it an epubcheck error?

I suppose it comes from Media overlay document requirements (section 9.3.2.1 in the current draft) stating

more than one media overlay document MUST NOT reference the same EPUB content document.

rdeltour · 2026-02-10T15:01:20Z

For clarification, EPUBCheck is likely wrong here. script.xhtml is not referred from the spine so it may not be considered a content document and the statement above would not apply.

But out of memory we do not verify that a document is referenced from the spine before applying checks, so basically any XHTML document found in the container is considered an XHTML Content Document by EPUBCheck, and constraints of content documents are applied.

mattgarrish · 2026-02-10T15:09:54Z

Ya, this is a pretty radical departure from current media overlays where the xhtml content document is the driver. You can have multiple content documents that refer to the same media overlay, but a single content document can't be referred to from multiple media overlay documents because you can only specify one in the media-overlay attribute. (It could become a real headache to figure out when media overlays are valid moving forward.)

I'm assuming the media overlays section will need a pretty radical rewrite to take focus off it being largely bound to xhtml with audio sync capabilities. I'm not even sure how syncing images and content documents works, beyond even the display in the viewport issue. It presumably makes the text content a top-level content document which will require these text documents to be in the spine, but how does that work with rolls?

I was also thinking media overlays may not belong under aural rendering since it sounds like audio sync may not be a requirement anymore. You might have to pull it back out and maybe make aural rendering informative with a trimmed down explanation of how media overlays work for audio sync.

(But I'm trying to focus on getting the accessibility metadata guide wrapped up, as we need to be able to reference it from the techniques document, so I haven't had the time to keep up.)

iherman · 2026-02-10T15:27:56Z

Hm. I am feeling more and more uncomfortable with amount of change triggered by the introduction of the <img> element at this point. Honestly, I thought it would be a simple change, and I was obviously wrong (and Matt was right, who warned us on the call...).

I wonder whether it is indeed a good idea to do this at this point in the game. I would prefer to re-discuss it on the call to be sure we are still o.k. with this.

Sorry @HadrienGardeur

cc @w3c/w3c-group-145018-members

HadrienGardeur · 2026-02-10T17:43:47Z

Some additional thoughts:

I've used a single out-of-spine HTML for the whole script because this felt like the right thing to do, but I could write a script per page instead
one limitation that I can think of, is that in some cases, you need to go back and forth between the document positioned in the left of the spread and the one in the right of the spread (that's an issue with our current approach for Media Overlays that extends beyond this PR)
we could keep the restriction of having a single text element, but this would require a textual alternative/script for such files, which might be fine
IMO, we should allow SMIL on images in the spine, I don't see a good reason to limit this to XHTML/SVG with the addition of <img>
I could drop the <seq> elements with no epub:textref but we would lose the possibility of navigating panel by panel (which is quite useful)

As you can see, I can potentially work around some of these epubcheck issues at the cost of some features.

HadrienGardeur · 2026-02-10T17:49:32Z

I was also thinking media overlays may not belong under aural rendering since it sounds like audio sync may not be a requirement anymore.

That's a different matter, unrelated to this PR.

With our current model, we only require <text> while <audio> is optional, which means that labeling the whole section "Aural rendering" feels like a misnomer.

If we also require 0 or 1 <img> element, this won't change the fact that you could have a SMIL with just <par> and <text>.

HadrienGardeur · 2026-02-10T23:32:15Z

Working on this example, seeing the current limitations with epubcheck and having these discussions all feel like a very fruitful exercise to me.

Based on my recent comment (#2919 (comment)), I'll update my example to use a script per page instead of a global one.

~~I'm not changing yet how I use <seq> because it's worth discussing IMO, but if it's too much of a hassle, it's easy enough to just flatten the structure.~~

For the other epubcheck errors, I think that they should be amended eventually:

having Media Overlays on images is IMO a good thing for compatibility, RS who can't work with images in spine will fallback to the HTML variant and won't be exposed to <img> in SMIL
errors related to <img> would naturally vanish if we update epubcheck to match the spec

[UPDATE]: and that's done. As expected I'm still receiving the three errors pointed out above, but I could easily get rid of the one related to <seq>.
[UPDATE 2]: Instead of flattening the structure of the SMIL, I've added IDs for each panel in the script of each page and added references on to them using epub:textref. We're down to two errors in epubcheck.

HadrienGardeur · 2026-02-11T14:48:16Z

@iherman give me another week to continue working on this before we discuss it in a call again.

I'm done with the example for now, which means that I can go back to the PR.

HadrienGardeur · 2026-02-14T10:25:01Z

Just a heads up to say that we'll start implementing this feature in the Readium Swift toolkit next week:

we don't work with SMIL as-is, it gets converted to an internal model where we already support images and videos
which means that parsing <img> in SMIL is very easy for us
we'll use my example to support img + audio on images in spine
I'll most likely create a variant that drops audio references to also test our fallback to TTS using text
for now, we won't use image fragments, this will come later this year
we expect to have this ready by the end of March and available in our beta for Thorium Mobile
I'll most likely do a demo at an accessibility conference organized in Oslo in June

In terms of UX, this initial support will offer two options for users:

when you decide to "read", it will display the full page with audio playback in the background (either continuous or page by page)
or when you decide to "listen", it will provide an audiobook-like experience

For screen reader users, we might also open the script instead of images but I'm not 100% sure about this one yet.

First draft for <img> in SMIL

7c6b6a9

github-project-automation bot added this to PM/EPUB issues Feb 5, 2026

github-project-automation bot moved this to In review in PM/EPUB issues Feb 5, 2026

mattgarrish reviewed Feb 5, 2026

View reviewed changes

Added img to the par content model

0cc7491

iherman added Spec-EPUB3 The issue affects the core EPUB 3.X Recommendation Spec-ReadingSystems The issue affects the EPUB Reading Systems 3.X Recommendation labels Feb 6, 2026

HadrienGardeur mentioned this pull request Feb 16, 2026

Region-based navigation #2914

Open

Conversation

HadrienGardeur commented Feb 5, 2026 • edited by pr-preview bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HadrienGardeur Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattgarrish Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HadrienGardeur commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iherman commented Feb 6, 2026

Uh oh!

HadrienGardeur commented Feb 6, 2026

Uh oh!

HadrienGardeur commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iherman commented Feb 9, 2026

Uh oh!

HadrienGardeur commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iherman commented Feb 10, 2026

Uh oh!

rdeltour commented Feb 10, 2026

Uh oh!

rdeltour commented Feb 10, 2026

Uh oh!

mattgarrish commented Feb 10, 2026

Uh oh!

iherman commented Feb 10, 2026

Uh oh!

HadrienGardeur commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HadrienGardeur commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HadrienGardeur commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HadrienGardeur commented Feb 11, 2026

Uh oh!

HadrienGardeur commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HadrienGardeur commented Feb 5, 2026 •

edited by pr-preview bot

Loading

HadrienGardeur Feb 6, 2026 •

edited

Loading

mattgarrish Feb 10, 2026 •

edited

Loading

HadrienGardeur commented Feb 6, 2026 •

edited

Loading

HadrienGardeur commented Feb 7, 2026 •

edited

Loading

HadrienGardeur commented Feb 9, 2026 •

edited

Loading

HadrienGardeur commented Feb 10, 2026 •

edited

Loading

HadrienGardeur commented Feb 10, 2026 •

edited

Loading

HadrienGardeur commented Feb 10, 2026 •

edited

Loading