Skip to content

Support for <img> in SMIL#2919

Draft
HadrienGardeur wants to merge 2 commits intomainfrom
img-smil
Draft

Support for <img> in SMIL#2919
HadrienGardeur wants to merge 2 commits intomainfrom
img-smil

Conversation

@HadrienGardeur
Copy link
Member

@HadrienGardeur HadrienGardeur commented Feb 5, 2026

This PR closes #2883.

It's currently a draft PR and not ready to be reviewed yet.

Changes:

  • Added a new section for the img element
  • Added img to the content model for par (set to 0 or 1)
  • Added Media Fragment URI to normative references
  • Added a new paragraph in "Referencing document fragments"
  • Added new example for comics in "Structural semantics in overlays"
  • Added new entry in changelog

Links:


Preview | Diff

Comment on lines 8319 to 8328
<ul class="nomark">
<li>
<p> [^text^] <code>[exactly 1]</code>
</p>
</li>
<li>
<p> [^audio^] <code>[0 or 1]</code>
</p>
</li>
</ul>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new content model for par needs to be specified. Is img + audio a separate combination, since relying on tts for images likely shouldn't be an option, or is img optionally allowed and text remains required so all three elements can be used together? (But is showing both an image and a text fragment realistic if the viewport is occupied by the fxl image?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the latter would be a new case of img and audio required and text optional, not img as an optional element in the current model. Probably doesn't make sense to always require text with img.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Apologies if you're already working on this, but I assumed you were moving on to the RS aspects by opening the draft.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I'm not done with anything at this point, I just prefer to open a draft PR as early as possible.

For me, all combinations are valid:

  • img + text (textual alternatives for regions, for example description of a panel)
  • img + audio (audio narration for regions)
  • img + audio + text (this would allow someone to either listen to the pre-recorded or use for example a Braille tablet by consuming the textual content)

Even img on its own could be a valid use case and result in a panel-by-panel navigation for example in comics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't go through the section that you highlighted yet, but IMO text, img or audio would all become 0 or 1 with at least one of them present.

Copy link
Member Author

@HadrienGardeur HadrienGardeur Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe that's how smil expects it to work by default, though, if I understand you correctly (that you pick the applicable format to synchronize). I'll preface by saying I'm not the expert on this, but my understanding is that if you specify all three in a single par then all three are expected to be synchronized.

This is definitely meant to synchronize all three media together.

You can open the following Google Slides in presenter mode to see a demo of what this could feel like: https://docs.google.com/presentation/d/1LGHRIN_vHl-H-bgXsHkhqrL0owMCy8qx854x9YU3d8w/edit?usp=sharing

That said, even with Media Overlays today you're free to use what you want:

  • just text
  • just audio
  • or text and audio together

Multiple apps already offer the ability to consume EPUB with Media Overlays like an audiobook with just a player interface on screen.

In the specific case of comics and highly illustrated content, the ideal scenario would be to customize things to your needs:

  • for example a dyslexic user could enable audio on captions and speech bubbles (either using <audio> or with TTS on the content of <text>) but skip descriptions using skippability (this would need a specific role that could be identified) but display text below the image fragment (this is what my example in Google Slides illustrates)
  • a blind user could go full audio (once again using either <audio>, with TTS on the content of <text> or using a textual view with a screen reader) without skipping anything at all
  • a user reading on a small screen could just use these image fragments to read more easily on their device and turn audio on just for speech bubbles

This is potentially problematic as it means you could validly have only an image listed, but what happens then?

That's the last use case that I've described above, this would give you region by region navigation.

With text, the reading system is supposed to tts the content before moving to the next par, but if someone only lists an image does it load and unload instantaneously?

Frankly, there's more of a use case for <img> or <audio> on their own than having just <text>.

What's the point of a SMIL with just <text> when you can just use TTS? Skippability and escapibility can be achieved without SMIL, the only use case I think of is to guide you through places in the publication.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely meant to synchronize all three media together.

Okay, I'll have to see the new model for rendering the content before I comment any more on this. Having text content synced with a roll image, or an image placed into reflowable text, seems complex to spec out.

That said, even with Media Overlays today you're free to use what you want:

I guess if you're okay with your media overlay being invalid. The content model requires a text element in all cases. There are hacks around how much text you have to provide, but you can't produce audio-only content without at least one element to synchronize with.

What's the point of a SMIL with just <text> when you can just use TTS?

The only advantage is that you don't have to change playback modes. If the body is professionally narrated you could sync the backmatter for TTS without having to prerecord it.

But that's not the issue. There's still a timing sequence for text if you push the rendering out to TTS. The reading system will present/highlight the text for as long as it takes to render it as speech.

If all you have is an image, there is no synchronization and there is no timing information. So how does the author state how long the user should see the image? Will reading systems just assign some common amount of time per image regardless of their complexity?

It's a little weird to use a synchronization markup language to not synchronize anything. An audio-only par element is equally strange from a synchronization standpoint, but at least the timing of the clip gives it a duration to play.

But, these are just my immediate concerns. I can wait until the draft is in a more complete state before commenting any more so we don't add a lot of noise to the pull request.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if you're okay with your media overlay being invalid. The content model requires a text element in all cases. There are hacks around how much text you have to provide, but you can't produce audio-only content without at least one element to synchronize with.

I was mostly talking about the UX from a RS perspective, of course the current spec requires text.

Our current spec is really written in a way where we assume that:

  • the main way this will be handled is by displaying text on the screen with audio playback in the background
  • and the use of authored CSS is very much skewed towards FXL

In a reflowable EPUB where users are free to select different themes, using authored CSS for highlighting is a potential accessibility hazard, since you could end up with major contrast issues.

In practice, a user could just listen to an EPUB with Media Overlays without displaying anything on a screen, even if the SMIL includes text and/or img. This would be indistinguishable from an audiobook from a UX perspective.

If all you have is an image, there is no synchronization and there is no timing information. So how does the author state how long the user should see the image? Will reading systems just assign some common amount of time per image regardless of their complexity?

I don't think that this is any different from <text> which doesn't have any inherent duration (it only has one when TTS is used to read the text aloud). That's also why the dur attribute exists in SMIL but we don't have it in EPUB.

From a UX perspective, it's important to keep in mind that even with <text> and <audio> the playback can either be:

  • continuous, where you go through the entire publication
  • per page or per spread, where the current page/spread is read and the RS waits for the user to switch to another one before playback continues
  • or handled element by element

With this last one, you have something that makes perfect sense for img on its own. For example:

  • whole page displayed
  • then just the first panel
  • then the second panel
  • then a zoom into part of the second panel to showcase a character and a bubble
  • etc.

Once you sync text or audio to each of these image fragments, then playback can also become continuous or based on page/spread.

Copy link
Member

@mattgarrish mattgarrish Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this is any different from <text> which doesn't have any inherent duration (it only has one when TTS is used to read the text aloud)

Sure, and that's also why it was always required that text reference content with a possible duration when audio isn't present. TTS provides the duration for text content. We took out the section on embedded media last revision and advise people that referencing that kind of content from text will have unpredictable results, but text was never meant to link to content for which no duration could be established.

If you go back to 3.2, before we took that section out, the embedded media that text referred to had to have an audio component that could be played back:

When a text element references embedded media that contains audio, the audio sibling element is OPTIONAL.

Referring to images was always problematic because the text content to TTS wasn't as straightforward as getting the text content of a typical html element, but there was still the possibility of using alt.

That's why having img reference images outside of an xhtml wrapper as the only element of a par contradicts all expectations we've ever had for media overlays to synchronize content with audio.

Media overlays has never been about providing a non-auditory experience. The concern I have is not about the other content possibilities, which could all be viable:

  • img + text = duration through tts.
  • img + audio = duration through audio playback.
  • img + text + audio = duration through audio playback

But img on its own has no duration and no audio, so why do we even need it? It's like region-based nav but if you take away playback control from the user and expect the reading system to meaningfully automate it. If you drop that one case, the internal conflicts with what we have are greatly reduced.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there's a use case for <img> on its own, but as you've correctly pointed out it can also be implemented using the lesser known region-based navigation.

Some resources related to this use case in the Kindle ecosystem include:

Sure, and that's also why it was always required that text reference content with a possible duration when audio isn't present.

It feels very vague quite frankly, because it's impossible to estimate the duration of a SMIL that's <text> only. Based on the voice and speed that I use, the duration will be very different.

Media overlays has never been about providing a non-auditory experience. The concern I have is not about the other content possibilities, which could all be viable

I completely understand your point here and for the sake of maximizing compatibility, I'm willing to focus strictly on two use cases right now:

  • img + text
  • and img + text + audio

With this approach, text (1 exactly) and audio (0 or 1) would keep their current content model. img would become optional (0 or 1) just like audio.

I think that this somehow raises the bar for what we require from content creators, but given the focus on accessibility and specialized libraries, it's a trade off that we can work with.

If this works out, we could always relax our approach in a future revision to allow img + audio or img on its own.

@HadrienGardeur
Copy link
Member Author

HadrienGardeur commented Feb 6, 2026

I'd like to create an example for this PR as well. This will take a bit of work but I'd like to convert what I created for Readium that's currently available at: https://github.com/readium/guided-navigation/tree/main/examples/comics

This is a CC-licensed comic so it's a pretty good example to work with.

Here's what I have in mind:

@iherman
Copy link
Member

iherman commented Feb 6, 2026

I'd like to create an example for this PR as well. This will take a bit of work but I'd like to convert what I created for Readium that's currently available at:

Alternatively, or in addition to, it would be great to have this example added to the test suite. I am happy to help to convert it into a bona fide test when the time comes (there are some metadata requirements).

@iherman iherman added Spec-EPUB3 The issue affects the core EPUB 3.X Recommendation Spec-ReadingSystems The issue affects the EPUB Reading Systems 3.X Recommendation labels Feb 6, 2026
@HadrienGardeur
Copy link
Member Author

@iherman I'd like to create a full example (an entire chapter) but it could be easily shortened to a page or two for the test suite.

@HadrienGardeur
Copy link
Member Author

HadrienGardeur commented Feb 7, 2026

Here's the WIP for this example: https://github.com/HadrienGardeur/accessible-epub-comics

[UPDATE]: @iherman I'm done with the first page so there's probably enough content for a test file now.

@iherman
Copy link
Member

iherman commented Feb 9, 2026

[UPDATE]: @iherman I'm done with the first page so there's probably enough content for a test file now.

Thanks. I will look at this at some point, but I would prefer to wait until this PR gets indeed consensus, ie, get merged, before doing this.

@HadrienGardeur
Copy link
Member Author

HadrienGardeur commented Feb 9, 2026

I'm done with a first version of a full publication with <img> in SMIL.

Of course, epubcheck is unhappy with this example:

  • it doesn't like that an image in the spine has a Media Overlay
  • it doesn't like that I'm using <img> in SMIL
  • it doesn't like that I'm using <seq epub:type="panel"> to group <par> for each panel without using epub:textref
  • and it doesn't like seeing multiple references to script.xhtml in different SMIL files using <text>

By the way, this example also illustrates how images in spine + Media Overlays can be more accessible for comics than just wrapping up images in XHTML.

@iherman
Copy link
Member

iherman commented Feb 10, 2026

I'm done with a first version of a full publication with <img> in SMIL.

Of course, epubcheck is unhappy with this example:

  • it doesn't like that an image in the spine has a Media Overlay
  • it doesn't like that I'm using <img> in SMIL
  • it doesn't like that I'm using <seq epub:type="panel"> to group <par> for each panel without using epub:textref
  • and it doesn't like seeing multiple references to script.xhtml in different SMIL files using <text>

By the way, this example also illustrates how images in spine + Media Overlays can be more accessible for comics than just wrapping up images in XHTML.

I can understand the first two entries, obviously. The third sounds like sg. that must be specified in the spec if we go ahead with images. But the fourth entry is weird. Is it an epubcheck error?

cc @mattgarrish @rdeltour

@rdeltour
Copy link
Member

But the fourth entry is weird. Is it an epubcheck error?

I suppose it comes from Media overlay document requirements (section 9.3.2.1 in the current draft) stating

more than one media overlay document MUST NOT reference the same EPUB content document.

@rdeltour
Copy link
Member

For clarification, EPUBCheck is likely wrong here. script.xhtml is not referred from the spine so it may not be considered a content document and the statement above would not apply.

But out of memory we do not verify that a document is referenced from the spine before applying checks, so basically any XHTML document found in the container is considered an XHTML Content Document by EPUBCheck, and constraints of content documents are applied.

@mattgarrish
Copy link
Member

Ya, this is a pretty radical departure from current media overlays where the xhtml content document is the driver. You can have multiple content documents that refer to the same media overlay, but a single content document can't be referred to from multiple media overlay documents because you can only specify one in the media-overlay attribute. (It could become a real headache to figure out when media overlays are valid moving forward.)

I'm assuming the media overlays section will need a pretty radical rewrite to take focus off it being largely bound to xhtml with audio sync capabilities. I'm not even sure how syncing images and content documents works, beyond even the display in the viewport issue. It presumably makes the text content a top-level content document which will require these text documents to be in the spine, but how does that work with rolls?

I was also thinking media overlays may not belong under aural rendering since it sounds like audio sync may not be a requirement anymore. You might have to pull it back out and maybe make aural rendering informative with a trimmed down explanation of how media overlays work for audio sync.

(But I'm trying to focus on getting the accessibility metadata guide wrapped up, as we need to be able to reference it from the techniques document, so I haven't had the time to keep up.)

@iherman
Copy link
Member

iherman commented Feb 10, 2026

Hm. I am feeling more and more uncomfortable with amount of change triggered by the introduction of the <img> element at this point. Honestly, I thought it would be a simple change, and I was obviously wrong (and Matt was right, who warned us on the call...).

I wonder whether it is indeed a good idea to do this at this point in the game. I would prefer to re-discuss it on the call to be sure we are still o.k. with this.

Sorry @HadrienGardeur

cc @w3c/w3c-group-145018-members

@HadrienGardeur
Copy link
Member Author

HadrienGardeur commented Feb 10, 2026

Some additional thoughts:

  • I've used a single out-of-spine HTML for the whole script because this felt like the right thing to do, but I could write a script per page instead
  • one limitation that I can think of, is that in some cases, you need to go back and forth between the document positioned in the left of the spread and the one in the right of the spread (that's an issue with our current approach for Media Overlays that extends beyond this PR)
  • we could keep the restriction of having a single text element, but this would require a textual alternative/script for such files, which might be fine
  • IMO, we should allow SMIL on images in the spine, I don't see a good reason to limit this to XHTML/SVG with the addition of <img>
  • I could drop the <seq> elements with no epub:textref but we would lose the possibility of navigating panel by panel (which is quite useful)

As you can see, I can potentially work around some of these epubcheck issues at the cost of some features.

@HadrienGardeur
Copy link
Member Author

HadrienGardeur commented Feb 10, 2026

I was also thinking media overlays may not belong under aural rendering since it sounds like audio sync may not be a requirement anymore.

That's a different matter, unrelated to this PR.

With our current model, we only require <text> while <audio> is optional, which means that labeling the whole section "Aural rendering" feels like a misnomer.

If we also require 0 or 1 <img> element, this won't change the fact that you could have a SMIL with just <par> and <text>.

@HadrienGardeur
Copy link
Member Author

HadrienGardeur commented Feb 10, 2026

Working on this example, seeing the current limitations with epubcheck and having these discussions all feel like a very fruitful exercise to me.

Based on my recent comment (#2919 (comment)), I'll update my example to use a script per page instead of a global one.

I'm not changing yet how I use <seq> because it's worth discussing IMO, but if it's too much of a hassle, it's easy enough to just flatten the structure.

For the other epubcheck errors, I think that they should be amended eventually:

  • having Media Overlays on images is IMO a good thing for compatibility, RS who can't work with images in spine will fallback to the HTML variant and won't be exposed to <img> in SMIL
  • errors related to <img> would naturally vanish if we update epubcheck to match the spec

[UPDATE]: and that's done. As expected I'm still receiving the three errors pointed out above, but I could easily get rid of the one related to <seq>.
[UPDATE 2]: Instead of flattening the structure of the SMIL, I've added IDs for each panel in the script of each page and added references on to them using epub:textref. We're down to two errors in epubcheck.

@HadrienGardeur
Copy link
Member Author

@iherman give me another week to continue working on this before we discuss it in a call again.

I'm done with the example for now, which means that I can go back to the PR.

@HadrienGardeur
Copy link
Member Author

Just a heads up to say that we'll start implementing this feature in the Readium Swift toolkit next week:

  • we don't work with SMIL as-is, it gets converted to an internal model where we already support images and videos
  • which means that parsing <img> in SMIL is very easy for us
  • we'll use my example to support img + audio on images in spine
  • I'll most likely create a variant that drops audio references to also test our fallback to TTS using text
  • for now, we won't use image fragments, this will come later this year
  • we expect to have this ready by the end of March and available in our beta for Thorium Mobile
  • I'll most likely do a demo at an accessibility conference organized in Oslo in June

In terms of UX, this initial support will offer two options for users:

  • when you decide to "read", it will display the full page with audio playback in the background (either continuous or page by page)
  • or when you decide to "listen", it will provide an audiobook-like experience

For screen reader users, we might also open the script instead of images but I'm not 100% sure about this one yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Spec-EPUB3 The issue affects the core EPUB 3.X Recommendation Spec-ReadingSystems The issue affects the EPUB Reading Systems 3.X Recommendation

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

Support for <img> in SMIL

4 participants