Conversation
| <ul class="nomark"> | ||
| <li> | ||
| <p> [^text^] <code>[exactly 1]</code> | ||
| </p> | ||
| </li> | ||
| <li> | ||
| <p> [^audio^] <code>[0 or 1]</code> | ||
| </p> | ||
| </li> | ||
| </ul> |
There was a problem hiding this comment.
The new content model for par needs to be specified. Is img + audio a separate combination, since relying on tts for images likely shouldn't be an option, or is img optionally allowed and text remains required so all three elements can be used together? (But is showing both an image and a text fragment realistic if the viewport is occupied by the fxl image?)
There was a problem hiding this comment.
I guess the latter would be a new case of img and audio required and text optional, not img as an optional element in the current model. Probably doesn't make sense to always require text with img.
There was a problem hiding this comment.
(Apologies if you're already working on this, but I assumed you were moving on to the RS aspects by opening the draft.)
There was a problem hiding this comment.
No I'm not done with anything at this point, I just prefer to open a draft PR as early as possible.
For me, all combinations are valid:
img+text(textual alternatives for regions, for example description of a panel)img+audio(audio narration for regions)img+audio+text(this would allow someone to either listen to the pre-recorded or use for example a Braille tablet by consuming the textual content)
Even img on its own could be a valid use case and result in a panel-by-panel navigation for example in comics.
There was a problem hiding this comment.
I didn't go through the section that you highlighted yet, but IMO text, img or audio would all become 0 or 1 with at least one of them present.
There was a problem hiding this comment.
I don't believe that's how smil expects it to work by default, though, if I understand you correctly (that you pick the applicable format to synchronize). I'll preface by saying I'm not the expert on this, but my understanding is that if you specify all three in a single par then all three are expected to be synchronized.
This is definitely meant to synchronize all three media together.
You can open the following Google Slides in presenter mode to see a demo of what this could feel like: https://docs.google.com/presentation/d/1LGHRIN_vHl-H-bgXsHkhqrL0owMCy8qx854x9YU3d8w/edit?usp=sharing
That said, even with Media Overlays today you're free to use what you want:
- just text
- just audio
- or text and audio together
Multiple apps already offer the ability to consume EPUB with Media Overlays like an audiobook with just a player interface on screen.
In the specific case of comics and highly illustrated content, the ideal scenario would be to customize things to your needs:
- for example a dyslexic user could enable audio on captions and speech bubbles (either using
<audio>or with TTS on the content of<text>) but skip descriptions using skippability (this would need a specific role that could be identified) but display text below the image fragment (this is what my example in Google Slides illustrates) - a blind user could go full audio (once again using either
<audio>, with TTS on the content of<text>or using a textual view with a screen reader) without skipping anything at all - a user reading on a small screen could just use these image fragments to read more easily on their device and turn audio on just for speech bubbles
This is potentially problematic as it means you could validly have only an image listed, but what happens then?
That's the last use case that I've described above, this would give you region by region navigation.
With text, the reading system is supposed to tts the content before moving to the next par, but if someone only lists an image does it load and unload instantaneously?
Frankly, there's more of a use case for <img> or <audio> on their own than having just <text>.
What's the point of a SMIL with just <text> when you can just use TTS? Skippability and escapibility can be achieved without SMIL, the only use case I think of is to guide you through places in the publication.
There was a problem hiding this comment.
This is definitely meant to synchronize all three media together.
Okay, I'll have to see the new model for rendering the content before I comment any more on this. Having text content synced with a roll image, or an image placed into reflowable text, seems complex to spec out.
That said, even with Media Overlays today you're free to use what you want:
I guess if you're okay with your media overlay being invalid. The content model requires a text element in all cases. There are hacks around how much text you have to provide, but you can't produce audio-only content without at least one element to synchronize with.
What's the point of a SMIL with just
<text>when you can just use TTS?
The only advantage is that you don't have to change playback modes. If the body is professionally narrated you could sync the backmatter for TTS without having to prerecord it.
But that's not the issue. There's still a timing sequence for text if you push the rendering out to TTS. The reading system will present/highlight the text for as long as it takes to render it as speech.
If all you have is an image, there is no synchronization and there is no timing information. So how does the author state how long the user should see the image? Will reading systems just assign some common amount of time per image regardless of their complexity?
It's a little weird to use a synchronization markup language to not synchronize anything. An audio-only par element is equally strange from a synchronization standpoint, but at least the timing of the clip gives it a duration to play.
But, these are just my immediate concerns. I can wait until the draft is in a more complete state before commenting any more so we don't add a lot of noise to the pull request.
There was a problem hiding this comment.
I guess if you're okay with your media overlay being invalid. The content model requires a text element in all cases. There are hacks around how much text you have to provide, but you can't produce audio-only content without at least one element to synchronize with.
I was mostly talking about the UX from a RS perspective, of course the current spec requires text.
Our current spec is really written in a way where we assume that:
- the main way this will be handled is by displaying text on the screen with audio playback in the background
- and the use of authored CSS is very much skewed towards FXL
In a reflowable EPUB where users are free to select different themes, using authored CSS for highlighting is a potential accessibility hazard, since you could end up with major contrast issues.
In practice, a user could just listen to an EPUB with Media Overlays without displaying anything on a screen, even if the SMIL includes text and/or img. This would be indistinguishable from an audiobook from a UX perspective.
If all you have is an image, there is no synchronization and there is no timing information. So how does the author state how long the user should see the image? Will reading systems just assign some common amount of time per image regardless of their complexity?
I don't think that this is any different from <text> which doesn't have any inherent duration (it only has one when TTS is used to read the text aloud). That's also why the dur attribute exists in SMIL but we don't have it in EPUB.
From a UX perspective, it's important to keep in mind that even with <text> and <audio> the playback can either be:
- continuous, where you go through the entire publication
- per page or per spread, where the current page/spread is read and the RS waits for the user to switch to another one before playback continues
- or handled element by element
With this last one, you have something that makes perfect sense for img on its own. For example:
- whole page displayed
- then just the first panel
- then the second panel
- then a zoom into part of the second panel to showcase a character and a bubble
- etc.
Once you sync text or audio to each of these image fragments, then playback can also become continuous or based on page/spread.
There was a problem hiding this comment.
I don't think that this is any different from
<text>which doesn't have any inherent duration (it only has one when TTS is used to read the text aloud)
Sure, and that's also why it was always required that text reference content with a possible duration when audio isn't present. TTS provides the duration for text content. We took out the section on embedded media last revision and advise people that referencing that kind of content from text will have unpredictable results, but text was never meant to link to content for which no duration could be established.
If you go back to 3.2, before we took that section out, the embedded media that text referred to had to have an audio component that could be played back:
When a text element references embedded media that contains audio, the audio sibling element is OPTIONAL.
Referring to images was always problematic because the text content to TTS wasn't as straightforward as getting the text content of a typical html element, but there was still the possibility of using alt.
That's why having img reference images outside of an xhtml wrapper as the only element of a par contradicts all expectations we've ever had for media overlays to synchronize content with audio.
Media overlays has never been about providing a non-auditory experience. The concern I have is not about the other content possibilities, which could all be viable:
img+text= duration through tts.img+audio= duration through audio playback.img+text+audio= duration through audio playback
But img on its own has no duration and no audio, so why do we even need it? It's like region-based nav but if you take away playback control from the user and expect the reading system to meaningfully automate it. If you drop that one case, the internal conflicts with what we have are greatly reduced.
There was a problem hiding this comment.
I think that there's a use case for <img> on its own, but as you've correctly pointed out it can also be implemented using the lesser known region-based navigation.
Some resources related to this use case in the Kindle ecosystem include:
- https://kdp.amazon.com/en_US/help/topic/G9GSTY4LTRT39D4Z
- https://kdp.amazon.com/en_US/help/topic/GJMRD9F78MS9F43R
Sure, and that's also why it was always required that text reference content with a possible duration when audio isn't present.
It feels very vague quite frankly, because it's impossible to estimate the duration of a SMIL that's <text> only. Based on the voice and speed that I use, the duration will be very different.
Media overlays has never been about providing a non-auditory experience. The concern I have is not about the other content possibilities, which could all be viable
I completely understand your point here and for the sake of maximizing compatibility, I'm willing to focus strictly on two use cases right now:
img+text- and
img+text+audio
With this approach, text (1 exactly) and audio (0 or 1) would keep their current content model. img would become optional (0 or 1) just like audio.
I think that this somehow raises the bar for what we require from content creators, but given the focus on accessibility and specialized libraries, it's a trade off that we can work with.
If this works out, we could always relax our approach in a future revision to allow img + audio or img on its own.
|
I'd like to create an example for this PR as well. This will take a bit of work but I'd like to convert what I created for Readium that's currently available at: https://github.com/readium/guided-navigation/tree/main/examples/comics This is a CC-licensed comic so it's a pretty good example to work with. Here's what I have in mind:
|
Alternatively, or in addition to, it would be great to have this example added to the test suite. I am happy to help to convert it into a bona fide test when the time comes (there are some metadata requirements). |
|
@iherman I'd like to create a full example (an entire chapter) but it could be easily shortened to a page or two for the test suite. |
|
Here's the WIP for this example: https://github.com/HadrienGardeur/accessible-epub-comics
[UPDATE]: @iherman I'm done with the first page so there's probably enough content for a test file now. |
Thanks. I will look at this at some point, but I would prefer to wait until this PR gets indeed consensus, ie, get merged, before doing this. |
|
I'm done with a first version of a full publication with Of course, epubcheck is unhappy with this example:
By the way, this example also illustrates how images in spine + Media Overlays can be more accessible for comics than just wrapping up images in XHTML. |
I can understand the first two entries, obviously. The third sounds like sg. that must be specified in the spec if we go ahead with images. But the fourth entry is weird. Is it an epubcheck error? |
I suppose it comes from Media overlay document requirements (section 9.3.2.1 in the current draft) stating
|
|
For clarification, EPUBCheck is likely wrong here. But out of memory we do not verify that a document is referenced from the spine before applying checks, so basically any XHTML document found in the container is considered an XHTML Content Document by EPUBCheck, and constraints of content documents are applied. |
|
Ya, this is a pretty radical departure from current media overlays where the xhtml content document is the driver. You can have multiple content documents that refer to the same media overlay, but a single content document can't be referred to from multiple media overlay documents because you can only specify one in the media-overlay attribute. (It could become a real headache to figure out when media overlays are valid moving forward.) I'm assuming the media overlays section will need a pretty radical rewrite to take focus off it being largely bound to xhtml with audio sync capabilities. I'm not even sure how syncing images and content documents works, beyond even the display in the viewport issue. It presumably makes the text content a top-level content document which will require these text documents to be in the spine, but how does that work with rolls? I was also thinking media overlays may not belong under aural rendering since it sounds like audio sync may not be a requirement anymore. You might have to pull it back out and maybe make aural rendering informative with a trimmed down explanation of how media overlays work for audio sync. (But I'm trying to focus on getting the accessibility metadata guide wrapped up, as we need to be able to reference it from the techniques document, so I haven't had the time to keep up.) |
|
Hm. I am feeling more and more uncomfortable with amount of change triggered by the introduction of the I wonder whether it is indeed a good idea to do this at this point in the game. I would prefer to re-discuss it on the call to be sure we are still o.k. with this. Sorry @HadrienGardeur cc @w3c/w3c-group-145018-members |
|
Some additional thoughts:
As you can see, I can potentially work around some of these epubcheck issues at the cost of some features. |
That's a different matter, unrelated to this PR. With our current model, we only require If we also require 0 or 1 |
|
Working on this example, seeing the current limitations with epubcheck and having these discussions all feel like a very fruitful exercise to me. Based on my recent comment (#2919 (comment)), I'll update my example to use a script per page instead of a global one.
For the other epubcheck errors, I think that they should be amended eventually:
[UPDATE]: and that's done. As expected I'm still receiving the three errors pointed out above, but I could easily get rid of the one related to |
|
@iherman give me another week to continue working on this before we discuss it in a call again. I'm done with the example for now, which means that I can go back to the PR. |
|
Just a heads up to say that we'll start implementing this feature in the Readium Swift toolkit next week:
In terms of UX, this initial support will offer two options for users:
For screen reader users, we might also open the script instead of images but I'm not 100% sure about this one yet. |
This PR closes #2883.
It's currently a draft PR and not ready to be reviewed yet.
Changes:
imgelementimgto the content model forpar(set to 0 or 1)Links:
Preview | Diff