In Media contentType there is ##timing or ##phrases section, where you can put subtitles. Actually it is contentSubType.
Also in subtitles you can specify <Voice /> -- who is talking particular phrase, in webvtt way, like: <v Rob>.
Add in that way another subType: avatars/voices where in yaml like format you can specify avatars of speakers.
## voices
Rob: https://facebook.com/rob3913/images/photo.jpg
Neil: images/avatars/Neil.png