It is mentioned in the paper "A lightweight text decoder is trained jointly to generate new captions, further enriching the training signal." I'm wondering if you also included the text decoders in the open-sourced models, since these could also be of various interests (e.g. image caption). It would be greate if you can also provide some demo for the usage of the decoders along the weights. Thanks!