Image captioning is a technology that combines LSTM text generation with the computer vision powers of a convolutional neural network. In this part, I used LSTM and CNN to create a image captioning system. I used transfer learning to utilize these two projects:
• InceptionV3
• Glove Embeddings
I used inception to extract features from the images and glove embeddings as a set of Natural Language Processing (NLP) vectors for common words.
SemArt Dataset can be downloaded using the following link. You will need to place all the images in a folder named 'images' in the same directory.
You also need to create a data folder for storing all the features of trained and test images of semart dataset. You could also download these features using these links-
data/train
data/test
Glove emebeddings file can be downloaded from here
In this example, I used Glove for the text embedding and InceptionV3 to extract features from the images. Both of these transfers serve to extract features from the raw text and the images. InceptionV3 has 2,048 features below the classifier, and MobileNet has over 50K. If the additional dimensions truly capture aspects of the images, then they are worthwhile. However, having 50K features increases the processing needed and the complexity of the neural network we are constructing.The embeddings were then passed into LSTM after which the image and text features were combined and sent to a decoder network to generate the next word..