Neural Image Caption Generator

Project overview 📃

Given an image, generate a suitable caption describing the image using CNN and LSTM RNN cells. The project was a part of an academic course in team of 3, under CS 535 Deep Learning.

January 2020 – March 2020

Tech 👨🏻‍💻 ****

Python, Numpy, Pandas, PyTorch, SkLearn, MatplotLib

📕Source code:

Github

🗯️ The Problem

Given an image, the model should generate a suitable caption describing the content and actions in the image. There has been several implementations in parallel to this problem; this project aims to solve by using CNN plus RNN model.

🎯 Architecture

The architecture of the project is inspired from the implementation described in this paper. The paper illustrates usage of blend of a pre-trained CNN and several LSTM RNN cells. Along with it, the project also implemented captioning using attention mechanism.

Process:

Following steps were performed over the completion of project:

The project was split into two main parts- getting the pre-trained CNN and connecting the output from CNN with RNN LSTM cells. It is similar to an encoder-decoder network where pre-trained CNN acts as an encoder and LSTM cells act as decoder.
The project was implemented on three datasets -flickr8k, flickr30k and MSCOCO dataset.
We used ResNet based pre-trained classifier network for transfer learning. This network generated all the features present in the input image, which is then feeded as input to generate predict sequential words using RNN LSTM cells.
Entire training of the model was done using PyTorch library of Python and Natural Language Tool-Kit was also used for data wrangling with captions.
The model was trained for roughly 12-14 hours on 2GB Nvidia graphics card, which was 3-4 epochs in total on MSCOCO dataset. The training resulted perplexity in the range of 5-10, which can be improved if trained for higher epochs.
The average Bleu score that was recorded was 46.

Some results during the training of the model:

Image Captioning Model

                                             Image Captioning Model

                            Training and Validation loss v/s TimeSteps

                                                      Perplexity v/s Timesteps

                                           Encoder-Decoder architecture

                                                                                                                                               Made with 🖤 by Gulshan