Combining a CNN with an LSTM to produce an image captioning system

Since when can computers translate an image into natural language?

The time has come. The time when I can finally see, if I close my eyes, data flowing through an LSTM in my head. What´s more, if I close my eyes, I can see it flowing through the entire architecture in Figure A. Worrying? Most likely. Satisfying? A lot.

A big thanks to Udacity for doing what they do. This project is from their Computer Vision Nanodegree, of which I´m enjoying every minute of. The project is in this repo:


To design an automated system that will generate natural language captions of images.

Previous thoughts

CNNs generate low level representations of images. It makes sense to think that RNNs (LSTMs in this case) would be capable of working with said representations, incorporating them as any other input.

Solving the problem

This paper ( demonstrates great success in combining a EncoderCNN (Resnet, with pretrained params) with a DecoderRNN (LSTM), to solve the problem of image captioning. The implementation is inspired in the findings outline in the paper.

Furthermore, the hyper-parameters employed throughout the implementation are those recommended by the paper — I have not experimented much, since computing resources and time have been scarce. Training the network takes a long time and is expensive.

At a very high level, the implementation consists in sending the low-level output of the CNN into the LSTM (together with other inputs). Like so, the LSTM will propagate forward with ‘knowledge’ of the image.

Files contains the implementation of EncoderCNN and DecoderRNN contains the definition of the Vocabulary class. contains the definition of the CoCoDataset class and the get_loader() method.

0_Dataset.ibynb contains the initial frolicking with the data.

1_Preliminaries.ibynb contains an initial run through the data and model configuration

2_Training.ibynb contains the training code and results

3_Inference.ibynb contains the inference code


I have employed the COCO Dataset ( for this project. I have used their API to download the data. The data is structured as follows:

Per every datapoint:

  • 1 image
  • 5 corresponding captions

Naturally, training is all about feeding an image into the CNN and have the LSTM output words, calculate loss with respect to one of the 5 corresponding captions and then tweak weights.

For an NN to understand a word, it needs to be tokenized. The Vocab and CoCoDatset classes take care of that (Figure B).

Captions are not all of equal length. When approaching data ingestion for the LSTM, in order to avoid many useless zeros to be propagating forward through the network, a probability distribution is drafted whereby captions of length N join a ‘bucket’ comprised of captions with length N. During training, randomly, a length N is sampled from this distribution and in turn, the system prepares the corresponding captions.

If for instance, the length 10 is randomly sampled, the system produces data of shape [batch_size, 10, vocab_size]. Naturally, the number of LSTM unfolding (which are an abstraction of the LSTM cell to accommodate for the abstraction of the passing of time) is equivalent to the sequence length.

(Training) Architecture

I have prepared a diagram (Figure A) to explain the architecture, which can do a better job than many words. The following is happening:

  1. The Encoder (CNN + Embedding Layer) is producing a low level representation of the image. The embedding layer outputs data of dimension [embed_size]. Since we are dealing with batches, the output dimensions of the encoder are [batch_size, embed_size]
  2. The output of the encoder gets fed into the first unfolding of the Decoder (LSTM).
  3. It is actually concatenated with the incoming word vector, that passes through an embedding layer. The embedding layer outputs data of shape [batch_size, sequence_length, embed_size], where embed_size matches that of the embedding layer in the encoder.
  4. The result of the concat operation is data of shape [batch_size, sequence_length + 1, embed_size]. The CNN output joins the incoming embedded word tokens as simply another addition to the sequence, to all effects, turning into the first part of the sequence.
  5. The Decoder then propagates forward with this data. To be more specific:
  • When a data ‘package’ of shape [batch_size, sequence_length + 1, embed_size] arrives at the LSTM:
  1. There is quite a bit of black magic happening, in terms of logistics (part F of this segment, specifically).
  2. The first unit of the sequence (2nd dimension) is the CNN output.
  3. The LSTM processes the first unit of the sequence, with a certain dimension, brought about by the embedding process. The cell then outputs ht and ct. Both get passed onto the next unfolding, however, ht also is the output of the cell, at the top.
  4. This output (ht) gets measured against the expected result, via Cross Entropy Loss which combines Softmax (expressing the probability of each of the vocab_size, 8855 in this case, elements being the true one) and nn.NLLLoss(), which is a loss function like any other.
  5. Then onto the next tilmestep (word token), along the sequence length. The LSTM unfolds, successively, for every element along the sequence length.
  6. Then backprograpagation and gradient descent happens.
  7. However, logistically speaking, the LSTM is managing the batch_size dimension in the ‘background’. The above is all happening in batches. If batch_size is 10, data ‘packages’ are coming into the LSTM 10 by 10. Output also comes in batches. Outputs will be of the shape [batch_size, sequence_length, vocab_size].


As mentioned, there has been little room for experimentation. The parameters employed have been the following:

batch_size = 64

embed_size = 256

hidden_size = 512

vocab_threshold = 5

epochs = 1

learnable parameters = list(encoder.embed.parameters()) + list(decoder.parameters())

optimizer = torch.optim.Adam(params, lr=0.001)

NB: even though the paper recommends to use SGD, it became clear very quickly that Adam works better.


Inference is performed by the sample method of the DecoderRNN class (Figure C).

The unfolding of the LSTM is an abstraction, meaning that to perform inference with it you have to run it on a loop (check out figure C).

In inference, an image is passed through the Encoder (having undergone the appropriate transformations). The output of the encoder is fed as the one and only inout of the first unfolding of the LSTM. It is propagated forward. Eventually, the hidden state (ht and ct tuple) is passed onto the next unfolding, whilst the output of the cell is passed through a torch.max function.

After that, the resulting index (torch.max returns a (values, indices) tuple is passed through the embedding layer and into the next unfolding of the LSTM cell and so forth throughout the maximum sequence length.

Like so, the DecoderRNN produces a sequence, which consists of a natural language description of the originally fed image.


Model validation is a task I have not been able to tackle yet. However, here are a few examples of the system´s output. Bear in mind, I only trained it for 1 epoch. This is pretty exciting / scary stuff.

Investor, Technologist. Post opinions, not financial advice. Do your own research. Follow me at TW:alc2022

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store