In this new series of blog posts, I’m going to discuss structured prediction through the lens of conditional image generation. I’m taking IFT6266 at University of Montreal taught by Prof. Aaron Courville, and this is the course project.

The problem is to fill the missing inner (32, 32) sub-image given the outer (64, 64) image *and* a textual caption which describes the whole image.

The above example use full-resolution samples from the MSCOCO dataset. In this course, however, we would take to solving the simpler problem with down-sampled images of size (64, 64), such as this one. Here’s an example:

A black Honda motorcycle parked in front of a garage.

The is a 64×64 image, but upsampled using ‘nearest pixel’ so that we can see the details. Note that most typical image viewers including WordPress would upsample using linear, bilinear or cubic upsampling, which would be misleading since the images would appear smoother. ‘Nearest pixel’ upsampling does not change the values of the pixel, and thus presents the original image in all its raw glory. Here’s another example:

A room with blue walls and a white sink and door

My first thought is that the distribution seems mostly unimodal, or perhaps might have a handful of modes, By this I mean that there is probably enough information in the outer image and the caption to define each and every pixel of the inner image *independently. *Mathematically, this implies the modeling assumption:

However, even if the above assumption holds true in the *technical* sense, things could be different practically. A feedforward convolutional layer is one model which makes the above assumption, i.e. it parametrizes a function where each output unit is mutually independent given the input.

While the independence assumption might be correct theoretically, a feedforward model might fail practically due to other reasons. For instance, it might be difficult to optimize or current known architectural components might not be well-suited to the task. If the feedforward model works, we would know that both the modeling assumption and the architecture are reasonable. However, if it doesn’t, then although that is good information, we won’t be sure what failed: the independence assumption or the particular architecture choice. However, some evidence could be obtained by qualitative analysis of the results, and we will attempt to do so.

The other popular models that could be used for this task are:

- Ancestral Sampling Methods
- Autoregressive Modeling (PixelCNN)
- Latent Variable Models
- Generative Adversarial Networks
- Variational Bayes
- Normalizing flows

- Iterative Sampling Methods
- Parametric Sampler (eg., GSNs)
- Non-parametric Sampler (eg. Hamiltonian VI)

Most other methods are a combination of the two or more of the above, as far as I know.

The next blogpost would go into details on each of these paradigms and discuss the theoretical and practical assumptions made by them and their advantages and disadvantages.