## Conditional PixelCNN

The first serious model I tried is a PixelCNN. A pixelCNN is designed in a way such that when predicting a particular pixel at location (x, y), the network can only use pixels “before” it, i.e. pixels either strictly above it, or pixels towards the left in the same vertical coordinate.

I added an additional column in the architecture which does not use masked convolutions. It has access to pixels in all directions, but its input is masked: the inner 32×32 subimage is zeroed out. This allows the conditional PixelCNN to use information from outside the inner 32×32 sub-image, and use information from pixels towards the top-left as well, i.e. ones which occur earlier in the autoregressive chain. Captions were not used.

The results are encouraging. Qualitatively, the Conditional PixelCNN is able to generate sharp images which match the features of its surroundings for the most part, and seem a reasonable prediction of the missing subimage if you don’t look closely. The predictions are not blurry at all, but still far from perfect. Notably, it seems that the model doesn’t really know what to generate. It mostly tries to avoid the problem of generating real objects, by generating something which is an attempt at consistency with the surroundings.

The model does better when generations only requires texture synthesis, such as the food plate, or roads, and backgrounds. The model fails spectacularly when objects (large spatial structures) need to be generated.

## Avenues of Improvement

• Use captions! The current model doesn’t use any caption information. I still believe that the caption-less performance is still not good enough to the point that caption information would help.
• The teacher forcing discrepancy. Autoregressive models are trained using teacher forcing. In practice, most models are imperfect, due to non-convergence of training, or lack of capacity, or the data being under-determined for the problem. Thus, at sample time, it’s easy to generate a partial sample on which the model performs badly, and the conditional distributions is really out of tune of the empirical data distribution. There are multiple fixes to this problem, such as professor forcing.
• Hierarchical PixelCNN, which makes some conditional independent assumptions so that the generation process is not top-left to bottom-right, but proceed by iteratively super-resolving images, starting from a 1×1 image, and then 2×2, 4×4, 8×8 …, 64×64. Although the independence assumptions are not ideal, the change in the generation process could possibly have a better prior that suits this dataset.

## Conditional Image Generation: Introduction

In this new series of blog posts, I’m going to discuss structured prediction through the lens of conditional image generation. I’m taking IFT6266 at University of Montreal taught by Prof. Aaron Courville, and this is the course project.

The problem is to fill the missing inner (32, 32) sub-image given the outer (64, 64) image and a textual caption which describes the whole image.

The above example use full-resolution samples from the MSCOCO dataset. In this course, however, we would take to solving the simpler problem with down-sampled images of size (64, 64), such as this one. Here’s an example:

A black Honda motorcycle parked in front of a garage.

The is a 64×64 image, but upsampled using ‘nearest pixel’ so that we can see the details. Note that most typical image viewers including WordPress would upsample using linear, bilinear or cubic upsampling, which would be misleading since the images would appear smoother. ‘Nearest pixel’ upsampling does not change the values of the pixel, and thus presents the original image in all its raw glory. Here’s another example:

A room with blue walls and a white sink and door

My first thought is that the distribution $P(inner | outer, caption)$ seems mostly unimodal, or perhaps might have a handful of modes, By this I mean that there is probably enough information in the outer image and the caption to define each and every pixel of the inner image independently. Mathematically, this implies the modeling assumption:

$P(inner | outer, caption) = \prod_{pixel \in inner}P(pixel | outer, caption)$

However, even if the above assumption holds true in the technical sense, things could be different practically. A feedforward convolutional layer is one model which makes the above assumption, i.e. it parametrizes a function where each output unit is mutually independent given the input.

While the independence assumption might be correct theoretically, a feedforward model might fail practically due to other reasons. For instance, it might be difficult to optimize or current known architectural components might not be well-suited to the task. If the feedforward model works, we would know that both the modeling assumption and the architecture are reasonable. However, if it doesn’t, then although that is good information, we won’t be sure what failed: the independence assumption or the particular architecture choice. However, some evidence could be obtained by qualitative analysis of the results, and we will attempt to do so.

The other popular models that could be used for this task are:

1. Ancestral Sampling Methods
1. Autoregressive Modeling (PixelCNN)
2. Latent Variable Models