The first serious model I tried is a PixelCNN. A pixelCNN is designed in a way such that when predicting a particular pixel at location (x, y), the network can only use pixels “before” it, i.e. pixels either strictly above it, or pixels towards the left in the same vertical coordinate.
I added an additional column in the architecture which does not use masked convolutions. It has access to pixels in all directions, but its input is masked: the inner 32×32 subimage is zeroed out. This allows the conditional PixelCNN to use information from outside the inner 32×32 sub-image, and use information from pixels towards the top-left as well, i.e. ones which occur earlier in the autoregressive chain. Captions were not used.
The results are encouraging. Qualitatively, the Conditional PixelCNN is able to generate sharp images which match the features of its surroundings for the most part, and seem a reasonable prediction of the missing subimage if you don’t look closely. The predictions are not blurry at all, but still far from perfect. Notably, it seems that the model doesn’t really know what to generate. It mostly tries to avoid the problem of generating real objects, by generating something which is an attempt at consistency with the surroundings.
The model does better when generations only requires texture synthesis, such as the food plate, or roads, and backgrounds. The model fails spectacularly when objects (large spatial structures) need to be generated.
Avenues of Improvement
- Use captions! The current model doesn’t use any caption information. I still believe that the caption-less performance is still not good enough to the point that caption information would help.
- The teacher forcing discrepancy. Autoregressive models are trained using teacher forcing. In practice, most models are imperfect, due to non-convergence of training, or lack of capacity, or the data being under-determined for the problem. Thus, at sample time, it’s easy to generate a partial sample on which the model performs badly, and the conditional distributions is really out of tune of the empirical data distribution. There are multiple fixes to this problem, such as professor forcing.
- Hierarchical PixelCNN, which makes some conditional independent assumptions so that the generation process is not top-left to bottom-right, but proceed by iteratively super-resolving images, starting from a 1×1 image, and then 2×2, 4×4, 8×8 …, 64×64. Although the independence assumptions are not ideal, the change in the generation process could possibly have a better prior that suits this dataset.