This post summarizes my idea of Convolutional Auto-encoder (ConvAE), which is the direct extension of Auto-encoder model by using ConvNet’s architecture (they are essentially same actually considering same size filters). This post assumed that the reader understand how ConvNet works.

ConvAE: Basic Theory

As any other Auto-encoder model, ConvAE tries to learn representation of input signal (primarily 2D data) by employing encoder-decoder paradigm. The basic structure is a ConvNet with 3 layers, but you can actually improve it to a symmetric network that has multiple layers. Conventionally, auto-encoders are forced to learn the entire receptive field since the constraints of MLP networks. Therefore, in order to learn filters of local receptive fields, people collect random image patches from the dataset and feed them in a MLP network. This is proven to be a good initialization of ConvNet in some early papers. The major advantage of ConvAE is that all the filters are traveling through all possible position of the image, so we don’t have to collect the image patches manually. And the result filters can also be used by further supervised ConvNet models.