A typical machine learning situation assumes we have a large number of training vectors, for example gray level images of 16×16 size representing digits 0 to 9 with each image labelled with the digit whose pattern is shown in by the variation of gray levels in the image. In such a situation, we want the machine learning system, be a decision tree or a neural network, to learn to output the label of the input image. Such learning is what we call supervised learning. Lets consider another machine learning scenario. We just have a collection of gray level images with no labels attached to them. And lets say we want to train a system that will faithfully reproduce input image insofar as possible at the output. At first, you may question the need to produce at output an approximation of the input. Why not keep on working with the input directly? Well! Being able to map input to a representation from which we can map it back to input with some approximation opens up possibilities of learning hidden relationships, i.e. the latent features, present in our input data. Such hidden relationships can be then used for dimensionality reduction or as features for classification.
One way to learn such hidden relationships is via an autoencoder. As shown in the figure below, it is a neural network with one hidden layer and with input-output layers of same size. The hidden layer works as an encoder mapping input, , to a new representation. The output layer acts as a decoder to produce as recovered input. Such a network is trained via the usual backpropagation algorithm because the desired output for each neuron is known as we want the output layer neurons to produce a copy of the input. The learning in such a scenario is known as unsupervised learning as there are no labels attached to the training vectors.
An example of latent features learned by an autoencoder using a training set of 400 face images is shown below. The images are from the ORL face database. The images were resized to 12×12 pixels resulting in 144- dimensional training vectors. The size of the hidden layer was set to 36. This 144x36x144 autoencoder was trained using the autoencoder package. Each image in the image matrix below represents a feature mask corresponding to a hidden neuron. The latent features learned by an autoencoder are similar to those obtained via the principal component analysis or singular value decomposition as is the case in the present example.
It is not necessary to have a fewer number of neurons to learn interesting patterns in input vectors. The number of neurons in the hidden layer can be even greater than the size of the input layer and we can still have an autoencoder learn interesting patterns provided some additional constraints are imposed on learning. One such constraint is the sparsity constraint and the resulting encoder is known as sparse autoencoder. In sparsity constraint, we try to control the number of hidden layer neurons that become active, that is produce output close to 1, for any input. Suppose we have 100 hidden neurons and on an average suppose we restrict only 10 hidden neurons to be active for an input vector, then the sparsity is said to be 10%. The way to include the sparsity constraint is to modify the error criterion by adding a penalty term to reflect deviation from desired sparsity and then apply backpropagation incorporating the sparsity penalty. Experience with different levels of sparsity indicates an inverse relationship between the level of sparsity and the nature of relationships captured in the training data. Higher level of sparsity tends to capture more local features and vice a versa.
A deep autoencoder has multiple hidden layers. Such autoencoders are used to build features at successive levels of abstraction and have been used to pre-train deep neural networks and hence the name deep autoencoder. The training of a deep encoder is carried out in stages with one hidden layer at a time as shown in the figure below. During Step 1, the first hidden layer, shown in blue, is trained using the training vectors. In stage 2, the output of first hidden layer neurons acts as an input to train the second hidden layer, shown in pink. Next in Step 3, both of the trained hidden layers are combined and an output layer, in deep blue, is added and the combined network is ready for further training using supervised learning to perform classification. This idea of step-wise training can be carried out for many more hidden layers to build pre-trained deep neural networks.
While autoencoders have been primarily used for learning low-dimensional representation and as a way to pre-train deep neural networks, there are some works where deep autoencoders have been used for content based image retrieval. Such an approach has the advantage of query time being independent of the collection size.