Deconvolution in Deep Learning

In deep learning literature, the term deconvolution is used in a way that is different from its meaning in conventional usage. This blog post is to briefly explain this difference.

Deconvolution in Image & Signal Processing

Oxford dictionary defines deconvolution as “a process of resolving something into its constituent elements or removing complication.” In image and signal processing, deconvolution is known as an image/signal restoration technique used to enhance an image/signal degraded by blurring and noise. The observed degraded image, g, is modeled as g = h∗f + η, where f is the input image and h represents the degradation function. The degradation process is treated as a convolution operation represented by symbol and η stands for additive noise in the image acquistion process. Thus, the aim of an image restoration process or deconvolution is to recover f given the degraded image g. There are several methods for this task, both in the frequency and the spatial (time-domain for one-dimensional signals) domains, using different assumptions. One such method for deconvolution is the regularized filter method that minimizes a loss function consisting of the squared error between the estimated image \hat{f} and the observed image g subject to preserving image smoothness constraint. An example of deblurring using the regularized filter approach is shown below.

The result of applying deconvolution. Input image is on left and the resulting image is on right.
Convolution and Deconvolution in Deep Learning

Deconvolution in deep learning is not concerned with restoring a degraded signal or image; it is rather concerned with mapping a set of data values to another larger set of data values, that is up-sampling the data. To understand this, lets begin by considering a 4×4 data patch and a 3×3 convolution filter as shown below.

A 4×4 data patch and a 3×3 convolution mask

Lets assume that we want to calculate the convolution result only for those patch positions where placing of the convolution mask doesn’t extend it outside the patch. In other words, we want to perform convolution with zero or no padding. This leaves us only four positions, second row-second column, second row-third column, third row-second column, and third row-third column, where the convolution operation is to be applied. Just to refresh the convolution operation, the result of convolution at second row-second column position is given by 0*3 + 1*3 + 2*2 + 2*0 +2*0 + 0*1 + 0*3 + 1*1 + 2*2, where the first element in each multiplication term comes from the convolution mask. The result is 12. Calculating similarly at the remaining positions, we get the following 2×2 matrix as convolution result for the given data patch and the convolution mask.

Lets look at the above convolution operation in a matrix-vector form; that is how the computation is carried out. Converting the 4×4 data patch by reading its elements left to right and top to bottom, we can write its vector representation as \bold D^T = [3 3 2 1 0 0 1 3 3 1 2 2 2 0 0 2]. Lets represent our 3×3 convolution mask with generic elements as

Convolution mask with generic elements

To obtain vector representation for the convolution mask, we create a 4×4 array and position the mask at four positions where we want to calculate convolution. The remaining elements in the 4×4 array are filled with 0’s. We then convert each array to a vector by reading elements left to right and top to bottom to generate four 16-dimensional arrays as shown in the figure below.

Coverting convolution mask to vectors

We Combine the four arrays representing the convolution operation at four positions in our data patch, and get a 16×4 matrix C. The transpose of matrix C is shown below.

Matrix representation of convolution operation

Replacing the generic convolution mask values with actual mask elements used earlier, matrix C can be written as

With the above matrix vector representation, the convolution operation can be simply written as \bold {D}^T\bold {C} = [12 12 10 17] which upon rearanging as a 2×2 array gives the result shown earlier.

Transposed Convolution

In deep learning, deconvolution essentially refers to the operation that gets performed when the computation is being done from the output to input layer during error propagation or segmented image generation as in semantic segmentation. Taking again a simplified view without pooling and using the above example, deconvolution in this case involves mapping a 2×2 array to a 4×4 array. Once we roll out the 2×2 array into a 4-dimensional vector, E, we can obtain 16 data values as a 16-dimensional row vector by performing the matrix multiplication \bold{E}\bold{C}^T using the transposed convolution matrix. Rearranging the resulting 16-dimensional vector gives us an up-sampled result. Thus, deconvolution in deep learning refers to transposed convolution and it has no connection with deconvolution for image/signal restoration.

While the above illustration was done with zero padding, unit stride, and no pooling, it is not difficult to see how a similar matrix vector representation can be used be used with padded data/image patches and with different strides and pooling. In those situations, shaping of the data patches and convolution masks into matrix vector representation will require additional considerations but the basic idea remains the same.