Deep Learning and Feature Engineering


Deep learning is currently enjoying immense popularity because of what researchers have been able to demonstrate in last few years, and the investment big names like Baidu, Facebook, Google, and Microsoft are making in this technology. Deep learning is being used to address important problems in computer vision, speech recognition, and natural language processing to provide sensory capabilities to computers at par with humans.  Computer vision community has been running a very large scale object recognition competition, known as ImageNet challenge, wherein researchers and scientists from universities and industry research labs from across the globe compete to develop image recognition systems capable of recognizing a very large number of objects in images. The ImageNet competition has been ongoing since 2010. Early this year, a team of Microsoft’s researchers demonstrated a system that was trained using 1.2 million images containing 1,000 different objects. Their system was able to achieve a 4.92% error rate besting the estimated human error rate of 5.1%. Their system is based on convolution neural networks, a many layer deep neural network architecture popular for computer vision and object recognition tasks. Another impetus for popularity of deep learning came when the Kagle competition for dogs and cats recognition was won by the team using deep learning and the team trained the network in less than an hour to achieve over 97% accuracy.

A Bit of History

Deep learning is not the first neural network technique to have caused a great level of excitement. Similar excitement was felt about fifty years ago with the perceptron learning algorithm of Rosenblatt. It was shown that a perceptron could learn to distinguish between two classes of objects by presenting examples from two classes and making adjustments to perceptron weights to produce correct output. However, the excitement was short lived when it was realized that while perceptron learning was excellent for so called linearly separable problems there didn’t appear a way out to extend perceptron learning to nonlinearly separable problems by using a network of layered perceptrons. The figure below shows the difference between linearly and nonlinearly separable problems. In the left panel, a line can be drawn to separate circles from squares; perceptron learning can figure out such a line by starting out with a randomly drawn line and modifying it by repeatedly looking at examples of circles and squares till it achieve a complete separation. In the right panel, there is no line that completely separate circles from squares. With such a set of examples, perceptron learning never finishes figuring out the line and is thus unable to find a solution.

linex
Examples of linearly and nonlinearly separable data

The next wave of excitement in neural networks came in eighties with the development of backpropagation learning for multiple layer feedforward neural networks, often referred to as multiple layer perceptron (MLP) networks. These networks have many layers, known as hidden layers, between the leftmost and rightmost layers of neurons as shown in the figure below.  The backpropagation learning algorithm, using chain-rule computation for gradient, suggested a way to adjust interconnection weights in a layered network of neurons which could be used to distinguish between nonlinearly separable classes of objects with high accuracy. Backpropagation learning soon found numerous applications in a variety of settings including speech and character recognition. Most applications relied on one or two hidden layers as training networks of many more layers was found to be problematic because the gradients computed at layers farther from the output layer tended to be either too small or too large causing unstable learning. This behavior of gradient values at early layers has been termed vanishing gradient problem. Consequently, focus shifted towards engineering of novel features, for example HOG (Histogram of Oriented Gradients), MFCC (Mel-Frequency Cepstral Coefficients), SIFT (Scale Invariant Feature Transform), and SURF (Speeded Up Robust Features) to name a few, that could be applied to image and speech signals to extract useful indicators to be used in conjunction with a one or two hidden layer neural networks or some other suitable classifiers for object and speech recognition tasks.

mlp
Multiple layer feedforward neural network

The feature engineering approach was the dominant approach till recently when deep learning techniques started demonstrating recognition performance better than the carefully crafted feature detectors. Deep learning shifts the burden of feature design also to the underlying learning system along with classification learning typical of earlier multiple layer neural network learning. From this perspective, a deep learning system is a fully trainable system beginning from raw input, for example image pixels, to the final output of recognized objects. This is illustrated by the figure below which shows the features learned by different layers for recognizing objects in ImageNet data. The left most panel shows 30 low-level features that are learned directly from image pixels. The middle panel shows the next level of features that are learned from the features of the left panel. Similarly the right most panel shows features learned from the previous layer. These features are reported in a paper by Zeiler and Fergus.

Learned features and their aggregation using ImageNet data
Learned features and their aggregation using ImageNet data

One might ask here what led to the successful training of deep networks, i.e. multiple layer feedforward networks with many layers, in recent years? The answer lies in few key changes made in the architecture and training of deep networks with respect to previous work on multilayer networks. The first key change was the incorporation of shared weights and biases in each hidden layer similar to the idea of applying local operators to images in computer vision so that the output of each hidden neuron represented the result of a convolution operation on its inputs from the previous layer, hence the name convolution neural networks. Sharing reduced the number of unknown parameters/weights for the learning task at hand. The second key component was the incorporation of better regularization techniques to further constraint potential solutions and avoid overfitting. Another key change from previous efforts was the use of activation functions, such as rectified linear function instead of sigmoidal neurons; such an activation function has been shown not to exhibit vanishing gradient behavior thus yielding stable learning. Of course, the easy availability of vast amounts of training data and cheap computational power also are key factors in the success of deep learning.

Feature Engineering and Feature Learning

While deep learning has yielded amazing results by learning features and feature hierarchies, it doesn’t mean that we should abandon feature engineering and dive fully into deep learning. There are learning tasks where feature engineering can result in simpler model while matching or even outperforming deep learning as shown by the example described below.

The example consists of three raw measurements of line lengths to predict whether the three lines can be arranged to form a triangle or not. I generated 300 labeled examples of length triplets for learning. A sampling of these examples is shown in the figure below. A value of 1 in the label field means the three lines of given lengths can be arranged to form a triangle.

Screen Shot 2016-01-04 at 8.18.04 PM
A sampling of learning examples

To perform learning without feature engineering, the training data was used as given and was partitioned into training and test sets using 70:30 ratio. Using the SPSS package’s neural learning, I performed training of multiple layer neural networks with varying number of neurons with one and two hidden layers. The best performance was given by a network of two hidden layers with four neurons in the first hidden layer and three in the second layer. Over multiple trials, this configuration yielded almost 100% accuracy by occasionally misclassifying one or two examples out of 300 examples.

Next, I repeated the learning task by defining three features. The first feature corresponded to the longest line length; the second feature corresponded to the second longest line length. The third feature was the length of the smallest line. With these features, I repeated the experiment. Not surprisingly, a single neuron network was able to yield 100% accuracy. The weights learned by this neuron approximated the relationship that the length of the longest line should be smaller than the sum of the remaining two lines to form a triangle. Thus, a bit of feature engineering gave not only a more understandable model but also performed better.

Although the triangle example used here is pretty simple; it does illustrate the message that I want to convey. That is while deep learning with its inbuilt feature learning is amazing, there are applications where a good understanding of the problem domain can lead to features that can yield a better model and accuracy.