Who hasn’t read or heard the story of a dying old man worried about his sons who were always fighting. While on his deathbed he invited them all and asked them each to break a single stick which the sons did easily and wondered what their old man was up to. Next, the old man tied the sticks together and asked each son to break the bundle to teach his sons the lesson of strength in unity. It turns out the lesson holds true for building classifiers too. Instead of devoting a considerable effort and time in training a classifier that can perform with high accuracy, one can get a high level of performance simply by training a large number of classifiers that perform slightly better than 50% accuracy, and then pooling their votes to arrive at a final decision. Such low performance classifiers in an ensemble of classifiers are called weak learners. The following slide shows why a bunch of weak learners working together can yield a level of performance much higher than each weak learner’s performance.
One may ask whether there is any requirement on a bunch of weak learners that are going to work together. Well! These weak learners shouldn’t be strongly correlated; rather they should be complimenting each other. So the question is how do we train such weak learners? Some of the possible training approaches are:
–Different algorithms for different learners
–Same algorithm but different parameters
–Different features for different learners
–Different data subsets for different learners
When we use different algorithms, for example Bayes, neural networks, support vector machine etc., we end up with an ensemble of classifiers or learners where each learner is of different type. Such an ensemble is known as heterogeneous ensemble. We could use the same type of learners but train them with different parameters to minimize the learners’ correlation. However, better independence between the learners is obtained when each learner uses a different subset of features or different subsets of training data or both. A way of choosing different data subsets for training different weak learners is bagging which stands for bootstrap aggregating. When using bagging, it is a good idea to use those kinds of learners or classification models that are much more sensitive to changes in the training data, for example decision tree learners. That way different subsets of training data will result in different decision trees that show low correlation. On the other hand, if you choose, for example the nearest neighbor classifier, then bagging will not produce good results because some changes in the training data may not result in appreciable changes in the performance of weak learners thus producing a highly correlated set of learners. Some ensemble methods try to train different learners with different data subsets as well as with different feature subsets. This is the approach used in the ensemble method known as Random Forests where a collection of small decision trees is trained using different features and different data subsets.
There is another approach that is possible for combining a group of weak learners. This approach is known as boosting. In this approach, you train first weak learner and check where this learner fails. To overcome such a failure you train another learner and then another learner and so on. Obviously, the successive learners pay increasing attention to training examples not correctly learned by earlier learners. A systemic way to do this is the AdaBoost algorithm widely available in machine learning and data mining packages.