Word Similarity, Google Distance, and Baseball and Cricket

A pair of words are considered similar if they mean the same thing, or are used in the same context, for example teacher, students, classroom. Words are also similar if one is a type of another, for example terrier, dog, and mammal. Measuring similarities between words is needed in many text mining tasks.

Baseball and cricket are two games with lots of similarities. Both are bat and ball games; both games are scored in runs and judging is done by umpires. Given the similarities of baseball and cricket, I thought it would be interesting to measure the conceptual or semantic similarity between the words “baseball” and “cricket” using the normalized Google distance (NGD). The NGD measure of word similarity represents an interesting approach using Google search engine. The logic behind this approach is that if two words have similar meaning, or are used in a similar context, then the chances of both words occurring together on a large number of web pages are high. On the other hand, if a pair of words has nothing in common then there will be relatively fewer web pages having both words. This idea was proposed by Rudi L. Cilibrasi and Paul M.B. Vitanyi in a paper in 2007. The NGD between two words, x and y, is computed using the following formula:

Screen Shot 2014-11-29 at 5.50.41 PM

The total number of web pages can be estimated using any common word, for example “the” present on every page. To calculate the NGD between a pair of words, all we have to do is use the Google search engine three times to get the respective number of pages. There is even a web based calculator that one can use for this.

To put the similarity between “baseball” and “cricket” in perspective, I decided to pick another eight words related to different sports. These are: “soccer”, “football”, “basketball”, “hockey”, “skiing”, “surfing”, and “swimming”. Using the web based NGD calculator, I ended up obtaining the following matrix representing NGD between different word pairs.

Baseball Cricket Soccer Football Basketball Hockey Skiing Surfing Swimming
Baseball 0.00 0.57 0.40 0.37 0.20 0.27 0.51 0.92 0.56
Cricket 0.57 0.00 1.75 1.24 1.63 1.15 1.26 0.87 1.47
Soccer 0.40 1.75 0.00 0.39 0.37 0.49 0.82 1.07 0.50
Football 0.37 1.24 0.39 0.00 0.32 0.42 0.62 0.70 0.38
Basketball 0.20 1.63 0.37 0.32 0.00 0.38 0.72 1.02 0.46
Hockey 0.27 1.15 0.49 0.42 0.38 0.00 0.71 0.98 0.85
Skiing 0.51 1.26 0.82 0.62 0.72 0.71 0.00 0.49 0.61
Surfing 0.92 0.87 1.07 0.70 1.02 0.98 0.49 0.00 0.68
Swimming 0.56 1.47 0.50 0.38 0.46 0.85 0.61 0.68 0.00

Using R, I decided to cluster these words using single link clustering. The resulting dendrogram is shown below.

Screen Shot 2014-11-29 at 9.14.57 PM

Now this clustering result shows some interesting results; it doesn’t show baseball and cricket close to each other but rather baseball and basketball are shown to be most similar. It also shows hockey, football, and soccer to be close to each other. One explanation why baseball and basketball are deemed most similar on the basis of the NGD is that both these games are predominantly North American games and  a very large number of websites in USA are devoted to sports as well as all newspapers carry stories about these sports. On the other hand, cricket playing countries, for example Australia, England, India, Pakistan, Sri Lanka, do not play baseball and thus websites from these countries are not likely to have mention of both baseball and cricket on the same page.

So what can we say about NGD from this simple exercise? Well! the NGD appears to have a hidden geographical component in it and there might be an application where this could be exploited.