Many applications require representing text documents through a set of numerical measures. Once computed such measures can be used for a multitude of purposes. For example, we can use such measures for:
1. Finding similarities between different authors or even for establishing the authorship of a document of unknown authorship.
2. Classifying documents into different categories.
3. Grouping documents based on their content measured in terms of frequencies of different words present in different documents.
In this blog post, I will show how to use Excel’s text processing functions to compute three basic measures that are part of the group of measures used for authorship identification. The first measure is the average word size in terms of number of characters per word indicating the preference for simple or complex words. The second measure is the average sentence length measured in words because some writers use simple, short sentences and some have a tendency to use complex, long sentences.The third measure relates to the frequencies of function words, common words (articles, prepositions, pronouns etc.) that do not carry any semantic or content information but their usage pattern varies from author to author. The frequencies of function words have been found to give a good discrimination between different writing styles.
To see how Excel’s text functions can be used for computing the desired measures, let us assume we have copied and pasted a sample writing document for analysis in cell A2. We will first try to count the number of words in the sample text in cell A2. Recall that the LEN function gives us the number of characters in a string and the TRIM function removes all excess spaces. We can thus measure the length of the sample text in cell A2 by the following formula:
We know that after trimming the words in the text are separated by single spaces. Thus, if we were to get the length of the text after removing spaces, we can find the number of words in the text. Therefore, we use the following formula to measure the text length after removing the spaces:
In the above formula, we are using the SUBSTITUTE function to remove all spaces. The difference between the above two lengths incremented by 1 then gives us the count of the number of words in the text.
Next, let us try to find the number of sentences in the text. Here, we use the same idea that we used for finding words. We will remove all periods (.) from the text using the SUBSTITUTE function and then measure the length. The difference in two lengths will tell us the number of periods and hence the number of sentences. Thus, we use the following formula to count the number of sentences:
=LEN(TRIM(A2)) – LEN(SUBSTITUTE(TRIM(A2), “.”,””))
Note that the above formula will miss those sentences that end with a question mark(?) or with an exclamation point(!). Thus, we should look for question mark and exclamation point as well to get a better count. One way to do this is to use the following formula:
=LEN(TRIM(A2)) – LEN(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(TRIM(A2),”.”,””), “?”,””), “!”,””))
Now that we have the number of sentences, the total number of words, and the total number of characters, we can easily get the average words per sentence and characters per word.
As an example, consider the following text in cell A2:
At every moment you choose yourself. But do you choose your self? Body and soul contain a thousand possibilities out of which you can build many I’s. But in only one of them is there a congruence of the elector and the elected. Only one – which you will never find until you have excluded all those superficial and fleeting possibilities of being and doing with which you toy, out of curiosity or wonder or greed, and which hinder you from casting anchor in the experience of the mystery in life, and the consciousness of the talent entrusted to you which is your I.
Using the above formulas, we get the following numbers:
Length after trimming (Number of characters in the text): 566
Length after deleting space occurrences: 463
Number of word: 566 – 463 + 1 = 104
Average word size: 463/104 = 4.45
Number of sentences: 5
Average sentence length in words: 104/5 =20.8
I should mention here that the above formula for counting the number of sentences is still not perfect; it will occasionally return an over count. For example, consider the following sentence:
Books by Mr. Excel are very popular.
The above formula will return the number of sentences as 2. Similarly, the formula will return an over count for the following sentence:
Article 2.3 applies in such cases.
Despite of occasional over count, the above formula for counting the number of sentences works well.
The SUBSTITUTE function is again useful to count the frequencies of occurrences of different function words. Let us say, we are looking for function word “and” to count how many times it is present in the sample text. The following formula will give us the desired count:
=(LEN(TRIM(A2))-LEN(SUBSTITUTE(TRIM(A2),” and “,””)))/5
Note that the pattern searched by the SUBSTITUTE function has the string “and” preceded and followed by space. This is to make sure we do not count the string “and” when it is part of a word, for example in “land”. The division by 5 is being made because each occurrence of the string “ and “ will be removed by the SUBSTITUTE function resulting in the total string length decreasing by 5. In a similar fashion, we can compute frequencies of occurrences for any set of function words.
The approach described here for representing text through a set of numerical measures is known as stylometry. For those of you, whose interest is piqued by this blog, I suggest reading the following article:
[This blog post also appears on my other blog ExcelVBAComputing: Learning Excel via Problem Solving]