An overwhelming amount of data is in unstructured text form. Such data must be processed to make it useful for machine learning and pattern discovery. One such processing requires extracting all predefined entities, for example persons, organizations, locations, and dates etc. Such processing is a part of what is known as information extraction and the particular task of extracting predefined entities is called named entity recognition (NER). One can think of many applications for NER. One example would be to use it on financial news streams to update information in a database about organizational changes. Yet another example would be to build a question-answering system about persons, locations, events, and organizations etc. Needless to say, NER is useful in finding relationships between different entities of interest.
A simple way to view the process of named entity recognition is to think of a method that identifies proper names in text and then classifies them into predefined classes or categories. While person, location, and organization are the universal categories of interest, specific applications may require a different set of categories. For example a medical application might require the names of drugs and disease symptoms. The machine learning approach to NER uses a training corpus with entities labeled appropriately. A tokenizer is used to segment the text into tokens – words, numbers, and punctuation marks. This is followed by feature extraction. The features for a word are typically designed to reflect the local context of the word. Examples of local context are neighboring k words, appearing before and after and their respective part-of-speech tags. It is the choice of the features that determines the accuracy of the NER.
In my last blog post on Azure ML Studio, I had mentioned about text analytics being an important part of the Azure ML Studio. So I wanted to try the text analytics modules. This blog is a summary of my experience in using the NER module. First thing I noticed was that there was very little information on the module as is shown by the screen shot of help on NER.
Despite very little information, I just ran the NER module with the sample data set included in the studio for this task connected to the Story input port and left the CustomResources port hanging. The assumption was that the NER module is already trained and thus there is no need to supply any input to the second input port. I was pleasantly surprised that the module ran and produced the correct output. Next I added to the NER sample data file few more sentences by copying beginning paragraphs of some stories from New York Times and Indian Express. This also produced correct results for each of the stories. Satisfied that the NER module is working well, I wanted to see how well it would do on tweets where the sentences are short and informal.
I used the tweets data by downloading it from Sentiment 140. Although the data is originally collected for sentiment analysis, it can be used for NER as many tweets refer to persons, organizations and locations as well. The tweets are in a file with CSV format and I just used 500 tweets to see how well the NER module will perform. A screen shot of the first several lines of the tweet data file are shown below.
After uploading the data, the set up for the experiment was as shown below. The uploaded tweets file is named testdata.manual.2009.06.14.csv. The function of the Project Columns module is to select Col. 6 of the data file because the tweets are in that column.
After running the experiment, I visualized the output of the NER module to see how well it did on tweets. Here is a screen shot of that visualization for the first few tweets. The first column refers to the tweet number beginning with 0. Second column is the recognized entity. The third column shows the entity’s position in the tweet text string. The fourth column is the number of characters in the entity string, and finally the fifth column indicates the entity type. Some of the errors made by the NER module have been highlighted. For example, the string “Ok” is incorrectly taken to mean Oklahoma. In tweet number 12, the system misses to recognize “espn” as an organization. Similarly “Visa” is missed as an organization. Anthough “Booz Allen Hamilton” should have been recognized as an organization, it is classified as person. These errors go to show the difficulty of NER task, especially when dealing with informal short text strings as found in tweets.
Just to see how well the Azure ML Studio did in comparison with other similar recognizers, I inputted the first 28 tweets to the the Stanford Named Entity Tagger. The screen shot of the results is shown below.
You can judge for yourself where the Azure ML Studio NER stands. As always, feel free to comment and let me know if you have any question.