Information Retrieval Vs. Data Retrieval

Information retrieval (IR) is a field of study dealing with the representation, storage, organization of, and access to documents. The documents may be books, reports, pictures, videos, web pages or multimedia files. The whole point of an IR system is to provide a user easy access to documents containing the desired information. The best known example of an IR system is Google search engine.

The difference between an information retrieval system and a data retrieval system is that

– IR deals with unstructured/semi-structured data while a data retrieval (a database management system or DBMS) deals with structured data with well-defined semantics

– Querying a DBMS system produces exact/precise results or no results if no exact match is found

– Querying an IR system produces multiple results with ranking. Partial match is allowed

Document Characterization

Three kinds of characteristics are associated with a document.

Metadata characterization

imageThis kind of characterization refers to ownership, authorship and other items of information about a document. The Library of Congress subject coding is also an example of metadata. Another example of metadata is the category headings at Internet search engine Yahoo. To standardize category headings, many areas use specific ontologies, which are hierarchical taxonomies of terms describing certain knowledge topics.

Presentation Characterization

image   This refers to attributes that control the formatting or presentation of a document.

Content Characterization

This refers to attributes that denote the semantic content of a document. Content characterization is of primary interest in imageIR. The common practice in IR is to represent a textual document by a set of keywords called index terms or simply terms. An index term is a word or a phrase in a document whose semantics give an indication of the document’s theme. The index terms, in general, are mainly nouns because nouns have meaning by themselves.

The same concept can be applied to images/multimedia documents to characterize them in terms of words using the representation known aptly as the Bag of Word (BoW) representation. image

However, what are the words for images is a tricky question. One way to define such words for images is through the use of vector quantization which yields a set of code words (A code word is simply an image patch of certain size) that can be used to assemble given images. Once we are able to characterize images in terms of code words, various attributes such as frequencies and joint frequencies of code words occurrences can be computed to group images into meaningful groups. An example of one such grouping is shown in the figure below where the grouping of pictures is shown in the form of a graph.