The common search or Boolean query that computer users do everyday is a submission of a term to search engine which is programmed with a Boolean algorithm which finds documents with the term we included in the search and it is supported by an index containing all terms in the database. The simple form of Boolean query,
which is efficiently implemented over large databases, suffers several limitations: The number of retrieved documents is typically prohibitively large. A substantial part of the retrieved documents is irrelevant to the user's information need.
A broadly used alternative to the Boolean query is the similarity query, which is typically based on the vector-space model. Under this setting, documents are viewed as (algebraic) vectors over terms. A query, q, may consist of many terms, and even comprise a complete document. It too is viewed as a body of text, rather than merely as a search-terms combination and is represented as a vector as well. The retrieval task reduces to searching the database for document-vectors that are most similar to the query-vector. Other approaches based on the vector-space model also aim to reduce the dependency of the retrieved documents on the particular choice of query terms, and effectively improve retrieval. One way to do this is through the reweighting of query terms, where terms occurring within relevant documents
receive a higher weight than those occurring in irrelevant ones. This process is called relevance feedback.
1c. Text categorization
A task often addressed by information retrieval systems is that of text categorization. This is the labeling of text by category-tags from a predefined set of categories. There are two approaches to text categorization: The first is the knowledge-engineer approach, a set of rules are encoded to determine the categorization of the data base by an engineer who consults with an expert with knowledge of the information in the data base and makes the rules. This method has a fallback in that the rules must be continually revised to keep up with information in the data and results in what is know as the knowledge engineering bottleneck. The other is the machine learning (ML) approach, where a text classifier is viewed as a function learnt by an inductive process, from a training set of example documents, already classified into a predefined set of categories. ML-based classification is partitioned into two types: hard and soft classification. Under hard classification a document is strictly assigned to a single category. In contrast, soft classification entails a ranking by relevance of the categories for each document. Under this approach, the classifier returns a number between 0 and 1 (called the categorization status value, CSV).1
History of the internet
Just briefly touching on this subject, a series of memos were written by J.C.R. Licklider of MIT who envisioned what he called the "Galactic Net" Leonard Kleinrock MIT published the first paper on packet switching