DSIR model
DSIR stands for "Distributional Semantics based Information Retrieval", and it's a retrieval model based on the Vector space model. In this model, the contents of retrievable objects, such as words, phrases, sentences, documents, are represented in a single form by vectors of several dimensions. These vectors are created from a co-occurrence matrix computed on a collection of text documents being indexed. Semantic proximity among objects is then simply interpreted in regards to the geometric proximity between corresponding vectors in the multi-dimensional space, called the "meaning space".
Definitions
Assuming that there exists a correlation between meaning of a word and its observable distributional characteristics within particular contexts in a given language, these distributional characteristics can either be "occurrences" of that word itself, or its "co-occurrences" with the other words appearing within the documents.
The characterization of word contexts is made on the basis of "co-occurrence statistic" which is a source of distributional information easily extracted from a document collection. The co-occurrence statistic of a word is the number of times that word co-occurs with one of its neighbours within a pre-defined boundary, the "distributional environment", such as sentences, paragraphs, sections, whole documents, or windows of k words.
We can then build a co-occurrence matrix M for each word i and each neighbour j where Mi,j is its co-occurrence statistic. For a word i its vector representation v will have the coordinates vi = (Mi, 1,Mi, 2,...,Mi, j).
A document vector Vd will have a vector representation where the coordinates are all its words vectors. A query has a similar representation to the Vector space model and a similarity function such as the cosine is used to compare the document with the query.
References
- {{ cite | title=Parallel DSIR Text Indexing System: Using Multiple Master/Slave Concept | url=http://www.springerlink.com/content/ebja71gh68u4q5b0/ | author=A. Rungsawang | coauthors=P. Laohawee , A. Tangpong | year=2000 }}