DSIR model

DSIR stands for "Distributional Semantics based Information Retrieval", and it's a retrieval model based on the Vector space model. In this model, the contents of retrievable objects, such as words, phrases, sentences, documents, are represented in a single form by vectors of several dimensions. These vectors are created from a co-occurrence matrix computed on a collection of text documents being indexed. Semantic proximity among objects is then simply interpreted in regards to the geometric proximity between corresponding vectors in the multi-dimensional space, called the "meaning space".

Definitions

Assuming that there exists a correlation between meaning of a word and its observable distributional characteristics within particular contexts in a given language, these distributional characteristics can either be "occurrences" of that word itself, or its "co-occurrences" with the other words appearing within the documents.

The characterization of word contexts is made on the basis of "co-occurrence statistic" which is a source of distributional information easily extracted from a document collection. The co-occurrence statistic of a word is the number of times that word co-occurs with one of its neighbours within a pre-defined boundary, the "distributional environment", such as sentences, paragraphs, sections, whole documents, or windows of k words.

We can then build a co-occurrence matrix M for each word i and each neighbour j where Mi,j is its co-occurrence statistic. For a word i its vector representation v will have the coordinates vi = (Mi, 1,Mi, 2,...,Mi, j).

A document vector Vd will have a vector representation where the coordinates are all its words vectors. A query has a similar representation to the Vector space model and a similarity function such as the cosine is used to compare the document with the query.

References