Bangda Sun

Practice makes perfect

Stanford NLP (Coursera) Notes (17) - Introduction to Ranked Retrieval

Introduction to Ranked Retrieval.

The problem with boolean search is feast and famine, scoring basis are need to rank the search results.

1. Jaccard Coefficient

In terms of set theory, it is “intersection” over “union”:

\[
Jaccard(A, B) = \frac{|A\cap B|}{|A\cup B|}.
\]

And \(Jaccard(A, A) = 1\), \(Jaccard(A, B) = 0\) if \(A\) and \(B\) are disjoint sets.
Issues to notice:

  • not consider term frequencies
  • not work well for rare terms

Usually a normalization can by applied: using \(\sqrt{|A\cup B}|\) instead of \(|A\cup B|\).

2. Term Frequency Weight

Recall term-document matrix in previous section, word count \(TF(t, d)\) - term \(t\) in document \(d\) can be used instead of binary value. This is also called Bag-of-Words Model, where word orderings are ignored (big issue). Another issue is relevance doesn’t increasing proportionally with term frequency.

A variation is adding the log scale:

\[
W(t, d) = 1 + \log_{10} TF(t, d), \text{ if } TF(t, d) > 0.
\]

The score for a query - document pair is summing weights over \(t\):

\[
Score = \sum_{t\in q\cap d}\left(1 + \log_{10}TF(t, d)\right)
\]

3. Inverse Document Frequency Weight

An observation is rare terms are more informative.

Define \(DF(t)\) is the number of documents contain the term \(t\), and \(DF(t) \leq N\), then the inverse document frequency is defined as

\[
IDF(t) = \log_{10}\left(\frac{N}{DF(t)}\right),
\]

log is applied to dampen the effect.

Inverse document frequency has no effect on ranking one term queries.

4. TF-IDF Weight

Combining the above two weights together is TF-IDF:

\[
W(t, d) = \left(1 + \log_{10}TF(t, d)\right)\times \log\left(\frac{N}{DF(t)}\right),
\]

and the score for a query - document pair is summing weights over \(t\):

\[
Score = \sum_{t\in q\cap d} W(t, d).
\]

5. Retrieval Model - VSM (Vector Space Model)

The documents will be ranked based on scores, the score can be either similarity or distance. One issue of using distance is it is not normalized. A commonly used measure is cosine similarity, it is equivalent to dot product of length-normalized vectors.