Bangda Sun

Practice makes perfect

Stanford NLP (Coursera) Notes (18) - Word Meaning and Similarity

Introduction to Semantics.

1. Word Senses and Word Relations

Lemma and Word Form: Lemma is same stem, POS, rough semantics, each lemma can have multiple meanings, or sense - defined as a discrete representation. For example, “bank” has two senses.

Homonymy: words with same form but different meanings. It has two sub-categories: homograph - same form and homophone - different form but pronounce same.

Polysemy: words have related meanings.

Synonyms and Antonyms: Synonyms are words that have the same meaning in some or all contexts, the relation is between sense rather than words. For example, big and large have same meaning in some contexts, while in some other contexts they are not - we can only say “big sister” and not “large sister”. Antonyms are words that have the opposite meaning.

Hyponymy and Hypernymy: Hyponymy in formal is IS-A hierarchy - A is a B, and B subsumes A.

2. WordNet and other Online Thesauri

Applications of Thesauri and Ontologies (a model to describe the relationship between objects, their attributes and relationships):

  • Information Extraction
  • Information Retrieval
  • Question Answering
  • Machine Translation

The definition of “sense” in WordNet is the synset, i.e. synonym set. WordNet is available in nltk.

3. Word Similarity: Thesaurus Method

Word similarity is not word relatedness. In one word, Thesaurus based methods calculate how words “nearby” in hypernym hierarchy.

Path based similarity: denote \(pathlen(c_{1}, c_{2})\) as the number of edges in the shortest path between two nodes in hypernym graph. Then the similarity is defined as

\[
sim(c_{1}, c_{2}) = \max_{c_{1}\in\text{sense}(w_{1}), c_{2}\in\text{sense}(w_{2})} \frac{1}{pathlen(c_{1}, c_{2})}.
\]

The issue for this simple method is, all edges are equally weighted, but nodes high in the hierarchy are very abstract. Improved methods are proposed:

  • represent the weight of each edge independently
  • words connected only through abstract nodes are less similar

Information content similarity (Resnik method): define \(P(c)\) as the probability that a randomly selected word in a corpus is an instance of concept \(c\). Formally, there is a distinct random variable, ranging over words, associated with each concept in the hierarchy. The lower a node in the hierarchy, the lower its probability. Let \(word(c)\) be the set of all words that are children of node \(c\), then

\[
P(c) = \frac{1}{N}\sum_{w\in word(c)} count(w).
\]

The information content is defined as \(IC(c) = -\log P(c)\), and most informative subsumer \(LCS(c_{1}, c_{2})\) is defined as the most informative (lowest) node in the hierarchy subsuming both \(c_{1}\) and \(c_{2}\). The similarity between two words is related to their common information:

\[
sim(c_{1}, c_{2}) = -\log P(LCS(c_{1}, c_{2})).
\]

4. Word Similarity: Distributional Method

Distributional based methods (a.k.a. Vector Space Models) calculate word similarities in terms of whether they have similar distributional contexts, as Thesaurus based methods have these issues:

  • not every language has thesaurus
  • many words and phrases are missing
  • Thesauri work less well for verbs and adjectives as they have less structured hyponymy relations

Recall term-document matrix from previous section, two words are defined as similar if their vector are similar. Unlike using the all counts when calculating document similarity, vectors for words can be based on a small context window, therefore a word is defined by a vector over counts of context words, therefore the term-context matrix is built. Also, unlike using raw counts or TFIDF in term-document matrix, Positive Pointwise Mutual Information (PPMI) is used. Pointwise mutual information between two words is

\[
PMI(w_{1}, w_{2}) = \log_{2}\frac{P(w_{1}, P(w_{2})}{P(w_{1})P(w_{2})},
\]

and PPMI is just replace all negative values with 0.

Cosine similarity can still be used:

\[
cosine(\mathbf{v},\mathbf{w}) = \frac{\sum^{N}_{i=1}v_{i}w_{i}}{\sqrt{\sum^{N}_{i=1}v_{i}^{2}\sum^{N}_{i=1}w_{i}^{2}}},
\]

where \(v_{i}\) is the PPMI for word \(v\) in context \(i\), \(w_{i}\) is the PPMI for word \(w\) in context \(i\).