[Paper Reading]: Enriching Word Vectors with Sub-word Information.
The limitation of skip-gram and other model used to learn word embeddings is each word is represented as a distinct vector with morphology of the word ignored. A new model based on skip-gram model is proposed where each word is represented as a bag of character ngrams, and the word is represented as the sum of these representations. With this model, words not appear in the training data (OOV) will also get their vector representations.
1. Recap of Skip-gram Model
With a word vocabulary of size \(W\), the goal is to learn a vector representation for each word \(w\). Given a sequence of large training corpus represented as a sequence of words \(w_{1}, w_{2}, \cdots, w_{T}\), the objective function of skip-gram model is minimize the negative log-likelihood:
\[
-\sum^{T}_{t=1}\sum_{c\in\text{context}(t)}\log p(w_{c}|w_{t}) = -\sum^{T}_{t=1}\sum_{c\in\text{context}(t)}\frac{e^{s(w_{t}, w_{c})}}{\sum^{W}_{j=1}e^{s(w_{t}, j)}}.
\]
Scoring function \(s\) can be simply parameterized using dot product of word vectors:
\[
s(w_{t}, w_{c}) = u^{\top}_{w_{t}}v_{w_{c}}.
\]
This is equivalent to an independent binary classification task: for word at position \(t\), consider all context words as positive examples and sample negatives from vocabulary.
2. Sub-word Model
A different scoring function is proposed to take into account of internal structure of the word. Each word is represented as a bag of character ngram. Special boundary symbols “<” and “>” are used at the beginning and end of words. Taking “where” and trigram as example, it will be represented as: <wh
, whe
, her
, ere
, re>
and <where>
.
Given a dictionary of ngrams of size \(G\), denote \(\mathcal{G}_{w}\subset{1,\cdots,G}\) the set of ngrams appearing in \(w\). \(z_{g}\) is the vector representation to each ngram \(g\). The scoring function is
\[
s(w, c) = \sum_{g\in\mathcal{G}_{w}} z_{g}^{\top}v_{c}.
\]
The simple model allows sharing the representation across words, thus allowing to learn reliable representation for rare words.
3. Experiments
Morphological information can significantly improves the syntactic tasks. In contrast, it doesn’t help for semantic questions and even degrades the performance for German and Italian.