GloVe - Word Embeddings Utilizing Global statistics | Bangda Sun

[Paper Reading]: GloVe: Global Vectors for Word Representation.

GloVe is an unsupervised learning algorithm for learning word embeddings, which combines the advantages of the two major model families: global matrix factorization and local context window. It uses log-bilinear model to capture ratios of co-occurrence probabilities as linear meaning components in a word vector space.

1. Introduction

The two main model families for learning word vectors are:

global matrix factorization: such as LSA
local context window: such as skip-gram model

Both of these two model families suffer significant drawbacks. For global matrix factorization, they perform relatively poorly on word analogy task, indicating a sub-optimal vector space structure. For local context window, they poorly utilize the statistics of the corpus since they train on separate local context window instead of global co-occurrence counts.

A specific weighted least squares (log-bilinear regression) model that trains on global word-word co-occurrence (not word-document) counts. The model produces a word vector spaces with performance of 75% accuracy on word analogy dataset.

2. The GloVe Model

Let \(X\) be the word-word co-occurrence counts matrix, \(X_{ij}\) is the number of times word \(j\) occurs in the context window of word \(i\), \(X_{i} = \sum_{k}X_{ik}\) is the number of times any word appears in the context of word \(i\), \(P_{ij} = P(j|i) = X_{ij}/X_{i}\) is the probability that word \(j\) appear in the conext of word \(i\).

The way that certain aspects of meaning can be extracted from co-occurreance matrix by probability ratio can be illustrated by the following example: consider words in the context of “ice” and “steam”,

where the probability ratio is better able to distinguish relevant words.

Denote the probability ratio as

\[
F(w_{i}, w_{j}, \tilde{w}_{k}) = \frac{P_{ik}}{P_{jk}},
\]

where \(w\in R^{d}\) are word vectors and \(\tilde{w}\in R^{d}\) are separate context word vectors.

Since vector spaces are inherently linear structures, \(F\) can be restricted to functions that depend only on the difference of the two targeted words, therefore

\[
F(w_{i} - w_{j}, \tilde{w}_{k}) = \frac{P_{ik}}{P_{jk}},
\]

next parameterized \(F\) by dot product:

\[
F\left((w_{i} - w_{j})^\top \tilde{w}_{k}\right) = \frac{P_{ik}}{P_{jk}}.
\]

Then note that for word-word co-occurence matrix, the distinction between a word and a context word is arbtrary and two roles can be exchanged. Finally,

\[
F\left((w_{i} - w_{j})^\top \tilde{w}_{k}\right) = \frac{F\left(w_{i}^{\top}\tilde{w}_{k}\right)}{F\left(w_{j}^{\top}\tilde{w}_{k}\right)},
\]

the solution of above equation is \(F = \exp\), or

\[
w_{i}^{\top}\tilde{w}_{k} = \log P_{ik} = \log X_{ik} - \log X_{i}.
\]

Absorb \(\log X_{i}\) into the bias term \(b_{i}\) and add additional bias \(\tilde{b}_{k}\) for \(\tilde{w}_{k}\) to restore the symmetry:

\[
w_{i}^{\top}\tilde{w}_{k} + b_{i} + \tilde{b}_{k} = \log X_{ik}.
\]

Including an additive shift in the logarithm can avoid the divergences.

A main drawback of this model is that it weighs all co-occurrences equally. A new weighted least squares regression model is proposed to address this problem,

\[
J = \sum^{V}_{i,j=1}f\left(X_{ij}\right)\left(w_{i}^{\top}\tilde{w}_{j} + b_{i} + \tilde{b}_{j} - \log X_{ij}\right)^{2}.
\]

In the paper \(f\) is parameterized as \(f(x) = (x / x_{\max})^{\alpha}\) if \(x < x_{\max}\), and \(x_{max} = 100, \alpha=3/4\).

3. Relationship to Other Models

The starting point for skip-gram model is a model \(Q_{ij}\) (softmax) for the probability that word \(j\) appear in the context of word \(i\):

\[
Q_{ij} = \frac{\exp\left(w_{i}^{\top}\tilde{w}_{j}\right)}{\sum^{V}_{k=1}\exp\left(w_{i}^{\top}\tilde{w}_{k}\right)}.
\]

Training proces is minimize the global objective function

\[
J = -\sum_{i\in\text{corpus}}\sum_{j\in\text{context}(i)} Q_{ij} = -\sum^{V}_{i=1}\sum^{V}_{j=1}X_{ij}\log Q_{ij}.
\]

Using the model proposed in this paper, rewrite \(J\) as

\[
J = -\sum^{V}_{i=1}X_{i}\sum^{V}_{j=1}P_{ij}\log Q_{ij} = \sum^{V}_{i=1}X_{i}H(P_{i}, Q_{i})
\]

Where \(H(P_{i}, Q_{i})\) is the cross entropy of the distributions \(P_{i}\) and \(Q_{i}\). One could interpret this objective as “global skip-gram” model.

As cross entropy has unfortunate property that distributions with long tails are often modeled poorly with too much weight given to the unlikely events. Normalizing factor is also computational expensive. A natural choice would be a least square objective in which normalization factors in \(Q\) and \(P\) are discarded,

\[
J = \sum^{i,j} X_{i}\left(\hat{P}_{ij} - \hat{Q}_{ij}\right)^{2},
\]

where \(\hat{P}_{ij} = X_{ij}\) and \(\hat{Q}_{ij} = \exp\left(w_{i}^{\top}\tilde{w}_{j}\right)\) are the unnormalized distributions. Now another problem emerges, \(X_{ij}\) often takes very large values. An effective remedy is to minimize the squared error of the logarithms of \(\hat{P}\) and \(\hat{Q}\) instead

\[
J = \sum_{i,j}X_{i}\left(\log \hat{P}_{ij} - \log \hat{Q}_{ij}\right)^{2} = \sum_{i,j} X_{i}\left(w_{i}^{\top}\tilde{w}_{j} - \log X_{ij}\right)^{2}.
\]

Which is equivalent to the GloVe loss function proposed in previous section.

4. Experiments

In constructing \(X\), context window need to be specified, and whether to distinguish left context from right context. For all experiments, the model is trained using AdaGrad, stochastically sampling non-zero elements from \(X\), with initial learning rate of 0.05. Iteration is set to 50 for vectors smaller than 300 dimensions, 100 otherwise. As for certain types of neural networks, training multiple instances and combining the results can help reduce overfitting and noise and generally improve results, the final word vector is \(W + \tilde{W}\).