Language Modeling - Probability Perspective.
1. Probabilistic Language Model
The language model we discuss here refer to probabilistic language model, i.e. assign a probability to a sentence. This is a widely used concept. We often need to compare the probabilities of different texts:
- machine translation: the translated sentences with higher probability will be more reasonable results
- spell correction: we can suggest a sentence with higher probability - it contains the correct words
- …
We denote the probability of a sentence to be
\[
P(W) = P(w_1w_2w_3\dots w_n),
\]
there is another related probabilities: probability of an upcoming word
\[
P(w_5|w_1w_2w_3w_4).
\]
In probability, these two probabilities corresponding to joint probability and conditional probability. And models that compute either of them are called language model.
To compute the (joint) probability we can apply chain rule:
\[
P(W) = P(w_{n}|w_{1}w_{2}w_{3}\dots w_{n-1})\dots P(w_{1})P(w_{2}|w_{1})P(w_{3}|w_{1}w_{2}).
\]
For example, the probability of “I want to learn NLP” is
\[
P(\text{I want to learn NLP}) = P(\text{I})P(\text{want}|\text{I})P(\text{to}|\text{I want})P(\text{learn}|\text{I want to})P(\text{NLP}|\text{I want to learn}).
\]
You can see we need to know a series of conditional probabilities. A straightforward way to estimate them is count the number of occurrence:
\[
P(\text{NLP}|\text{I want to learn}) = \frac{count(\text{I want to learn NLP})}{count(\text{I want to learn})},
\]
but there will be some problems: 1) there are too many possible sentences 2) there are not enough data to estimate - or say we need huge corpus to estimate. Thus we need to use Markov Assumption: the current word only depend on several previous words rather than all previous words:
\[
P(\text{NLP}|\text{I want to learn})\approx P(\text{NLP}|\text{learn}).
\]
Then it comes N-gram models: \(N-1\) denotes how many previous words we want to include,
- Uni-gram (1-gram)
\[
P(W) = P(w_1)P(w_2)\dots P(w_n)
\]
- Bi-gram (2-gram)
\[
P(W) = P(w_{1})P(w_{2}|w_{1})P(w_{3}|w_{2})\dots P(w_{n}|w_{n-1})
\]
- …
In general this is an insufficient model for language because language has long distance dependencies. But we can often get away with N-gram models.
Now we can use same method to estimate the probabilities, take bigram as example,
\[
P(w_2|w_1) = \frac{count(w_1, w_2)}{count(w_1)},
\]
this is also the maximum likelihood estimate.
In practice, we always do this in log space, this can avoid underflow issue, also adding is faster than multiplying:
\[
\log P(w_1w_2w_3) = \log(P(w_1)P(w_2|w_1)P(w_3|w_2)) = \log P(w_1) + \log P(w_2|w_1) + \log P(w_3|w_2).
\]
2. Model Evaluation
After we have the model, we want to know how good is the model. The model should prefer good sentences to bad sentences. Here we also have training set and test set: where we train the language model on training set, and test it on test set. To compare two models, we can
- put each model in a task (spelling corrector, speech recognizer, machine translation system)
- run the task and get accuracy for both models, see how many misspelled words corrected properly; how many words translated correctly
This is extrinsic evaluation of N-gram model. It could be time consuming - can take days or weeks. Therefore sometimes we use intrinsic evaluation: perplexity.
One intuition of perplexity is Shannon game: how well can we predict the next word. A better model of a text is one which assigns a higher probability to the word that actually occurs. If you have basic machine learning knowledge, you know that a good model should have good predictions on an unseen test set. Same for language model here. Given the probability of a sentence, perplexity is the inverse probability of the test set, normalized by the number of words:
\[
PP(w_1w_2w_3\dots w_n) = P(w_1w_2w_3\dots w_n)^{-\frac{1}{N}}.
\]
Therefore maximizing probability is equivalent to minimize perplexity.
3. Generalization and Zeros
N-gram model only work well for word prediction if the test corpus looks like the training corpus. This is easy to understand - if there some ngrams that doesn’t exist in training set, the test performance will be bad, since we will assign zero-probabilities to those ngrams. Therefore to improve our model, we need to generalize, start from those zero-probabilities.
One of the methods is smoothing. The intuition of smoothing is simple: steal probability mass to generalize better. The simplest smoothing is Add-one smoothing, also know as Laplace smoothing:
- pretend we saw each word one more time than we did
- add one to all counts
therefore
\[
P_{\text{add-one}}(w_2|w_1) = \frac{count(w_1, w_2) + 1}{count(w_1) + V}
\]
where \(V\) is vocabulary size - make sure the total probability is still 1. In real world problems, we will usually use better smoothing method since add-one smoothing is a blunt instrument, it’s not the best one for N-gram. But in some cases it’s still useful: in text classification, we don’t care the exact estimation but just want to separate classes; in some domains there are not too much zeros it can also be a good approximation.
4. Backoff and Interpolation
Now we introduce some advanced techniques. Sometimes it helps to use less context: condition on less context for contexts we haven’t learned much about. Here comes backoff and interpolation:
Backoff: use trigram if there are some good evidence, otherwise use bigram / unigram.
Interpolation: mix unigram / bigram / trigram.
Usually interpolation works better:
\[
P(w_3|w_1w_2) = \lambda_1 P(w_3|w_1w_2) + \lambda_2 P(w_3|w_2) + \lambda_3 P(w_3),
\]
where \(\sum_{i} \lambda_i = 1\). To choose \(\lambda\), we can set a held-out corpus and see which setting gives largest probability to held-out corpus - this is similar with using cross-validation to choose hyperparameters.
5. Advanced Smoothing methods
5.1 Add-k smoothing
We can generalize to add-k smoothing:
\[
P_{\text{add-k}}(w_2|w_1) = \frac{count(w_1, w_2) + k}{count(w_1) + kV},
\]
or unigram prior smoothing
\[
P_{\text{add-k}}(w_2|w_1) = \frac{count(w_1, w_2) + k/V}{count(w_1) + k} = \frac{count(w_1, w_2) + k P(w_2)}{count(w_1) + k}.
\]
5.2 Estimating unseen count
We can use the count of things we’ve seen once to help estimate the count of things we’ve never seen. We denote \(N_c\) to be the count of things we’ve seen \(c\) times.
Example: you are fishing (a scenario from Josh Goodman) and caught: 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel (18 total),
- how likely is it that next species is trout? 1/18
- how likely is it that next species is new? 3/18 since \(N_1=3\)
- assuming so, how likely is it that next species is trout? must be less than 1/18
Here comes Good Turing Calculation (Smoothing),
\[
P(\text{things with zero-count}) = \frac{N_1}{N},
\]
\[
c_{\text{update}} = \frac{(c+1)N_{c+1}}{N_c}.
\]
6. More Discussions
There are more discussions about language modeling, for example Kneser-Ney Smoothing; how to apply language modeling on huge web scale, etc. The course instructor mentioned this but didn’t give too much details. Therefore we just stop here and will post more detailed discussion in the future.