Stanford NLP (coursera) Notes (5) - Spell Correction | Bangda Sun

Spell Correction Task.

This post we will discuss something practical, one of the applications of NLP to real life. Spell correction is a widely deployed application, we can find it in many places: word processing app, search engine, etc.

1. Different Tasks

There are two steps of spelling correction: 1) spell error detection; 2) spell error correction: including autocorrect and suggestions.

For error, there are also several categories: 1) non-word error: words that not exist; 2) real-word error: typographical error - there -> three, cognitive error - too -> two.

2. Non-word Error

For non-word spelling error detection, any word not appear in the dictionary / vocabulary is an error (out-of-vocabulary). Therefore larger dictionary will be better, make sure the OOV words are real non-words.

3. Real-word Error

For real-word error, it is more complex. For every word \(w\) we need to generate a candidate set: we need to find candidate words with similar pronunciations or similar spelling, then we choose the best candidate (include \((w\) itself). We can use two methods: Noisy Channel and Classifier.

3.1 Noisy channel

The intuition of noisy channel is we feed an original word into a channel, then it comes out a noisy word. We see an observation \(x\) of a misspelled word, and the correct word \(w\) should be

\[
w = \arg\max_{w\in V} P(w|x) = \arg\max_{w\in V} \frac{P(x|w)P(w)}{P(x)} \propto \arg\max_{w\in V} P(x|w)P(w),
\]

in Bayesian perspective, \(P(w)\) is the prior, \(P(x|w)\) is likelihood - it is also the probability of edit (use operations like deletion/insertion/substitution). We define

\(del(x, y)\) to be \(count(xy \text{ typed as } x)\)
\(ins(x, y)\) to be \(count(x \text{ typed as }xy)\)
\(sub(x, y)\) to be \(count(x \text{ typed as y})\)
\(trans(x, y)\) to be \(count(xy\text{ typed as }yx)\)

then we can calculate the corresponding likelihoods,

\[
P(x|w) = \left\{
\begin{aligned}
&del(w_{i-1}, w_{i}) / count(w_{i-1} w_{i}) \\
&ins(w_{i-1}, x_{i}) / count(w_{i-1}) \\
&sub(x_{i}, w_{i}) / count(w_{i}) \\
&trans(w_{i}, w_{i+1}) / count(w_{i} w_{i+1})
\end{aligned}
\right..
\]

We can also incorporate language models with channel model in practice: we can use unigram / bigram model to calculate above probabilities.

3.2 Classifier

This topic has intersection with machine learning, besides channel model and language model, we can also generate features from the text to help us builder a better classifier, we will discuss them in later posts!