Maximum Entropy Classifiers.
1. Generative vs Discriminative
So far we’ve looked at generative models, they are modeling joint distributions: where we have some data \((x, y)\) of paired observations and hidden classes \(y\), it place probabilities over both observed data and the hidden stuff - generate the observed data from hidden stuff, it try to maximize \(P(x, y)\). It turns out to be trivial to choose weights - just use relative frequencies. All classic statistical NLP models are generative models: N-gram models, Naive Bayes Classifier, Hidden Markov Models, Probabilistic Context-free grammars, etc.
But there is now much use of conditional or discriminative models, because
- they give high accuracy performance
- they make it easy to incorporate lots of linguistically important features
- they allow automatic building of language independent, retargetable NLP modules
It models conditional distributions, take the data as given and put a probability over hidden structure given the data, it try to maximize \(P(y|x)\), includes: Logistic Regression, Conditional Loglinear / Maximum Entropy Models, Conditional Random Fields, SVM, Perceptron, etc.
2. Extract Features from Text
Features are elementary pieces of evidence that link aspects what we observed with a class \(y\) that we want to predict. It’s a function with a bounded real value:
\[
f: y\times x \rightarrow \mathbb{R}.
\]
In NLP uses, a features is an indicator function of properties of the input and a particular class. Example features: we want to classify if the document is LOCATION or DRUG, we can extract features like:
\[
\begin{aligned}
f_{1} &= [y = \text{ LOCATION } \text{and }w_{-1}=\text{ “in” }\text{ and }isCapitalized(w)] \\
f_{2} &= [y = \text{ LOCATION } \text{and }hasAccentedLatinChar(w)]
\end{aligned}
\]
Then when we train the model, it will assign to each feature a weight: a positive weight votes that this configuration is likely correct; a negative weight votes that this configuration is likely incorrect.
For feature expectations, we will make use two:
- empirical count (expectation) of a feature:
\[
E(f_{i}) = \sum_{(x,y)}f_{i}(x,y).
\]
- model expectation of a feature:
\[
E(f_{i}) = \sum_{(x,y)}p(x,y)f_{i}(x,y).
\]
3. Linear Classifiers
Key components:
- linear function from feature sets to classes
- assign a weight \(w_i\) to each feature \(f_i\)
- we consider each class for an observed data \(x\)
- for a pair of data \((x, y)\), features vote with their weights:
\[
Vote(y) = \sum \lambda_i f_i(x,y)
\]
- choose the class \(y\) which maximizes the vote
There are many ways to choose weights, depends on the way we modeling and the way we train the model. Here we use Maximum Entropy Classifier (it is essentially multiclass logistic regression, in NLP it usually called Maximum Entropy Model), it model the conditional probability as follows:
\[
P(y|x) = \frac{\exp\left(\sum_i \lambda_i f_i(x,y)\right)}{\sum_y \exp\left(\sum_i \lambda_i f_i(x,y)\right)}.
\]
Here exponential function will make everything positive, and the denominator could be view as normalized votes. We can see the weights are the parameters of this model, combine via the softmax function.
The way we estimate the parameters is through maximizing the likelihood (MLE). Here it is, (we take a log)
\[
\log\prod_{(x,y)}P(y|x) = \sum_{(x,y)}\log P(y|x) = \sum_{(x,y)}\log \frac{\exp\left(\sum_i \lambda_i f_i(x,y)\right)}{\sum_y \exp\left(\sum_i \lambda_i f_i(x,y)\right)}
\]
4. Maximum Entropy Models
An equivalent approach: we want a distribution which is uniform except in specific ways we require. Uniformity means high entropy.