NBSVM - A Strong Classification Baseline | Bangda Sun

[Paper reading] Baseline and Bigrams: Simple, Good Sentiment and Topic Classification.

1. Introduction

When I participated in Toxic Comment Classification Challenge on kaggle this March, I saw a kernel that implemented NBLR as a baseline, which outperformed Logistic Regression. This is a variation of NBSVM in paper Baseline and Bigrams: Simple, Good Sentiment and Topic Classification by Sida Wang and Christopher Manning. At that time I just folk the kernel and didn’t look into the details. Recently I started a new NLP related competition on kaggle, I think it’s a good chance to review some NLP concepts and learn something new. And I decide to start from the baseline.

2. Highlights

Here are the key concepts and conclusions in the paper.

2.1 Introduction

Bigram features are still strong performers on snippet sentiment classifications
NB perform better than SVM on short snippet sentiment tasks, while SVM perform better on longer documents
Combine generative and discriminant models together by adding NB log-count ratio features to SVM (NBSVM), it is a strong and robust baseline
Confirm that MNB (Multinomial NB) is normally better and more stable than Multivariate Bernoulli NB

2.2 Methods

A base classifier:

\[
y = \text{sign}\left(\mathbf{w}^\top \mathbf{x} + b\right),
\]

where \(y\in\{+1, -1\}\). \(V\) is the feature set, and \(\mathbf{f}\) is the feature count vector, \(\mathbf{f}\in \mathbb{R}^{|V|}\), where \(f^{(i)}_{j}\) represents the number of occurence of feature \(V_{j}\) of \(i\)th sample.

Define the count vector

\[
\mathbf{p} = \alpha + \sum_{i: y^{(i)} = 1}\mathbf{f}^{(i)}, ~\mathbf{q} = \alpha + \sum_{i: y^{(i)} = -1}\mathbf{f}^{(i)},
\]

as you can see that the count vector is just the sum of feature vectors for training data with different labels, and \(\alpha\) is the smoothing parameter.

The log-count ratio is defined as

\[
\mathbf{r} = \log\left(\frac{\mathbf{p} / ||\mathbf{p}||_{1}}{\mathbf{q} / ||\mathbf{q}||_{1}}\right)
\]

2.2.1 SVM

For text classification, \(\mathbf{x}^{(i)} = \mathbf{f}^{(i)}\), \(\mathbf{w}\) and \(b\) is from

\[
\arg\min_{\mathbf{w}, b} C \sum^{n}_{i=1} \max\left(0, 1 - y^{(i)}(\mathbf{w}^\top \mathbf{f}^{(i)} + b)\right)^2 + ||\mathbf{w}||_{2} ,
\]

this form (L2 regularization and L2 loss) work the best (L1 loss to be less stable).

2.2.2 NBSVM

Now \(\mathbf{x}^{(i)} = \mathbf{f}^{(i)}\circ\mathbf{r}\), where \(\circ\) is the element wise product. This does very well for long documents. And an interpolation between MNB and SVM performs excellently for all documents, the model is

\[
\mathbf{w}’ = (1 - \beta)\frac{||\mathbf{w}||_{1}}{|V|} + \beta\mathbf{w},
\]

where \(\beta\in [0, 1]\) is the interpolation parameter. This interpolation can be seen as a form of regularization: trust NB unless SVM is very confident.

2.3 Conclusions

MNB is better at snippets, SVM is better at full-length reviews
Benefits of bigrams depends on task: word bigram features are not commonly used in text classification tasks, probably due to their having mixed and overall limited utility in topical text classification tasks. Sentiment classification gains much more from bigrams, because they can capture modified verbs and nouns
NBSVM is a robust performer (snippets and longer documents). One disadvantage is having the interpolation parameter \(\beta\), the performance on longer documents is virtually identical for \(\beta\in [1/4, 1]\), while \(\beta = 1/4\) is on average 0.5% better for snippets than \(\beta=1\). Using \(\beta\in[1/4, 1/2]\) makes the NBSVM more robust than more extreme values
Multivariate Bernoulli NB usually performs worse than MNB. The only place where it’s comparable to MNB is for snippet tasks using only unigrams
For MNB and NBSVM, using the binarized MNB is slightly better than using raw count. The difference is negligible for snippets
Use Logistic Regression in place of SVM gives similar results