[Paper reading]: Field-aware Factorization Machines for CTR Prediction.
Linear models (logistic regression, factorization machines) are widely used in CTR prediction tasks. This paper proposes Field-aware factorization machines, which is a variant of factorization machines, it is famous for winning several CTR prediction competitions on, the key component is field.
1. Introduction
Logistic regression is widely used in CTR prediction task due to it’s simplicity and high efficiency. Given dataset \( (\mathbf{x}_{i}, y_{i})\) for \(i = 1,2,\cdots,m\), where label \(y_{i}\in\{-1, +1\}\), the logistic regression is learned by minimizing logistic loss (a.k.a log loss or cross entropy loss):
\[
\mathbf{w} = \arg\min_{\mathbf{w}} \sum^{m}_{i=1}\log\left(1 + \exp(-y_{i}\phi_{\text{LM}}(\mathbf{w}, \mathbf{x}_{i}))\right) + \frac{\lambda}{2}||\mathbf{w}||^{2}_{2},
\]
where \(\phi_{\text{LM}}(\mathbf{w}, \mathbf{x}_{i}) = \mathbf{x}_{i}^{\top}\mathbf{w}\).
Well, here only raw features are used for modeling, each feature are assumed to be independent. However, feature interactions (a.k.a feature conjunctions) are crucial for CTR prediction. Brute forcely enumerate all combinations of feature interactions or using polynomial kernel SVM could work, but there is a more efficient model called factorization machines which factorize the interaction matrix.
A variant of FM called pairwise interaction tensor factorization (PITF) was proposed earlier for personalized tag recommendation task, it not only consider the interactin between user and item, but also tag: (user, item), (user, tag) and (item, tag). FFM is a more general predictor with same idea.
2. FFM
2.1 Motivation and Basis
As mentioned before, Poly2 and FM with \(n\) features model 2-way feature interactions as follow:
\[
\begin{aligned}
\phi_{\text{Poly2}}(\mathbf{w}, \mathbf{x}) &= \sum^{n}_{j_{1}=1}\sum^{n}_{j_{2}=j_{1} +1}w_{j_{1}j_{2}}x_{j_{1}}x_{j_{2}} \\
\phi_{\text{FM}}(\mathbf{w}, \mathbf{x}) &= \sum^{n}_{j_{1}=1}\sum^{n}_{j_{2}=j_{1} +1}\langle w_{j_{1}}, w_{j_{2}}\rangle x_{j_{1}}x_{j_{2}} \\
\end{aligned}
\]
Here is the example data, where “+” and “-“ represents the number of clicked and unclicked.
For most CTR datasets like this, features can be grouped into fields. In example data, feature ESPN, Vogue, NBC are all belong to field Publisher; feature Nike, Gucci, Adidas are all belong to field Advertiser. And FFM incorporates these field informations. If one sample is: Cliked Publisher: ESPN, Advertiser: Nike, Gender: Male
, FM models the feature interaction like follows:
\[
\langle \mathbf{w}_{\text{ESPN}}, \mathbf{w}_{\text{Nike}} \rangle + \langle \mathbf{w}_{\text{ESPN}}, \mathbf{w}_{\text{Male}} \rangle + \langle \mathbf{w}_{\text{Nike}}, \mathbf{w}_{\text{Male}} \rangle,
\]
which are all combinations of dot products of row “ESPN”, “Nike” and “Male” in latent factor matrix \(\mathbf{W}\in R^{n\times k}\).
In FFM, unlike FM that each feature only has one latent factor vector, it lets each feature has several latent factor vectors, depends on fields:
\[
\langle \mathbf{w}_{\text{ESPN}, A}, \mathbf{w}_{\text{Nike}, P} \rangle + \langle \mathbf{w}_{\text{ESPN}, G}, \mathbf{w}_{\text{Male}, P} \rangle + \langle \mathbf{w}_{\text{Nike}, G}, \mathbf{w}_{\text{Male}, A} \rangle,
\]
therefore instead of use one latent factor matrix, here there are three latent factor matrices: (Publisher, Advertiser), (Publisher, Gender), (Advertiser, Gender). This is based on the assumption that: interactions from different fields could be different. Therefore:
\[
\phi_{\text{FFM}}(\mathbf{w}, \mathbf{x}) = \sum^{n}_{j_{1}=1}\sum^{n}_{j_{2} = j_{1}+1}\langle \mathbf{w}_{j_{1}f_{2}}, \mathbf{w}_{j_{2}f_{1}} \rangle x_{j_{1}} x_{j_{2}},
\]
where \(f_{1}\) and \(f_{2}\) are respectively the fields of \(j_{1}\) and \(j_{2}\). If the latent factor is \(k\), the number of parameters in feature interactionsof FM is \(nk\), with time complexity \(O(nk)\); the number of parameters in feature interactions of FFM is \(fnk\), with time complexity \(O(n^{2}k)\), there \(f\) is the number of fields. However, since in FFM each latent vector only needs to learn the effect with a specific field, usually
\[
k_{\text{FFM}} \ll k_{\text{FM}}.
\]
2.2 Incorporate Field Information
For categorical features, the fields are usually easy to identify based on the format of data.
For numerical features, a naive way is to treat each feature as a dummy field, but it is possible there are too many unique values. Another possible way is to discretize each numerical feature to categorical one, in other words use historgram to approxmiate the exact distribution.
There are also cases that all features are all in one field, this is the situation happens in NLP. The only field is the text, say “sentence”. If there is only one field then FFM is actually reduced to FM.
3. Experiments and Conclusions
Experiments are mainly done based on two CTR kaggle competition datasets Criteo
and Avazu
, and the comparisons are among three models: logistic regression, FM and FFM. The results show that:
- since logistic regression is a convex optimization problem, therefore different optimization methods should generate same result theoretically, however they are not; therefore indicate that stop condition of optimization methods can affect the performance of model.
- FFM outperforms other models in terms of log loss, but it also takes longer training time.
- FM is a good balance between log loss and speed.
Finally a guideline of applying FFM on different kinds of datasets is summarized:
- FFM should be effective for datasets that contain categorical features and are transformed to binary features.
- If the transformed dataset is not sparse enough, FFM seem to bring less benefit.
- It is more difficule to apply FFM on numerical datasets.