A Simple Framework for Predicting CTR in Display Advertising

[Paper reading]: Simple and Scalable Response Prediction for Display Advertising.

This paper gives a general introduction about two core tasks in display advertising: CTR (click-through rate) and CVR (conversion rate) prediction. The reason I read this paper is mainly about its comprehensive introduction about computational ads. Since I had participated several CTR competitions on kaggle and Tianchi, while at that time I just treat them as binary classification problem, but had little domain knowledge about the background.

1. Introduction

Display advertising is a form of online advertising where advertisers pay publishers for placing ads on publishers’ web pages. Spot markets offer advertisers a wide range of payments options. If the goal of an advertising campaign is getting their message to the target audience then paying per impression (CPM) is normally the choice for advertiser. While, many advertisers would prefer NOT to pay for an ad impression unless that impression leads the user to their websites. Therefore there are other choices, such as cost-per-click (CPC) and cost-per-conversion (CPA). An auction that supports these conditional payment options needs to convert advertiser bids to expected price per impression (ECPM), which will be one of the key component to consider in ranking ads. For CPM ads, ECPM is identical to bid; for CPC or CPA, the ECPM will depend on the CTR and CVR.

The challenges in display advertising are different from context of search and search advertising (which are other two types of advertising). In display advertising, the auctioneer doesn’t really have access to the content of ads, ads might not even hosted by the auctioneers. Furthermore, the content of ads might be generated dynamically depending on the properties of the user.

While there have been some attempts to capture the content of the ad or landing pages, this requires non-trivial effort and is not always possible. Content related, web graph and anchor text information are therefore absent in display advertising, leaving the auctioneer mostly with unique identifiers for representation.

Therefore in this paper, a simple ML framework is proposed to address the issues above, which is able to scale to billions of samples and hundreds of millions of parameters

Factors that can have an impact on users’ response for ads are included explicitly as features:

context features: query in sponsored search, publisher page in content match, etc;
content features: text of ads, etc;
user features: demographical features, etc;
feedback features: historical aggregated log data, etc.

Logistic regression and tree models are widely used in computational advertising area. For tree models, one of the challenges is having categorical features with very high cardinality, there are different ways to handle this, such as feature hashing, use gain ratio as splitting criterion with thresholds, etc. For logistic regression, the advantages are obvious: it can be parallelized easily to handle large scale data, because the data is usually very sparse for advertising business; it is very flexible, can incorporate different kinds of regularizations. Bayesian logistic regression is also one choice.

There are more advanced techniques listed as references in the paper.

3. Data and Features

The data used in this paper is from traffic logs of Yahoo!’s Right Media Exchange. Publishers label their web pages with site ID, they can also tag different parts of the same page with different sections ID.

3.1 Target

The target is click and post-click conversion (PCC). Click refers to the event that occurs when a user interacts with an ad by means of a mouse click . Post-click conversion is an action that the user takes after visiting the landing page of conversion actions, such as subscribing to an email list, making a reservation or purchasing a product. The quantities to measure these two events are basically CTR and CVR as mentioned before.

In general several conversion events could be associated with the same click as advertisers may show a sequence of conversion generating pages to the user after a click. Conversions could have delays, it can happen minutes, hours or even days after the click. In the example dataset, 86.7% of the conversion are triggered within 10 minutes of the click events, 39.2% of these conversion occur within 1 minute of the click. Use 1 hour delay could cover 95.5% conversion events. These observations are important to build train / test datasets.

3.2 Features

The features are grouped into four sets:

advertiser: advertiser ID, advertiser network, ad campaign, ad creative, conversion ID, ad group, ad size, ad creative type, offer type ID (ad category);
publisher: publisher ID, publisher network, site, section, URL, page referrer;
user: gender, age, region, network speed, accept cookies, geo;
time: serve time, click time.

4. Modeling

4.1 Logistic Regression

Logistic regression (a.k.a Maximum entropy model) is widely used. Given a training set \((\mathbf{x}_{i}, y_{i})\) with \(\mathbf{x}_{i}\in \{0, 1\}^{d}\) a sparse binary feature vector in a \(d\) dimensional space, and \(y\in\{-1, +1\}\) a binary target, the goal is to find a weight vector \(\mathbf{w}\in R^{d}\). The predicted probability for a sample to be positive is

\[
P(y=1|\mathbf{x}) = \frac{1}{1 + \exp\left(-\mathbf{w}^{\top}\mathbf{x}\right)}.
\]

Logistic regression is a linear model of the log odds ratio,

\[
\log\frac{P(y=1|\mathbf{x})}{P(y=-1|\mathbf{x})} = \mathbf{w}^{\top}\mathbf{x},
\]

with L2 regularization, the weight vector is learned by minimizing logistic loss function

\[
\sum^{n}_{i=1}\log\left(1 + \exp\left(-y_{i}\mathbf{w}^{\top}\mathbf{x}_{i}\right)\right) + \frac{\lambda}{2}||\mathbf{w}||^{2}_{2}.
\]

4.2 Hashing Trick

The idea is to use a hash function to reduce the number of values a feature can take, a.k.a. value reduction.

Compared with manually reduce infrequent values, hashing requires less data processing and the use of dictionary, it is very straightforward. The main problem for hashing is collision, but it would not be a major concern.

Consider the case when the first value has been observed \(n_{1}\) times (all on negative samples), the second value has been observed \(n_{2}\) times (all on positive samples), and there is only one feature. If there were no collision, the weight for the value would be \(-\infty\) for first value and \(+\infty\) for second value, this lead to a log likelihood to be 0 on all samples. However if there is a collision, the negative log likelihood is

\[
-n_{1}\log\frac{n_{1}}{n_{1} + n_{2}} - n_{2}\log\frac{n_{2}}{n_{1} + n_{2}},
\]

this is indeed the log likelihood achieved by the solution where all the weights are at 0 except the one where there is a collision and which has a value \(\log(n_{2} / n_{1})\). This log likelihood is large only when \(n_{1}\) and \(n_{2}\) are large. And indeed this scenario can be considered a worse case because these two values are extremely predictive and there was no redundancy in features.

5. Experiments

Hashing trick. The results show that for the same model size, the hashing based model is slightly superior. There are two other advantages:

convenience: no need to find the most important values and keep them in a dictionary;
real-time efficiency: hashing is faster than a dictionary lookup.

Subsampling. Sometimes similar test errors can be achieved with smaller training sets and there is no need for large scale learning in these cases. There are two ways:

subsampling on negative samples: empirical results show that a subsample rate of 1% was a good trade-off between model training time and preditive accuracy;
subsampling on all samples: the results show that there is a drop in accuracy after subsampling.

6. Feature Selection

This section discussed about feature selection using conditional mutual information and reference distribution. I’ll back here someday to give more detailed summaries.

7. Non-Stationary

Display advertising is a non stationary process as the set of active advertisers, campaigns, publishers and users is constantly changing.

There are two perspectives to quantify the changes:

Ad creation rate: with three levels - conversion, creative and compaign identifiers. Analysis shows that advertisers tend to use the same identifiers to track conversions even after they create new ad creatives and campaigns.
Ad life-time: also with three levels; analysis shows that conversion ID lives much longer than the creative ID and campaign ID, which is consistent with the conclusion of ad creation rate analysis.