Estimating CTR for New Ads in Search Advertising

[Paper reading]: Predicting Clicks: Estimating the Click-Through Rate for New Ads.

This paper shows the process to use features of ads, terms and advertisers to learn a model that accurately predicts the CTR for new ads. This is an older paper compared with previous two papers I read:

1. Introduction

Most of search engines are funded through textual advertising placed next to their search results. This paper will focus on the most common online advertising model: pay-per-performance with a cost-per-click (CPC) billing: search engine is get paid every time the ad is clicked by a user.

For ads have been displayed repeatedly, empirical CTR could be the estimate, but this has very high variance and cannot be applied on new ads. For new ads come to the system, the historical info don’t exist.

This paper address the problem of estimating CTR for newly created ads and advertising accounts. It shows that a model using ad info (meta data), page info (the ad points to) and statistics for related ads can reasonably predicts the future CTR for that ad.

2. Motivation

The key task for a search engine advertising system is to determine what advertisements should be displayed, in what order, for each query that search engine receives. As with search results, the probability that a user clicks on an advertisement declines rapidly with display position.

To maximize ad quality and total revenue, most search engines order their ad primarily based on expected revenue:

\[
E_{\text{ad}}(\text{revenue}) = P_{\text{ad}}(\text{click})\cdot CPC_{\text{ad}},
\]

the \(CPC\) for an ad is its bid (in a first price auction) or the bid of the next-highest bidder (in a second price auction).

The search advertising market has grown significantly in recent years: many new advertisers enter at every day, existing advertisers frequently launch new advertising campaigns, existing ads are sometimes targeted to new queries. As a result there is a large inventory of ads for which the search engine has no prior information.

Previous research estimates the CTR of new ads by using the CTRs of existing ads with same bid terms or topic clusters. But even within the same term there can be a large variation. To account for these within key-words variations, it is important to incorporate features that depend on more than just the terms the ad was bid on.

3. Searching Advertising Framework

Whenever an ad is displayed on the search results page, it has some chance of being viewed by the user. As a simplification, consider the probability that an ad is clicked on to be dependent on two factors:

the probability that it is viewed
the probability that it is clicked on, given that it is viewed

\[
P(\text{click}|\text{ad, position}) = P(\text{click}|\text{ad, seen, position})\cdot P(\text{seen}|\text{ad, position}),
\]

assume the probability that it is clicked on is independent of the ad, then \(P(\text{click}|\text{ad, seen, position}) = P(\text{click}|\text{ad, seen})\), where CTR is defined as \(P(\text{click}|\text{ad, seen})\).

4. Data and Baseline Model

The data is collected from Microsoft Web search engine, each ad contains the following info:

landing page: the URL that a user is redirected to upon clicking the ad
bid term (key-words): the query for which this ad should be displayed
title and body: ad title and description shown to the user
display URL: the URL shown to the user at the bottom of the ad
clicks: the number of times the ad has been clicked since it was entered into the system
views: the number of times the ad has been seen since it was entered into the system

The same ad content may appear for different bid terms. In fact, the user interface for the ad system encourages this: account holders create an order, which is the ad info, and an associated collection of the terms for which the ad should be displayed. Each pairing of the ad text with a term is an unique ad, as the CTR for different terms varies significantly. The train test split is on advertiser level, all ads by the same advertiser went into the same split. Also, “premium advertisers” are removed.

The base model is logistic regression, and it is trained using the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method, with cross-entropy as loss function. All features are normalized to have zero mean and unit standard deviation, and logarithm / square of the features are added to capture non-linearity. The evaluation metric is average KL-divergence between predicted CTR and true CTR on test set (lower is better).

5. Estimating Term CTR

There is significant variation in the average CTR for different bid terms.

5.1 Term CTR

The first feature is the CTR of other ads (not including those of the current advertiser) that have the same bid term,

\[
f_{0}(\text{ad}) = \frac{\alpha \overline{CTR} + N(\text{ad}_{\text{term}})CTR(\text{ad}_{term})}{\alpha + N(\text{ad}_{\text{term}})},
\]

where \(N(\text{ad}_{\text{term}})\) is the number of ads with the given bid term, \(CTR(\text{ad}_{term})\) is the average CTR for those ads, \(\overline{CTR}\) is the average CTR for all ads in training set, \(\alpha\) sets the strength of the prior, in terms of number of views.

To take advantage of other ads that have related terms, use the approach of considering ads with subsets and / or supersets of the bid term. Let \(R_{mn}(t)\) be the set of ads whose terms are the same as \(t\) when removes \(m\) words from \(t\) and \(n\) words from the ad term, and have at least one term in common,

\[
R_{mn}(t) = \{ \text{ad}: |\text{ad}_{\text{term}}\sup t| > 0 \text{ and } |t - \text{ad}_{\text{term}}| = m \text{ and }|\text{ad}_{\text{term}} - t| = n \},
\]

given \(R_{mn}\), the feature (average CTR for related ads) for a given ad is

\[
CTR_{mn}(\text{term}) = \frac{1}{|R_{mn}(\text{term})|}\sum_{x\in R_{mn}(\text{term})} CTR_{x},
\]

and \(CTR_{mn}(\text{term})\) can also be smoothed like term CTR.

6. Estimating Ad Quality

Even within a term there is significant variation in ad CTR. Here a question is proposed: can we use features of the ad itself to come up with an even better estimate for a given ad’s CTR? Except ads title, ads description and URL, there are at least 4 rough categories of influence on user:

appearance: is the ad aesthetically pleasing?
attention capture: does the ad draw the user in?
reputation: is the advertiser a known or reputable brand? if the user is not familiar with the advertiser, would they guess that the advertiser is a good brand?
landing page quality: the quality of the landing page may be indicative of the probability the user will click the ad.
relevance: how relevant is the ad to search query term?

Each category can derive several features:

appearance:
- how many words are in the title and body?
- does the advertisement have good capitalization?
- does the advertisement contain too many exclamation points, dollar signs or other punctuations?
- does the advertisement uses short words or long words?
attention capture:
- does the title or body contains action words such as “buy”, “join”, “subscribe”, etc?
- does the ad provide numbers (such as specific discounts, prices, etc)?
reputation:
- does the display URL end with .com, .net, .org, .edu?
- how long is the URL?
- how many segments are in the URL?
- does the URL contains dashes or numbers? (good and short domain can be expensive)
landing page quality:
- does the page contains flash?
- what fraction of the page is covered with images?
- is it W3C compliant?
- does the page use style sheets?
- is the page covered with ads?
relevance:
- does the bid term appear in the title exactly?
- do any subsets of the term appear in the title and body?
- what is the fraction of the body?

7. Measuring Order Specificity

Now look at how the CTR of an ad may vary depending on what variety of terms it was originally associated with. When an advertiser wishes to create advertisements, they enter an order, which is the title, description, etc, and a set of terms used to establish relevance to the user query. This will generates \(N\) ads, one for each term in the order. The order text may also contain a parameter that could be filled in with the term.

For example, there is an order:

Title: Buy shoes now,
Description: Shop at our discount shoe warehouse!
URL: shoes.com
Terms: {buy shoes, shoes, cheap shoes}

This will result in three ads, all with the same title and description, but with different bid terms.

In some cases, the ads may be more broadly targeted. For example,

Title: Buy {item} now
Description: Shop at our discount warehouse!
URL: store.com
Terms: {shoes, TVs, grass, paint}

Since the second order generate less targeted ads, it might have a lower CTR than the first order.

Category entropy can be used to capture how targeted an order is. To categorize a term, we can perform a web search for it and run a text classification algorithm on the resulting result snippets. This classifies each term into one of \(C\) categories. Then measure the entropy of the distribution of categories of the order bid terms, and use that as a feature of the model.

8. More Discussions

One significant direction of future work is in making the CTR estimate dependent on the user’s query. In this paper the CTRs are predicted independent of the query. In broad matching case, the query have some looser relations to the bid term.

The model used is to predict expected CTR for a new ad, rather than ad ranking. This could be used to inform advertisers what they should change about an ad they are creating in order to increase its CTR.

More features could be incorporated. Those found to be useful for the static and dynamic ranking of web pages might prove particularly beneficial. Data on how often user have visited the ad’s landing page or its domain, how long they remain on that page, whether they click back or a link off of the page, etc, could prove useful.