Stanfold NLP (Coursera) Notes (20) - Text Summarization | Bangda Sun

Text Summarization Basis.

1. Introduction to Text Summarization

The goal of text summarization is produce an abridged version of a text that contains information that is important or relevant to a user. Applications include

outline or abstract from documents
action items from meeting
simplifying text by compression sentences

For single document summarization, the summarization is mainly about producing abstract, outline or headline. For multiple document summarization, the summarization is about producing a gist of the content.

In addition to generic summarization, there is also query - focused summarization, which summarize a document with respect to an information need expressed in a user query.

A simple baseline is just taking the first sentence as the summary and create the text snippet (like we see from google).

2. Generating Snippets

Three stages of the summarization:

Content selection
- choose sentences to be extracted from the document
Information ordering
- choose an order to place them in the summary
Sentence realization
- clean up the sentences

A simple method will just use document order and original sentences for the last two steps. So we can focus on first step - content selection.

Unsupervised content selection: choose sentences that have salient or informative words. Two approaches are proposed to define salient words:

TF-IDF
Topic signature: choose a set of salient words using mutual information and log-likelihood ratio

Supervised content selection: given a labeled training set of good summaries for each document, align the sentences in the document with sentences in the summary, extract features (position, length of sentence, word informativeness, etc), then train binary classifier (put the sentence in summary or not). The problem includes

labeled training data is not easy to get
alignment is not easy to do
performance is not better than unsupervised approaches

Unsupervised approaches are more commonly used in practice.