Text Summarization Basis.
1. Introduction to Text Summarization
The goal of text summarization is produce an abridged version of a text that contains information that is important or relevant to a user. Applications include
- outline or abstract from documents
- action items from meeting
- simplifying text by compression sentences
For single document summarization, the summarization is mainly about producing abstract, outline or headline. For multiple document summarization, the summarization is about producing a gist of the content.
In addition to generic summarization, there is also query - focused summarization, which summarize a document with respect to an information need expressed in a user query.
A simple baseline is just taking the first sentence as the summary and create the text snippet (like we see from google).
2. Generating Snippets
Three stages of the summarization:
- Content selection
- choose sentences to be extracted from the document
- Information ordering
- choose an order to place them in the summary
- Sentence realization
- clean up the sentences
A simple method will just use document order and original sentences for the last two steps. So we can focus on first step - content selection.
Unsupervised content selection: choose sentences that have salient or informative words. Two approaches are proposed to define salient words:
- TF-IDF
- Topic signature: choose a set of salient words using mutual information and log-likelihood ratio
Supervised content selection: given a labeled training set of good summaries for each document, align the sentences in the document with sentences in the summary, extract features (position, length of sentence, word informativeness, etc), then train binary classifier (put the sentence in summary or not). The problem includes
- labeled training data is not easy to get
- alignment is not easy to do
- performance is not better than unsupervised approaches
Unsupervised approaches are more commonly used in practice.