Bangda Sun

Practice makes perfect

Stanford NLP (coursera) Notes (10) - Relation Extraction

Relation Extraction.

1. Relation Extraction

Last post we briefly introduced Information Extraction and one of the tasks: Named Entity Recognized (NER). This time we will continue - not only extract entities, but also extract the relationships among entities: IS - A relation, instance - of relation, etc (more from WordNet Thesaurus). For example, after we extract entities from company report, we get Company/Location/Date, to get more advanced knowledge structure, we focus on relation triples: Company - Founding, like Founding - year (IBM, 1911), Founding - location (IBM, New York).

Why Relation Extraction?

  • create new structured knowledge bases, useful for any app;
  • augment current knowledge bases;
  • support Question - Answering system.

Two resources:

  • Automated Content Extraction (ACE) gives 17 relations from 2008 “Relation Extraction Task”, these are specific rules for extraction;
  • Unified Medical Language System (UMLS) specifies 54 relations among 134 entity types.

Besides those hand-written patterns, we can also use supervised learning, semi-supervised learning and unsupervised learning.

2. Hand-Written Patterns

First let’s see the simplest one - IS-A relation. The early intuition from Hearst (1992):

  • Y such as X ((, X) * (, and | or) X);
  • such Y as X;
  • X or other Y;
  • X and other Y;
  • Y including X;
  • Y, especially X.

There are more relations like Located-in, founded, cures, etc. Named Entities are also helpful when extract relations.

Advantages:

  • human made rules tend to be high-precision;
  • can be tailored to specific domains.

Disadvantages:

  • human made rules are often low-recall;
  • time consuming work.

3. Supervised Relation Extraction

The basic task for the classifier is decide any 2 entities are related. Specific steps are as follows:

  • choose a set of relations we’d like to extract;
  • choose a set of relevant named entities;
  • find and label data, choose corpus - label the named entities in the corpus - hand label the relations between these entities - split into training and test set;
  • train a classifier on training set.

For features, we could extract word-based features (words before/after the target entities, words between target entities), entity-based features (POS tags of entities) and syntactic features (constituent path, base syntactic chunk path, typed-dependency path), etc.

Advantages:

  • can get high accuracy with enough training data and test data is similar with training.

Disadvantages:

  • labeling large training data is expensive;
  • classifier may not generalize well to different genres.

4. Semi - Supervised and Unsupervised Relation Extraction

When we don’t have label or even no training data, we could have a few seed tuples or high-precision patterns.

we can use bootstrapping: use the seeds to directly learn to populate a relation. First gather a set of seed pairs that have relation \(R\), then iterate:

  1. find sentences with these pairs;
  2. look at the context between or around the pair and generalize the context to create patterns;
  3. use the patterns for grep for more pairs.

Also there are more advanced methods available, like Distant Supervision, etc.