Bangda Sun

Practice makes perfect

Stanford NLP (coursera) Notes (9) - Information Extraction and NER

Information Extraction and Named Entity Recognition.

1. Information Extraction

Information extraction (IE) systems is also a widely used NLP application in real life, it can

  • find and understanding limited relevant parts of texts
  • gather information from many pieces of text
  • produce a structure representation of relevant information
  • goal to organize information so that it useful to people; put information in a semantically precise form that allows further inferences to be made by computer algorithms.

Roughly you can just think it can extract “who did what to whom when“ from texts, for example: gathering earnings, profits, board members from company reports, I know this type of task is used by many financial companies nowadays.

To be more specific and professional, one of the tasks in IE is Name Entity Recognition (NER), which need to find person/date/location/organization etc in text.

The usage includes:

  • named entities can be indexed, linked off, etc;
  • sentiment can be attributed to companies or products;
  • a lot of IE relations are associations between named entities;
  • answers from QA system are often named entities.

2. NER Task

From machine learning perspective, NER is a classification task: whether these pieces of the text (word or phrase) is a named entity? Then the general pipeline will be:

2.1 Training classifier

  • collect a set of representative training documents;
  • label each token or its entity or OTHER;
  • feature engineering (current word, prev/next word, POS, previous label, etc);
  • train classifiers.

2.2 Testing

  • receive new documents
  • apply the model for each token
  • output the extracted entities

Finally for model evaluation, we can use Recall/Precision/F-score, which are already introduced in my previous posts ;)

Besides classic machine learning classifiers when discussed before (Naive Bayes, Logistic Regression/Maximum Entropy/SVM), the task we face here can also be viewed as sequence problems, and there are some other statistical models (such as Conditional Markov Model, a.k.a. Maximum Entropy Markov Model) that use conditional evidence from observations and previous decisions. One of the example is POS (Part-of-speech) Tagging.

In sequence problems, we make inferences sequentially.

  • Greedy inference

We just start at the beginning and use classifier at each position to assign a labe. The classifier can depend on previous labeled decisions as well as observations.

Advantages: Fast, save space; easy to implement.

Disadvantages: Greedy, may make commit errors that cannot recover from.

  • Beam inference

At each position, keep the top \(K\) complete sequences, extend each sequence in each local way. The extensions compete for the \(K\) slots at the next position.

Advantages: Fast, beam size of 3-5 are almost as good as exact inference; easy to implement.

Disadvantages: Inexact, the globally best sequence can fall off the beam.

  • Viterbi inference

Dynamic progamming / memoization, requires small window of state influence.

Advantages: Exact, the global best sequence is returned.

Disadvantages: Harder to implement long-distance state-state interactions.