Text Processing.
1. Regular Expressions
First concept in text processing is Regular expression. It’s a formal language for specifying text strings, we can use regular expression to search specific patterns in text and do more processing based on this.
1.1 Disjunctions
If we want to find a pattern with a set of letters or symbols, the basic way to put them into brackets [ ]
(one square brackets match one symbol). For examples pattern [wW]ood
matches Wood
and wood
, and commonly used ranges include [A-Z]
(an upper case letter); [a-z]
(a lower case letter); [0-9]
(a single digit).
The opposite way is negate disjunctions. We can add carat in the square bracket. For example [^wW]ood
will match any words with last three letters to be ood
but not contain W
and w
in the first place, or simply say: “don’t find Wood and wood”.
Other disjunctions include:
|
- or?
- 0 or 1 previous char*
- 0 or more previous char+
- 1 or more previous char.
- any char, placeholder
1.2 Anchors
Two main anchors is ^
and $
which represent the beginning and the end of the text. For example, ^[A-Za-z]
will match Regular
in “Regular expression” and ^[^A-Za-z]
will match 1
in “1. introduction”; .$
will match ?
in “The end?”.
Generally speaking, regular expressions are widely used and could provide surprisingly large role in NLP. We can use it to do text cleansing or generate features for machine learning algorithms.
2. Word Tokenization
Text normalization is very important and every NLP task needs to do it. Since we are facing different format of text (symbols, characters, words, sentences and documents) in NLP, not standard tabular data format in machine learning text book. Text normalization includes:
- Segmenting / tokenizing words in running text
- Normalizing word formats
- Segmenting sentences in running text
And we use Type to indicate an element of vocabulary, Token to indicate an instance of that type in running text.
We will have many issues in word tokenization. For example, whether we should remove “-“ in “state-of-the-art”? what’s the number of tokens of “San Francisco” (one or two)? Also different languages will have different issues. Take Chinese as an example, Chinese words are composed of characters (characters are generally 1 syllable and 1 morpheme, average word is 2.4 characters long), therefore we always need an extra step called word segmentation. And to do this we have many algorithms available like Maximum matching algorithm.
3. Word Normalization and Stemming
3.1 Normalization
Words also have different format. Lemma means same stem, for instance “cat” and “cats” have same lemma; Wordform is the full inflected surface form, “cat” and “cats” have different wordforms.
Also, in information retrieval, we need the indexed text and query terms must have same form: “U.S.A.” should match “USA”. In that way, we can implicitly define equivalence classes of terms
3.2 Case folding
Reduce all letters to lower case is a common method in text normalization. But sometimes case is also helpful, like words “US”, “MT”.
3.3 Lemmatization
Lemmatization means reduce inflections or variant forms to base form: “am”, “are”, “is” are reduced to “be”; “cat”, “cats”, “cat’s”, “cats’” are reduced to “cat”. In this way we have to find correct dictionary headword form to make sure we don’t make spell errors.
3.4 Stemming
Morphemes consists of stems (core meaning-bearing units) and affixes (bits and pieces that adhere to stems).
Stemming means reduce terms to their stems in information retrieval, it is crude chopping of affixes. For example “automates”, “automatic” and “automation” are all reduced to “automat”. You can see the difference with lemmatization: “automat” is not a element in vocabulary. That says, stemming a word or sentence may result in words that are not actual words.
There are many stemming algorithms, the most common one is Porter’s algorithm. It’s consists of some rules for reducing: sess -> ss
, ies -> i
, (v)ing -> (v)
, (v)ed -> (v)
; ational -> ate
, izer -> ize
, ator -> ate
etc. We can observe that these rules are based on linguistics.
4. Sentence Segmentation
In most case we are not only facing with words, we facing documents such as comments, reviews and articles. Therefore how to segment sentence is also one important aspect.
Punctuation is one aspect we should consider in sentence segmentation. !
and ?
are sometimes useful for indicating sentiments but .
is quite ambiguous: it could be the end of the sentence, it could also be abbreviations like “Dr.”, “Inc.” and numbers like “0.5”, “4.3%”. Sometimes we need use some specific rules to determine this and use regular expressions for implementation.