The source code for this blog is available on GitHub.

We announced a new product!

Blog.

Inputting & PreProcessing Text

Cover Image for Inputting & PreProcessing Text
Jake Batsuuri
Jake Batsuuri
min read

Input Methods, String & Unicode, Regular Expression Use Cases

Introduction

We need to learn to analyze new information and not just Moby Dick. So we need to learn how to input, process text into our programs.

  • Import local text files
    • Strings
    • Files
    • Regex
  • Import web text files
    • Dispense with HTML & Markup
  • Split the text into words and punctuation
    • Tokenization
    • Stemming
  • Export formatted output
from __future__ import division
import nltk, re, pprint

EBooks

!pip install urlopen
import urllib.request

url = "https://www.gutenberg.org/files/11/11-0.txt"
raw = urllib.request.urlopen(url).read()
type(raw)
# <type 'str'>
len(raw)
# 1176831
raw[:75]
# 'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'

Tokenization

tokens = nltk.word_tokenize(raw)
type(tokens)
# <type 'list'>
len(tokens)
# 255809
tokens[:10]
# ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']

More functions

text = nltk.Text(tokens)
# <type 'nltk.text.Text'>
text[1020:1060]
# ['CHAPTER', 'I', 'On', 'an', 'exceptionally', 'hot', 'evening', 'early', 'in',
# 'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in',
# 'which', 'he', 'lodged', 'in', 'S', '.', 'Place', 'and', 'walked', 'slowly',
# ',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K', '.', 'bridge', '.']
text.collocations()
# Katerina Ivanovna; Pulcheria Alexandrovna; Avdotya Romanovna; Pyotr
# Petrovitch; Project Gutenberg; Marfa Petrovna; Rodion Romanovitch;
# Sofya Semyonovna; Nikodim Fomitch; did not; Hay Market; Andrey
# Semyonovitch; old woman; Literary Archive; Dmitri Prokofitch; great
# deal; United States; Praskovya Pavlovna; Porfiry Petrovitch; ear rings

Header and Footer

>>> raw.find("PART I")
5303
>>> raw.rfind("End of Project Gutenberg's Crime")
1157681
>>> raw = raw[5303:1157681]
>>> raw.find("PART I")
0

The find() and rfind() (“reverse find”) methods help us get the right index values to use for slicing the string

We overwrite raw with this slice, so now it begins with “PART I” and goes up to (but not including) the phrase that marks the end of the content.

HTML

>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
>>> html = urlopen(url).read()
>>> html[:60]
'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

You can type print html to see the HTML content in all its glory, including meta tags, an image map, JavaScript, forms, and tables.

>>> raw = nltk.clean_html(html)
>>> tokens = nltk.word_tokenize(raw)
>>> tokens
['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'", 'to', 'die', 'out', ...]
>>> tokens = tokens[96:399]
>>> text = nltk.Text(tokens)
>>> text.concordance('gene')
they say too few people now carry the gene for blondes to last beyond the next tw
t blonde hair is caused by a recessive gene . In order for a child to have blonde
to have blonde hair , it must have the gene on both sides of the family in the gra
there is a disadvantage of having that gene or by chance . They don ' t disappear
ondes would disappear is if having the gene was a disadvantage and I do not think

Beautiful Soup

Search Engine Results

Finally, the markup in the result returned by a search engine may change unpredictably, breaking any pattern based method of locating particular content (a problem which is ameliorated by the use of search engine APIs).

RSS Feeds

>>> import feedparser
>>> llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
>>> llog['feed']['title']
u'Language Log'
>>> len(llog.entries)
15
>>> post = llog.entries[2]
>>> post.title
u"He's My BF"
>>> content = post.content[0].value
>>> content[:70]
u'<p>Today I was chatting with three of our visiting graduate students f'
>>> nltk.word_tokenize(nltk.html_clean(content))
>>> nltk.word_tokenize(nltk.clean_html(llog.entries[2].content[0].value))
[u'Today', u'I', u'was', u'chatting', u'with', u'three', u'of', u'our', u'visiting',
u'graduate', u'students', u'from', u'the', u'PRC', u'.', u'Thinking', u'that', u'I',

Note that the resulting strings have a u prefix to indicate that they are Unicode strings (see Section 3.3).

Reading Local Files

>>> f = open('document.txt')
>>> raw = f.read()

To find the right directory

>>> import os
>>> os.listdir('.')

Read one line at a time

>>> f = open('document.txt', 'rU')
>>> for line in f:
... print line.strip()
Time flies like an arrow.
Fruit flies like a banana.

Here we use the strip() method to remove the newline character at the end of the input line.

>>> path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')
>>> raw = open(path, 'rU').read()

Binary Files

Text often comes in binary formats—such as PDF and MSWord—that can only be opened using specialized software. Third-party libraries such as pypdf and pywin32 provide access to these formats.

Extracting text from multicolumn documents is particularly challenging.

For one-off conversion of a few documents, it is simpler to open the document with a suitable application, then save it as text to your local drive, and access it as described below.

If the document is already on the Web, you can enter its URL in Google’s search box.

The search result often includes a link to an HTML version of the document, which you can save as text.

User Input

Sometimes we want to capture the text that a user inputs when she is interacting with ur program. To prompt the user to type a line of input, call the Python function raw_input().

>>> s = raw_input("Enter some text: ")
# Enter some text: On an exceptionally hot evening early in July
>>> print "You typed", len(nltk.word_tokenize(s)), "words."
# You typed 8 words.

Pipeline

>>> raw = open('document.txt').read()
>>> type(raw)
<type 'str'>
>>> tokens = nltk.word_tokenize(raw)
>>> type(tokens)
<type 'list'>

>>> words = [w.lower() for w in tokens]
>>> type(words)
<type 'list'>

>>> vocab = sorted(set(words))
>>> type(vocab)
<type 'list'>

Strings

Single Quotation Mark

>>> monty = 'Monty Python'
>>> monty
'Monty Python'
>>> circus = 'Monty Python\'s Flying Circus'
>>> circus
"Monty Python's Flying Circus"
>>> circus = 'Monty Python's Flying Circus'
File "<stdin>", line 1
circus = 'Monty Python's Flying Circus'
# SyntaxError: invalid syntax

Double Quotation Mark

>>> circus = "Monty Python's Flying Circus"
>>> circus
"Monty Python's Flying Circus"

Triple Quotation Mark

>>> couplet = "Shall I compare thee to a Summer's day?"\
... "Thou are more lovely and more temperate:"
>>> print couplet
# Shall I compare thee to a Summer's day?Thou are more lovely and more temperate:

>>> couplet = ("Rough winds do shake the darling buds of May,"
... "And Summer's lease hath all too short a date:")
>>> print couplet
# Rough winds do shake the darling buds of May,And Summer's lease hath all too short a date:
>>> couplet = """Shall I compare thee to a Summer's day?
... Thou are more lovely and more temperate:"""
>>> print couplet
# Shall I compare thee to a Summer's day?
# Thou are more lovely and more temperate:

>>> couplet = '''Rough winds do shake the darling buds of May,
... And Summer's lease hath all too short a date:'''
>>> print couplet
# Rough winds do shake the darling buds of May,
# And Summer's lease hath all too short a date:

Concatenation

>>> 'very' + 'very' + 'very'
# 'veryveryvery'
>>> 'very' * 3
# 'veryveryvery'

Print

>>> grail = 'Holy Grail'
>>> print monty + grail
# Monty PythonHoly Grail
>>> print monty, grail
# Monty Python Holy Grail
>>> print monty, "and the", grail
# Monty Python and the Holy Grail

Individual Chars

>>> monty[0]
'M'
>>> monty[3]
't'
>>> monty[5]
' '

Again as with lists, we can use negative indexes for strings, where -1 is the index of the last character . Positive and negative indexes give us two ways to refer to any position in a string. In this case, when the string had a length of 12, indexes 5 and -7 both refer to the same character (a space). (Notice that 5 = len(monty) - 7.)

>>> monty[-1]
'n'
>>> monty[5]
' '
>>> monty[-7]
' '

Print Chars

>>> sent = 'colorless green ideas sleep furiously'
>>> for char in sent:
... print char,
...
#c o l o r l e s s   g r e e n   i d e a s   s l e e p   f u r i o u s l y

Count Chars

>>> from nltk.corpus import gutenberg
>>> raw = gutenberg.raw('melville-moby_dick.txt')
>>> fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())
>>> fdist.keys()
# ['e', 't', 'a', 'o', 'n', 'i', 's', 'h', 'r', 'l', 'd', 'u', 'm', 'c', 'w',
# 'f', 'g', 'p', 'b', 'y', 'v', 'k', 'q', 'j', 'x', 'z']

You might like to visualize the distribution using fdist.plot(). The relative character frequencies of a text can be used in automatically identifying the language of the text.

Substrings

>>> monty[6:10]
# 'Pyth'

The slice [m,n] contains the characters from position m through n-1.

Negative Indexing

We can also slice with negative indexes—the same basic rule of starting from the start index and stopping one before the end index applies; here we stop before the space character.

>>> monty[-12:-7]
# 'Monty'
>>> monty[:5]
# 'Monty'

>>> monty[6:]
# 'Python'

Substring Membership

>>> phrase = 'And now for something completely different'
>>> if 'thing' in phrase:
... print 'found "thing"'
# found "thing"

Substring Index

>>> monty.find('Python')
# 6

More Operations

  • Method Functionality

    • s.find(t) Index of first instance of string t inside s (-1 if not found)

    • s.rfind(t) Index of last instance of string t inside s (-1 if not found)

    • s.index(t) Like s.find(t), except it raises ValueError if not found

    • s.rindex(t) Like s.rfind(t), except it raises ValueError if not found

    • s.join(text) Combine the words of the text into a string using s as the glue

    • s.split(t) Split s into a list wherever a t is found (whitespace by default)

    • s.splitlines() Split s into a list of strings, one per line

    • s.lower() A lowercased version of the string s

    • s.upper() An uppercased version of the string s

    • s.titlecase() A titlecased version of the string s

    • s.strip() A copy of s without leading or trailing whitespace

    • s.replace(t, u) Replace instances of t with u inside s

Lists Vs Strings

>>> query = 'Who knows?'
>>> beatles = ['John', 'Paul', 'George', 'Ringo']

>>> query[2]
# 'o'
>>> beatles[2]
# 'George'
>>> query[:2]
# 'Wh'
>>> beatles[:2]
# ['John', 'Paul']
>>> query + " I don't"
# "Who knows? I don't"
>>> beatles + 'Brian'
# Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# TypeError: can only concatenate list (not "str") to list
>>> beatles + ['Brian']
# ['John', 'Paul', 'George', 'Ringo', 'Brian']
>>> beatles[0] = "John Lennon"
>>> del beatles[-1]
>>> beatles
['John Lennon', 'Paul', 'George']
>>> query[0] = 'F'
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: object does not support item assignment

This is because strings are immutable: you can’t change a string once you have created it. However, lists are mutable, and their contents can be modified at any time. As a result, lists support operations that modify the original value rather than producing a new value.

Unicode

Code Point

Unicode supports over a million characters. Each character is assigned a number, called a code point. In Python, code points are written in the form \uXXXX, where XXXX is the number in four-digit hexadecimal form.

Types of Encodings

Within a program, we can manipulate Unicode strings just like normal strings. However, when Unicode characters are stored in files or displayed on a terminal, they must be encoded as a stream of bytes. Some encodings (such as ASCII and Latin-2) use a single byte per code point, so they can support only a small subset of Unicode, enough for a single language. Other encodings (such as UTF-8) use multiple bytes and can represent the full range of Unicode characters.

  • Basically, ASCII has a smaller capacity, while Unicode has a bigger capacity.
    • ASCII can hold 128 or 256 characters, because it only uses 1 byte or 8 bits
    • Whereas UTF-8 can encode millions of characters because it uses 1 to 4 bytes

Bit Spaces

Generally speaking Unicode is universal and the other ones are more niche, so when we translate into Unicode, we say decoding.

Text in files will be in a particular encoding, so we need some mechanism for translating it into Unicode—translation into Unicode is called decoding. Conversely, to write out Unicode to a file or a terminal, we first need to translate it into a suitable encoding— this translation out of Unicode is called encoding.

Glyphs

From a Unicode perspective, characters are abstract entities that can be realized as one or more glyphs. Only glyphs can appear on a screen or be printed on paper. A font is a mapping from characters to glyphs.

Codecs

The Python codecs module provides functions to read encoded data into Unicode strings, and to write out Unicode strings in encoded form. The codecs.open() function takes an encoding parameter to specify the encoding of the file being read or written. So let’s import the codecs module, and call it with the encoding 'latin2' to open our Polish file as Unicode:

>>> path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')

>>> import codecs
>>> f = codecs.open(path, encoding='latin2')
  • Basically a codec is a device or program that compresses data so that it can be transmitted faster or more efficiently
f = codecs.open(path, 'w', encoding='utf-8').

Ordinal

To get the ordinal value of a character:

ord('a')
# 97

The hexadecimal 4 digit notation for 97 is 0061:

>>> a = u'\u0061'
>>> a
# u'a'
>>> print a
# a

Tokenizing Unicode

>>> nltk.word_tokenize(line)

#[u'niemc\xf3w', u'pod', u'koniec', u'ii', u'wojny', u'\u015bwiatowej',u'na', u'dolny', u'\u015bl\u0105sk', u'zosta\u0142y']

Regular Expression Application

Many linguistic tasks require pattern matching, for example endswith('ed') finds words that end with 'ed'. Regular expressions help us do that very efficiently and robustly.

To use regular expressions in Python we have to import re and some words to process:

>>> import re
>>> wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]

Basic Metacharacters

We will use the re.search(p, s) function to check whether the pattern p can be found somewhere inside the string s. In regex this is ‹‹ed$››.

Caret

^ matches the start of the string.

re.search('^pre', w)

Basically means space.

Dollar Sign

>>> [w for w in wordlist if re.search('ed$', w)]
# ['abaissed', 'abandoned', 'abased', 'abashed', 'abatised', 'abed', 'aborted', ...]

$ matches the end of the string.

Dot Wildcard

>>> [w for w in wordlist if re.search('^..j..t..$', w)]
# ['abjectly', 'adjuster', 'dejected', 'dejectly', 'injector', 'majestic', ...]

Single character wildcard

Maybe Question

How do you accept both forms of email and e-mail ?

‹‹^e-?mail$››

The ? makes the character right before it optional.

Summing It Up

sum(1 for w in text if re.search('^e-?mail$', w))

Ranges and Closures

Two or more words that are entered with the same keystrokes are called textonyms.

Both golf and hole are 4653 with the T9 system, used for entering text on mobile.

What other words could be produced with the same sequence? These are the types of questions regular expressions can helps us answer.

In other words, what are ‹‹^[ghi][mno][jlk][def]$›› ?

[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)]
# ['gold', 'golf', 'hold', 'hole']
  • First character can only be [ghi] , and so on
  • ‹‹^[hig][mon][jkl][fed]$›› order doesn't matter

Understanding Regular Expressions & Kleene Closures

‹‹^[ghijklmno]+$›› = ‹‹^[g-o]+$››

‹‹^[a-fj-o]+$›› = ‹‹^[abcdef][][jklmno]+$››

>>> chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))
>>> [w for w in chat_words if re.search('^m+i+n+e+$', w)]

# ['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee', 'miiiiiinnnnnnnnnneeeeeeee', 'mine',
# 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']
>>> [w for w in chat_words if re.search('^[ha]+$', w)]

# ['a', 'aaaaaaaaaaaaaaaaa', 'aaahhhh', 'ah', 'ahah', 'ahahah', 'ahh',
# 'ahhahahaha', 'ahhh', 'ahhhh', 'ahhhhhh', 'ahhhhhhhhhhhhhh', 'h', 'ha', 'haaa',
# 'hah', 'haha', 'hahaaa', 'hahah', 'hahaha', 'hahahaa', 'hahahah', 'hahahaha', ...]

It should be clear that + simply means “one or more instances of the preceding item,” which could be an individual character like m, a set like [fed], or a range like [d-f].

Now let’s replace + with , which means “zero or more instances of the preceding item.” The regular expression ‹‹^m*i*n*e*$»:

  • will match everything that we found using «^m+i+n+e+$»,
  • but also words where some of the letters don’t appear at all, e.g., me, min, and mmmmm.

Note that the + and * symbols are sometimes referred to as Kleene closures, or simply closures.

Matching Not Something

The ^ operator has another function when it appears as the first character inside square brackets. For example, «[^aeiouAEIOU]» matches any character other than a vowel. We can search the NPS Chat Corpus for words that are made up entirely of non-vowel characters using «^[^aeiouAEIOU]+$» to find items like these:

  • :):):),
  • grrr,
  • cyb3r, and
  • zzzzzzzz.

Notice this includes non-alphabetic characters.

Matching Different Patterns

Here are some more examples of regular expressions being used to find tokens that match a particular pattern, illustrating the use of some new symbols: , {}, (), and |.

>>> wsj = sorted(set(nltk.corpus.treebank.words()))
>>> [w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)]
# ['0.0085', '0.05', '0.1', '0.16', '0.2', '0.25', '0.28', '0.3', '0.4', '0.5',
# '0.50', '0.54', '0.56', '0.60', '0.7', '0.82', '0.84', '0.9', '0.95', '0.99',
# '1.01', '1.1', '1.125', '1.14', '1.1650', '1.17', '1.18', '1.19', '1.2', ...]

>>> [w for w in wsj if re.search('^[A-Z]+\$$', w)]
# ['C$', 'US$']

>>> [w for w in wsj if re.search('^[0-9]{4}$', w)]
# ['1614', '1637', '1787', '1901', '1903', '1917', '1925', '1929', '1933', ...]

>>> [w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)]
# ['10-day', '10-lap', '10-year', '100-share', '12-point', '12-year', ...]

>>> [w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)]
# ['black-and-white', 'bread-and-butter', 'father-in-law', 'machine-gun-toting',
# 'savings-and-loan']

>>> [w for w in wsj if re.search('(ed|ing)$', w)]
# ['62%-owned', 'Absorbed', 'According', 'Adopting', 'Advanced', 'Advancing', ...]
  • Escape \ You probably worked out that a backslash means that the following character is deprived of its special powers and must literally match a specific character in the word. Thus, while . is special, . only matches a period.
  • Curly Brackets { } The braced expressions, like {3,5}, specify the number of repeats of the previous item.
  • Or | The pipe character indicates a choice between the material on its left or its right.
  • Order of Operations ( ) Parentheses indicate the scope of an operator, and they can be used together with the pipe (or disjunction) symbol like this: «w(i|e|ai|oo)t», matching wit, wet, wait, and woot.

Backslash

To the Python interpreter, a regular expression is just like any other string. If the string contains a backslash followed by particular characters, it will interpret these specially. For example, \b would be interpreted as the backspace character.

In general, when using regular expressions containing backslash, we should instruct the interpreter not to look inside the string at all, but simply to pass it directly to the re library for processing. We do this by prefixing the string with the letter r, to indicate that it is a raw string. For example, the raw string r'\band\b' contains two \b symbols that are interpreted by the re library as matching word boundaries instead of backspace characters. If you get into the habit of using r'...' for regular expressions—as we will do from now on—you will avoid having to think about these complications.

Regular Expressions Use Cases

The previous examples all involved searching for words w that match some regular expression regexp using re.search(regexp, w)

Extracting Word Pieces

re.findall() method finds all non-overlapping matches of the given regular expression

>>> word = 'supercalifragilisticexpialidocious'
>>> re.findall(r'[aeiou]', word)
# ['u', 'e', 'a', 'i', 'a', 'i', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', 'o', 'u']
>>> len(re.findall(r'[aeiou]', word))
# 16
>>> wsj = sorted(set(nltk.corpus.treebank.words()))
>>> fd = nltk.FreqDist(vs for word in wsj for vs in re.findall(r'[aeiou]{2,}', word))
>>> fd.items()
# [('io', 549), ('ea', 476), ('ie', 331), ('ou', 329), ('ai', 261), ('ia', 253),
# ('ee', 217), ('oo', 174), ('ua', 109), ('au', 106), ('ue', 105), ('ui', 95),
# ('ei', 86), ('oi', 65), ('oa', 59), ('eo', 39), ('iou', 27), ('eu', 18), ...]

Reconstructing Word Pieces

Once we can use re.findall() to extract material from words, there are interesting things to do with the pieces, such as glue them back together or plot them.

>>> regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'
>>> def compress(word):
... pieces = re.findall(regexp, word)
... return ''.join(pieces)
>>> english_udhr = nltk.corpus.udhr.words('English-Latin1')
>>> print nltk.tokenwrap(compress(w) for w in english_udhr[:75])
# Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and
# of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn
# of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn
# rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,
# and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and

Conditional Frequency Distribution & Find All

>>> rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
>>> cvs = [cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]
>>> cfd = nltk.ConditionalFreqDist(cvs)
>>> cfd.tabulate()
  a    e    i    o    u
k 418  148  94   420  173
p 83   31   105  34   51
r 187  63   84   89   79
s 0    0    100  2    1
t 47   8    0    148  37
v 93   27   105  48   49
>>> cv_word_pairs = [(cv, w) for w in rotokas_words
... for cv in re.findall(r'[ptksvr][aeiou]', w)]
>>> cv_index = nltk.Index(cv_word_pairs)
>>> cv_index['su']
# ['kasuari']
>>> cv_index['po']
# ['kaapo', 'kaapopato', 'kaipori', 'kaiporipie', 'kaiporivira', 'kapo', 'kapoa',
# 'kapokao', 'kapokapo', 'kapokapo', 'kapokapoa', 'kapokapoa', 'kapokapora', ...]

Finding Word Stems

For some language processing tasks we want to ignore word endings, and just deal with word stems, for example in querying a search engine.

Naïve Approach

>>> def stem(word):
... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
... if word.endswith(suffix):
... return word[:-len(suffix)]
... return word

Regex Equivalent

>>> re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')
['ing']

Here, re.findall() just gave us the suffix even though the regular expression matched the entire word. This is because the parentheses have a second function, to select substrings to be extracted. If we want to use the parentheses to specify the scope of the disjunction, but not to select the material to be output, we have to add ?:, which is just one of many arcane subtleties of regular expressions. Here’s the revised version.

>>> re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')
['processing']

However, we’d actually like to split the word into stem and suffix. So we should just parenthesize both parts of the regular expression:

>>> re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')
# [('process', 'ing')]

This looks promising, but still has a problem. Let’s look at a different word, processes:

>>> re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')
# [('processe', 's')]

The regular expression incorrectly found an -s suffix instead of an -es suffix. This demonstrates another subtlety: the star operator is “greedy” and so the ._ part of the expression tries to consume as much of the input as possible. If we use the “non-greedy” version of the star operator, written _?, we get what we want:

>>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')
# [('process', 'es')]

This works even when we allow an empty suffix, by making the content of the second parentheses optional:

>>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language')
# [('language', '')]

This approach still has many problems (can you spot them?), but we will move on to define a function to perform stemming, and apply it to a whole text:

Better Approach

>>> def stem(word):
... regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
... stem, suffix = re.findall(regexp, word)[0]
... return stem

>>> raw = """DENNIS: Listen, strange women lying in ponds distributing swords
... is no basis for a system of government. Supreme executive power derives from
... a mandate from the masses, not from some farcical aquatic ceremony."""
>>> tokens = nltk.word_tokenize(raw)
>>> [stem(t) for t in tokens]
# ['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'ly', 'in', 'pond',
# 'distribut', 'sword', 'i', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern',
# '.', 'Supreme', 'execut', 'power', 'deriv', 'from', 'a', 'mandate', 'from',
# 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']

Notice that our regular expression removed the s from ponds but also from is and basis. It produced some non-words, such as distribut and deriv, but these are acceptable stems in some applications.

Searching Tokenized Text

You can use a special kind of regular expression for searching across multiple words in a text (where a text is a list of tokens). For example, " " finds all instances of a man in the text. The angle brackets are used to mark token boundaries, and any whitespace between the angle brackets is ignored (behaviors that are unique to NLTK’s findall() method for texts).

  • In the following example, we include <.*> , which will match any single token, and enclose it in parentheses so only the matched word (e.g., monied) and not the matched phrase (e.g., a monied man) is produced.
>>> from nltk.corpus import gutenberg, nps_chat
>>> moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
>>> moby.findall(r"<a> (<.*>) <man>")
# monied; nervous; dangerous; white; white; white; pious; queer; good;
# mature; white; Cape; great; wise; wise; butterless; white; fiendish;
# pale; furious; better; certain; complete; dismasted; younger; brave;
# brave; brave; brave
  • The second example finds three-word phrases ending with the word bro.
>>> chat = nltk.Text(nps_chat.words())
>>> chat.findall(r"<.*> <.*> <bro>")
# you rule bro; telling you bro; u twizted bro
  • The last example finds sequences of three or more words starting with the letter l .
>>> chat.findall(r"<l.*>{3,}")
# lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; la
# la la; lovely lol lol love; lol lol lol.; la la la; la la la

Consolidate your understanding of regular expression patterns and substitutions using nltk.re_show(p, s), which annotates the string s to show every place where pattern p was matched, and nltk.app.nemo(), which provides a graphical interface for exploring regular expressions. For more practice, try some of the exercises on regular expressions at the end of this chapter.

Hypernyms

It is easy to build search patterns when the linguistic phenomenon we’re studying is tied to particular words. In some cases, a little creativity will go a long way. For instance, searching a large text corpus for expressions of the form x and other ys allows us to discover hypernyms:

>>> from nltk.corpus import brown
>>> hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned']))
>>> hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")
# speed and other activities; water and other liquids; tomb and other
# landmarks; Statues and other monuments; pearls and other jewels;
# charts and other items; roads and other features; figures and other
# objects; military and other areas; demands and other factors;
# abstracts and other compilations; iron and other metals

With enough text, this approach would give us a useful store of information about the taxonomy of objects, without the need for any manual labor. However, our search results will usually contain false positives, i.e., cases that we would want to exclude. For example, the result demands and other factors suggests that demand is an instance of the type factor, but this sentence is actually about wage demands. Nevertheless, we could construct our own ontology of English concepts by manually correcting the output of such searches.

Other Articles This post is part of a series of stories that explores the fundamentals of natural language processing:

  1. Context of Natural Language Processing Motivations, Disciplines, Approaches, Outlook
  2. Notes on Formal Language Theory Objects, Operations, Regular Expressions and Finite State Automata
  3. Natural Language Tool Kit 3.5 Search Functions, Statistics, Pronoun Resolution
  4. What Are Regular Languages? Minimization, Finite State Transducers, Regular Relations
  5. What Are Context Free Languages? Grammars, Derivations, Expressiveness, Hierarchies
  6. Inputting & PreProcessing Text Input Methods, String & Unicode, Regular Expression Use Cases In the next article, we will explore Normalizing, Tokenizing and Sentence Segmentation. For the table of contents and more content click here. References Clark, Alexander. The Handbook of Computational Linguistics and Natural Language Processing. Wiley-Blackwell, 2013. Eisenstein, Jacob. Introduction to Natural Language Processing. The MIT Press, 2019. Bird, Steven, et al. Natural Language Processing with Python. O’Reilly, 2009. Jurafsky, Dan, and James H. Martin. Speech and Language Processing. Pearson, 2014. Barker-Plummer, Dave, et al. Language, Proof and Logic. CSLI Publ., Center for the Study of Language and Information, 2011.