The source code for this blog is available on GitHub.

We announced a new product!

Blog.

Text Segmentation

Cover Image for Text Segmentation
Jake Batsuuri
Jake Batsuuri
min read

Normalizing Text

What does normalizing a text do?

In earlier program examples we have often converted text to lowercase before doing anything with its words, e.g., set(w.lower() for w in text). By using lower(), we have normalized the text to lowercase so that the distinction between The and the is ignored.

Often we want to go further than this and strip off any affixes, a task known as stemming. A further step is to make sure that the resulting form is a known word in a dictionary, a task known as lemmatization.

  • Stemming
  • Lemmatization
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government. Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = nltk.word_tokenize(raw)

Stemming

NLTK includes several off-the-shelf stemmers, and if you ever need a stemmer, you should use one of these in preference to crafting your own using regular expressions, since NLTK’s stemmers handle a wide range of irregular cases. The Porter and Lancaster stemmers follow their own rules for stripping affixes. Observe that the Porter stemmer correctly handles the word lying (mapping it to lie), whereas the Lancaster stemmer does not.

  • Porter > Lancaster
porter = nltk.PorterStemmer()
[porter.stem(t) for t in tokens]
# ['DENNI', ':', 'Listen', ',', 'strang', 'women', 'lie', 'in', 'pond',
# 'distribut', 'sword', 'is', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern',
# '.', 'Suprem', 'execut', 'power', 'deriv', 'from', 'a', 'mandat', 'from',
# 'the', 'mass', ',', 'not', 'from', 'some', 'farcic', 'aquat', 'ceremoni', '.']

lancaster = nltk.LancasterStemmer()
[lancaster.stem(t) for t in tokens]
# ['den', ':', 'list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut',
# 'sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem',
# 'execut', 'pow', 'der', 'from', 'a', 'mand', 'from', 'the', 'mass', ',', 'not',
# 'from', 'som', 'farc', 'aqu', 'ceremony', '.']

Stemming is not a well-defined process, and we typically pick the stemmer that best suits the application we have in mind. The Porter Stemmer is a good choice if you are indexing some texts and want to support search using alternative forms of words (illustrated in Example 3-1, which uses object-oriented programming techniques that are outside the scope of this book, string formatting techniques to be covered in Section 3.9, and the enumerate() function to be explained in Section 4.2).

class IndexedText(object):
	def __init__(self, stemmer, text):
		self._text = text
		self._stemmer = stemmer
		self._index = nltk.Index((self._stem(word), i) for (i, word) in enumerate(text))

	def concordance(self, word, width=40):
		key = self._stem(word)
		wc = width/4 # words of context
		for i in self._index[key]:
			lcontext = ' '.join(self._text[i-wc:i])
			rcontext = ' '.join(self._text[i:i+wc])
			ldisplay = '%*s' % (width, lcontext[-width:])
			rdisplay = '%-*s' % (width, rcontext[:width])
			print ldisplay, rdisplay

	def _stem(self, word):
		return self._stemmer.stem(word).lower()

porter = nltk.PorterStemmer()
grail = nltk.corpus.webtext.words('grail.txt')
text = IndexedText(porter, grail)
text.concordance('lie')

# r king ! DENNIS : Listen , strange women lying in ponds distributing swords is no
# beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest of
# Nay . Nay . Come . Come . You may lie here . Oh , but you are wounded !
# doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Well
# ere is much danger , for beyond the cave lies the Gorge of Eternal Peril , which
# you . Oh TIM : To the north there lies a cave -- the cave of Caerbannog --
# h it and lived ! Bones of full fifty men lie strewn about its lair . So , brave k
# not stop our fight ' til each one of you lies dead , and the Holy Grail returns t

Lemmatization

The WordNet lemmatizer removes affixes only if the resulting word is in its dictionary. This additional checking process makes the lemmatizer slower than the stemmers just mentioned. Notice that it doesn’t handle lying, but it converts women to woman.

wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(t) for t in tokens]

'''
# ['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in', 'pond',
# 'distributing', 'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of',
# 'government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a',
# 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical',
# 'aquatic', 'ceremony', '.']
'''

The WordNet lemmatizer is a good choice if you want to compile the vocabulary of some texts and want a list of valid lemmas (or lexicon headwords).

Another normalization task involves identifying non-standard words, including numbers, abbreviations, and dates, and mapping any such tokens to a special vocabulary. For example, every decimal number could be mapped to a single token 0.0, and every acronym could be mapped to AAA. This keeps the vocabulary small and improves the accuracy of many language modeling tasks.

Regular Expression for Tokenization

Tokenization is the task of cutting a string into identifiable linguistic units that constitute a piece of language data. Although it is a fundamental task, we have been able to delay it until now because many corpora are already tokenized, and because NLTK includes some tokenizers. Now that you are familiar with regular expressions, you can learn how to use them to tokenize text, and to have much more control over the process.

Naïve Tokenization

The very simplest method for tokenizing text is to split on whitespace. Consider the following text from Alice’s Adventures in Wonderland:

raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone
though), 'I won't have any pepper in my kitchen AT ALL. Soup does very
well without--Maybe it's always pepper that makes people hot-tempered,'..."""

We could split this raw text on whitespace using raw.split(). To do the same using a regular expression, it is not enough to match any space characters in the string , since this results in tokens that contain a \n newline character; instead, we need to match any number of spaces, tabs, or newlines.

re.split(r' ', raw)
# ["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in',
# 'a', 'very', 'hopeful', 'tone\nthough),', "'I", "won't", 'have', 'any', 'pepper',
# 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very\nwell', 'without--Maybe',
# "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]

re.split(r'[ \t\n]+', raw)
# ["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in',
# 'a', 'very', 'hopeful', 'tone', 'though),', "'I", "won't", 'have', 'any', 'pepper',
# 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe',
# "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]

re.split(r'\s+', raw)

The regular expression «[ \t\n]+» matches one or more spaces, tabs (\t), or newlines (\n). Other whitespace characters, such as carriage return and form feed, should really be included too. Instead, we will use a built-in re abbreviation, \s, which means any whitespace character. The second statement in the preceding example can be rewritten as re.split(r'\s+', raw).

Important: Remember to prefix regular expressions with the letter r (meaning “raw”), which instructs the Python interpreter to treat the string literally, rather than processing any backslashed characters it contains.

Complement Way

Splitting on whitespace gives us tokens like '(not' and 'herself,'. An alternative is to use the fact that Python provides us with a character class \w for word characters, equivalent to [a-zA-Z0-9_]. It also defines the complement of this class, \W, i.e., all characters other than letters, digits, or underscore. We can use \W in a simple regular expression to split the input on anything other than a word character:

re.split(r'\W+', raw)
# ['', 'When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in',
# 'a', 'very', 'hopeful', 'tone', 'though', 'I', 'won', 't', 'have', 'any', 'pepper',
# 'in', 'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', 'well', 'without',
# 'Maybe', 'it', 's', 'always', 'pepper', 'that', 'makes', 'people', 'hot', 'tempered',
# '']

Observe that this gives us empty strings at the start and the end (to understand why, try doing 'xx'.split('x')). With re.findall(r'\w+', raw), we get the same tokens, but without the empty strings, using a pattern that matches the words instead of the spaces. Now that we’re matching the words, we’re in a position to extend the regular expression to cover a wider range of cases. The regular expression «\w+|\S\w*» will first try to match any sequence of word characters. If no match is found, it will try to match any non-whitespace character (\S is the complement of \s) followed by further word characters. This means that punctuation is grouped with any following letters (e.g., ’s) but that sequences of two or more punctuation characters are separated.

re.findall(r'\w+|\S\w*', raw)
["'When", 'I', "'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',',
'(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'I", 'won', "'t",
'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does',
'very', 'well', 'without', '-', '-Maybe', 'it', "'s", 'always', 'pepper', 'that',
'makes', 'people', 'hot', '-tempered', ',', "'", '.', '.', '.']

Let’s generalize the \w+ in the preceding expression to permit word-internal hyphens and apostrophes: «\w+([-']\w+)*». This expression means \w+ followed by zero or more instances of [-']\w+; it would match hot-tempered and it’s. (We need to include ?: in this expression for reasons discussed earlier.) We’ll also add a pattern to match quote characters so these are kept separate from the text they enclose.

print re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw)
["'", 'When', "I'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',',
'(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'", 'I',
"won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup',
'does', 'very', 'well', 'without', '--', 'Maybe', "it's", 'always', 'pepper',
'that', 'makes', 'people', 'hot-tempered', ',', "'", '...']

Symbol Function


\b Word boundary (zero width) \d Any decimal digit (equivalent to [0-9])

\D Any non-digit character (equivalent to [^0-9]) \s Any whitespace character (equivalent to [ \t\n\r\f\v] \S Any non-whitespace character (equivalent to [^ \t\n\r\f\v]) \w Any alphanumeric character (equivalent to [a-zA-Z0-9_]) \W Any non-alphanumeric character (equivalent to [^a-za-z0-9_]) \t The tab character \n The newline character



Regex symbol list and regex examples



. Period, matches a single character of any single character, except the end of a line. For example, the below regex matches shirt, short and any character between sh and rt.

sh.rt

^ Carat, matches a term if the term appears at the beginning of a paragraph or a line. For example, the below regex matches a paragraph or a line starts with Apple.

^Apple

^ Carat inside a bracket, for example, the below regex matches any characters but a, b, c, d, e.

[^a-e]

$ Dollar sign, matches a term if the term appears at the end of a paragraph or a line. For example, the below regex matches a paragraph or a line ends with bye.

bye$

  • [ ] Square brackets, matches any single character from within the bracketed list. For example, the below regex matches bad, bed, bcd, brd, and bod.

b[aecro]d

– Hyphen, used for representing a range of letters or numbers,often used inside a square bracket. For example, the below regex matches kam, kbm, kcm, k2m, k3m, k4m and k5m.

k[a-c2-5]m

( ) Parentheses, groups one or more regular expressions. For example, the below regex matches codexpedia.com, codexpedia.net, and codexpedia.org.

codexpedia\.(com|net|org)

{n} Curly brackets with 1 number inside it, matches exactly n times of the preceding character. For example, the below regular expression matches 4 digits string, and only four digits string because there is ^ at the beginninga nd $ at the end of the regex.

^[\d]{4}$

{n,m} Curly brackets with 2 numbers inside it, matches minimum and maximum number of times of the preceding character. For example, the below regular expression matches google, gooogle and goooogle.

go{2,4}gle

{n,}, Curly brackets with a number and a comma, matches minimum number of times the preceding character. For example, the below regex matches google, gooogle, gooooogle, goooooogle, ….

go{2,}gle

| Pipe, matches either the regular expression preceding it or the regular expression following it. For example, the below regex matches the date format of MM/DD/YYYY, MM.DD.YYYY and MM-DD-YYY. It also matches MM.DD-YYYY, etc.

^(0[1-9]|1[012])[-/.](0[1-9]|[12][0-9]|3[01])[-/.][0-9]{4}$

? Question mark, matches 1 or 0 character in front of the question mark. For example, the below regular expression matches apple and apples.

apples?

  • Asterisk, matches 0 or more characters in front of the asterisk. For example, the below regular expression matches cl,col,cool,cool,…,coooooooooool,…

co*l

+ Plus, matches 1 or more characters in fron of the plus. For example, the below regular expression matches col,cool,…,cooooooooooool,…

co+l

! Exclamation, do not matches the next character or regular expression. For example, the below regular expression matches the the characher q if the charachter after q is not a digit, it will matches the q in those strings of abdqk, quit, qeig, but not q2kd, sdkq8d.

q(?![0-9])

\ Backslash, turns off the special meaning of the next character. For example, the below regex treats the period as a normal character and it matches a.b only.

a\.b

\b Backslash and b, matches a word boundary. For example, “\bwater” finds “watergun” but not “cleanwater” whereas “water\b” finds “cleanwater” but not “watergun”.\n Backslash and n, represents a line break.\t Backslash and t, represents a tab.\w Backslash and w, it is equivalent to [a-zA-Z0-9_], matches alphanumeric character or underscore. Conversely, Capital \W will match non-alphnumeric character and not underscore.\d Backslash and d, matches digits 0 to 9, equivalent to [0-9] or [:digit][:alpha:] or [A-Za-z] represents an alphabetic character.[:digit:] or [0-9] or [\d] represents a digit.[:alnum:] or [A-Za-z0-9] represents an alphanumeric character.

This regex matches email addresses

\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}\b

This regex matches websites links ending with sites of .com, .org, .edu, .gov and .us

https?://(www\.)?[A-Za-z0-9]+\.(com|org|edu|gov|us)/?.*

This regex matches social security numbers.

^[0-9]{3}-[0-9]{2}-[0-9]{4}$





NLTK Regex Tokenizer

The function nltk.regexp_tokenize() is similar to re.findall() (as we’ve been using it for tokenization). However, nltk.regexp_tokenize() is more efficient for this task, and avoids the need for special treatment of parentheses.

text = 'That U.S.A. poster-print costs $12.40...'
pattern = r'''(?x)     # set flag to allow verbose regexps
([A-Z]\.)+             # abbreviations, e.g. U.S.A.
| \w+(-\w+)*           # words with optional internal hyphens
| \$?\d+(\.\d+)?%?     # currency and percentages, e.g. $12.40, 82%
| \.\.\.               # ellipsis
| [][.,;"'?():-_`]     # these are separate tokens
'''
nltk.regexp_tokenize(text, pattern)
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

When using the verbose flag, you can no longer use ' ' to match a space character; use \s instead. The regexp_tokenize() function has an optional gaps parameter. When set to True, the regular expression specifies the gaps between tokens, as with re.split().

We can evaluate a tokenizer by comparing the resulting tokens with a wordlist, and then report any tokens that don’t appear in the wordlist, using set(tokens).difference(wordlist). You’ll probably want to lowercase all the tokens first.

Issues with Tokenization

Tokenization turns out to be a far more difficult task than you might have expected. No single solution works well across the board, and we must decide what counts as a token depending on the application domain.

When developing a tokenizer it helps to have access to raw text which has been manually tokenized, in order to compare the output of your tokenizer with high-quality (or “gold-standard”) tokens. The NLTK corpus collection includes a sample of Penn Treebank data, including the raw Wall Street Journal text (nltk.corpus.tree bank_raw.raw()) and the tokenized version (nltk.corpus.treebank.words()).

A final issue for tokenization is the presence of contractions, such as didn’t. If we are analyzing the meaning of a sentence, it would probably be more useful to normalize this form to two separate forms: did and n’t (or not). We can do this work with the help of a lookup table.

Segmentation

Tokenization is a type of segmentation, at the word level. We can also segment at the sentence level.

Sentence Segmentation

Manipulating texts at the level of individual words often presupposes the ability to divide a text into individual sentences. As we have seen, some corpora already provide access at the sentence level. In the following example, we compute the average number of words per sentence in the Brown Corpus:

len(nltk.corpus.brown.words()) / len(nltk.corpus.brown.sents())
20.250994070456922 # Average sentence length

In other cases, the text is available only as a stream of characters. Before tokenizing the text into words, we need to segment it into sentences. NLTK facilitates this by including the Punkt sentence segmenter (Kiss & Strunk, 2006).

Here is an example of its use in segmenting the text of a novel. (Note that if the segmenter’s internal data has been updated by the time you read this, you will see different output.)

sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')
sents = sent_tokenizer.tokenize(text)
pprint.pprint(sents[171:181])

['"Nonsense!',
'" said Gregory, who was very rational when anyone else\nattempted paradox.',
'"Why do all the clerks and navvies in the\nrailway trains look so sad and tired,...',
'I will\ntell you.',
'It is because they know that the train is going right.',
'It\nis because they know that whatever place they have taken a ticket\nfor that ...',
'It is because after they have\npassed Sloane Square they know that the next stat...',
'Oh, their wild rapture!',
'oh,\ntheir eyes like stars and their souls again in Eden, if the next\nstation w...'
'"\n\n"It is you who are unpoetical," replied the poet Syme.']

Notice that this example is really a single sentence, reporting the speech of Mr. Lucian Gregory. However, the quoted speech contains several sentences, and these have been split into individual strings. This is reasonable behavior for most applications.

Sentence segmentation is difficult because a period is used to mark abbreviations, and some periods simultaneously mark an abbreviation and terminate a sentence, as often happens with acronyms like U.S.A.

Examples F^-1

text = "**doyouseethekitty**seethedoggy**doyoulikethekitty**likethedoggy"
seg1 = "**0000000000000001**00000000001**00000000000000001**00000000000"
seg2 = "0100100100100001001001000010100100010010000100010010000"

Reconstruct segmented text from string representation: seg1 and seg2 represent the initial and final segmentations of some hypothetical child-directed speech; the segment() function can use them to reproduce the segmented text.

def segment(text, segs):
	words = []
	last = 0
	for i in range(len(segs)):
		if segs[i] == '1':
			words.append(text[last:i+1])
			last = i+1
	words.append(text[last:])
	return words

text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
seg2 = "0100100100100001001001000010100100010010000100010010000"

segment(text, seg1)
['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']

segment(text, seg2)
['do', 'you', 'see', 'the', 'kitty', 'see', 'the', 'doggy', 'do', 'you',
'like', 'the', kitty', 'like', 'the', 'doggy']

Maximizing the Objective Function to F

def evaluate(text, segs):
	words = segment(text, segs)
	text_size = len(words)
	lexicon_size = len(' '.join(list(set(words))))
	return text_size + lexicon_size

text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
seg2 = "0100100100100001001001000010100100010010000100010010000"
seg3 = "0000100100000011001000000110000100010000001100010000001"

segment(text, seg3)
['doyou', 'see', 'thekitt', 'y', 'see', 'thedogg', 'y', 'doyou', 'like',
'thekitt', 'y', 'like', 'thedogg', 'y']
evaluate(text, seg3)
46
evaluate(text, seg2)
47
evaluate(text, seg1)
63

The final step is to search for the pattern of zeros and ones that maximizes this objective function, shown in Example 3-4. Notice that the best segmentation includes “words” like thekitty, since there’s not enough evidence in the data to split this any further.

Search with Simulated Annealing

from random import randint

def flip(segs, pos):
	return segs[:pos] + str(1-int(segs[pos])) + segs[pos+1:]

def flip_n(segs, n):
	for i in range(n):
		segs = flip(segs, randint(0,len(segs)-1))
	return segs

def anneal(text, segs, iterations, cooling_rate):
	temperature = float(len(segs))
	while temperature > 0.5:
		best_segs, best = segs, evaluate(text, segs)
		for i in range(iterations):
			guess = flip_n(segs, int(round(temperature)))
			score = evaluate(text, guess)
			if score < best:
				best, best_segs = score, guess
		score, segs = best, best_segs
		temperature = temperature / cooling_rate
		print evaluate(text, segs), segment(text, segs)
	print
	return segs

Non-deterministic search using simulated annealing: Begin searching with phrase segmentations only; randomly perturb the zeros and ones proportional to the “temperature”; with each iteration the temperature is lowered and the perturbation of boundaries is reduced.

text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"

anneal(text, seg1, 5000, 1.2)
60 ['doyouseetheki', 'tty', 'see', 'thedoggy', 'doyouliketh', 'ekittylike', 'thedoggy']
58 ['doy', 'ouseetheki', 'ttysee', 'thedoggy', 'doy', 'o', 'ulikethekittylike', 'thedoggy']
56 ['doyou', 'seetheki', 'ttysee', 'thedoggy', 'doyou', 'liketh', 'ekittylike', 'thedoggy']
54 ['doyou', 'seethekit', 'tysee', 'thedoggy', 'doyou', 'likethekittylike', 'thedoggy']
53 ['doyou', 'seethekit', 'tysee', 'thedoggy', 'doyou', 'like', 'thekitty', 'like', 'thedoggy']
51 ['doyou', 'seethekittysee', 'thedoggy', 'doyou', 'like', 'thekitty', 'like', 'thedoggy']
42 ['doyou', 'see', 'thekitty', 'see', 'thedoggy', 'doyou', 'like', 'thekitty', 'like', 'thedoggy']
'0000100100000001001000000010000100010000000100010000000'

With enough data, it is possible to automatically segment text into words with a reasonable degree of accuracy. Such methods can be applied to tokenization for writing systems that don’t have any visual representation of word boundaries.

Formatting: Lists to Strings

Often we write a program to report a single data item, such as a particular element in a corpus that meets some complicated criterion, or a single summary statistic such as a word-count or the performance of a tagger.

More often, we write a program to produce a structured result; for example, a tabulation of numbers or linguistic forms, or a reformatting of the original data. When the results to be presented are linguistic, textual output is usually the most natural choice.

However, when the results are numerical, it may be preferable to produce graphical output. In this section, you will learn about a variety of ways to present program output.

From Lists to Strings

The simplest kind of structured object we use for text processing is lists of words. When we want to output these to a display or a file, we must convert these lists into strings. To do this in Python we use the join() method, and specify the string to be used as the “glue”:

silly = ['We', 'called', 'him', 'Tortoise', 'because', 'he', 'taught', 'us', '.']

' '.join(silly)
'We called him Tortoise because he taught us .'
';'.join(silly)
'We;called;him;Tortoise;because;he;taught;us;.'
''.join(silly)
'WecalledhimTortoisebecausehetaughtus.'

So ' '.join(silly) means: take all the items in silly and concatenate them as one big string, using ' ' as a spacer between the items. I.e., join() is a method of the string that you want to use as the glue. (Many people find this notation for join() counterintuitive.) The join() method only works on a list of strings—what we have been calling a text—a complex type that enjoys some privileges in Python.

Strings and Formats

We have seen that there are two ways to display the contents of an object:

word = 'cat'

print word
cat
word
'cat'
sentence = """hello
world"""

print sentence
hello
world
sentence
'hello\nworld'

The print command yields Python’s attempt to produce the most human-readable form of an object. The second method—naming the variable at a prompt—shows us a string that can be used to recreate this object. It is important to keep in mind that both of these are just strings, displayed for the benefit of you, the user. They do not give us any clue as to the actual internal representation of the object.

There are many other useful ways to display an object as a string of characters. This may be for the benefit of a human reader, or because we want to export our data to a particular file format for use in an external program.

Formatted output typically contains a combination of variables and pre-specified strings. For example, given a frequency distribution fdist, we could do:

fdist = nltk.FreqDist(['dog', 'cat', 'dog', 'cat', 'dog', 'snake', 'dog', 'cat'])
for word in fdist:
print word, '->', fdist[word], ';',

# dog -> 4 ; cat -> 3 ; snake -> 1 ;

String Formatting Expressions

Apart from the problem of unwanted whitespace, print statements that contain alternating variables and constants can be difficult to read and maintain. A better solution is to use string formatting expressions.

for word in fdist:
print '%s->%d;' % (word, fdist[word]),

# dog->4; cat->3; snake->1;

To understand what is going on here, let’s test out the string formatting expression on its own. (By now this will be your usual method of exploring new syntax.)

'%s->%d;' % ('cat', 3)
'cat->3;'

'%s->%d;' % 'cat'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: not enough arguments for format string

The special symbols %s and %d are placeholders for strings and (decimal) integers. We can embed these inside a string, then use the % operator to combine them. Let’s unpack this code further, in order to see this behavior up close:

String and Decimal Operators ⇒ Conversion Specifiers

'%s->' % 'cat'
'cat->'

'%d' % 3
'3'

'I want a %s right now' % 'coffee'
'I want a coffee right now'

We can have a number of placeholders, but following the % operator we need to specify a tuple with exactly the same number of values:

"%s wants a %s %s" % ("Lee", "sandwich", "for lunch")
'Lee wants a sandwich for lunch'

We can also provide the values for the placeholders indirectly. Here’s an example using a for loop:

template = 'Lee wants a %s right now'
menu = ['sandwich', 'spam fritter', 'pancake']
for snack in menu:
print template % snack
...

Lee wants a sandwich right now
Lee wants a spam fritter right now
Lee wants a pancake right now

The %s and %d symbols are called conversion specifiers. They start with the % character and end with a conversion character such as s (for string) or d (for decimal integer) The string containing conversion specifiers is called a format string. We combine a format string with the % operator and a tuple of values to create a complete string formatting expression.

Lining Things Up

So far our formatting strings generated output of arbitrary width on the page (or screen), such as %s and %d. We can specify a width as well, such as %6s, producing a string that is padded to width 6.

'%6s' % 'dog'
' dog'

'%-6s' % 'dog'
'dog '

It is right-justified by default , but we can include a minus sign to make it left-justified . In case we don’t know in advance how wide a displayed value should be, the width value can be replaced with a star in the formatting string, then specified using a variable .

width = 6
'%-*s' % (width, 'dog')
'dog '

Other control characters are used for decimal integers and floating-point numbers. Since the percent character % has a special interpretation in formatting strings, we have to precede it with another % to get it in the output.

count, total = 3205, 9375
"accuracy for %d words: %2.4f%%" % (total, 100 * count / total)
'accuracy for 9375 words: 34.1867%'

An important use of formatting strings is for tabulating data. Recall that in Section 2.1 we saw data being tabulated from a conditional frequency distribution. Let’s perform the tabulation ourselves, exercising full control of headings and column widths, as shown in Example 3-5. Note the clear separation between the language processing work, and the tabulation of results.

def tabulate(cfdist, words, categories):
	print '%-16s' % 'Category',
	for word in words:                        # column headings
		print '%6s' % word,
		print
	for category in categories:
		print '%-16s' % category,               # row heading
		for word in words:                      # for each word
			print '%6d' % cfdist[category][word], # print table cell
			print                                 # end the row
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist(
(genre, word)
for genre in brown.categories()
for word in brown.words(categories=genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
tabulate(cfd, modals, genres)

Recall from the listing in Example 3-1 that we used a formatting string "%*s". This allows us to specify the width of a field using a variable.

'%*s' % (15, "Monty Python")
' Monty Python'

We could use this to automatically customize the column to be just wide enough to accommodate all the words, using width = max(len(w) for w in words). Remember that the comma at the end of print statements adds an extra space, and this is sufficient to prevent the column headings from running into each other.

Writing Results to a File

It is often useful to write output to files as well. The following code opens a file output.txt for writing, and saves the program output to the file.

output_file = open('output.txt', 'w')
words = set(nltk.corpus.genesis.words('english-kjv.txt'))
for word in sorted(words):
output_file.write(word + "\n")

What is the effect of appending \n to each string before we write it to the file? If you’re using a Windows machine, you may want to use word + "\r\n" instead. What happens if we do output_file.write(word)

When we write non-text data to a file, we must convert it to a string first. We can do this conversion using formatting strings, as we saw earlier. Let’s write the total number of words to our file, before closing it.

len(words)
2789

str(len(words))
'2789'

output_file.write(str(len(words)) + "\n")
output_file.close()

You should avoid filenames that contain space characters, such as output file.txt, or that are identical except for case distinctions, e.g., Output.txt and output.TXT.

Text Wrapping

When the output of our program is text-like, instead of tabular, it will usually be necessary to wrap it so that it can be displayed conveniently. Consider the following output, which overflows its line, and which uses a complicated print statement:

We can take care of line wrapping with the help of Python’s textwrap module. For maximum clarity we will separate each step onto its own line:

from textwrap import fill
format = '%s (%d),'
pieces = [format % (word, len(word)) for word in saying]
output = ' '.join(pieces)
wrapped = fill(output)
print wrapped

# After (5), all (3), is (2), said (4), and (3), done (4), , (1), more
# (4), is (2), said (4), than (4), done (4), . (1),

Notice that there is a linebreak between more and its following number. If we wanted to avoid this, we could redefine the formatting string so that it contained no spaces (e.g., '%s_(%d),'), then instead of printing the value of wrapped, we could print wrap ped.replace('_', ' ').

Summary

  • Substrings are accessed using slice notation: 'Monty Python'[1:5] gives the value onty.

    • If the start index is omitted, the substring begins at the start of the string;
    • if the end index is omitted, the slice continues to the end of the string.
  • Strings can be split into lists: 'Monty Python'.split() gives ['Monty', 'Python'].

  • Lists can be joined into strings: '/'.join(['Monty', 'Python']) gives 'Monty/Python'.

  • We can read text from a file f using text = open(f).read().

  • We can read text from a URL u using text = urlopen(u).read().

  • We can iterate over the lines of a text file using for line in open(f).

  • Tokenization is the segmentation of a text into basic units—or tokens—such as words and punctuation. Tokenization based on whitespace is inadequate for many applications because it bundles punctuation together with words. NLTK provides an off-the-shelf tokenizer nltk.word_tokenize().

  • Lemmatization is a process that maps the various forms of a word (such as appeared, appears) to the canonical or citation form of the word, also known as the lexeme or lemma (e.g., appear).

  • Regular expressions are a powerful and flexible method of specifying patterns. Once we have imported the re module, we can use re.findall() to find all substrings in a string that match a pattern.

  • If a regular expression string includes a backslash, you should tell Python not to preprocess the string, by using a raw string with an r prefix: r'regexp'.

  • When backslash is used before certain characters, e.g., \n, this takes on a special meaning (newline character);

    • however, when backslash is used before regular expression wildcards and operators, e.g., ., |, $, these characters lose their special meaning and are matched literally.
  • A string formatting expression template % arg_tuple consists of a format string template that contains conversion specifiers like %-6s and %0.2d.

  • Simulated annealing is a heuristic for finding a good approximation to the optimum value of a function in a large, discrete search space, based on an analogy with annealing in metallurgy. The technique is described in many Artificial Intelligence texts.