Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
The classifier that you will be building needs features to use for classifying different items. A feature is anything that you can determine as being either present or absent in the item. When considering documents for classification, the items are the documents and the features are the words in the document. When using words as features, the assumption is that some words are more likely to appear in spam than in nonspam, which is the basic premise underlying most spam filters. Features don’t have to be individual words, however; they can be word pairs or phrases or anything else that can be classified as absent or present in a particular document.
Create a new file called docclass.py, and add a function called
getwords to extract the features from
the text: