Now you can train the filter with some text.

SPAM> (train 'Make money fast' 'spam)

And then see what the classifier thinks.

SPAM> (classify 'Make money fast')

SPAM

SPAM> (classify 'Want to go to the movies?')

UNSURE

While ultimately all you care about is the classification, it'd be nice to be able to see the raw score too. The easiest way to get both values without disturbing any other code is to change classification to return multiple values.

(defun classification (score)

(values

(cond

((<= score *max-ham-score*) 'ham)

((>= score *min-spam-score*) 'spam)

(t 'unsure))

score))

You can make this change and then recompile just this one function. Because classify returns whatever classification returns, it'll also now return two values. But since the primary return value is the same, callers of either function who expect only one value won't be affected. Now when you test classify, you can see exactly what score went into the classification.

SPAM> (classify 'Make money fast')

SPAM

0.863677101854273D0

SPAM> (classify 'Want to go to the movies?')

UNSURE

0.5D0

And now you can see what happens if you train the filter with some more ham text.

SPAM> (train 'Do you have any money for the movies?' 'ham)

1

SPAM> (classify 'Make money fast')

SPAM

0.7685351219857626D0

It's still spam but a bit less certain since money was seen in ham text.

SPAM> (classify 'Want to go to the movies?')

HAM

0.17482223132078922D0

And now this is clearly recognizable ham thanks to the presence of the word movies, now a hammy feature.

However, you don't really want to train the filter by hand. What you'd really like is an easy way to point it at a bunch of files and train it on them. And if you want to test how well the filter actually works, you'd like to then use it to classify another set of files of known types and see how it does. So the last bit of code you'll write in this chapter will be a test harness that tests the filter on a corpus of messages of known types, using a certain fraction for training and then measuring how accurate the filter is when classifying the remainder.

Testing the Filter

To test the filter, you need a corpus of messages of known types. You can use messages lying around in your inbox, or you can grab one of the corpora available on the Web. For instance, the SpamAssassin corpus[257] contains several thousand messages hand classified as spam, easy ham, and hard ham. To make it easy to use whatever files you have, you can define a test rig that's driven off an array of file/type pairs. You can define a function that takes a filename and a type and adds it to the corpus like this:

(defun add-file-to-corpus (filename type corpus)

(vector-push-extend (list filename type) corpus))

The value of corpus should be an adjustable vector with a fill pointer. For instance, you can make a new corpus like this:

(defparameter *corpus* (make-array 1000 :adjustable t :fill-pointer 0))

If you have the hams and spams already segregated into separate directories, you might want to add all the files in a directory as the same type. This function, which uses the list-directory function from Chapter 15, will do the trick:

(defun add-directory-to-corpus (dir type corpus)

(dolist (filename (list-directory dir))

(add-file-to-corpus filename type corpus)))

For instance, suppose you have a directory mail containing two subdirectories, spam and ham, each containing messages of the indicated type; you can add all the files in those two directories to *corpus* like this:

SPAM> (add-directory-to-corpus 'mail/spam/' 'spam *corpus*)

NIL

SPAM> (add-directory-to-corpus 'mail/ham/' 'ham *corpus*)

NIL

Now you need a function to test the classifier. The basic strategy will be to select a random chunk of the corpus to train on and then test the corpus by classifying the remainder of the corpus, comparing the classification returned by the classify function to the known classification. The main thing you want to know is how accurate the classifier is—what percentage of the messages are classified correctly? But you'll probably also be interested in what messages were misclassified and in what direction—were there more false positives or more false negatives? To make it easy to perform different analyses of the classifier's behavior, you should define the testing functions to build a list of raw results, which you can then analyze however you like.

The main testing function might look like this:

(defun test-classifier (corpus testing-fraction)

(clear-database)

(let* ((shuffled (shuffle-vector corpus))

(size (length corpus))

(train-on (floor (* size (- 1 testing-fraction)))))

(train-from-corpus shuffled :start 0 :end train-on)

(test-from-corpus shuffled :start train-on)))

This function starts by clearing out the feature database.[258] Then it shuffles the corpus, using a function you'll implement in a moment, and figures out, based on the testing- fraction parameter, how many messages it'll train on and how many it'll reserve for testing. The two helper functions train-from-corpus and test-from-corpus will both take :start and :end keyword parameters, allowing them to operate on a subsequence of the given corpus.

The train-from-corpus function is quite simple—simply loop over the appropriate part of the corpus, use DESTRUCTURING-BIND to extract the filename and type from the list found in each element, and then pass the text of the named file and the type to train. Since some mail messages, such as those with attachments, are quite large, you should limit the number of characters it'll take from the message. It'll obtain the text with a function start-of-file, which you'll implement

Вы читаете Practical Common Lisp
Добавить отзыв
ВСЕ ОТЗЫВЫ О КНИГЕ В ИЗБРАННОЕ

0

Вы можете отметить интересные вам фрагменты текста, которые будут доступны по уникальной ссылке в адресной строке браузера.

Отметить Добавить цитату