in a moment, that takes a filename and a maximum number of characters to return. train-from- corpus looks like this:
(defparameter *max-chars* (* 10 1024))
(defun train-from-corpus (corpus &key (start 0) end)
(loop for idx from start below (or end (length corpus)) do
(destructuring-bind (file type) (aref corpus idx)
(train (start-of-file file *max-chars*) type))))
The test-from-corpus function is similar except you want to return a list containing the results of each classification so you can analyze them after the fact. Thus, you should capture both the classification and score returned by classify and then collect a list of the filename, the actual type, the type returned by classify, and the score. To make the results more human readable, you can include keywords in the list to indicate which values are which.
(defun test-from-corpus (corpus &key (start 0) end)
(loop for idx from start below (or end (length corpus)) collect
(destructuring-bind (file type) (aref corpus idx)
(multiple-value-bind (classification score)
(classify (start-of-file file *max-chars*))
(list
:file file
:type type
:classification classification
:score score)))))
To finish the implementation of test-classifier, you need to write the two utility functions that don't really have anything particularly to do with spam filtering, shuffle-vector and start- of-file.
An easy and efficient way to implement shuffle-vector is using the Fisher-Yates algorithm.[259] You can start by implementing a function, nshuffle-vector, that shuffles a vector in place. This name follows the same naming convention of other destructive functions such as NCONC and NREVERSE. It looks like this:
(defun nshuffle-vector (vector)
(loop for idx downfrom (1- (length vector)) to 1
for other = (random (1+ idx))
do (unless (= idx other)
(rotatef (aref vector idx) (aref vector other))))
vector)
The nondestructive version simply makes a copy of the original vector and passes it to the destructive version.
(defun shuffle-vector (vector)
(nshuffle-vector (copy-seq vector)))
The other utility function, start-of-file, is almost as straightforward with just one wrinkle. The most efficient way to read the contents of a file into memory is to create an array of the appropriate size and use READ-SEQUENCE to fill it in. So it might seem you could make a character array that's either the size of the file or the maximum number of characters you want to read, whichever is smaller. Unfortunately, as I mentioned in Chapter 14, the function FILE- LENGTH isn't entirely well defined when dealing with character streams since the number of characters encoded in a file can depend on both the character encoding used and the particular text in the file. In the worst case, the only way to get an accurate measure of the number of characters in a file is to actually read the whole file. Thus, it's ambiguous what FILE-LENGTH should do when passed a character stream; in most implementations, FILE-LENGTH always returns the number of octets in the file, which may be greater than the number of characters that can be read from the file.
However, READ-SEQUENCE returns the number of characters actually read. So, you can attempt to read the number of characters reported by FILE- LENGTH and return a substring if the actual number of characters read was smaller.
(defun start-of-file (file max-chars)
(with-open-file (in file)
(let* ((length (min (file-length in) max-chars))
(text (make-string length))
(read (read-sequence text in)))
(if (< read length)
(subseq text 0 read)
text))))
Now you're ready to write some code to analyze the results generated by test-classifier. Recall that test-classifier returns the list returned by test-from-corpus in which each element is a plist representing the result of classifying one file. This plist contains the name of the file, the actual type of the file, the classification, and the score returned by classify. The first bit of analytical code you should write is a function that returns a symbol indicating whether a given result was correct, a false positive, a false negative, a missed ham, or a missed spam. You can use DESTRUCTURING-BIND to pull out the :type and :classification elements of an individual result list (using &allow-other- keys to tell DESTRUCTURING-BIND to ignore any other key/value pairs it sees) and then use nested ECASE to translate the different pairings into a single symbol.
(defun result-type (result)
(destructuring-bind (&key type classification &allow-other-keys) result
(ecase type
(ham
(ecase classification
(ham 'correct)
(spam 'false-positive)
(unsure 'missed-ham)))
(spam
(ecase classification
(ham 'false-negative)
(spam 'correct)
(unsure 'missed-spam))))))
You can test out this function at the REPL.
