and 1 minus the hamminess. This has the nice effect that when the two scores agree (high spamminess and low hamminess, or vice versa) you'll end up with a strong indicator near either 0 or 1. But when the spamminess and hamminess scores are both high or both low, then you'll end up with a final value near 1/2, which you can treat as an 'uncertain' classification.

The score function that implements this scheme looks like this:

(defun score (features)

(let ((spam-probs ()) (ham-probs ()) (number-of-probs 0))

(dolist (feature features)

(unless (untrained-p feature)

(let ((spam-prob (float (bayesian-spam-probability feature) 0.0d0)))

(push spam-prob spam-probs)

(push (- 1.0d0 spam-prob) ham-probs)

(incf number-of-probs))))

(let ((h (- 1 (fisher spam-probs number-of-probs)))

(s (- 1 (fisher ham-probs number-of-probs))))

(/ (+ (- 1 h) s) 2.0d0))))

You take a list of features and loop over them, building up two lists of probabilities, one listing the probabilities that a message containing each feature is a spam and the other that a message containing each feature is a ham. As an optimization, you can also count the number of probabilities while looping over them and pass the count to fisher to avoid having to count them again in fisher itself. The value returned by fisher will be low if the individual probabilities contained too many low probabilities to have come from random text. Thus, a low fisher score for the spam probabilities means there were many hammy features; subtracting that score from 1 gives you a probability that the message is a ham. Conversely, subtracting the fisher score for the ham probabilities gives you the probability that the message was a spam. Combining those two probabilities gives you an overall spamminess score between 0 and 1.

Within the loop, you can use the function untrained-p to skip features extracted from the message that were never seen during training. These features will have spam counts and ham counts of zero. The untrained-p function is trivial.

(defun untrained-p (feature)

(with-slots (spam-count ham-count) feature

(and (zerop spam-count) (zerop ham-count))))

The only other new function is fisher itself. Assuming you already had an inverse-chi- square function, fisher is conceptually simple.

(defun fisher (probs number-of-probs)

'The Fisher computation described by Robinson.'

(inverse-chi-square

(* -2 (log (reduce #'* probs)))

(* 2 number-of-probs)))

Unfortunately, there's a small problem with this straightforward implementation. While using REDUCE is a concise and idiomatic way of multiplying a list of numbers, in this particular application there's a danger the product will be too small a number to be represented as a floating-point number. In that case, the result will underflow to zero. And if the product of the probabilities underflows, all bets are off because taking the LOG of zero will either signal an error or, in some implementation, result in a special negative-infinity value, which will render all subsequent calculations essentially meaningless. This is particularly unfortunate in this function because the Fisher method is most sensitive when the input probabilities are low—near zero—and therefore in the most danger of causing the multiplication to underflow.

Luckily, you can use a bit of high-school math to avoid this problem. Recall that the log of a product is the same as the sum of the logs of the factors. So instead of multiplying all the probabilities and then taking the log, you can sum the logs of each probability. And since REDUCE takes a :key keyword parameter, you can use it to perform the whole calculation. Instead of this:

(log (reduce #'* probs))

write this:

(reduce #'+ probs :key #'log)

Inverse Chi Square

The implementation of inverse-chi-square in this section is a fairly straightforward translation of a version written in Python by Robinson. The exact mathematical meaning of this function is beyond the scope of this book, but you can get an intuitive sense of what it does by thinking about how the values you pass to fisher will affect the result: the more low probabilities you pass to fisher, the smaller the product of the probabilities will be. The log of a small product will be a negative number with a large absolute value, which is then multiplied by -2, making it an even larger positive number. Thus, the more low probabilities were passed to fisher, the larger the value it'll pass to inverse-chi- square. Of course, the number of probabilities involved also affects the value passed to inverse- chi-square. Since probabilities are, by definition, less than or equal to 1, the more probabilities that go into a product, the smaller it'll be and the larger the value passed to inverse-chi-square. Thus, inverse-chi-square should return a low probability when the Fisher combined value is abnormally large for the number of probabilities that went into it. The following function does exactly that:

(defun inverse-chi-square (value degrees-of-freedom)

(assert (evenp degrees-of-freedom))

(min

(loop with m = (/ value 2)

for i below (/ degrees-of-freedom 2)

for prob = (exp (- m)) then (* prob (/ m i))

summing prob)

1.0))

Recall from Chapter 10 that EXP raises e to the argument given. Thus, the larger value is, the smaller the initial value of prob will be. But that initial value will then be adjusted upward slightly for each degree of freedom as long as m is greater than the number of degrees of freedom. Since the value returned by inverse-chi-square is supposed to be another probability, it's important to clamp the value returned with MIN since rounding errors in the multiplication and exponentiation may cause the LOOP to return a sum just a shade over 1.

Training the Filter

Since you wrote classify and train to take a string argument, you can test them easily at the REPL. If you haven't yet, you should switch to the package in which you've been writing this code by evaluating an IN-PACKAGE form at the REPL or using the SLIME shortcut change-package. To use the SLIME shortcut, type a comma at the REPL and then type the name at the prompt. Pressing Tab while typing the package name will autocomplete based on the packages your Lisp knows about. Now you can invoke any of the functions that are part of the spam application. You should first make sure the database is empty.

SPAM> (clear-database)

Вы читаете Practical Common Lisp
Добавить отзыв
ВСЕ ОТЗЫВЫ О КНИГЕ В ИЗБРАННОЕ

0

Вы можете отметить интересные вам фрагменты текста, которые будут доступны по уникальной ссылке в адресной строке браузера.

Отметить Добавить цитату