SPAM> (result-type '(:FILE #p'foo' :type ham :classification ham :score 0))

CORRECT

SPAM> (result-type '(:FILE #p'foo' :type spam :classification spam :score 0))

CORRECT

SPAM> (result-type '(:FILE #p'foo' :type ham :classification spam :score 0))

FALSE-POSITIVE

SPAM> (result-type '(:FILE #p'foo' :type spam :classification ham :score 0))

FALSE-NEGATIVE

SPAM> (result-type '(:FILE #p'foo' :type ham :classification unsure :score 0))

MISSED-HAM

SPAM> (result-type '(:FILE #p'foo' :type spam :classification unsure :score 0))

MISSED-SPAM

Having this function makes it easy to slice and dice the results of test-classifier in a variety of ways. For instance, you can start by defining predicate functions for each type of result.

(defun false-positive-p (result)

(eql (result-type result) 'false-positive))

(defun false-negative-p (result)

(eql (result-type result) 'false-negative))

(defun missed-ham-p (result)

(eql (result-type result) 'missed-ham))

(defun missed-spam-p (result)

(eql (result-type result) 'missed-spam))

(defun correct-p (result)

(eql (result-type result) 'correct))

With those functions, you can easily use the list and sequence manipulation functions I discussed in Chapter 11 to extract and count particular kinds of results.

SPAM> (count-if #'false-positive-p *results*)

6

SPAM> (remove-if-not #'false-positive-p *results*)

((:FILE #p'ham/5349' :TYPE HAM :CLASSIFICATION SPAM :SCORE 0.9999983107355541d0)

(:FILE #p'ham/2746' :TYPE HAM :CLASSIFICATION SPAM :SCORE 0.6286468956619795d0)

(:FILE #p'ham/3427' :TYPE HAM :CLASSIFICATION SPAM :SCORE 0.9833753501352983d0)

(:FILE #p'ham/7785' :TYPE HAM :CLASSIFICATION SPAM :SCORE 0.9542788587998488d0)

(:FILE #p'ham/1728' :TYPE HAM :CLASSIFICATION SPAM :SCORE 0.684339162891261d0)

(:FILE #p'ham/10581' :TYPE HAM :CLASSIFICATION SPAM :SCORE 0.9999924537959615d0))

You can also use the symbols returned by result-type as keys into a hash table or an alist. For instance, you can write a function to print a summary of the counts and percentages of each type of result using an alist that maps each type plus the extra symbol total to a count.

(defun analyze-results (results)

(let* ((keys '(total correct false-positive

false-negative missed-ham missed-spam))

(counts (loop for x in keys collect (cons x 0))))

(dolist (item results)

(incf (cdr (assoc 'total counts)))

(incf (cdr (assoc (result-type item) counts))))

(loop with total = (cdr (assoc 'total counts))

for (label . count) in counts

do (format t '~&~@(~a~):~20t~5d~,5t: ~6,2f%~%'

label count (* 100 (/ count total))))))

This function will give output like this when passed a list of results generated by test- classifier:

SPAM> (analyze-results *results*)

Total: 3761 : 100.00%

Correct: 3689 : 98.09%

False-positive: 4 : 0.11%

False-negative: 9 : 0.24%

Missed-ham: 19 : 0.51%

Missed-spam: 40 : 1.06%

NIL

And as a last bit of analysis you might want to look at why an individual message was classified the way it was. The following functions will show you:

(defun explain-classification (file)

(let* ((text (start-of-file file *max-chars*))

(features (extract-features text))

(score (score features))

(classification (classification score)))

(show-summary file text classification score)

(dolist (feature (sorted-interesting features))

(show-feature feature))))

(defun show-summary (file text classification score)

(format t '~&~a' file)

(format t '~2%~a~2%' text)

(format t 'Classified as ~a with score of ~,5f~%' classification score))

(defun show-feature (feature)

(with-slots (word ham-count spam-count) feature

(format

t '~&~2t~a~30thams: ~5d; spams: ~5d;~,10tprob: ~,f~%'

word ham-count spam-count (bayesian-spam-probability feature))))

(defun sorted-interesting (features)

(sort (remove-if #'untrained-p features) #'< :key #'bayesian-spam-probability))

What's Next

Obviously, you could do a lot more with this code. To turn it into a real spam-filtering application, you'd need

Вы читаете Practical Common Lisp
Добавить отзыв
ВСЕ ОТЗЫВЫ О КНИГЕ В ИЗБРАННОЕ

0

Вы можете отметить интересные вам фрагменты текста, которые будут доступны по уникальной ссылке в адресной строке браузера.

Отметить Добавить цитату