247
There has since been some disagreement over whether the technique Graham described was actually 'Bayesian.' However, the name has stuck and is well on its way to becoming a synonym for 'statistical' when talking about spam filters.
248
It would, however, be poor form to distribute a version of this application using a package starting with com.gigamonkeys
since you don't control that domain.
249
A version of CL-PPCRE is included with the book's source code available from the book's Web site. Or you can download it from Weitz's site at http://www.weitz.de/cl-ppcre/
.
250
The main reason to use PRINT-UNREADABLE-OBJECT
is that it takes care of signaling the appropriate error if someone tries to print your object readably, such as with the ~S FORMAT
directive.
251
PRINT-UNREADABLE-OBJECT
also signals an error if it's used when the printer control variable *PRINT-READABLY*
is true. Thus, a PRINT-OBJECT
method consisting solely of a PRINT- UNREADABLE-OBJECT
form will correctly implement the PRINT- OBJECT
contract with regard to *PRINT-READABLY*
.
252
If you decide later that you do need to have different versions of increment-feature
for different classes, you can redefine increment-count
as a generic function and this function as a method specialized on word-feature
.
253
Technically, the key in each clause of a CASE
or ECASE
is interpreted as a CASE
and ECASE
will select the clause whose list of keys contains the value of the key form. For example, if you wanted to make good
a synonym for ham
and bad
a synonym for spam
, you could write increment-count
like this:
(defun increment-count (feature type)
(ecase type
((ham good) (incf (ham-count feature)))
((spam bad) (incf (spam-count feature)))))
254
Speaking of mathematical nuances, hard-core statisticians may be offended by the sometimes loose use of the word
255
Robinson's articles that directly informed this chapter are 'A Statistical Approach to the Spam Problem' (published in the http://www.linuxjournal.com/ article.php?sid=6467
and in a shorter form on Robinson's blog at http://radio.weblogs.com/ 0101454/stories/2002/09/16/spamDetection.html
) and 'Why Chi? Motivations for the Use of Fisher's Inverse Chi-Square Procedure in Spam Classification' (available at http://garyrob.blogs.com/ whychi93.pdf
). Another article that may be useful is 'Handling Redundancy in Email Token Probabilities' (available at http://garyrob.blogs.com//handlingtokenredundancy94.pdf
). The archived mailing lists of the SpamBayes project (http://spambayes.sourceforge.net/
) also contain a lot of useful information about different algorithms and approaches to testing spam filters.
256
Techniques that combine nonindependent probabilities as though they were, in fact, independent, are called