Painters: Big Ideas from the Computer Age (O'Reilly, 2004)

247

There has since been some disagreement over whether the technique Graham described was actually 'Bayesian.' However, the name has stuck and is well on its way to becoming a synonym for 'statistical' when talking about spam filters.

248

It would, however, be poor form to distribute a version of this application using a package starting with com.gigamonkeys since you don't control that domain.

249

A version of CL-PPCRE is included with the book's source code available from the book's Web site. Or you can download it from Weitz's site at http://www.weitz.de/cl-ppcre/.

250

The main reason to use PRINT-UNREADABLE-OBJECT is that it takes care of signaling the appropriate error if someone tries to print your object readably, such as with the ~S FORMAT directive.

251

PRINT-UNREADABLE-OBJECT also signals an error if it's used when the printer control variable *PRINT-READABLY* is true. Thus, a PRINT-OBJECT method consisting solely of a PRINT- UNREADABLE-OBJECT form will correctly implement the PRINT- OBJECT contract with regard to *PRINT-READABLY*.

252

If you decide later that you do need to have different versions of increment-feature for different classes, you can redefine increment-count as a generic function and this function as a method specialized on word-feature.

253

Technically, the key in each clause of a CASE or ECASE is interpreted as a list designator, an object that designates a list of objects. A single nonlist object, treated as a list designator, designates a list containing just that one object, while a list designates itself. Thus, each clause can have multiple keys; CASE and ECASE will select the clause whose list of keys contains the value of the key form. For example, if you wanted to make good a synonym for ham and bad a synonym for spam, you could write increment-count like this:

(defun increment-count (feature type)

(ecase type

((ham good) (incf (ham-count feature)))

((spam bad) (incf (spam-count feature)))))

254

Speaking of mathematical nuances, hard-core statisticians may be offended by the sometimes loose use of the word probability in this chapter. However, since even the pros, who are divided between the Bayesians and the frequentists, can't agree on what a probability is, I'm not going to worry about it. This is a book about programming, not statistics.

255

Robinson's articles that directly informed this chapter are 'A Statistical Approach to the Spam Problem' (published in the Linux Journal and available at http://www.linuxjournal.com/ article.php?sid=6467 and in a shorter form on Robinson's blog at http://radio.weblogs.com/ 0101454/stories/2002/09/16/spamDetection.html) and 'Why Chi? Motivations for the Use of Fisher's Inverse Chi-Square Procedure in Spam Classification' (available at http://garyrob.blogs.com/ whychi93.pdf). Another article that may be useful is 'Handling Redundancy in Email Token Probabilities' (available at http://garyrob.blogs.com//handlingtokenredundancy94.pdf). The archived mailing lists of the SpamBayes project (http://spambayes.sourceforge.net/) also contain a lot of useful information about different algorithms and approaches to testing spam filters.

256

Techniques that combine nonindependent probabilities as though they were, in fact, independent, are called naive Bayesian. Graham's original proposal was essentially a naive Bayesian classifier with some 'empirically derived' constant factors thrown in.

Вы читаете Practical Common Lisp
Добавить отзыв
ВСЕ ОТЗЫВЫ О КНИГЕ В ИЗБРАННОЕ

0

Вы можете отметить интересные вам фрагменты текста, которые будут доступны по уникальной ссылке в адресной строке браузера.

Отметить Добавить цитату