257

Several spam corpora including the SpamAssassin corpus are linked to from http://nexp.cs.pdx.edu/~psam/cgi-bin/view/PSAM/CorpusSets.

258

If you wanted to conduct a test without disturbing the existing database, you could bind *feature- database*, *total-spams*, and *total-hams* with a LET, but then you'd have no way of looking at the database after the fact— unless you returned the values you used within the function.

259

This algorithm is named for the same Fisher who invented the method used for combining probabilities and for Frank Yates, his coauthor of the book Statistical Tables for Biological, Agricultural and Medical Research (Oliver & Boyd, 1938) in which, according to Knuth, they provided the first published description of the algorithm.

260

In ASCII, the first 32 characters are nonprinting control characters originally used to control the behavior of a Teletype machine, causing it to do such things as sound the bell, back up one character, move to a new line, and move the carriage to the beginning of the line. Of these 32 control characters, only three, the newline, carriage return, and horizontal tab, are typically found in text files.

261

Some binary file formats are in-memory data structures—on many operating systems it's possible to map a file into memory, and low-level languages such as C can then treat the region of memory containing the contents of the file just like any other memory; data written to that area of memory is saved to the underlying file when it's unmapped. However, these formats are platform-dependent since the in- memory representation of even such simple data types as integers depends on the hardware on which the program is running. Thus, any file format that's intended to be portable must define a canonical representation for all the data types it uses that can be mapped to the actual in-memory data representation on a particular kind of machine or in a particular language.

262

The term big-endian and its opposite, little-endian, borrowed from Jonathan Swift's Gulliver's Travels, refer to the way a multibyte number is represented in an ordered sequence of bytes such as in memory or in a file. For instance, the number 43981, or abcd in hex, represented as a 16-bit quantity, consists of two bytes, ab and cd. It doesn't matter to a computer in what order these two bytes are stored as long as everybody agrees. Of course, whenever there's an arbitrary choice to be made between two equally good options, the one thing you can be sure of is that everybody is not going to agree. For more than you ever wanted to know about it, and to see where the terms big-endian and little-endian were first applied in this fashion, read 'On Holy Wars and a Plea for Peace' by Danny Cohen, available at http://khavrinen.lcs.mit.edu/wollman/ien-137.txt.

263

LDB and DPB, a related function, were named after the DEC PDP-10 assembly functions that did essentially the same thing. Both functions operate on integers as if they were represented using twos-complement format, regardless of the internal representation used by a particular Common Lisp implementation.

264

Common Lisp also provides functions for shifting and masking the bits of integers in a way that may be more familiar to C and Java programmers. For instance, you could write read-u2 yet a third way, using those functions, like this:

(defun read-u2 (in)

(logior (ash (read-byte in) 8) (read-byte in)))

which would be roughly equivalent to this Java method:

public int readU2 (InputStream in) throws IOException {

return (in.read() << 8) | (in.read());

}

The names LOGIOR and ASH are short for LOGical Inclusive OR and Arithmetic SHift. ASH shifts an integer a given number of bits to the left when its second argument is positive or to the right if the second argument is negative. LOGIOR combines integers by logically oring each bit. Another function, LOGAND, performs a bitwise and, which can be used to mask off certain bits. However, for the kinds of bit twiddling you'll need to do in this chapter and the next, LDB and BYTE will be both more convenient and more idiomatic Common Lisp style.

265

Originally, UTF-8 was designed to represent a 31-bit character code and used up to six bytes per code point.

Вы читаете Practical Common Lisp
Добавить отзыв
ВСЕ ОТЗЫВЫ О КНИГЕ В ИЗБРАННОЕ

0

Вы можете отметить интересные вам фрагменты текста, которые будут доступны по уникальной ссылке в адресной строке браузера.

Отметить Добавить цитату