There have been numerous claims that data-mining techniques have been used successfully in counter-terrorism intelligence analyses, but little has surfaced to support these claims. The idea is that by analyzing the characteristics and profiles of known terrorists, it should be feasible to predict who in a sample population might also be a terrorist. This is actually a good example of potential pitfalls in the application of such analytical techniques to practical problems, as this type of profiling generates hypotheses, for which there may be good substantiation. The risk is that overly zealous law enforcement personnel, again highly motivated for good reasons, overreact when an individual, despite his or her profile, is not a terrorist. There is enough evidence in the media, albeit sensationalized, to suggest this is a real risk. Only careful investigation can prove whether the possibility is a probability. The degree to which a data-mining process supports business goals or scientific objectives of data explorations is much more important than the algorithms and data-mining tools it uses.
1.7 ORGANIZATION OF THIS BOOK
After introducing the basic concepts of data mining in Chapter 1, the rest of the book follows the basic phases of a data-mining process. In Chapters 2 and 3 the common characteristics of raw, large, data sets and the typical techniques of data preprocessing are explained. The text emphasizes the importance and influence of these initial phases on the final success and quality of data-mining results. Chapter 2 provides basic techniques for transforming raw data, including data sets with missing values and with time-dependent attributes. Outlier analysis is a set of important techniques for preprocessing of messy data and is also explained in this chapter. Chapter 3 deals with reduction of large data sets and introduces efficient methods for reduction of features, values, and cases. When the data set is preprocessed and prepared for mining, a wide spectrum of data-mining techniques is available, and the selection of a technique or techniques depends on the type of application and data characteristics. In Chapter 4, before introducing particular data-mining methods, we present the general theoretical background and formalizations applicable for all mining techniques. The essentials of the theory can be summarized with the question: How can one learn from data? The emphasis in Chapter 4 is on statistical learning theory and the different types of learning methods and learning tasks that may be derived from the theory. Also, problems of evaluation and deployment of developed models is discussed in this chapter.
Chapters 5 to 11 give an overview of common classes of data-mining techniques. Predictive methods are described in Chapters 5 to 8, while descriptive data mining is given in Chapters 9 to 11. Selected statistical inference methods are presented in Chapter 5, including Bayesian classifier, predictive and logistic regression, analysis of variance (ANOVA), and log-linear models. Chapter 6 summarizes the basic characteristics of the C4.5 algorithm as a representative of logic-based techniques for classification problems. Basic characteristics of the Classification and Regression Trees (CART) approach are also introduced and compared with C4.5 methodology. Chapter 7 discusses the basic components of artificial neural networks and introduces two classes: multilayer perceptrons and competitive networks as illustrative representatives of a neural-network technology. Practical applications of a data-mining technology show that the use of several models in predictive data mining increases the quality of results. This approach is called ensemble learning, and basic principles are given in Chapter 8.
Chapter 9 explains the complexity of clustering problems and introduces agglomerative, partitional, and incremental clustering techniques. Different aspects of local modeling in large data sets are addressed in Chapter 10, and common techniques of association-rule mining are presented. Web mining and text mining are becoming one of the central topics for many researchers, and results of these activities are new algorithms summarized in Chapter 11. There are a number of new topics and recent trends in data mining that are emphasized in the last 7 years. Some of these topics, such as graph mining, and temporal, spatial, and distributed data mining, are covered in Chapter 12. Important legal restrictions and guidelines, and security and privacy aspects of data mining applications are also introduced in this chapter. Most of the techniques explained in Chapters 13 and 14, about genetic algorithms and fuzzy systems, are not directly applicable in mining large data sets. Recent advances in the field show that these technologies, derived from soft computing, are becoming more important in better representing and computing data as they are combined with other techniques. Finally, Chapter 15 recognizes the importance of data-mining visualization techniques, especially those for representation of large-dimensional samples.
It is our hope that we have succeeded in producing an informative and readable text supplemented with relevant examples and illustrations. All chapters in the book have a set of review problems and reading lists. The author is preparing a solutions manual for instructors who might use the book for undergraduate or graduate classes. For an in-depth understanding of the various topics covered in this book, we recommend to the reader a fairly comprehensive list of references, given at the end of each chapter. Although most of these references are from various journals, magazines, and conference and workshop proceedings, it is obvious that, as data mining is becoming a more mature field, there are many more books available, covering different aspects of data mining and knowledge discovery. Finally, the book has two appendices with useful background information for practical applications of data-mining technology. In Appendix A we provide an overview of the most influential journals, conferences, forums, and blogs, as well as a