list of commercially and publicly available data-mining tools, while Appendix B presents a number of commercially successful data-mining applications.

The reader should have some knowledge of the basic concepts and terminology associated with data structures and databases. In addition, some background in elementary statistics and machine learning may also be useful, but it is not necessarily required as the concepts and techniques discussed within the book can be utilized without deeper knowledge of the underlying theory.

1.8 REVIEW QUESTIONS AND PROBLEMS

1. Explain why it is not possible to analyze some large data sets using classical modeling techniques.

2. Do you recognize in your business or academic environment some problems in which the solution can be obtained through classification, regression, or deviation? Give examples and explain.

3. Explain the differences between statistical and machine-learning approaches to the analysis of large data sets.

4. Why are preprocessing and dimensionality reduction important phases in successful data-mining applications?

5. Give examples of data where the time component may be recognized explicitly, and other data where the time component is given implicitly in a data organization.

6. Why is it important that the data miner understand data well?

7. Give examples of structured, semi-structured, and unstructured data from everyday situations.

8. Can a set with 50,000 samples be called a large data set? Explain your answer.

9. Enumerate the tasks that a data warehouse may solve as a part of the data-mining process.

10. Many authors include OLAP tools as a standard data-mining tool. Give the arguments for and against this classification.

11. Churn is a concept originating in the telephone industry. How can the same concept apply to banking or to human resources?

12. Describe the concept of actionable information.

13. Go to the Internet and find a data-mining application. Report the decision problem involved, the type of input available, and the value contributed to the organization that used it.

14. Determine whether or not each of the following activities is a data-mining task. Discuss your answer.

(a) Dividing the customers of a company according to their age and sex.

(b) Classifying the customers of a company according to the level of their debt.

(c) Analyzing the total sales of a company in the next month based on current-month sales.

(d) Classifying a student database based on a department and sorting based on student identification numbers.

(e) Determining the influence of the number of new University of Louisville students on the stock market value.

(f) Estimating the future stock price of a company using historical records.

(g) Monitoring the heart rate of a patient with abnormalities.

(h) Monitoring seismic waves for earthquake activities.

(i) Extracting frequencies of a sound wave.

(j) Predicting the outcome of tossing a pair of dice.

1.9 REFERENCES FOR FURTHER STUDY

Berson, A., S. Smith, K. Thearling, Building Data Mining Applications for CRM, McGraw-Hill, New York, 2000.

The book is written primarily for the business community, explaining the competitive advantage of data-mining technology. It bridges the gap between understanding this vital technology and implementing it to meet a corporation’s specific needs. Basic phases in a data-mining process are explained through real-world examples.

Han, J., M. Kamber, Data Mining: Concepts and Techniques, 2nd edition, Morgan Kaufmann, San Francisco, CA, 2006.

This book gives a sound understanding of data-mining principles. The primary orientation of the book is for database practitioners and professionals, with emphasis on OLAP and data warehousing. In-depth analysis of association rules and clustering algorithms is an additional strength of the book. All algorithms are presented in easily understood pseudo-code, and they are suitable for use in real-world, large-scale data-mining projects, including advanced applications such as Web mining and text mining.

Hand, D., H. Mannila, P. Smith, Principles of Data Mining, MIT Press, Cambridge, MA, 2001.

The book consists of three sections. The first, foundations, provides a tutorial overview of the principles underlying data-mining algorithms and their applications. The second section, data-mining algorithms, shows how algorithms are constructed to solve specific problems in a principled manner. The third section shows how all of the preceding analyses fit together when applied to real-world data-mining problems.

Olson D., S. Yong, Introduction to Business Data Mining, McGraw-Hill, Englewood Cliffs, NJ, 2007.

Introduction to Business Data Mining was developed to introduce students, as opposed to professional practitioners or engineering students, to the fundamental concepts of data mining. Most importantly, this text shows readers how to gather and analyze large sets of data to gain useful business understanding. The authors’ team has had extensive experience with the quantitative analysis of business as well as with data-mining analysis. They have both taught this material and used their own graduate students to prepare the text’s data-mining reports. Using real-world vignettes and their extensive knowledge of this new subject, David Olson and Yong Shi have created a text that demonstrates data-mining processes and techniques needed for business applications.

Westphal, C., T. Blaxton, Data Mining Solutions: Methods and Tools for Solving Real-World Problems, John Wiley, New York, 1998.

This introductory book gives a refreshing “out-of-the-box” approach to data mining that will help the reader to maximize time and problem-solving resources, and prepare for the next wave of data-mining visualization techniques. An extensive coverage of data-mining software tools is valuable to readers who are planning to set up their own data-mining environment.

2

PREPARING THE DATA

Chapter Objectives

Analyze basic representations and characteristics of raw and large data sets.

Apply different normalization techniques on numerical attributes.

Recognize different techniques for data preparation, including attribute transformation.

Compare different methods for elimination of missing values.

Construct a method for uniform representation of time-dependent data.

Compare different techniques for outlier detection.

Implement some data preprocessing techniques.

2.1 REPRESENTATION OF RAW DATA

Data samples introduced as rows in Figure 1.4 are basic components in a data-mining process. Every sample is described with several features, and there are different types of values for every feature. We will start with the two most common types: numeric and categorical. Numeric values include real-value variables or integer variables such as age, speed, or length. A feature with numeric values has two important properties: Its values have an order relation (2 < 5 and 5 < 7) and a distance relation (d [2.3, 4.2] = 1.9).

In contrast, categorical (often called symbolic) variables have neither of

Вы читаете Data Mining
Добавить отзыв
ВСЕ ОТЗЫВЫ О КНИГЕ В ИЗБРАННОЕ

0

Вы можете отметить интересные вам фрагменты текста, которые будут доступны по уникальной ссылке в адресной строке браузера.

Отметить Добавить цитату