a core requirement is understanding, coordination, and successful cooperation between all team members. The best results in data mining are achieved when data-mining experts combine experience with organizational domain experts. While neither group needs to be fully proficient in the other’s field, it is certainly beneficial to have a basic background across areas of focus.

Introducing a data-mining application into an organization is essentially not very different from any other software application project, and the following conditions have to be satisfied:

There must be a well-defined problem.

The data must be available.

The data must be relevant, adequate, and clean.

The problem should not be solvable by means of ordinary query or OLAP tools only.

The results must be actionable.

A number of data mining projects have failed in the past years because one or more of these criteria were not met.

The initial phase of a data-mining process is essential from a business perspective. It focuses on understanding the project objectives and business requirements, and then converting this knowledge into a data-mining problem definition and a preliminary plan designed to achieve the objectives. The first objective of the data miner is to understand thoroughly, from a business perspective, what the client really wants to accomplish. Often the client has many competing objectives and constraints that must be properly balanced. The data miner’s goal is to uncover important factors at the beginning that can influence the outcome of the project. A possible consequence of neglecting this step is to expend a great deal of effort producing the right answers to the wrong questions. Data-mining projects do not fail because of poor or inaccurate tools or models. The most common pitfalls in data mining involve a lack of training, overlooking the importance of a thorough pre-project assessment, not employing the guidance of a data-mining expert, and not developing a strategic project definition adapted to what is essentially a discovery process. A lack of competent assessment, environmental preparation, and resulting strategy is precisely why the vast majority of data-mining projects fail.

The model of a data-mining process should help to plan, work through, and reduce the cost of any given project by detailing procedures to be performed in each of the phases. The model of the process should provide a complete description of all phases from problem specification to deployment of the results. Initially the team has to answer the key question: What is the ultimate purpose of mining these data, and more specifically, what are the business goals? The key to success in data mining is coming up with a precise formulation of the problem the team is trying to solve. A focused statement usually results in the best payoff. The knowledge of an organization’s needs or scientific research objectives will guide the team in formulating the goal of a data-mining process. The prerequisite to knowledge discovery is understanding the data and the business. Without this deep understanding, no algorithm, regardless of sophistication, is going to provide results in which a final user should have confidence. Without this background a data miner will not be able to identify the problems he/she is trying to solve, or to even correctly interpret the results. To make the best use of data mining, we must make a clear statement of project objectives. An effective statement of the problem will include a way of measuring the results of a knowledge discovery project. It may also include details about a cost justification. Preparatory steps in a data-mining process may also include analysis and specification of a type of data mining task, and selection of an appropriate methodology and corresponding algorithms and tools. When selecting a data-mining product, we have to be aware that they generally have different implementations of a particular algorithm even when they identify it with the same name. Implementation differences can affect operational characteristics such as memory usage and data storage, as well as performance characteristics such as speed and accuracy.

The data-understanding phase starts early in the project, and it includes important and time-consuming activities that could make enormous influence on the final success of the project. “Get familiar with the data” is the phrase that requires serious analysis of data, including source of data, owner, organization responsible for maintaining the data, cost (if purchased), storage organization, size in records and attributes, size in bytes, security requirements, restrictions on use, and privacy requirements. Also, the data miner should identify data-quality problems and discover first insights into the data, such as data types, definitions of attributes, units of measure, list or range of values, collection information, time and space characteristics, and missing and invalid data. Finally, we should detect interesting subsets of data in these preliminary analyses to form hypotheses for hidden information. The important characteristic of a data-mining process is the relative time spent to complete each of the steps in the process, and the data are counterintuitive as presented in Figure 1.6. Some authors estimate that about 20% of the effort is spent on business objective determination, about 60% on data preparation and understanding, and only about 10% for data mining and analysis.

Figure 1.6. Effort in data-mining process.

Technical literature reports only on successful data-mining applications. To increase our understanding of data-mining techniques and their limitations, it is crucial to analyze not only successful but also unsuccessful applications. Failures or dead ends also provide valuable input for data-mining research and applications. We have to underscore the intensive conflicts that have arisen between practitioners of “digital discovery” and classical, experience-driven human analysts objecting to these intrusions into their hallowed turf. One good case study is that of U.S. economist Orley Ashenfelter, who used data-mining techniques to analyze the quality of French Bordeaux wines. Specifically, he sought to relate auction prices to certain local annual weather conditions, in particular, rainfall and summer temperatures. His finding was that hot and dry years produced the wines most valued by buyers. Ashenfelter’s work and analytical methodology resulted in a deluge of hostile invective from established wine-tasting experts and writers. There was a fear of losing

Вы читаете Data Mining
Добавить отзыв
ВСЕ ОТЗЫВЫ О КНИГЕ В ИЗБРАННОЕ

0

Вы можете отметить интересные вам фрагменты текста, которые будут доступны по уникальной ссылке в адресной строке браузера.

Отметить Добавить цитату