Nuggets^® and Data Mining

White Paper

Michael Gilman, Ph. D.

Data Mining Technologies Inc.

2 Split Rock

Melville, NY 11747

631 692-4400

e-mail mgilman@data-mine.com

May 2000

Management Overview

Today’s business environment is more competitive than ever. The difference between survival and defeat often rests on a thin edge of higher efficiency than the competition. This advantage is often the result of better information technology providing the basis for improved business decisions. The problem of how to make such business decisions is therefore crucial. But how is this to be done? One answer is through the better analysis of data.

Some estimates hold that the amount of information in the world doubles every twenty years. Undoubtedly the volume of computer data increases at a much faster rate. In 1989 the total number of databases in the world was estimated at five million, most of which were small dBase files. Today the automation of business transactions produces a deluge of data because even simple transactions like telephone calls, shopping trips, medical tests and consumer product warranty registrations are recorded in a computer. Scientific databases are also growing rapidly. NASA, for example, has more data than it can analyze. The human genome project will store thousands of bytes for each of the several billion genetic bases. The 1990 US census data of over a billion bytes contains an untold quantity of hidden patterns that describe the lifestyles of the population.

How can we explore this mountain of raw data? Most of it will never be seen by human eyes and even if viewed could not be analyzed by “hand." Computers provide the obvious answer.

The computer method we should use to process the data then becomes the issue. Although simple statistical methods were developed long ago, they are not as powerful as a new class of “intelligent” analytical tools collectively called data mining methods.

Data mining is a new methodology for improving the quality and effectiveness of the business and scientific decision making process. It complements, and can often replace, other business decision assistance tools, such as statistical analysis, computer reporting and querying. Data mining can achieve high return on investment decisions by exploiting one of an enterprise’s most valuable and often overlooked assets—DATA!

Byte Magazine reported that some companies have reaped returns on investment of as much as 1,000 times their initial investment on a single project. More and more companies are realizing that the massive amounts of data that they have been collecting over the years can be their key to success. With the proliferation of data warehouses, this data can be mined to uncover the hidden nuggets of knowledge. Simply put, data mining tools are fast becoming a business necessity. The Gartner group has predicted that data mining will be one of the five hottest technologies in the early years of the new century.

There are currently several data mining techniques available. This white paper will discuss the leading ones and present an exciting and powerful new data mining method, Nuggets^® that uses breakthrough technology—combining proprietary algorithms and genetic methodology to offer significant benefits over other methods.

Management Summary - The Bottom Line

Many of the standard analytical tools that do not use data mining have powerful capabilities for performing sophisticated user-driven queries. They are, however, limited in their ability to discover trends and complex patterns in a database because the user must “think up” a hypothesis and then test it. Relevant and important hypotheses may not be obvious or come from patterns obscured within the data.

Data mining tools, however, analyze data by automatically formulating hypotheses about data. The problem that often confronts researchers new to the field is that there are a variety of data mining techniques available—which one to choose? All these tools give you answers. Some are more difficult to use than others, and they differ in other, superficial ways, but most importantly, the underlying algorithms used differ and the nature of these algorithms is directly related to the quality of the results obtained.

1. This paper will discuss the strengths and weaknesses of some methods available today, define data mining with some examples and explain the benefits of Nuggets® vis-à-vis the alternatives. It will describe how Nuggets® gives you power to extract knowledge from your data unavailable with other approaches.

What is Data Mining?

Definition

The objective of data mining is to extract valuable information from your data, to discover the “hidden gold.” This gold is the valuable information in that data. Small changes in strategy, provided by data mining’s discovery process, can translate into a difference of millions of dollars to the bottom line. With the proliferation of data warehouses, data mining tools are fast becoming a business necessity. An important point to remember, however, is that you do not need a data warehouse to successfully use data mining—all you need is data.

Many traditional reporting and query tools and statistical analysis systems use the term "data mining" in their product descriptions. Exotic Artificial Intelligence-based systems are also being touted as new data mining tools. Which leads to the question, “What is a data mining tool and what isn't?” The ultimate objective of data mining is knowledge discovery. Data mining methodology extracts predictive information from databases. With such a broad definition, however, an on-line analytical processing (OLAP) product or a statistical package could qualify as a data mining tool, so we must narrow the definition. To be a true knowledge discovery method, a data mining tool should unearth information automatically. By this definition data mining is data-driven, whereas by contrast, traditional statistical, reporting and query tools are user-driven.

User Driven Analysis

Query Generators

Traditionally the goal of identifying and utilizing information hidden in data has proceeded via query generators and data interpretation systems. A user formulates a theory with a hypothesis and queries the database to test the validity of this hypothesis.

For example, a user might hypothesize about the relationship between industrial sales of color copiers and customers' specific industries. The user would generate a query against the data and segment the results into a report. Typically, the generated information provides a good overview.

This verification type of analysis is limited in at least three ways, however. First, it's usually based on a hunch. In our example, the hunch is that the industry in which a potential customer operates correlates with the number of copiers it buys or leases. Second, the quality of the extracted information depends on the user's interpretation of the results—and is thus subject to error. Third, the user’s ability to hypothesize is usually limited to two or three variables at best. There may, however, be many relationships that exist among more than two or three variables that the user will not search for.

Statistics

Statistical methods have long been used to extract information from data. Multifactor analyses of variance and multivariate analyses include statistical methods that could identify the relationships among factors that influence the outcome of copier sales, for example. Pearson product‑moment correlations measure the strength and direction of the relationship between each database field and the dependent variable.

One of the problems with these approaches is that the techniques tend to focus on tasks in which all the attributes have continuous or ordinal values. Many of the attributes are also parametric, that is, they assume a particular probability distribution of the variables. Many methods also assume that a relationship is expressible as a linear combination of the attribute values. Statistical methodology also very often assumes normally distributed data—a sometimes tenuous supposition in the real world. These assumptions are not usually verified in practice and therefore the results are questionable.

Nuggets^®True Data Mining

What is True Data Mining?

The generation of a query stems from the need to know certain facts, such as regional sales reports stratified by type of business; data mining projects stem from the need to discover more general information such as the factors that influence these sales. One way to identify a true data mining tool is by how it operates on the data: is it manual (top‑down) or automatic (bottom‑up)? In other words, does the user or the software originate the query?

Neural networks and decision tree methods qualify as true automatic data mining tools because they autonomously interrogate the data for patterns.

Data mining tools offer great potential for corporate data warehouses since they discover rather than confirm trends or patterns in data.

Most of these symbolic classifiers are also known as rule-induction programs or decision‑tree generators. They use statistical algorithms or machine‑learning algorithms such as ID3, C4.5, AC2, CART, CHAID, CN2, or modifications of these algorithms. Symbolic classifiers split a database into classes that differ as much as possible in their relation to a selected output. That is, the tool partitions a database according to the results of statistical tests often directed by the user.

How Data Mining Works

Data mining includes several steps: problem analysis, data extraction, data cleansing, rules development, output analysis and review. Data mining sources are typically flat files extracted from on-line sets of files, from data warehouses or other data source. Data may however be derived from almost any source. Whatever the source of data, data mining will often be an iterative process involving these steps.

Rule Generators

Some data mining tools generate their findings in the format of "if then" rules. The results are thus more understandable to the decision makers. Here's an example of the data mining rules that Nuggets® might discover for a project to target potential product buyers.

Rule 1.

IF CUSTOMER SINCE = 1978 through 1994

AND REVOLVING LIMIT = 5120 through 8900

AND CREDIT/DEBITRATIO =67

THEN Potential Buyer = Yes with a confidence factor of 89%

Rule 2.

IF CUSTOMER SINCE = 1994 through 1996

AND REVOLVING LIMIT = 1311 through 5120

AND CREDIT/DEBITRATIO =67

THEN Potential Buyer = Yes with a confidence factor of 49%

Data Analysis Methods - An Overview

The following represents a discussion of some of the most popular methods used to extract information from data.

Non-Data Mining Methods

Query Tools

Most of these tools come with graphical components. Some support a degree of multi-dimensionality such as crosstab reporting, time series analysis, drill down, slice and dice and pivoting.

Pros

These tools are sometimes a good adjunct to data mining tools in that they allow the analyst an opportunity to get a feel for the data. They can help to determine the quality of the data and which variables might be relevant for a data mining project to follow. They are useful to further explore the results supplied by true data mining tools.

Cons

Simply put -- you must formulate the questions specifically. What are the sales in the northeast region by salesperson and product? If a person’s income is between $50K and $100K what is the probability they will respond to our mailing? What percentage of patients will have nausea if they take penicillin and if they also take a beta blocker drug?

This approach works well if you have the time to investigate the large number of questions that may be involved, which you almost never will. For example, a data mining problem with 200 variables where each variable can have up to 200 values has 1.6 x 10⁴⁶⁰values. This number is so large that all the computers on earth operating for the rest of the life of our galaxy could not explore these possibilities. Nuggets®, however, while not exploring them all explicitly, does examine them implicitly through intelligent search methods that avoid the insignificant ones.

Querying, therefore, is most effective when the investigation is limited to a relatively small number of “known” questions.

Statistics

There are a variety of statistical methods used in data mining projects. Again these are not true data mining tools (see the discussion of querying methods immediately above). Statistical tools are widely used in science and industry and provide excellent features for describing and visualizing large chunks of data.

Some of the methods commonly used are regression analysis, correlation, Chaid analysis, hypothesis testing, and discriminant analysis.

Pros

Statistical analysis is often a good ‘first step’ in understanding data. These methods deal well with numerical data where the underlying probability distributions of the data are known. They are not as good with nominal data such as “good”, better”, “best” or “Europe”, “North America”, “Asia” or “South America”.

Cons

Statistical methods require statistical expertise, or a project person well versed in statistics who is heavily involved. Such methods require difficult to verify statistical assumptions and do not deal well with non-numerical data. They suffer from the “black box aversion syndrome”. This means that that non-technical decision makers, those who will either accept or reject the results of the study, are often unwilling to make important decisions based on a technology that gives them answers but does not explain how it got the answers. To tell a non-statistician CEO that she or he must make a crucial business decision because of a favorable R value statistic is not usually well received. Using a true data mining tool such as Nuggets® the CEO can be told exactly how the conclusion was arrived at.

Another problem is that statistical methods are valid only if certain assumptions about the data are met. Some of these assumptions are: linear relationships between pairs of variables, non-multicollinearity, normal probability distributions, independence of samples. If you do not validate these assumptions because of time limitations or are not familiar with them, your analysis may be faulty and therefore your results may not be valid. Even if you know about them you may not have the time or information to verify the assumptions.

Data Mining Methods

Neural Nets

This is a popular technology, particularly in the financial community. These mathematical models were originally developed in the 1960’s to model biological nervous systems in an attempt to mimic thought processes.

Pros

These models may have potential in applications where there is intense human sensory processing such as speech recognition and vision. The end result of a Neural Net project is a mathematical model of the process. They deal well with numerical attributes but not as well with nominal data. Some people feel they are equivalent in certain aspects to regression analysis.

Cons

There is still much controversy regarding the efficacy of Neural Nets. One major objection to the method is that the development of a Neural Net model is partly an art and partly a science in that the results often depend on the individual who built the model. That is, the model form (called the network topology) and hence the results, may differ from one researcher to another for the same data. There is the problem that often occurs of “overfitting” that results in good prediction of the data used to build the model but bad results with new data. The final results may depend on the initial settings of weights that are usually guesses.

The “black box syndrome” also applies here to an even greater extent than in statistics because the underlying technology is not as well accepted and has not been in existence for as long.

Decision Trees

Decision tree is a technique for partitioning a training file into a set of rules. A decision tree consists of nodes and branches. The starting node is called the root node. Depending upon the results of a test the training files are partitioned into two or more sub-sets. The end result is a set of rules covering all possibilities.

Pros

Fairly fast with certain algorithms. Results are rules stated in English.

Cons

By far the most important negative for decision trees is that they are forced to make decisions along the way based on limited information that implicitly leaves out of consideration the vast majority of potential rules in the training file. This approach may leave valuable rules undiscovered since decisions made early in the process will preclude some good rules from being discovered later.

Nuggets® - True Rule Induction

Nuggets® uses proprietary search algorithms to develop English “if - then” rules. These algorithms use genetic methods and learning techniques to “intelligently” search for valid hypotheses that become rules. In the act of searching, the algorithms “learn” about the training data as they proceed. The result is a very fast and efficient search strategy that does not preclude any potential rule from being found. The new and proprietary aspects include the way in which hypotheses are created and the searching methods. The criteria for valid rules are set by the user.

Nuggets® also provides a suite of tools to use the rules for prediction of new data, understanding, classifying and segmenting data. The user can also query the rules or the data to perform special studies.

Pros

This method is fast and efficient in finding patterns. It can generate rules with many different dependent variables simultaneously or the use can direct the system to search for rules of a specific type. Tools are provided that allow you use of the rules to predict a file of new data, predict a single record from a file (useful when prospective data is constantly being updated and needs to be predicted frequently), query rules and data and segment data for market research. Nuggets® handles highly non-linear relationships and noisy or incomplete data. Currently runs on Windows 98, Windows 2000 and Windows NT although data can be imported from other platforms.

Cons

Does not run directly on mainframes but can import data to run on client PC’s.

What Nuggets® is Not

Nuggets® is not a statistical tool. It does not use statistical assumptions such as independence, linear relationships, multi-colinearity, normality, etc. It finds rules for which a set of independent variables are correlated with a result. This non-statistical notion of correlation simply means that given the ‘IF’ condition, the ‘THEN’ condition occurs a given percentage of the time.

For example suppose we develop the following rule:

IF Credit Rating = Good AND Bank Balance = over $10,000 And Employed = Yes Then Successful Loan = Yes, with confidence factor of 87%

This means that using the examples in the training file: of those, which satisfied the ‘If’ condition, 87% turned out to be successful. Thus the predictor variables, in this case credit rating, employment and bank balance, were correlated (i.e. associated) with a successful loan. Notice that Nuggets® is not suggesting a cause and effect relationship. A bank balance of over $10,000 is probably not the cause of the loan being good. It is merely associated with it in combination with the other factors as stated by the rule.

Stages in a Nuggets® Data Mining Operation

A most important point to note is that with any data mining effort it is helpful if the user possesses good domain knowledge about the business or scientific aspects of the data mining effort. The following steps should be undertaken before the data are presented to the model.

Problem Analysis

A data mining effort must begin with a mission statement that defines the kind of results required. While this is true for all types of research, for a data mining project the mission statement can be much less structured than those usually needed.

Here is a market research example: Suppose we are interested in determining which type of cross promotion to run when an item is on sale in a supermarket. The analysis can tell us what other items are likely to be purchased with the item in question.

Data Extraction

Potential sources of data should be explored before meaningful analysis can take place. Analysts can use existing sources of data or acquire new data for the analysis. Developing a sound model often involves combining a number of data sources, (for example, mailing lists, marketing data, census data, company sales records, and so forth). Often these files are part of a relational system but Nuggets® requires a single flat file. This requires the use of a “pre-processing” extraction step to create the input file from the user data

For our marketing sample, we would probably start with marketing data, such as that compiled at a supermarket cash register from shopper card records.

Data cleansing

Often historical databases contain noisy or missing data. Some data mining methods are more sensitive to these factors than others. Nuggets® handles noisy and/or missing data well. If possible, the training file should be reviewed for such problems. If these errors are not discovered at this stage, they may contribute to lower quality results. Nuggets® provides a data dictionary, that helps you to find erroneous data.

Rules Development

The data mining tool takes the training file and examines it for the underlying patterns. Nuggets® allows you to define the number of times the potential rule must occur before it is considered valid, and the percentage of the records that must display the pattern. You thus have control over the acceptable validity (i.e. confidence factor) for the case under study.

In our marketing sample, we might want to look at, say, which other items people buy when they buy pretzels. We could set the confidence factor (the proportion of time the rule must be true in the training file) to 75%, and the number of records to which the rule must apply to twenty. Thus the rule, IF buys item = “Yes” THEN buys pretzels = “Yes” is to be considered valid, there would have to be at least twenty instances of pretzels being bought in the data, and at least 15 of them would have bought the other item as well.

Output Analysis

Once the rules have been developed, they can be analyzed. Nuggets® orders the rule by their confidence factor in a report.

How Does Nuggets® Work?

Nuggets® is a data mining system for PC users that puts the power of a complete data mining environment on everyone’s desktop. It uses powerful new rule induction methodology to make explicit the relationships in both numeric and non-numeric information.

Nuggets® is automatic. This means that Nuggets® finds rules automatically without need for further interaction unless the user desires it.

Nuggets® then uses the rule library it has built to forecast expected results from new information, based on the “experience” contained in your existing database. The new information the user provides is called a “prospect” file.

How Nuggets® Can Help Your Organization

Features

· Power to extract knowledge from data that other methods can not

· Automatic rule generation in English “if-then” rules

· Ability to handle complex non-linear relationships

· Handles missing data

· Handles noisy data

· Assists in finding data errors

· Provides predictions for new data

· Allows powerful querying of rules or data

· Fast rule generation with new algorithms

· User friendly, intuitive interface

· Provides validation module

· Reverse engineers information implicit in databases

Area of Potential Application

The following list includes only a few of the possible applications.

Business

· Banking -- mortgage approval, loan underwriting, fraud analysis and detection

· Finance -- analysis and forecasting of business performance, stock and bond analysis

· Insurance -- bankruptcy prediction, risk analysis, credit and collection models

· Web Marketing -- targeted banner ads and cross selling opportunities

· Direct Marketing -- market research, product success prediction

· Market Research -- media selection, broadcasting analysis, product segmentation

· Maintenance - forecasting vehicle and equipment maintenance needs

Manufacturing

· Fault analysis, quality control, preventive maintenance scheduling, automated systems

Medicine

· Epidemiological studies, toxicology, diagnosis, drug interactions, risk factor analysis, quality control, retrospective drug studies

Scientific Research

· General modeling of all types

Technical Information

Nuggets® is a true 32-bit system that will run on Windows 9X, 2000 or, NT, multi-user and parallel processing enabled and predicts, forecasts, generalizes and validates.

Nuggets® and Data Mining

Management Overview

Management Summary - The Bottom Line

What is Data Mining?

Definition

User Driven Analysis

Query Generators

Statistics

Nuggets® True Data Mining

What is True Data Mining?

How Data Mining Works

Rule Generators

Data Analysis Methods - An Overview

Non-Data Mining Methods

Data Mining Methods

What Nuggets® is Not

Stages in a Nuggets® Data Mining Operation

Problem Analysis

Data Extraction

Data cleansing

Rules Development

Output Analysis

How Does Nuggets® Work?

How Nuggets® Can Help Your Organization

Features

Area of Potential Application

Technical Information

Nuggets^® and Data Mining

Nuggets^®True Data Mining