Nuggets® and Data
Mining
White
Paper Michael
Gilman, Ph. D. Data Mining Technologies
Inc. 2 Split Rock Melville, NY
11747 631 692-4400 e-mail mgilman@data-mine.com May 2000 Management
Overview
Today’s business environment is more competitive than
ever. The difference between survival and defeat often rests on a thin
edge of higher efficiency than the competition. This advantage is often
the result of better information technology providing the basis for
improved business decisions. The problem of how to make such business
decisions is therefore crucial. But how is this to be done? One answer is
through the better analysis of data. Some estimates hold that the amount of information in
the world doubles every twenty years. Undoubtedly the volume of computer
data increases at a much faster rate. In 1989 the total number of
databases in the world was estimated at five million, most of which were
small dBase files. Today the automation of business transactions produces
a deluge of data because even simple transactions like telephone calls,
shopping trips, medical tests and consumer product warranty registrations
are recorded in a computer. Scientific databases are also growing rapidly.
NASA, for example, has more data than it can analyze. The human genome
project will store thousands of bytes for each of the several billion
genetic bases. The 1990 US census data of over a billion bytes contains an
untold quantity of hidden patterns that describe the lifestyles of the
population. How can we explore this mountain of raw data? Most of
it will never be seen by human eyes and even if viewed could not be
analyzed by “hand." Computers provide the obvious answer. The computer method we should use to process the data
then becomes the issue.
Although simple statistical methods were developed long ago, they
are not as powerful as a new class of “intelligent” analytical tools
collectively called data mining methods. Data mining is a new methodology for improving the
quality and effectiveness of the business and scientific decision making
process. It complements, and can often replace, other business decision
assistance tools, such as statistical analysis, computer reporting and
querying. Data mining can
achieve high return on investment decisions by exploiting one of an
enterprise’s most valuable and often overlooked assets—DATA! Byte
Magazine reported that some companies have reaped returns on
investment of as much as 1,000 times their initial investment on a single
project. More and more
companies are realizing that the massive amounts of data that they have
been collecting over the years can be their key to success. With the
proliferation of data warehouses, this data can be mined to uncover the
hidden nuggets of knowledge. Simply put, data mining tools are fast
becoming a business necessity. The Gartner group has predicted that data
mining will be one of the five hottest technologies in the early years of
the new century. There are currently several data mining techniques
available. This white paper
will discuss the leading ones and present an exciting and powerful new
data mining method, Nuggets® that uses breakthrough
technology—combining proprietary algorithms and genetic methodology to
offer significant benefits over other methods. Management Summary - The
Bottom Line
Many of the standard analytical tools that do not use
data mining have powerful capabilities for performing sophisticated
user-driven queries. They are, however, limited in their ability to
discover trends and complex patterns in a database because the user must
“think up” a hypothesis and then test it. Relevant and important hypotheses
may not be obvious or come from patterns obscured within the data.
Data mining tools, however, analyze data by
automatically formulating hypotheses about data. The problem that often confronts
researchers new to the field is that there are a variety of data mining
techniques available—which one to choose? All these tools give you
answers. Some are more
difficult to use than others, and they differ in other, superficial ways,
but most importantly, the underlying algorithms used differ and the nature
of these algorithms is directly related to the quality of the results
obtained. 1. This paper will discuss the strengths and weaknesses of
some methods available today, define data mining with some examples and
explain the benefits of Nuggets® vis-à-vis the alternatives. It will
describe how Nuggets® gives you power to extract knowledge from your data
unavailable with other approaches. What is Data
Mining?
Definition
The objective of data mining is to extract valuable
information from your data, to discover the “hidden gold.” This gold is
the valuable information in that data. Small changes in strategy, provided
by data mining’s discovery process, can translate into a difference of
millions of dollars to the bottom line. With the proliferation of data
warehouses, data mining tools are fast becoming a business necessity. An
important point to remember, however, is that you do not need a data
warehouse to successfully use data mining—all you need is data.
Many traditional reporting and query tools and
statistical analysis systems use the term "data mining" in their product
descriptions. Exotic Artificial Intelligence-based systems are also being
touted as new data mining tools.
Which leads to the question, “What is a data mining tool and what
isn't?” The ultimate objective of data mining is knowledge discovery. Data
mining methodology extracts predictive information from databases. With
such a broad definition, however, an on-line analytical processing (OLAP)
product or a statistical package could qualify as a data mining tool, so
we must narrow the definition. To be a true knowledge discovery method, a
data mining tool should unearth information automatically. By this definition
data mining is data-driven, whereas by contrast, traditional statistical,
reporting and query tools are user-driven. User Driven
Analysis
Query
Generators
Traditionally the goal of identifying and utilizing
information hidden in data has proceeded via query generators and data
interpretation systems. A user formulates a theory with a hypothesis and
queries the database to test the validity of this
hypothesis. For example, a user might hypothesize about the
relationship between industrial sales of color copiers and customers'
specific industries. The user would generate a query against the data and
segment the results into a report. Typically, the generated information
provides a good overview. This verification type of analysis is limited in at
least three ways, however. First, it's usually based on a hunch. In our example, the hunch is that
the industry in which a potential customer operates correlates with the
number of copiers it buys or leases. Second, the quality of the extracted
information depends on the user's interpretation of the results—and is
thus subject to error. Third, the user’s ability to hypothesize is usually
limited to two or three variables at best. There may, however, be many
relationships that exist among more than two or three variables that the
user will not search for. Statistics
Statistical methods have long been used to extract
information from data.
Multifactor analyses of variance and multivariate analyses include
statistical methods that could identify the relationships among factors
that influence the outcome of copier sales, for example. Pearson
product‑moment correlations measure the strength and direction of the
relationship between each database field and the dependent
variable. One of the problems with these approaches is that the
techniques tend to focus on tasks in which all the attributes have
continuous or ordinal values.
Many of the attributes are also parametric, that is, they assume a
particular probability distribution of the variables. Many methods also
assume that a relationship is expressible as a linear combination of the
attribute values. Statistical methodology also very often assumes normally
distributed data—a sometimes tenuous supposition in the real world. These
assumptions are not usually verified in practice and therefore the results
are questionable. Nuggets®
True Data
Mining
What is True Data
Mining?
The generation of a query stems from the need to know
certain facts, such as regional sales reports stratified by type of
business; data mining projects stem from the need to discover more general
information such as the factors that influence these sales. One way to
identify a true data mining tool is by how it operates on the data: is it
manual (top‑down) or automatic (bottom‑up)? In other words, does the user or
the software originate the query? Neural networks and decision tree methods qualify as
true automatic data mining tools because they autonomously interrogate the
data for patterns. Data mining tools offer great potential for corporate
data warehouses since they discover rather than confirm trends or patterns in
data. Most of these symbolic classifiers are also known as
rule-induction programs or decision‑tree generators. They use statistical
algorithms or machine‑learning algorithms such as ID3, C4.5, AC2, CART,
CHAID, CN2, or modifications of these algorithms. Symbolic classifiers
split a database into classes that differ as much as possible in their
relation to a selected output. That is, the tool partitions a database
according to the results of statistical tests often directed by the user.
How Data Mining
Works
Data mining includes several steps: problem analysis,
data extraction, data cleansing, rules development, output analysis and
review. Data mining sources
are typically flat files extracted from on-line sets of files, from data
warehouses or other data source. Data may however be derived from almost
any source. Whatever the
source of data, data mining will often be an iterative process involving
these steps. Rule Generators
Some data mining tools generate their findings in the
format of "if then" rules. The results are thus more understandable to the
decision makers. Here's an example of the data mining rules that Nuggets®
might discover for a project to target potential product
buyers. Rule
1. IF CUSTOMER SINCE =
1978 through 1994 AND REVOLVING LIMIT =
5120 through 8900 AND CREDIT/DEBITRATIO
=67 THEN Potential Buyer =
Yes with a confidence factor of 89% Rule
2. IF CUSTOMER SINCE =
1994 through 1996 AND REVOLVING LIMIT =
1311 through 5120 AND CREDIT/DEBITRATIO
=67 THEN Potential Buyer =
Yes with a confidence factor of 49% Data Analysis Methods - An
Overview
The
following represents a discussion of some of the most popular methods used
to extract information from data. Non-Data Mining
Methods
Query
Tools Most of these tools
come with graphical components. Some support a degree of
multi-dimensionality such as crosstab reporting, time series analysis,
drill down, slice and dice and pivoting. Pros These tools are sometimes a good adjunct to data mining
tools in that they allow the analyst an opportunity to get a feel for the
data. They can help to determine the quality of the data and which
variables might be relevant for a data mining project to follow. They are
useful to further explore the results supplied by true data mining
tools. Cons Simply put -- you must formulate the questions
specifically. What are the sales in the northeast region by salesperson
and product? If a person’s income is between $50K and $100K what is the
probability they will respond to our mailing? What percentage of patients
will have nausea if they take penicillin and if they also take a beta
blocker drug? This approach works well if you have the time to
investigate the large number of questions that may be involved, which you
almost never will. For
example, a data mining problem with 200 variables where each variable can
have up to 200 values has 1.6 x 10460 values. This number is so
large that all the computers on earth operating for the rest of the life
of our galaxy could not explore these possibilities. Nuggets®, however,
while not exploring them all explicitly, does examine
them implicitly through
intelligent search methods that avoid the insignificant
ones. Querying, therefore, is most effective when the
investigation is limited to a relatively small number of “known”
questions. Statistics There are a variety of statistical methods used in data
mining projects. Again these are not true data mining tools (see the
discussion of querying methods immediately above). Statistical tools are
widely used in science and industry and provide excellent features for
describing and visualizing large chunks of data. Some of the methods commonly used are regression
analysis, correlation, Chaid analysis, hypothesis testing, and
discriminant analysis. Pros Statistical analysis is often a good ‘first step’ in
understanding data. These methods deal well with numerical data where the
underlying probability distributions of the data are known. They are not
as good with nominal data such as “good”, better”, “best” or “Europe”,
“North America”, “Asia” or “South America”. Cons Statistical methods require statistical expertise, or a
project person well versed in statistics who is heavily involved. Such methods require difficult to
verify statistical assumptions and do not deal well with non-numerical
data. They suffer from the “black box aversion syndrome”. This means that
that non-technical decision makers, those who will either accept or reject
the results of the study, are often unwilling to make important decisions
based on a technology that gives them answers but does not explain how it
got the answers. To tell a non-statistician CEO that she or he must make a
crucial business decision because of a favorable R value statistic is not
usually well received. Using a true data mining tool such as Nuggets® the
CEO can be told exactly how the conclusion was arrived at. Another problem is that statistical methods are valid
only if certain assumptions about the data are met. Some of these
assumptions are: linear relationships between pairs of variables,
non-multicollinearity, normal probability distributions, independence of
samples. If you do not validate these assumptions because of time
limitations or are not familiar with them, your analysis may be faulty and
therefore your results may not be valid. Even if you know about them you
may not have the time or information to verify the
assumptions. Data Mining
Methods
Neural
Nets This is a popular technology, particularly in the
financial community. These mathematical models were originally developed
in the 1960’s to model biological nervous systems in an attempt to mimic
thought processes. Pros These models may have potential in applications where
there is intense human sensory processing such as speech recognition and
vision. The end result of a Neural Net project is a mathematical model of
the process. They deal well with numerical attributes but not as well with
nominal data. Some people feel they are equivalent in certain aspects to
regression analysis. Cons There is still much controversy regarding the efficacy
of Neural Nets. One major objection to the method is that the development
of a Neural Net model is partly an art and partly a science in that the
results often depend on the individual who built the model. That is, the
model form (called the network topology) and hence the results, may differ
from one researcher to another for the same data. There is the problem
that often occurs of “overfitting” that results in good prediction of the
data used to build the model but bad results with new data. The final
results may depend on the initial settings of weights that are usually
guesses. The “black box syndrome” also applies here to an even
greater extent than in statistics because the underlying technology is not
as well accepted and has not been in existence for as long. Decision
Trees Decision tree is a technique for partitioning a
training file into a set of rules.
A decision tree consists of nodes and branches. The starting node
is called the root node. Depending upon the results of a test the training
files are partitioned into two or more sub-sets. The end result is a set
of rules covering all possibilities. Pros Fairly fast with certain algorithms. Results are rules
stated in English. Cons By far the most important negative for decision trees
is that they are forced to make decisions along the way based on limited
information that implicitly leaves out of consideration the vast majority
of potential rules in the training file. This approach may leave valuable
rules undiscovered since decisions made early in the process will preclude
some good rules from being discovered later. Nuggets® - True Rule
Induction Nuggets® uses proprietary search algorithms to develop
English “if - then” rules. These algorithms use genetic methods and
learning techniques to “intelligently” search for valid hypotheses that
become rules. In the act of searching, the algorithms “learn” about the
training data as they proceed. The result is a very fast and efficient
search strategy that does not preclude any potential rule from being
found. The new and proprietary aspects include the way in which hypotheses
are created and the searching methods. The criteria for valid rules are
set by the user. Nuggets® also provides a suite of tools to use the
rules for prediction of new data, understanding, classifying and
segmenting data. The user can also query the rules or the data to perform
special studies. Pros This method is fast and efficient in finding patterns.
It can generate rules with many different dependent variables
simultaneously or the use can direct the system to search for rules of a
specific type. Tools are provided that allow you use of the rules to
predict a file of new data, predict a single record from a file (useful
when prospective data is constantly being updated and needs to be
predicted frequently), query rules and data and segment data for market
research. Nuggets® handles highly non-linear relationships and noisy or
incomplete data. Currently runs on Windows 98, Windows 2000 and Windows NT
although data can be imported from other platforms. Cons Does not run directly on mainframes but can import data
to run on client PC’s. What Nuggets® is
Not
Nuggets® is not a statistical tool. It
does not use statistical assumptions such as independence, linear
relationships, multi-colinearity, normality, etc. It finds rules for which
a set of independent variables are correlated with a result. This
non-statistical notion of correlation simply means that given the ‘IF’
condition, the ‘THEN’ condition occurs a given percentage of the time.
For example suppose we develop the following rule:
IF Credit Rating = Good AND Bank Balance = over $10,000
And Employed = Yes Then Successful Loan = Yes, with confidence factor of
87% This means that using the examples in the training
file: of those, which satisfied the ‘If’ condition, 87% turned out to be
successful. Thus the predictor variables, in this case credit rating,
employment and bank balance, were correlated (i.e. associated) with a
successful loan. Notice that
Nuggets® is not suggesting a cause and effect relationship. A bank balance
of over $10,000 is probably not the cause of the loan being good. It is
merely associated with it in combination with the other factors as stated
by the rule. Stages in a Nuggets® Data Mining
Operation
A
most important point to note is that with any data mining effort it is
helpful if the user possesses good domain knowledge about the business or
scientific aspects of the data mining effort. The following steps should
be undertaken before the data are presented to the model. Problem
Analysis
A data mining effort must begin with a mission
statement that defines the kind of results required. While this is true for all types
of research, for a data mining project the mission statement can be much
less structured than those usually needed. Here is a market research example: Suppose we are
interested in determining which type of cross promotion to run when an
item is on sale in a supermarket.
The analysis can tell us what other items are likely to be
purchased with the item in question. Data Extraction
Potential sources of data should be explored before
meaningful analysis can take place.
Analysts can use existing sources of data or acquire new data for
the analysis. Developing a sound model often involves combining a number
of data sources, (for example, mailing lists, marketing data, census data,
company sales records, and so forth). Often these files are part of a
relational system but Nuggets® requires a single flat file. This requires
the use of a “pre-processing” extraction step to create the input file
from the user data For our marketing sample, we would probably start with
marketing data, such as that compiled at a supermarket cash register from
shopper card records. Data cleansing
Often historical databases contain noisy or missing
data. Some data mining methods are more sensitive to these factors than
others. Nuggets® handles noisy and/or missing data well. If possible, the
training file should be
reviewed for such problems.
If these errors are not discovered at this stage, they may
contribute to lower quality results. Nuggets® provides a data dictionary,
that helps you to find erroneous data. Rules
Development
The data mining tool takes the training file and
examines it for the underlying patterns. Nuggets® allows you to define the
number of times the potential rule must occur before it is considered
valid, and the percentage of the records that must display the pattern.
You thus have control over the acceptable validity (i.e. confidence
factor) for the case under study. In our marketing sample, we might want to look at, say,
which other items people buy when they buy pretzels. We could set the confidence factor
(the proportion of time the rule must be true in the training file) to
75%, and the number of records to which the rule must apply to
twenty. Thus the rule, IF
buys item = “Yes” THEN buys pretzels = “Yes” is to be considered valid,
there would have to be at least twenty instances of pretzels being bought
in the data, and at least 15 of them would have bought the other item as
well. Output Analysis
Once the rules have been developed, they can be
analyzed. Nuggets® orders the
rule by their confidence factor in a report. How Does Nuggets®
Work?
Nuggets® is a data mining
system for PC users that puts the power of a complete data mining
environment on everyone’s desktop.
It uses powerful new rule induction methodology to make explicit
the relationships in both numeric and non-numeric
information. Nuggets® is automatic. This means that Nuggets® finds
rules automatically without need for further interaction unless the user
desires it. Nuggets® then uses the rule library it has built to
forecast expected results from new information, based on the “experience”
contained in your existing database.
The new information the user provides is called a “prospect” file.
How Nuggets® Can Help Your
Organization
Features
·
Power to extract knowledge from data that other
methods can not ·
Automatic rule generation in English “if-then”
rules ·
Ability to handle complex non-linear
relationships ·
Handles missing data ·
Handles noisy data ·
Assists in finding data errors ·
Provides predictions for new data ·
Allows powerful querying of rules or
data ·
Fast rule generation with new
algorithms ·
User friendly, intuitive interface ·
Provides validation module ·
Reverse engineers information implicit in
databases Area of Potential Application
The
following list includes only a few of the possible
applications. Business ·
Banking -- mortgage approval, loan underwriting,
fraud analysis and detection ·
Finance -- analysis and forecasting of business
performance, stock and bond analysis ·
Insurance -- bankruptcy prediction, risk analysis,
credit and collection models ·
Web Marketing -- targeted banner ads and cross
selling opportunities ·
Direct Marketing -- market research, product
success prediction ·
Market Research -- media selection, broadcasting
analysis, product segmentation ·
Maintenance - forecasting vehicle and equipment
maintenance needs Manufacturing ·
Fault analysis, quality control, preventive
maintenance scheduling, automated systems Medicine ·
Epidemiological studies, toxicology, diagnosis,
drug interactions, risk factor analysis, quality control, retrospective
drug studies Scientific
Research ·
General modeling of all types Technical
Information
Nuggets® is a true 32-bit system that will run on
Windows 9X, 2000 or, NT, multi-user and parallel processing enabled and
predicts, forecasts, generalizes and validates. |
home |