23-04-2012, 12:42 PM
ASSOCIATION RULE MINING & SQL BASED APPROACHES: EVALUATION AND ANALYSIS
ASSOCIATION RULE MINING.doc (Size: 164.5 KB / Downloads: 49)
Introduction:
The rapid improvement in the size of the storage technology with associated drop in the storage cost, and increase in the computing power has made it feasible for organizations to store unprecedented amounts of organizational data and process it. These organizations,
though having a gold mine of data, have not yet been able to fully capitalize on its value. Typically, the data captures the business trends over a period of time. However, the nuggets of useful knowledge hidden are not so easy to discern. To compete effectively in today’s market, decision makers need to identify and utilize this information buried in the collected data and take advantage of the high return opportunities in a timely fashion. Given a database of sufficient size and quality, data mining technology can generate new business opportunities by providing a better insight to the business, based on the collected information. The key here is the generation of previously unknown knowledge from huge datasets. The process of mining is driven by the outcome requirements. Based on what we want, a specific data mining technique is employed. Data mining has become an important area of research because of its ability to get valuable information out of the data. With the recent emergence in the field of data mining, there is a need to build useful data mining systems that can be used to obtain valuable information from the large data bases. The data mining techniques need to be integrated with the traditional databases systems in order to be used along with other related applications like OLAP, Datawarehousing. This paper presents a few approaches performing the integration of data mining techniques with relational databases.
The different data mining techniques and their outcomes are briefly discussed below :
Classification: This is a process of grouping items based on a classifying attribute. A
model is then built based on the values of other attributes to classify each item to a particular class. A training dataset is typically used for validating and tuning the model. The classification technique may be used, for example, to identify the most probable consumers for a product, based on their spending patterns.
Clustering: The process of clustering tries to group the data set in such a way that the data points in one cluster are more similar to one another while the data points in different clusters are more dissimilar. A similarity measure needs to be defined and the quality of the outcome, to a large extent, depends on the appropriateness of the similarity measure for the data set or the domain of application. The technique of clustering, for example, can be used to divide the market into distinct groups, so that each group can be targeted with a different strategy.
The basic difference between classification and clustering is that in classification, the classifying class is known previously (also known as supervised), while clustering does not assume any knowledge of clusters (unsupervised).
Prediction: The technique of prediction is based on some continuous valued attributes. Previous history of the attributes is used to build the model. This technique is very commonly used for the prediction of sales of a product.
Deviation analysis: This technique compares current data with previously defined normal values to detect anomalies. Deviation analysis tools may be useful in security systems, where it may warn the authorities if there is any sharp deviation in the usage of resources by a particular user.
Literature Review:
Association rules are one of the most researched areas of data mining and have recently received much attention from the database community. They have proven to be quite useful in the marketing and retail communities as well as other more diverse fields. In this section we provide an overview of earlier researches on association rule mining integrated with RDBMS.
SQL Approach for Candidate Generation
For SQL formulation, the database is represented as a relation with 2 attributes: Tid and item. Multiple tuples of this transaction relation represent the items associated with a single transaction. Candidate and frequent itemsets are represented as relations containing a set of attributes, each representing an item. In the kth pass, the set of candidate itemsets Ck is generated from the frequent itemsets Fk-1 (generated in the (k-1) th pass) as shown below: