17-05-2012, 11:15 AM
DATA mining technology
4.documentation.doc (Size: 5.5 MB / Downloads: 56)
Introduction
DATA mining technology is now used in a wide variety of fields. Applications include the analysis of customer transaction records, web site logs, credit card purchase information, call records, to name a few. The interesting results of data mining can provide useful information such as customer behavior for business managers and researchers. One of the most important data mining applications is association rule mining.
We only need to focus on the methods of finding the frequent itemsets in the database. The Apriori approach was the first to address this issue. Apriori finds frequent itemsets by scanning a database to check the frequencies of candidate itemsets, which are generated by merging frequent subitemsets. However, Apriori-based algorithms have undergone bottlenecks because they have too many candidate itemsets.
Apriori-based hardware schemes require loading the candidate itemsets and the database into the hardware. Since the capacity of the hardware is fixed, if the number of items in the database is larger than the hardware capacity, the data items must be loaded separately. Therefore, the process of comparing candidate itemsets with the database needs to be executed several times. Similarly, if the number of candidate itemsets is larger than the capacity of the hardware, the pattern matching procedure has to be separated into many rounds. Clearly, it is infeasible for any hardware design to load the candidate itemsets and the database into hardware for multiple times. Since the time complexity of those steps that need to load candidate itemsets or database items into the hardware is in proportion to the number of candidate itemsets and the number of items in the database, this procedure is very time consuming. In addition, numerous candidate itemsets and a huge database may cause a bottleneck in the system.
MOTIVATION:
We only need to focus on the methods of finding the frequent itemsets in the database. The Apriori approach was the first to address this issue. Apriori-based hardware schemes require loading the candidate itemsets and the database into the hardware. Since the capacity of the hardware is fixed, if the number of items in the database is larger than the hardware capacity, the data items must be loaded separately.
Therefore, the process of comparing candidate itemsets with the database needs to be executed several times. Clearly, it is infeasible for any hardware design to load the candidate itemsets and the database into hardware for multiple times. Since the time complexity of those steps that need to load candidate itemsets or database items into the hardware is in proportion to the number of candidate itemsets and the number of items in the database, this procedure is very time consuming. In addition, numerous candidate itemsets and a huge database may cause a bottleneck in the system.
We propose a HAsh-based and PiPelIned (abbreviated as HAPPI) architecture for hardware-enhanced association rule mining. Our System solves the bottleneck problem in a priori-based hardware scheme with objectives:
• Applying Pipelining methodology in the HAPPI architecture.
• Comparing item sets with the database and collect useful information.
• Reducing the number of candidate item sets and items in the database simultaneously.
• Reduce the frequency of loading the database into the hardware.
BACKGROUND:
Apriori is a classic algorithm for learning association rules. Apriori is designed to operate on database containing transactions. Apriori finds frequent itemsets by scanning a database to check the frequencies of candidate itemsets, which are generated by merging frequent sub itemsets. Apriori uses to count candidate item sets efficiently. Apriori-based algorithms have undergone bottlenecks because they have too many candidate itemsets. So we can’t reduce the frequency of loading the database into hardware.
CONTRIBUTION OF THE HESIS
The thesis implements the integration of a simple computational structure for data mining onto the hard drive controller itself. The data mining implemented here is not Apriori, but rather the problem of exact and inexact string matching, a much more computationally regular problem compared to the Apriori algorithm. However, the work is useful, and will become more so as FPGA performance scales up and significantly exceeds the data supply capabilities of hierarchical memory systems. We base our comparisons of hardware performance versus various state-of-the-art software implementations as we are unaware of any comparable hardware implementation of the Apriori algorithm. Extensive research exists on parallelizing correlation algorithms, but we focus on single-processor machine performance. The array architecture provides a performance improvement of orders of magnitude over the state-of-the-art software implementations. The system is easily scalable and introduces an efficient “systolic injection” method for intelligently reporting unpredictably generated mid-array results to a controller without any chance of collision or excessive stalling. However, the support operation still requires two orders of magnitude more time than the other stages. We focus on addressing the support problem in this paper. The architecture we develop in this paper is entirely distinct from the work in.
OBJECTIVE OF THE THESIS
The main objective of this thesis is, it can effectively reduce the frequency of loading the database into the hardware. The time complexity of those steps that need to load candidate itemsets or database items into the hardware is also reduced through the implementation of a HAsh-based and PiPelIned (abbreviated as HAPPI) architecture for hardware enhanced association rule mining. Therefore, one can effectively reduce the frequency of loading the database into the hardware.