18-07-2012, 10:04 AM
SEMINAR ON MIME
MIME.doc (Size: 1.31 MB / Downloads: 39)
INTRODUCTION
Data mining is an inherently iterative process; the results of one analysis often lead to new questions, requiring more analysis. In an ideal world, this process is streamlined. That is, data mining is not only iterative, but also interactive: the user can give such feedback immediately, and easily browse the results. In traditional pattern mining, however, algorithms typically produce large amounts of patterns, many of which are not interesting to the user, and the results are typically only given in a flat text file, making it hard to analyze the results. By instead providing an iterative and interactive process, the user would be able to explore and refine the discovered patterns on the fly. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
DESCRIPTION OF THE SYSTEM
We consider transactional (supermarket) databases D where each transaction contains a number of items. A pattern is an itemset (a group of items {A,B,C} that occur together)or a rule (two groups of items {D,E} ! {F,G, H} where the presence of the first group implies the presence of the second group with a given confidence). The support s of a pattern is the number of transactions in the database in which the pattern is present. The frequency f is the supports relative to the size of the dataset, denoted as |D|. In this setting, frequent itemset mining is defined as the process of finding all patterns in a database that have a frequency fp higher than or equal to a user-specified threshold fu. Most pattern mining techniques produce an amount of output that due to size is difficult to post-process. One could try to reduce the number of results by making the quality thresholds more strict. Unfortunately this does not guarantee the usefulness of the produced patterns. To this end more is required. MIME combines knowledge of the expert and computational strength of a computer to increase the probability of finding interesting patterns. This is achieved in a visual framework where the expert can create his own pattern collection. In order to evaluate the created/mined patterns.
Worksheets and comparing miners
Our tool allows for multiple worksheets. Worksheets can be based on a dataset that is already in use, in which case rules can be copied or moved from one sheet to another. It is also possible to use one of the mining algorithms to construct a collection of patterns, select some of the patterns and copy these to a new worksheet such that we do not lose the original mining results. When using different worksheets, we can also use the global overview and global measures to compare the results of different mining sessions of the same algorithm, or to compare the results of different mining algorithms. For the first comparison we create 4 different workspaces using the Abstracts database, and run the Apriori miner by Borgelt using 4 different frequency levels (40, 30, 20 and 10%). Figure 2(d) shows the global overview of this setting. Using global information we can adapt the results of the mining algorithms to improve these values. For the second comparison we can for instance compare the outcome of 4 different tile-based methods with similar thresholds, this way the user can see which miner gives better results for a certain dataset, using information provided by the global overview as reference.
Worksheet colors and basket colors
MIME supports graphical and textual (numerical) values for showing ranks. The numerical Values can be used to obtain the worksheet coloring shown in Figure 2b. As is the case with the ranking of the items in the source dock, patterns can obtain a coloring scheme depending on the interestingness measure and the other rules in the worksheet. This way the user gets a quick understanding of which measures are similar for patterns without knowing the exact ranks. Items inside a basket of a pattern can also obtain a coloring scheme based on their personal influence on the pattern. The ranking of the pattern is computed when omitting the specified item and the rank is compared to the current ranking. If the current ranking is better, the item gets a positive coloring (green), and if the current ranking is worse, the item gets a negative coloring (red). If the item does not have an influence it gets a neutral (white) coloring.
RELATEDWORK
A lot of work has been done comparing and evaluating different objective interestingness measures [8, 11]. The most important outcome is that there is no single measure that
can be used for all purposes, and even worse, for some purposes, there exists no measure, and only subjective and semantic criteria based on experience and background knowledge can produce actionable results. In our tool we have incorporated several objective interestingness measures, but it is the combination of user knowledge and objective measures enabling subjective interestingness criteria to be applied. Also in the context of Inductive Databases several interactive constraint-based mining frameworks have been studied [7]. Here, the user typically has the ability to specify all kinds of constraints that are being used during the mining process. In these systems, the visualization and adaptation of the results has not been a major concern. Moreover, our framework could be built on top of such an inductive database implementation [3]. Most similar to the system presented here is the framework for mining decision trees proposed by Ankerst et al. [1]. The user and computer work together in this system such that in each step either the user or the computer can make the decision for a new split. The computer also provides extra computational power by showing the best split, lookahead information, purity measures, etc. Our approach is similar, but applies to frequent itemsets and association rules, instead of decision trees. Some methods for visualizing patterns exist [4, 5]. These visualizers present the output of mining algorithms in a compact and graphical format, and allow to further filter the output using queries. They do not provide means to mine the database interactively using subjective criteria and also do not allow to further explore existing patterns.
CONCLUSIONS AND FUTUREWORK
MIME is a framework for the interactive mining, exploration, and post-processing of patterns, and allows for easy comparison of different algorithms and pattern collections. It makes the user part of the mining process and so allows for creating patterns and adapting patterns by different mining algorithms and quality measures, as well as personal knowledge and interest. Post processing algorithms (clustering algorithms for instance) can be applied to the created collection or the user can create hierarchies himself. The tool also contains a plugin system that allows extension of the tool with existing software. There are still many options for extending MIME that we consider important future work. First, iterative mining algorithms can be incorporated in this system, allowing a speed-up of mining sessions with different quality constraints. Second, caching and pre-computation can be used in this framework by keeping track of user-interactions that are performed most and creating an interaction hierarchy. The system learns about user-interactions and can then decide to pre-compute information. Third, support for mining of decision trees and episodes is in progress. At last, we can also make more use of any-time algorithms, for instance when computing the Best Pattern Extension.