30-09-2016, 03:04 PM
1457082010-trialpaperCopy.docx (Size: 141 KB / Downloads: 4)
ABSTRACT;
This paper proposes improved KNN-GA algorithm with special focus on ranking the attributes based on their value as an important component for effective classification. After selecting subset of higher ranked attribute KNN is applied this is further optimized by GA.This algorithm focuses on considering attributes of prime importance for classification which further improves the performance of algorithm. Data set may contain redundant and irrelevant attribute, classification may produce less accurate results.The traditional KNN classification has three limitations such asHigh calculation complexity, Dependency on the training set,No weight difference between samples. To overcome this difficulty an improved KNN-GA algorithm is proposed with special focus on ranking the attributes. As compared to traditional algorithm this proposed algorithmimproves the Performance of classification significantly.
INTRODUCTION:
K nearest neighbours is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970’s as a non-parametric technique. Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which generate solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, mutation, selection, and crossover. Traditional KNN has following problems:
1. High calculation complexity: To find out the k nearest neighbour samples, all the similarities between thetraining samples needs to be calculated. When the number of training samples is less, the KNN classifier is no longer optimal, but if the training set contains a huge number of samples, the KNN classifier needs more time to calculate the similarities. This problem can be solved in 3 ways: reducing the dimensions of the feature space; using smaller data sets; using improved algorithm.
2. Dependency on the training set: The training samples are used to train classifier and it does not use any additional data .this makes algorithm totally dependent on training set only, also it needs to be recalculated for even a small change in training set.
3. No weight difference between samples: All the training samples are treated equally; there is no difference between the samples with larger value and sample with lower value. Entire data set is considered every time while calculating distances so this reduces the performance of classifier drastically.
Thus in this paper we are proposing improved KNN-GA algorithm with special focus on ranking the attributes as an important component for effective classification. So only those attributes who has higher value will be considered for classification and other less important attributes will be omitted. This improves the calculation time and gives better results.
LITERATURE SURVEY;
N. Suguna1 and Dr. K. Thanushkodi has proposed Genetic Algorithm (GA) is combined with k- Nearest Neighbor (KNN) algorithm called as Genetic KNN (GKNN), to overcome the limitations of traditional KNN[1].M.Akhil jabbar ,B.L Deekshatulua ,Priti Chandra approach combines KNN and genetic algorithm to improve the classification accuracy of heart disease data set. They used genetic search as a goodness measure to reduce redundant and irrelevant attributes, and to rank the attributes which contribute more towards classification. Least ranked attributes are removed[2].Sanwta Ram Dogiwal, Ȧ, Y.SinghShishodi and AbhayUpadhyaya Ċ showed how Gabor filter can be used for image segmentation to extraxt features[4].
METHODOLOGY AND DESIGN:
In this proposed algorithm, first of all dataset needs to load. Once data is loaded GA will be applied on dataset. Features will be ranked based on their value. Feature with high value will be ranked higher and features will less value will be ranked lesser .Thus during classification subset of attributes with higher rank will be considered for classification by classifier. KNN-GA will be applied on the selected subset then accuracy of the classifier will be calculated.
KNN:
To demonstrate a k-nearest neighbour analysis, let's consider the task of classifying a new object (query point) among a number of known examples. This is shown in the figure below, which depicts the examples (instances) with the plus and minus signs and the query point with a red circle. Our task is to estimate (classify) the outcome of the query point based on a selected number of its nearest neighbours. In other words, we want to know whether the query point can be classified as a plus or a minus sign.
To proceed, let's consider the outcome of KNN based on 1-nearest neighbor. It is clear that in this case KNN will predict the outcome of the query point with a plus (since the closest point carries a plus sign). Now let's increase the number of nearest neighbors to 2, i.e., 2-nearest neighbors. This time KNN will not be able to classify the outcome of the query point since the second closest point is a minus, and so both the plus and the minus signs achieve the same score (i.e., win the same number of votes). For the next step, let's increase the number of nearest neighbors to 5 (5-nearest neighbors). This will define a nearest neighbor region, which is indicated by the circle shown in the figure above. Since there are 2 and 3 plus and minus signs, respectively, in this circle KNN will assign a minus sign to the outcome of the query point.
GA:
Genetic Algorithms (GAs) are adaptive heuristic search algorithm based on the evolutionary ideas of natural selection and genetics laid down by Charles Darwin “survival of the fittest.” They are used to solve optimization problems. GAexploits historical information to direct the search into the region of better performance within the search space. The basic techniques of the GAs are designed to simulate processes in natural systems necessary for evolution,
The Algorithms
1. randomly initialize population(t)
2. determine fitness of population(t)
3. repeat
1. select parents from population(t)
2. perform crossover on parents creating population(t+1)
3. perform mutation of population(t+1)
4. determine fitness of population(t+1)
4. until best individual is good enough
After an initial population is randomly generated, the algorithm evolves the through three operators:
1. selection which equates to survival of the fittest;
2. crossover which represents mating between individuals;
3. Mutation which introduces random modifications.
1. Selection Operator
• Key idea: give preference to better individuals, allowing them to pass on their genes to the next generation. The goodness of each individual depends on its fitness. Fitness may be determined by an objective function or by a subjective judgement.
2. Crossover Operator
• Prime distinguished factor of GA from other optimization techniques. Two individuals are chosen from the population using the selection operator. A crossover site along the bit strings is randomly chosen. The values of the two strings are exchanged up to this point
• If S1=000000 and s2=111111 and the crossover point is 2 then S1'=110000 and s2'=001111
• The two new offspring created from this mating are put into the next generation of the population. By recombining portions of good individuals, this process is likely to create even better individuals
Mutation Operator
• With some low probability, a portion of the new individuals will have some of their bits flipped.Its purpose is to maintain diversity within the population and inhibit premature convergence.Mutation alone induces a random walk through the search spaceMutation and selection (without crossover) create a parallel, noise-tolerant, hill-climbing algorithms
CONCLUSION:
Thus this algorithm presents an improved KNN-GA with feature ranking as important aspect, which overcomes theproblems of traditional KNN-GA .this algorithm can be applied to dataset which has fewer attributes. This algorithm can be used for classification problems which need performance like classification of heart disease, early detection of lung cancer etc. This also saves computational cost and time.