Techniques and algorithms for clone search, detection, analysis, and management

**mkaasees** · 08-11-2016, 04:03 PM

1467027273-ReportCodeClone.doc (Size: 2.81 MB / Downloads: 5)

Abstract

Improved productivity of software development in which copying a code fragment and reusing it by pasting with or without minor modifications is a common practice. Reasons of unintentional clones may also appear in the source code without awareness of the developer.Code cloning may increase initial productivity, it may cause fault propagation, inflate the code base and increase maintenance overhead. Techniques realized into tools and large-scale in-depth analyses of clones to inform clone management in devising effective techniques and strategies.
Developed a clone detector as a plug-in to the Eclipse IDE.Hybrid approach thatcombines the strength of both parser-based and text-based techniques.Used novel approach that applies a suffix-tree-based k-difference hybrid algorithm.Tool aids clone-aware development by allowing focused search for clones of any code fragment of the developer’s interest.

This project presents an approach that takes two clone fragments as input which are detected from any tool and it applies following three steps to determine whether they can be re-factored without any side effects or not.
Here we detect code cloning using code attributes which are given below:
1] Number of line counts. 2] Number of bracket counts. 3] Number of import counts. 4] Number of commented line 5] Author name 6]File Name 7] Last Modified Date
Algorithm:
Step 1: Start
Step 2: Take image file of two Code fragments as input.
Step 3: Convert image file of code fragments into .text file.
Step 4: Then highlight different code.
Step 5: Display complete highlighted code to determine code is copied or
not.
Step 6: Then display similar code which present in both code fragment.
Step 7: Display number of lines present in code fragment 1.
Step 8: Display number of lines present in code fragment 2.
Step 9: Then display total number of brackets code from 1ST fragment.
Step 10: Display total number of brackets code from 2ND fragment.
Step 11: First display number of imports from code 1.
Step 12: Then display number of imports from code 2.
Step 13: Then display commented line from code 1.
Step 14: Then display commented line from code 2.
Step 15: Then display file name of code 1.
Step 16: Then display file name code 2.
Step 17: Then display author name from code 1.
Step 18: Then display author name from code 2.
Step 19: Then display last modified date of code 1.
Step 20: Then display last modified date of code 1.
Step 21: Generate Dependency graph from analysis of code fragments.
Step 22: Do analysis from generated dependency graph reversely.
Step 23: End.

INTRODUCTION
Software projects contain much similar code (i.e., code clones), which may be introduced by many commonly adopted software development practices, such as reusing a generic framework, following a specific programming pattern, and directly copying and pasting code. These practices can improve the productivity of software development by quickly replicating similar functionalities. However, such practices, especially copying and pasting, can also reduce program maintainability and introduce subtle programming errors. For example, when enhancements or bug fixes are done on a piece of duplicated code, it is often necessary to make similar modifications to the other instances of the code. It is easy for developers to miss some instances of the duplicated code and thus to introduce subtle bugs. “I think I have fixed the bug. Why is it still happening?” and “Why does the function work well in that way, but not in this way?” may be example questions that software maintainers ask and which may allude to clone-related bugs. Finding similar code automatically is an important step to alleviate the aforementioned issues. Here we have proposed a technique for eliminating similar code to help reduce software maintenance cost.

In software systems, it has been recognized that the code duplication is serious problem. Code duplication is having bad effect on software system maintenance and evolution[1]-[2]. In last few years different research communities have developed several techniques which were able to detect and analyse the duplicated code[3]. And now more recent research is focused on clone management activities[4], which includes tracing clones in the history of a project, analysing the consistency of modifications to clones, updating incrementally clone groups as the project evolves, and prioritizing the refactoring of clones. In addition to above development work, the effect of duplicated code on maintenance effort and cost, error-proneness due to inconsistent updates, software defects, change-proneness, and change propagation have been investigated empirically by several researchers. There is a lack of tools which can automatically analyse software clones to determine whether they can be safely re-factored without changing the behaviour of program. One of the important but missing features from clone management is re-factorability analysis. When the developers are interested in finding refactoring opportunities for duplicated code it could be used to filter clones that can be directly re-factored.

This is the way by which maintainers can focus on parts of the code that can immediately benefit from refactoring, and thus causes improvement in maintainability.

This paper presents an approach that takes two clone fragments as input which are detected from any tool and it applies following three steps to determine whether they can be re-factored without any side effects or not.
Here we detect code cloning using code attributes which are given below :
1] Number of line counts.
2]Number of bracket counts.
3]Number of import counts.

Step 1: in this step, this approach finds code fragments with identical nesting structures within those input clones which serve as potential refactoring opportunities. If they are sharing a common nesting structure then consider that two code fragments can be unified, and therefore re-factored.

Step 2: in this step, this approach finds a mapping between the statements of the code fragments that maximizes the number of mapped statements and minimizes the number of differences between the mapped statements by exploring the search space of alternative mapping solutions.

Step 3: in the last step, the differences between the mapped statements which were detected in the previous step are examined against a set of preconditions. This is done to determine whether they can be parameterized without changing the program behaviour or not.
Optical Character Recognition or Optical Character Reader, OCR is the process of taking an image of letters or typed text and converting it into data the computer understands. A good example is companies and libraries taking physical copies of books, magazines, or other old printed material and using OCR to put them onto computers. While far from perfect, OCR is currently the best method of digitizing typed pages of text.
The OCR (Optical Character Recognition) algorithm relies on a set of learned characters. It compares the characters in the scanned image file to the characters in this learned set. Generating the learned set is quite simple. Learned set requires an image file with the desired characters in the desired font be created, and a text file representing the characters in this image file.
OCR (optical character recognition) is the recognition of printed or written textcharacters by a computer. This involves photo scanning of the text character-by-character, analysis of the scanned-in image, and then translation of the character image into character codes, such as ASCII, commonly used in data processing.In OCR processing, the scanned-in image or bitmap is analyzed for light and dark areas in order to identify each alphabetic letter or numeric digit. When a character is recognized, it is converted into an ASCII code. Special circuit boards and computer chips designed expressly for OCR are used to speed up the recognition process.
OCR is the acronym for Optical Character Recognition. This technology allows to automatically recognizing characters through an optical mechanism. In case of human beings, our eyes are optical mechanism. The image seen by eyes is input for brain. The ability to understand these inputs varies in each person according to many factors. OCR is a technology that functions like human ability of reading. Although OCR is not able to compete with human reading capabilities.

BASIC CONCEPT

• Text-based Approach:-In this approach, the target source program is considered as sequence of lines/strings. Two code fragments are compared with each other to find sequences of same text/strings. Once two or more code fragments are found to be similar in their maximum possible extent (e.g., w.r.t. maximum no. of lines) are returned as clone pair or clone class by the detection technique.
• Token-based Approach: -Sequence of tokens is formed by lexing/parsing/transforming the entire source code in the token-based detection technique. This sequence is then scanned for finding duplicated subsequences of tokens and finally, the original code portions representing the duplicated subsequences returned as clones. Compared to text-based approaches, robustness of a token-based approach is usually more against code changes such as formatting and spacing. Tree-based Approach:-In the tree-based approach a program is parsed to a parse tree or an abstract syntax tree (AST) using a parser of the language of interest. Matching technique is used for searching similar sub-trees and the corresponding source code of the similar sub-trees is returned as clones pairs or clone classes.
• Program Dependency Graph (PDG): -In this approach, we achieve high abstraction of the source code by going one step further than other approaches by considering the semantic information of the source code. The control flow and data flow information of a program is available in PDG and hence carries semantic information. Isomorphic sub-graph matching algorithm is applied once a set of PDGs are obtained from a subject program, for finding similar sub-graphs which are returned as clones.
• Metric-based Approach:-Metrics-based approaches gather different metrics for code fragments and instead of comparing code directly these matrices are compared. There are number of clone detection techniques which use different software metrics for detection of similar code. First, a set of software metrics called fingerprinting functions are calculated for one or more syntactic units such as a class, a function, or a method or even statement and then the comparison of the matrices values is done to find the code clones over syntactic units.
• Hashing Technique: - An optimization technique is used by the tool using a hash function for string that reduces the computation complexity by a factor, which is a constant determined by the number of characters in a line. Again, meaningful clone resolution is difficult to achieve in a language-independent manner because it is hard to guarantee that detected clones represents cohesive unit in the language being analyzed. The process of resolution itself also depends on the language in question.

Following are the two core program structures that are used in this approach:
a) Program Structure Tree.
b) Program Dependence Graph.

a) Program Structure Tree
The Program Structure Tree (PST)[5] was introduced by Johnson et al. as a hierarchical representation of program structure and this structure is based on single-entry single-exit (SESE) regions of the control flow graph. The nesting relationship of SESE regions and chains of sequentially composed SESE regions are captured by essentially PST.

b) Program Dependence Graph
The Program Dependence Graph (PDG)[6] is a directed graph which consists of multiple edge types. In PDG the nodes denotes the statements of a function or method, and the edges denotes control and data flow dependencies between statements. In this approach PDG representation is used in two ways. In first way, the composite variables are introduces which represents the state of objects which are referred in body of method and it also creates data dependencies for these variables. In second way, two more types of edges are added in the PDG, which are helpful in the examination of preconditions. These two types of edges are: anti-dependencies and output dependencies.
CHAPTER 3
LITERATURE SURVEY
In[1] An approach that takes as input two clone fragments detected from any tool and applies three steps to determine whether they can be safely refactored (i.e., without any side effects). First, our approach finds code fragments with identical nesting structures within the input clones that could serve as potential refactoring opportunities. We consider that two code fragments can be unified, and therefore refactored, if they share a common nesting structure. In the second step, our approach finds a mapping between the statements of the code fragments that maximizes the number of mapped statements and minimizes the number of differences between the mapped statements by exploring the search space of alternative mapping solutions.
In[2] A technique for the refactoring of software clones in Java programs that tackles the afore- mentioned limitations. Our approach takes as input two code fragments or even entire methods that have been detected as clones by clone detection tools and applies three steps to determine whether the clones or parts of them can be safely refactored. In the first step, it tries to find identical control dependence structures within the clones that will serve as candidate refactoring opportunities. In the second step, it applies a mapping approach that tries to maximize the number of mapped statements and at the same time minimize the number of differences between them.
In [3] A comprehensive qualitative comparison and evaluation of all of the currently available clone detection techniques and tools in the context of a unified conceptual framework. Beginning with a basic introduction to clone detection background and terminology, we organize the current techniques and tools into a taxonomy based on a generic clone detection process model. We then classify, compare and evaluate the techniques and tools in two different dimensions.
In [4] The reuse mechanism by code cloning offers some benefits. For instance, cloning of existing code that is already known to be flawless, might save the developers from probable mistakes they might have made if they had to implement the same from scratch. It also saves time and effort in devising the logic and typing the corresponding textual code. Code cloning may also help in decoupling classes or components and facilitate independent evolution of similar feature implementations.
In [5] the program structure tree (PST), a hierarchical representation of program structure based on single entry single exit (SESE) regions of the control flow graph. We give a linear-time algorithm for finding SESE regions and for building the PST of arbitrary control flow graphs (including irreducible ones). Next, we establish a connection between SESE regions and control dependence equivalence classes, and show how to use the algorithm to find control regions in linear time. Finally, we discuss some applications of the PST. Many control flow algorithms, such as construction of Static Single Assignment form, can be speeded up by applying the algorithms in a divide-and-conquer style to each SESE region on its own.
In [6] Anintermediate program representations, called the program dependence graph (PDG), that makes explicit both the data and control dependence 5 for each operation in a program. Data dependences have been used to represent only the relevant data flow relationships of a program. Control dependence5 are introduced to analogously represent only the essential control flow relationships of a program. Control dependences are derived from the usual control flow graph. Many traditional optimizations operate more efficiently on the PDG. Since dependences in the PDG connect computationally related parts of the program, a single walk of these dependences is sufficient to perform many optimizations.
In [7] combines the functionality of Optical CharacterRecognition and speech synthesizer. The objective is to develop user friendly application which performs image to speech conversion system using android phones. The OCR takes image as the input, gets text from that image and then converts it into speech. This system can be useful in various applications like banking, legal industry, other industries, and home and office automation. It mainly designed for people who are unable to read any type of text documents. In this paper, the character recognition method is presented by using OCR technology and android phone with higher quality camera.

PROBLEM STATEMENT & SCOPE
4.1 Problem Statement:
Two different forms of input are processed by this approach, and those are :
1) Two code fragments are declared as clones by clone detection tool within the body of the same method, or different methods.
2) Two method declarations considered to be duplicated, or it may contain duplicate code fragments somewhere inside their bodies.
Use three major steps for assessing the refactorability in this approach.
1) Nesting Structure Matching
2) Statement Mapping
3) Precondition Examination

4.2 Scope:

We propose a novel approach that automatically predicts whether a code cloning operation requires consistency maintenance at the time point of performing copy-and-paste operations.

Evaluate our approach under two usage scenarios:
1) Recommend developers to perform only the cloning operations and perform statistics

2) Recommend developers to perform all cloning operations unless they are predicted & perform analysis of code.

SOFTWARE REQUIREMENT SPECIFICATION

6.1 User Interface Requirements:
Home page of User Interface will include Browse Button to take version history as input.
One button for feature Extraction, on clicking, it will show features & its attributes.
One Button for code clone tracker which will show whether code is copy-pasted.
One Button for Change tracker, which will show file name & line number where changes are performed.
One button for prediction to get final output.

6.2 Hardware Requirements:

System : Pentium IV 2.4 GHz.
Hard Disk : 40 GB.
Mouse : Optical Mouse.
Ram : 512 Mb.
Keyboard : 101 Keyboards.

6.3 Software Requirements:
Operating system : Windows 7 Ultimate (32-bit OS) /Windows 8 (64-bit OS)
Coding Language : Advance Java
Data Base : MySQL

6.4 Functional Requirements:
6.4.1 Code Tracker:
It should identify copied code.

6.4.2 Change Tracker:
It should identify wherever changes are performed in the code
6.4.3 Feature Extractor:
It should extract features & attributes of code.

6.4.4 Prediction module:
It should predict whether code is consistency-maintenance-required or consistency-maintenance-free.

6.5 Non-functional Requirements
6.5.1 Reliability
The degree to which the software is expected to perform its required functions under stated conditions for a stated period of time.
6.5.2 Availability
Software availability is the probability that a program is operating according to requirements at a given point in time.
6.5.3 Security
Software security is a software quality assurance activity that focuses on the identification and assessment of potential hazards that may affect software negatively and cause an entire system to fail.
6.5.4 Maintainability
The efforts needed to make changes in the software.
Maintainability = suitability for debugging (localization and correction of errors) and for modification and extension of functionality.
The maintainability of a software system depends on its:
• Readability
• Extensibility
• Testability
6.5.5 Portability
1. The ease with which a software system can be adapted to run on computers other than the one for which it was designed.
2. The portability of a software system depends on:
Degree of hardware independence
Implementation language
Extent of exploitation of specialized system functions
Hardware properties

METHODOLOGY & ALGORITHMS

Two different forms of input are processed by this approach, and those are:
1) Two code fragments are declared as clones by clone detection tool within the body of the same method, ordifferent methods.

2) Two method declarations considered to be duplicated, or it may contain duplicate code fragments somewhere inside their bodies.

7.1 Algorithm:
Step 1: Start
Step 2: Take image file of two Code fragments as input.
Step 3: Convert image file of code fragments into .text file.
Step 4: Then highlight different code.
Step 5: Display complete highlighted code to determine code is copied or
not.
Step 6: Then display similar code which present in both code fragment.
Step 7: Display number of lines present in code fragment 1.
Step 8: Display number of lines present in code fragment 2.
Step 9: Then display total number of brackets code from 1ST fragment.
Step 10: Display total number of brackets code from 2ND fragment.
Step 11: First display number of imports from code 1.
Step 12: Then display number of imports from code 2.
Step 13: Then display commented line from code 1.
Step 14: Then display commented line from code 2.
Step 15: Then display file name of code 1.
Step 16: Then display file name code 2.
Step 17: Then display author name from code 1.
Step 18: Then display author name from code 2.
Step 19: Then display last modified date of code 1.
Step 20: Then display last modified date of code 1.
Step 21: Generate Dependency graph from analysis of code fragments.
Step 22: Do analysis from generated dependency graph reversely.
Step 23: End.

Here are the three major steps for assessing the refactorability in this approach:
1) Nesting Structure Matching:
In this step, nesting structure of the input clone fragments is analyzed which is useful in finding maximal isomorphic sub-trees. It is assumed that two code fragments can be unified only if they are having an identical nesting structure. Each matched sub-tree pair will be further investigated as a separate clone refactoring opportunity in the next steps.
When we used nesting structure matching function it will provide the copy paste code as output by referring two files.
2) Statement Mapping:
The statements extracted from the previous step within the sub-tree pairs are mapped in a divide-and-conquer fashion. It takes advantage of the identical nesting structure between the isomorphic sub trees; the global mapping problem is divided into smaller sub-problems. The corresponding Program Dependence sub-graphs are mapped by applying a Maximum Common Sub-graph (MCS) algorithm for each sub-problem. These sub-solutions are combined to give the global mapping solution at the end. When we click on Statement mapping & Statistics then it will show you Predictive Code Clone Detector Frame. It will display all output by referring two files.

3) Precondition Examination:
A set of preconditions regarding the preservation of program behavior is examined based on the differences between the mapped statements in the global solution, as well as the statements that may have not been mapped. If no preconditions are violated, then the clone fragments corresponding to the mapped statements can be safely refectories, and thus those are considered to be refactored.
When we used precondition examination it shows number of line count, number of bracket count and number of import count files which are used in code fragments.
Following Fig shows the steps mentioned above:

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	industrial engineering and management project ideas	project topics	1	123,032,179	16-02-2018, 02:42 PM Last Post: tinasingh
	attending lectures on different geological techniques and geophysical techniques and	subine	2	1,165	24-01-2018, 12:34 AM Last Post: RamiJer
	analysis and evaluation of anaerobic stationary fixed film and sludge blanket reactor	subine	1	702	14-09-2017, 12:24 PM Last Post: jaseela123
	network management	subine	1	636	14-09-2017, 09:46 AM Last Post: jaseela123
	comparative analysis of particulate emission from diesel engine using diesel and biod	subine	1	452	13-09-2017, 10:28 AM Last Post: jaseela123
	three dimensional finite element analysis of deep excavations	subine	1	400	13-09-2017, 09:29 AM Last Post: jaseela123
	exact analysis for wind acting on steel bridges	subine	1	404	12-09-2017, 11:01 AM Last Post: jaseela123
	lateral buckling analysis of prestressed steel beam	subine	1	472	11-09-2017, 03:55 PM Last Post: jaseela123
	analysis of total aerosol emission from electrical discharge machining process	subine	1	575	08-09-2017, 12:30 PM Last Post: jaseela123
	colour removal techniques in textile industry	subine	1	665	07-09-2017, 11:01 AM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.