09-09-2017, 10:19 AM
The tree is one of the most common and well-studied data structures in computing. Measuring the similarity of such structures is key to analyzing this type of data. However, measuring the similarity of trees is not trivial because of the inherent complexity of trees and the resulting large search space. Tree kernels, a state-of-the-art measurement of tree similarity, represents trees as vectors in a space of features and measures of similarity in this space. When different characteristics are used, different algorithms are required. Tree distance editing is another widely used similarity measure of trees. It measures similarity through the editing operations needed to transform one tree into another. Without any restriction on editing operations, the calculation cost is too high to be applicable to a large volume of data. To improve the efficiency of the editing distance of the tree, some approximations were introduced in the editing distance of the tree. However, its efficacy may be compromised.
Trees are represented as multidimensional sequences and their similarity is measured on the basis of their sequential representations. Multidimensional sequences have their sequential dimensions and spatial dimensions. We measure sequential similarity by measuring sequence similarity of all common sub-sequences or the longest common subsequence measure, and measure spatial similarity by dynamic time deformation. Then we combine them to give a measure of the similarity of the tree. A brute force algorithm to calculate the similarity will have a high computational cost. In the spirit of dynamic programming two efficient algorithms are designed to calculate the similarity, which have quadratic complexity of time. The new measurements are evaluated in terms of classification accuracy in two popular classifiers (k-nearest neighbor and supporting vector machine) and in terms of search effectiveness and efficiency in the search for similarity between closest neighbors, using three sets of natural language data processing and retrieval information. The experimental results show that the new measures consistently and significantly outperform the reference measures.
Trees are represented as multidimensional sequences and their similarity is measured on the basis of their sequential representations. Multidimensional sequences have their sequential dimensions and spatial dimensions. We measure sequential similarity by measuring sequence similarity of all common sub-sequences or the longest common subsequence measure, and measure spatial similarity by dynamic time deformation. Then we combine them to give a measure of the similarity of the tree. A brute force algorithm to calculate the similarity will have a high computational cost. In the spirit of dynamic programming two efficient algorithms are designed to calculate the similarity, which have quadratic complexity of time. The new measurements are evaluated in terms of classification accuracy in two popular classifiers (k-nearest neighbor and supporting vector machine) and in terms of search effectiveness and efficiency in the search for similarity between closest neighbors, using three sets of natural language data processing and retrieval information. The experimental results show that the new measures consistently and significantly outperform the reference measures.