22-10-2016, 11:22 AM
1460462413-DistributedBalancedPartitioning.doc (Size: 36 KB / Downloads: 7)
Abstract
Balanced partitioning is often a crucial first step in solving large-scale graph optimization problems: in some cases, a big graph is chopped into pieces that fit on one machine to be processed independently. There are two algorithms which are used for distributed balanced partitioning: JA-BE-JA Algorithm and distributed graph cuts in which BR algorithm is used.
Introduction
Balanced partitioning is often a crucial first step in solving large-scale graph optimization problems: in some cases, a big graph is chopped into pieces that fit on one machine to be processed independently before stitching the results together, leading to certain sub-optimality from the interaction among different pieces. In other cases, links between different parts may show up in the running time and/or network communications cost, hence the desire to have small cut size. We study a distributed balanced partitioning problem where the goal is to partition the vertices of a given graph into k pieces, minimizing the total cut size.
Our algorithm consists of three main parts:
1. We first find a suitable mapping of the vertices to a line. This gives us an ordering of the vertices that presumably places (most) neighbours close to each other, therefore somewhat reduces the minimum-cut partitioning problem to an almost local optimization one.
2. We next attempt to improve the ordering mainly by swapping vertices in a semi-local manner. These moves are done so as to improve certain metrics (perhaps, the cut size of a fully balanced partition).
3. Finally, we use local post-processing optimization in the \split windows" (i.e., a small interval around the equal-size partitions cut points taking into account permissible imbalance) to improve the partition's cut size.
Every day, petabytes of data are generated and processed in on-line social networking services. Some of this data can be modeled as a graph, in which nodes represent users and edges represent the relationship between them. Similarly, search engines manage very large amounts of data to capture and analyze the structure of the Internet. Likewise, this data can be modeled as a graph, with websites as nodes and the hyperlinks between them as edges. One important problem related to graph-structured data processing is partitioning: extremely large scale graphs must be distributed to hosts in such a way, that most of the adjacent edges are stored on the same host. The graph partitioning problem, sometimes referred to as the min-cut problem, is formulated as dividing a graph into a predefined number of components, such that the number of edges between different components is small. A variant of this problem is the balanced or uniform graph partitioning problem, where it is also important that the components hold an equal numberof nodes. The examples of important applications include biological networks, circuit design, parallel programming, load balancing, graph databases and on-line social network analysis. The motivation for graph partitioning depends on the application. A good partitioning can be used to minimize
communication cost, to balance load, or to identify densely connected clusters.
JA-BE-JA Algorithm
A distributed balanced graph partitioning algorithm, which does not require any global knowledge of the graph topology. That is, we do not have cheap access to the entire graph and we have to process it only with partial information. Our solution, called JA-BE-JA, is a decentralized local search algorithm. Each node of the graph is a processing unit, with local information about its neighbouring nodes, and a small subset of random nodes in the graph, which it acquires by purely local interactions. Initially, every node selects a random partition, and over time nodes swap their partitions to increase the number of neighbors they have in the same partition as themselves.
Our algorithm is uniquely designed to deal with extremely large distributed graphs. The algorithm achieves this through its locality, simplicity and lack of synchronization requirements, which enables it to be adapted easily to graph processing frameworks. Furthermore, JA-BE-JA can be applied on fully distributed
graphs, where each network node represents a single graph vertex. To evaluate JA-BE-JA, we use multiple datasets of different characteristics, including a few synthetically generated graphs, some graphs that are well-known in the graph partitioning community , and some sampled graphs from Facebook and Twitter. We first investigate the impact of different heuristics on the resulting partitioning of the input graphs, and then compare JA-BE-JA to METIS, a well-known centralized solution. We show that, although JA-BE-JA does not have cheap random access to the graph data, it can work as good as, and sometimes even better than, a centralized solution. In particular, for large graphs that represent real-world social network structures, such as Facebook and Twitter, JA-BE-JA outperforms METIS. In the next section we define the exact problem that we are targeting, together with the boundary requirements of the potential applications.
The problem that we address in this paper is distributed balanced k-way graph partitioning. In this section we formulate the optimization problem and describe our assumptions about the system we operate in. Balanced k-way graph partitioning-We are given an undirected graph G = (V, E), where V is the set of nodes (vertices) and E is the set of edges. A k-way partitioning divides V into k subsets. Intuitively, in a good partitioning the number of edges that cross the boundaries of components is minimized. This is referred to as the min-cut problem in graph theory. Balanced (uniform) partitioning refers to the problem of partitioning the graph into equal-sized components. The equal size constraint can be softened by requiring that the partition sizes differ only by a factor of a small e. A k-way partitioning can be given with the help of a partition function π.
The algorithm can take advantage of the case, when a computer hosts more than one graph node. We call this the one-host-multiple-nodes model. Here, nodes on the same host can benefit from a shared memory on that host. For example, if a node exchanges some information with other nodes on the same host, the communication cost is negligible. However, information exchange across hosts is costly and constitutes the main body of the communication overhead. This model is interesting for data centers or cloud environments, where each computer can emulate thousands of nodes at the same time.
The basic idea
Recall, that we defined the energy of the system as the number of edges between nodes with different colors, and the energy of a node is the number of its neighbors
with a different color. The basic idea is to initialize colors uniformly at random, and then to apply heuristic local search to push the configuration towards lower energy states (mincut). The local search operator is executed by all the graph nodes in parallel: each node attempts to change its color to the most dominant color among its neighbors. However, in order to preserve the size of the partitions, the nodes cannot
change their color independently. Instead, they only swap1 their color with one another. Each node iteratively selects another node from either its neighbours or a random sample, and investigates the pair-wise utility of a color exchange. If the color exchange decreases the energy then the two nodes swap their colors. Otherwise, they preserve their colors. When applying local search, the key problem is to ensure that the algorithm does not get stuck in a local optimum. For this purpose, we employ the simulated annealing technique.
Note that, since no color is added to/removed from the graph, the distribution of colors is preserved during the course of optimization. Hence, if the initial random coloring of the graph is uniform, we will have balanced partitions at each step. We stress that this is a heuristic algorithm, so it cannot be proven (or, in fact, expected) that the globally minimal energy value is achieved. Exact algorithms are not feasible since the problem is NP-complete, so we cannot compute the minimum edge-cut in a reasonable time, even with a centralized solution and a complete knowledge of the graph. In Section V-F, however, we compare our results with the best known partitioning over a number of benchmark problem instances.
Distributed Graph-cuts
Graph-cuts are widely used in computer vision. In order to speed up the optimization process and improve the scalability for large graphs. There is a splitting method to split a graph into multiple subgraphs for parallel computation in both shared and distributed memory models. However, this parallel algorithm (parallel BK-algorithm) does not have a polynomial bound on the number of iterations and is found non-convergent in some cases due to the possible multiple optimal solutions of its sub-problems.
Graph-cuts optimization plays an important role in solving the energy minimization problem, which is usually derived from a Maximum A Posteriori (MAP) estimation of a Markov Random Field (MRF). As a consequence, how to increase the scalability and further speed up the optimization process of graph-cuts becomes an urgent task. On the other hand, to manage CPU power dissipation, processor makers favour multi-core chip designs. Therefore, fully exploiting the modern multi-core/multi-processor computer architecture to further boost the efficiency and scalability of graph-cuts has attracted much attention recently.
Perhaps the most widely used graph-cuts solver in computer vision community is the algorithm proposed by Boykov and Kolmogorov (called the BK-algorithm). This is a serial augmenting path maxflow algorithm which effectively reuses the two search trees originated from s and t respectively. In order to parallelize the BK-algorithm for further efficiency, the graph is usually split into multiple parts, either disjoint or overlapping. Liu and Sun uniformly partitioned the graph into a number of disjoint subgraphs, concurrently run BK-algorithm in each subgraph to get short-range search trees within each subgraph, then adaptively merged adjacent subgraphs to enlarge the search range until only one subgraph remains and all augmenting paths are found. For some 2D images segmentation cases, this algorithm could achieve a near-linear speedup with up to 4 computational threads.
However, this method requires a shared-memory model, which makes it difficult to be used on distributed platforms. To make the algorithm applicable to both shared and distributed models, Strandmark and Kahl proposed a new parallel and distributed BK-algorithm (called parallel BK-algorithm) which splits the graph into overlapped subgraphs based on dual decomposition. The BK-algorithm is then run in a parallel and iterative fashion. Unfortunately, due to the possible multiple optimal solutions of BK-algorithm, which is an inherent characteristic of all the graph-cuts methods, the parallel BK-algorithm may fail to converge in some cases.
To remedy the non-convergence problem in the parallel BK-algorithm, we first introduced a merging method along with the naive converged parallel BK-algorithm that simply merges every two neighbouring subgraphs once the algorithm is found hard to converge under the current subgraph splitting configuration. We then introduced a new pseudoboolean representation for graph cuts, namely the restricted homogeneous posiforms, and further developed an invariance analysis method for graph cuts algorithms, and proved the correctness and effectiveness of our proposed merging method. Based on the merging method, we proposed a general parallelization framework called the dynamic parallel graph cuts algorithm, which allows the subgraph splitting configuration to be adjusted freely during the parallelization process and has the guaranteed convergence at the same time. Extensive experiments for image segmentation and semantic segmentation with different number of computational threads demonstrated that even under the simplest merging strategy, our proposed naïve converged parallel BK-algorithm not only has the guaranteed convergence but also has competitive computational efficiency with the parallel BK-algorithm.