21-12-2012, 04:55 PM
A Gossip Protocol for Dynamic Resource Management in Large Cloud Environments
1A Gossip Protocol.pdf (Size: 517.06 KB / Downloads: 67)
Abstract
We address the problem of dynamic resource
management for a large-scale cloud environment. Our
contribution includes outlining a distributed middleware
architecture and presenting one of its key elements: a gossip
protocol that (1) ensures fair resource allocation among
sites/applications, (2) dynamically adapts the allocation to
load changes and (3) scales both in the number of physical
machines and sites/applications. We formalize the resource
allocation problem as that of dynamically maximizing
the cloud utility under CPU and memory constraints.
We first present a protocol that computes an optimal
solution without considering memory constraints and prove
correctness and convergence properties. Then, we extend
that protocol to provide an efficient heuristic solution
for the complete problem, which includes minimizing the
cost for adapting an allocation. The protocol continuously
executes on dynamic, local input and does not require
global synchronization, as other proposed gossip protocols
do. We evaluate the heuristic protocol through simulation
and find its performance to be well-aligned with our design
goals.
INTRODUCTION
WE consider the problem of resource management
for a large-scale cloud environment. Such an
environment includes the physical infrastructure and
associated control functionality that enables the provisioning
and management of cloud services. While
our contribution is relevant in a more general context,
we conduct the discussion from the perspective of the
Platform-as-a-Service (PaaS) concept, with the specific
use case of a cloud service provider which hosts sites in a
cloud environment. The stakeholders for this use case are
depicted in figure 1a. The cloud service provider owns
and administers the physical infrastructure, on which
cloud services are provided.
SYSTEM ARCHITECTURE
Datacenters running a cloud environment often contain
a large number of machines that are connected by
a high-speed network. Users access sites hosted by the
cloud environment through the public Internet. A site is
typically accessed through a URL that is translated to a
network address through a global directory service, such
as DNS. A request to a site is routed through the Internet
to a machine inside the datacenter that either processes
the request or forwards it.
Figure 2 (left) shows the architecture of the cloud
middleware. The components of the middleware layer
run on all machines. The resources of the cloud are
primarily consumed by module instances whereby the
functionality of a site is made up of one or more
modules. In the middleware, a module either contains
part of the service logic of a site (denoted by mi in
Figure 2) or a site manager (denoted by SMi).
FORMALIZING THE PROBLEM OF RESOURCE
ALLOCATION BY THE CLOUD MIDDLEWARE
For this work, we consider a cloud as having computational
resources (i.e., CPU) and memory resources,
which are available on the machines in the cloud infrastructure.
As explained earlier, we restrict the discussion
to the case where all machines belong to a single cluster
and cooperate as peers in the task of resource allocation.
The specific problem we address is that of placing
modules (more precisely: identical instances of modules)
on machines and allocating cloud resources to these
modules, such that a cloud utility is maximized under
constraints. As cloud utility we choose the minimum
utility generated by any site, which we define as the
minimum utility of its module instances. We formulate
the resource allocation problem as that of maximizing
the cloud utility under CPU and memory constraints.
The solution to this problem is a configuration matrix
that controls the module scheduler and the request forwarder
components. At discrete points in time, events
occur, such as demand changes, addition and removal
of site or machines, etc.
A PROTOCOL FOR DISTRIBUTED RESOURCE
ALLOCATION
In this section, we present our protocol for resource
allocation in a cloud environment, which we call P*. It
is based on a heuristic algorithm for solving OP(2) and
is implemented in form of a gossip protocol.
As a gossip protocol, P* has the structure of a roundbased
distributed algorithm (whereby round-based does
not imply that the protocol is synchronous). When executing
a round-based gossip protocol, each node selects
a subset of other nodes to interact with, whereby the
selection function is often probabilistic. Nodes interact
via ‘small’ messages, which are processed and trigger
local state changes. Node interaction with P* follows
the so-called push-pull paradigm, whereby two nodes
exchange state information, process this information and
update their local states during a round.
The Protocol and its Analysis
In this subsection, we present P’, a simplified version
of P* that ignores the memory constraints and the cost
of change in configuration. It assumes the ability of a
machine to choose uniformly at random another machine
for interaction. P’ approximates P* well in cases where
the memory demand is significantly smaller than the
available memory capacity and the cost of change is
small. P’ can be analyzed with respect to correctness and
convergence, whereas a similar analysis of P* seems to
be very hard given the current state-of-the-art.
A HEURISTIC SOLUTION TO OP(2)
In this subsection, we present the protocol P*, a
distributed heuristic algorithm to solve OP(2). P* can
be seen as an extension of P’. Recall that OP(2) differs
from OP(1) in that it considers memory constraints
and includes the secondary objective of minimizing the
cost of changing the current to a new configuration.
Considering the memory constraints for each machine
turns OP(1) into an NP-hard optimization problem [22].
P* is an asynchronous protocol. This means that a
machine does not synchronize the start time of a protocol
round with any other machine of the cloud. At the
beginning of a round (more precisely, at the start of the
loop of the active or passive thread), a machine reads
the current demands of the modules it runs. At the end
of a round (more precisely, at the end of the loop of the
active or passive thread) a machine updates its part of
the configuration matrix A. The matrix A thus changes
dynamically and asynchronously during the evolution of
the system.
RELATED WORK
The problem of resource management we address in
this paper is related to two lines of research, namely,
to that of application placement and to that of load
balancing in processor networks.
Application placement in datacenters is often modelled
through mapping a set of applications onto a set
of machines such that some utility function is maximized
under resource constraints. This approach has
been taken, e.g., in [18], [21], and solutions from these
works have been incorporated in middleware products
[6]. While these product solutions, in a similar way as
our scheme does, allow for computing an allocation that
maximizes the utility, they rely on centralized architectures,
which do not scale to system sizes we consider in
this paper.
The work in [25], which has been extended by [26]
presents a distributed middleware for application placement
in datacenters. As in this paper, the goal of that
work is to maximize a cluster utility under changing
demand, although a different concept of utility is used.
The choice of utility functions in that work is such that
service differentiation works very well under overload
conditions, with the risk of starving unpopular applications.
In contrast, our approach guarantees that every
module receives its fair share of the CPU resources of
the cloud, and that in underload conditions all modules
are guaranteed to have satisfied demands. The proposed
design in [25], [26] scales with the number of machines,
but it does not scale in the number of applications, as the
design in this paper does. (The concept of an application
in the referenced work roughly corresponds to concept
of a site in this paper.)
DISCUSSION AND CONCLUSION
With this paper, we make a significant contribution
towards engineering a resource management middleware
for cloud environments. We identify a key component
of such a middleware and present a protocol that can be
used to meet our design goals for resource management:
fairness of resource allocation with respect to sites,
efficient adaptation to load changes and scalability of
the middleware layer in terms of both the number of
machines in the cloud as well as the number of hosted
sites/applications.
We presented a gossip protocol P* that computes, in a
distributed and continuous fashion, a heuristic solution to
the resource allocation problem for a dynamically changing
resource demand. We evaluated the performance of
this protocol through simulation. In all the scenarios we
investigated, the protocol achieves the three qualitative
design goals given in Section I. For instance, regarding
fairness, the protocol performs close to an ideal system
for scenarios where the ratio of the total memory
capacity to the total memory demand is large. More
importantly, the simulations suggest that the protocol
is scalable in the sense that all investigated metrics do
not change when the system size (i.e., the number of
machines) increases proportional to the external load
(i.e., the number of sites).