14-02-2013, 04:34 PM
Automatic Reconfiguration for Large-Scale Reliable Storage Systems
Automatic Reconfiguration.pdf (Size: 1 MB / Downloads: 20)
Abstract
Byzantine-fault-tolerant replication enhances the availability and reliability of Internet services that store critical state and
preserve it despite attacks or software errors. However, existing Byzantine-fault-tolerant storage systems either assume a static set of
replicas, or have limitations in how they handle reconfigurations (e.g., in terms of the scalability of the solutions or the consistency
levels they provide). This can be problematic in long-lived, large-scale systems where system membership is likely to change during
the system lifetime. In this paper, we present a complete solution for dynamically changing system membership in a large-scale
Byzantine-fault-tolerant system. We present a service that tracks system membership and periodically notifies other system nodes of
membership changes. The membership service runs mostly automatically, to avoid human configuration errors; is itself Byzantinefault-
tolerant and reconfigurable; and provides applications with a sequence of consistent views of the system membership. We
demonstrate the utility of this membership service by using it in a novel distributed hash table called dBQS that provides atomic
semantics even across changes in replica sets. dBQS is interesting in its own right because its storage algorithms extend existing
Byzantine quorum protocols to handle changes in the replica set, and because it differs from previous DHTs by providing Byzantine
fault tolerance and offering strong semantics. We implemented the membership service and dBQS. Our results show that the approach
works well, in practice: the membership service is able to manage a large system and the cost to change the system membership is
low.
INTRODUCTION
TODAY, we are more and more dependent on Internet
services, which provide important functionality and
store critical state. These services are often implemented on
collections of machines residing at multiple geographic
locations such as a set of corporate data centers. For
example, Dynamo uses tens of thousands of servers
located in many data centers around the world to build
a storage back-end for Amazon’s S3 storage service and its
e-commerce platform [1]. As another example, in Google’s
cluster environment each cluster includes an installation of
the GFS file system spanning thousands of machines to
provide a storage substrate [2].
Additionally, these systems are long lived and need to
continue to function even though the machines they run on
break or are decommissioned. Thus, there is a need to
replace failed nodes with new machines; also it is necessary
to add machines to the system for increased storage or
throughput. Thus, the systems need to be reconfigured
regularly so that they can continue to function.
THE MEMBERSHIP SERVICE
This section describes the membership service (MS), which
provides a trusted source of membership information.
The MS describes membership changes by producing a
configuration, which identifies the set of servers currently in
the system, and sending it to all servers. To allow the
configuration to be exchanged among nodes without
possibility of forgery, the MS authenticates it using a
signature that can be verified with a well-known public key.
The MS produces configurations periodically rather than
after every membership change. The system moves in a
succession of time intervals called epochs, and we batch all
configuration changes at the end of the epoch. Producing
configurations periodically is a key design decision. It
allows applications that use the MS to be optimized for
long periods of stability (we expect that in storage
applications epochs could last for hours, although our
evaluation shows that we can support short epochs if
needed), and it reduces costs associated with propagating
membership changes (like signing configurations or transmitting
them).
MS Functionality
Membership Change Requests
The MS responds to requests to add and remove servers.
We envision a managed environment with admission
control since; otherwise, the system would be vulnerable to
a Sybil attack [19] where an adversary floods the system
with malicious servers. Thus, we assume servers are added
by a trusted authority that signs certificates used as
parameters to these requests. The certificate for an ADD
request contains the network address and port number of
the new server, as well as its public key, whereas the
certificate to REMOVE a node identifies the node whose
membership is revoked using its public key.
The MS assigns each server a unique node ID uniformly
distributed in a large, circular ID space, which enables the
use of consistent hashing [20] to assign responsibility for
work in some of our MS protocols; applications can also
these IDs if desired. The MS chooses the server’s node ID as
a SHA-1 hash of the values in the add certificate. To prevent
an attacker from adding a group of servers that are all
nearby in the ID space, we require that the node’s public
key be chosen by the trusted authority.
Freshness
Clients of the application using the MS need to verify the
freshness of their configuration information to ensure they
are communicating with the group that currently stores an
item of interest, and not an old group (which may have
exceeded the failure threshold).
We provide freshness by means of freshness certificates.
The mechanism works as follows: To use the replicated
service the client requires an unexpired freshness certificate.
It obtains a certificate by issuing a challenge to the MS. The
challenge contains a random nonce; the MS responds by
signing the nonce and current epoch number. The response
gives the client a period of time Tfc during which it may
execute requests; the value of Tfc is another system
parameter. The client determines when the period has
expired by using its clock: it reads the time when it sends
the challenge, and treats the corresponding freshness
certificate as valid until that time plus the duration Tfc. If
the certificate expires, the client halts application work until
it obtains a new one.
Reconfiguring the MS
There are two plausible ways to run the MS. The first is to
use special, separate nodes that are located in particularly
secure locations. The second is to use an “open” approach
in which the MS runs on regular system members: servers
occasionally act as MS replicas, in addition to running the
application. Our system can accommodate either view, by
considering the first one as a special case of the open
approach: we can mark servers (when they are added) to
indicate the roles they are allowed to assume.
At the end of the epoch, the system may decide to move
the MS to a different set of servers. This can happen because
one of the MS replicas fails; in the open approach it may
also happen proactively (every k epochs) since the nodes
running the MS are attractive targets for attack and this way
we can limit the time during which such an attack can be
launched. The steps that are needed to move the MS occur
after the old MS executes the MOVEEPOCH operation, and
are summarized in Fig. 1.
DYNAMIC REPLICATION
This section describes how storage applications (or other
services) can be extended to handle reconfigurations using
the membership service. In particular, we present dBQS, a
read/write block storage system based on Byzantine
quorums [25]. dBQS uses input from the MS to determine
when to reconfigure.
dBQS is a strongly consistent distributed hash table
(DHT) that provides two types of objects. Public-key objects
are mutable and can be written by multiple clients, whereas
content-hash objects are immutable: once created, a contenthash
object cannot change. In this paper, we describe only
public-key objects. In a separate document [21], we describe
the simpler protocols for content-hash objects, and also
dBFT, a state machine replication system based on the PBFT
algorithm, and a general methodology for how to transform
static replication algorithms into algorithms that handle
membership changes. A complete, formal description of the
main protocols used in dBQS and a proof of their
correctness can be found in a technical report [26].
Performance during an Epoch
This section evaluates the performance of the MS during an
epoch, and the impact of superimposing the service on
dBQS servers. A more detailed evaluation of the base
performance of dBQS can be found in [21].
Three types of membership activities happen during an
epoch: processing of membership events, such as node
additions and deletions; handling of freshness certificates;
and probing of system members. The first two are not a
major performance concern. We assume a deployment in
which membership events happen only occasionally;
processing of these events requires the use of PBFT, but
previous work [34] shows that this cost is modest for
reasonable values of f. Renewals of freshness certificates
are not a problem because the certificates are refreshed
infrequently and can be aggregated.
RELATED WORK
We begin by discussing prior work on systems like our MS
that provide membership control. Then, we discuss work on
replicated systems, like dBQS, that support a dynamic set of
replicas. At the end of the section, we discuss other examples
of large-scale Byzantine-fault-tolerant storage systems.
Membership Control
The membership service has the same goals as the group
membership modules present in group communication
systems and our concepts of configurations and epochs are
equivalent to the notions of process groups and views
introduced in the virtual synchrony model [4]. The initial
work on group communication systems only tolerated crash
failures. Byzantine failures are handled by the Rampart [35]
and Secure Ring [36] systems. Adding and removing
processes in these systems is a heavyweight operation: all
nodes in the system execute a three-phase Byzantine
agreement protocol that is introduced by these systems [6],
which scales poorly with system size. We get around this
limitation by treating most nodes in the system as clients,
and using only a small subset of system nodes to carry out
the protocol. Thus, our solution is scalable with the number
system nodes, which are only clients of the protocols.
CONCLUSION
This paper presents a complete solution for building large
scale, long-lived systems that must preserve critical state in
spite of malicious attacks and Byzantine failures. We
present a storage service with these characteristics called
dBQS, and a membership service that is part of the overall
system design, but can be reused by any Byzantine-faulttolerant
large-scale system.