05-10-2016, 12:24 PM
1457870412-barrelfishsosp09111020084111phpapp02.DOCX (Size: 385.43 KB / Downloads: 5)
ABSTRACT
Commodity computer systems contain more and more processor cores and exhibit increasingly diverse architectural tradeo s, in-cluding memory hierarchies, interconnects, instruction sets and variants, and IO configurations. Previous high-performance com-puting systems have scaled in specific cases, but the dynamic nature of modern client and server workloads, coupled with the impossi-bility of statically optimizing an OS for all workloads and hardware variants pose serious challenges for operating system structures.
We argue that the challenge of future multicore hardware is best met by embracing the networked nature of the machine, rethinking OS architecture using ideas from distributed systems. We investi-gate a new OS structure, the multikernel, that treats the machine as a network of independent cores, assumes no inter-core sharing at the lowest level, and moves traditional OS functionality to a distributed system of processes that communicate via message-passing.
We have implemented a multikernel OS to show that the ap-proach is promising, and we describe how traditional scalability problems for operating systems (such as memory management) can be e ectively recast using messages and can exploit insights from distributed systems and networking. An evaluation of our prototype on multicore systems shows that, even on present-day machines, the performance of a multikernel is comparable with a conventional OS, and can scale better to support future hardware
1. INTRODUCTION
Computer hardware is changing and diversifying faster than system software. A diverse mix of cores, caches, interconnect links, IO devices and accelerators, combined with increasing core counts, leads to substantial scalability and correctness challenges for OS designers.
2. MOTIVATIONS
Most computers built today have multicore processors, and future core counts will increase [12]. However, commercial multiproces-sor servers already scale to hundreds of processors with a single OS image, and handle terabytes of RAM and multiple 10Gb network connections. Do we need new OS techniques for future multicore hardware, or do commodity operating systems simply need to ex-ploit techniques already in use in larger multiprocessor systems?
In this section, we argue that the challenges facing a general-purpose operating system on future commodity hardware are dif-ferent from those associated with traditional ccNUMA and SMP machines used for high-performance computing. In a previous pa-per [8] we argued that single computers increasingly resemble net-worked systems, and should be programmed as such. We rehearse that argument here, but also lay out additional scalability challenges for general-purpose system software.
2.1 Systems are increasingly diverse
A general-purpose OS today must perform well on an increasingly diverse range of system designs, each with di erent performance characteristics [60]. This means that, unlike large systems for high-performance computing, such an OS cannot be optimized at design or implementation time for any particular hardware configuration.
To take one specific example: Dice and Shavit show how a reader-writer lock can be built to exploit the shared, banked L2 cache on the Sun Niagara processor [40], using concurrent writes to the same cache line to track the presence of readers [20]. On Ni-agara this is highly e cient: the line remains in the L2 cache. On a traditional multiprocessor, it is highly ine cient: the line ping-pongs between the readers’ caches.
This illustrates a general problem: any OS design tied to a par-ticular synchronization scheme (and data layout policy) cannot ex-ploit features of di erent hardware. Current OS designs optimize for the common hardware case; this becomes less and less e cient as hardware becomes more diverse. Worse, when hardware vendors introduce a design that o ers a new opportunity for optimization, or creates a new bottleneck for existing designs, adapting the OS to the new environment can be prohibitively di cult.
Even so, operating systems are forced to adopt increasingly com-plex optimizations [27,46,51,52,57] in order to make e cient use of modern hardware. The recent scalability improvements to Win-dows7 to remove the dispatcher lock touched 6000 lines of code in 58 files and have been described as “heroic” [58]. The Linux read-copy update implementation required numerous iterations due to feature interaction [50]. Backporting receive-side-scaling support to Windows Server 2003 caused serious problems with multiple other network subsystems including firewalls, connection-sharing and even Exchange servers1.
2.2 Cores are increasingly diverse
Diversity is not merely a challenge across the range of commodity machines. Within a single machine, cores can vary, and the trend is toward a mix of di erent cores. Some will have the same instruc-tion set architecture (ISA) but di erent performance characteris-tics [34,59], since a processor with large cores will be ine cient for readily parallelized programs, but one using only small cores will perform poorly on the sequential parts of a program [31,42]. Other cores have di erent ISAs for specialized functions [29], and many peripherals (GPUs, network interfaces, and other, often FPGA-based, specialized processors) are increasingly programmable.
Current OS designs draw a distinction between general-purpose cores, which are assumed to be homogeneous and run a sin-gle, shared instance of a kernel, and peripheral devices accessed through a narrow driver interface. However, we are not the only researchers to see an increasing need for OSes to manage the soft-ware running on such cores much as they manage CPUs today [55]. Moreover, core heterogeneity means cores can no longer share a single OS kernel instance, either because the performance tradeo s vary, or because the ISA is simply di erent.
2.3 The interconnect matters
Even for contemporary cache-coherent multiprocessors, message-passing hardware has replaced the single shared interconnect [18, 33] for scalability reasons. Hardware thus resembles a message-passing network, as in the interconnect topology of the commodity PC server in Figure 2. While on most current hardware the cache-coherence protocol between CPUs ensures that the OS can continue to safely assume a single shared memory, networking problems
2.4 Messages cost less than shared memory
In 1978 Lauer and Needham argued that message-passing and shared-memory operating systems are duals, and the choice of one model over another depends on the machine architecture on which the OS is built [43]. Of late, shared-memory systems have been the best fit for PC hardware in terms of both performance and good software engineering, but this trend is reversing. We can see evi-dence of this by an experiment that compares the costs of updating a data structure using shared memory with the costs using message passing. The graph in Figure 3 plots latency against number of cores for updates of various sizes on the 4 4-core AMD system (described in Section 4.1).
In the shared memory case, threads pinned to each core directly update the same small set of memory locations (without locking) and the cache-coherence mechanism migrates data between caches as necessary. The curves labeled SHM1–8 show the latency per operation (in cycles) for updates that directly modify 1, 2, 4 and 8 shared cache lines respectively. The costs grow approximately linearly with the number of threads and the number of modified cache lines. Although a single core can perform the update opera-tion in under 30 cycles, when 16 cores are modifying the same data it takes almost 12,000 extra cycles to perform each update. All of these extra cycles are spent with the core stalled on cache misses and therefore unable to do useful work while waiting for an update to occur.
In the case of message passing, client threads issue a lightweight remote procedure call [10], (which we assume fits in a 64-byte cache line), to a single server process that performs the update on their behalf. The curves labeled MSG1 and MSG8, show the cost of this synchronous RPC to the dedicated server thread. As expected, the cost varies little with the number of modified cache lines since they remain in the server’s local cache. Because each request is likely to experience some queuing delay at the server proportional
to the number of clients, the elapsed time per operation grows lin-early with the number of client threads. Despite this, for updates of four or more cache lines, the RPC latency is lower than shared memory access (SHM4 vs. MSG8). Furthermore, with an asyn-chronous or pipelined RPC implementation, the client processors can avoid stalling on cache misses and are free to perform other operations.
The final curve, labeled Server, shows time spent performing each update operation as measured at the server end of the RPC channel. Since it excludes queuing delay, this cost is largely in-dependent of the number of threads (and in fact decreases initially once there are enough outstanding client requests to keep the server 100% utilized). The cache e ciency of the single server is such that it can perform twice as many updates per second as all 16 shared-memory threads combined. The per-operation cost is dom-inated by message send and receive, and since these costs are sym-metric at the client end, we infer that the di erence between the Server and MSGn lines represents the additional cycles that would be available to the client for useful work if it were to use asyn-chronous messaging.
This example shows scalability issues for cache-coherent shared memory on even a small number of cores. Although current OSes have point-solutions for this problem, which work on specific plat-forms or software systems, we believe the inherent lack of scala-bility of the shared memory model, combined with the rate of in-novation we see in hardware, will create increasingly intractable software engineering problems for OS kernels.
2.5 Cache coherence is not a panacea
As the number of cores and the subsequent complexity of the inter-connect grows, hardware cache-coherence protocols will become increasingly expensive. As a result, it is a distinct possibility that future operating systems will have to handle non-coherent mem-ory [12, 49, 69], or will be able to realize substantial performance gains by bypassing the cache-coherence protocol [70].
It is already the case that programmable peripherals like NICs and GPUs do not maintain cache coherence with CPUs. Further-more, many multicore processors have already demonstrated the use of non-coherent shared memory [15, 26, 68], and Mattson et al. [49] argue that the overhead of cache coherence restricts the ability to scale up to even 80 cores.
2.6 Messages are getting easier
There are legitimate software engineering issues associated with message-based software systems, and one might therefore ques-tion the wisdom of constructing a multiprocessor operating sys-tem based on a “shared nothing” model as we propose. There are two principal concerns, the first to do with not being able to access shared data, and the second to do with the event-driven program-ming style that results from asynchronous messaging.
However, the convenience of shared data is somewhat superfi-cial. There are correctness and performance pitfalls when using shared data structures, and in scalable shared-memory programs (particularly high-performance scientific computing applications), expert developers are very careful about details such as lock gran-ularity and how fields are laid out within structures. By fine-tuning code at a low level, one can minimize the cache lines needed to hold the shared data and reduce contention for cache line ownership. This reduces interconnect bandwidth and the number of processor stalls incurred when cache contents are stale.
The same kind of expertise is also applied to make commodity operating systems more scalable. As we have shown above, this leads to a challenge in evolving the system as tradeo s change, because the knowledge required for e ective sharing of a particular data structure is encoded implicitly in its implementation. Note that this means programmers must think carefully about a shared-memory program in terms of messages sent by the cache-coherence protocol in response to loads and stores to data locations.
The second concern with message passing is the resultant “stack ripping” and obfuscation of control flow due to the event-driven nature of such programs. However, traditional monolithic kernels are essentially event-driven systems, even on multiprocessors. OS developers are perhaps more accustomed to thinking in terms of state machines and message handlers than other programmers.
Finally, we note that a message-passing, event-driven program-ming model is also the norm for many other programming do-mains, such as graphical user interface programming, some types of network server, and large-scale computation on clusters (where it has completely replaced the “distributed shared virtual memory” paradigm). This has meant that the programmability of message-passing or event-driven systems is an active research area with promising results that seem a natural fit for the multikernel model, such as the Tame/Tamer C++ libraries [41] and the X10 parallel programming language [16]. As the need for multicore program-ming environments at the user level becomes more pressing, we ex-pect tools like these to support a message-passing abstraction will become widespread.
2.7 Discussion
The architecture of future computers is far from clear but two trends are evident: rising core counts and increasing hardware diversity, both between cores within a machine, and between systems with varying interconnect topologies and performance tradeo s.
This upheaval in hardware has important consequences for a monolithic OS that shares data structures across cores. These sys-tems perform a delicate balancing act between processor cache size, the likely contention and latency to access main memory, and the complexity and overheads of the cache-coherence protocol. The irony is that hardware is now changing faster than software, and the e ort required to evolve such operating systems to perform well on new hardware is becoming prohibitive.
Increasing system and interconnect diversity, as well as core het-erogeneity, will prevent developers from optimizing shared mem-ory structures at a source-code level. Sun Niagara and Intel Ne-halem or AMD Opteron systems, for example, already require completely di erent optimizations, and future system software will have to adapt its communication patterns and mechanisms at run-time to the collection of hardware at hand. It seems likely that future general-purpose systems will have limited support for hard-ware cache coherence, or drop it entirely in favor of a message-passing model. An OS that can exploit native message-passing would be the natural fit for such a design.
We believe that now is the time to reconsider how the OS should be restructured to not merely cope with the next generation of hard-ware, but e ciently exploit it. Furthermore, rather than evolving an inherently shared-memory model of OS structure to deal with com-plex tradeo s and limited sharing, we take the opposite approach: design and reason about the OS as a distributed, non-shared system, and then employ sharing to optimize the model where appropriate.
Figure 4 depicts a spectrum of sharing and locking disciplines. Traditional operating systems, such as Windows and variants of Unix, have evolved from designs at the far left of the continuum towards finer-grained locking and more replication. These changes
have been driven by hardware developments that exposed scalabil-ity bottlenecks, particularly the adoption of multiple processors in commodity PCs. Mainstream OSes are currently moving towards the center, where several “hot” data structures are partitioned or replicated across cores. Research systems take this even further with mechanisms like clustered objects that improve the locality of partitioned data [24]. In contrast, we propose an OS architec-ture positioned at the extreme right of the spectrum, where all state is replicated by default and consistency is maintained using agree-ment protocols.
3. THE MULTIKERNEL MODEL
In this section we present our OS architecture for heterogeneous multicore machines, which we call the multikernel model. In a nutshell, we structure the OS as a distributed system of cores that communicate using messages and share no memory (Figure 1). The multikernel model is guided by three design principles:
1. Make all inter-core communication explicit.
2. Make OS structure hardware-neutral.
3. View state as replicated instead of shared.
These principles allow the OS to benefit from the distributed sys-tems approach to gain improved performance, natural support for hardware heterogeneity, greater modularity, and the ability to reuse algorithms developed for distributed systems.
After discussing the principles in detail below, in Section 4 we explore the implications of these principles by describing the im-plementation of Barrelfish, a new operating system based on the multikernel model.
3.1 Make inter-core communication explicit
Within a multikernel OS, all inter-core communication is per-formed using explicit messages. A corollary is that no memory is shared between the code running on each core, except for that used for messaging channels. As we have seen, using messages to access or update state rapidly becomes more e cient than shared-memory access as the number of cache-lines involved increases. We can expect such e ects to become more pronounced in the fu-ture. Note that this does not preclude applications sharing memory between cores (see Section 4.8), only that the OS design itself does not rely on it.
Explicit communication patterns facilitate reasoning about the use of the system interconnect. In contrast to implicit communica-tion (such as distributed shared memory, or the messages used for cache coherence), the knowledge of what parts of shared state are accessed when and by who is exposed. It is, of course, established practice in OS kernels to design point-solution data structures that can be updated using only one or two cache misses on particular architectures, but it is hard to evolve such operations as hardware changes, and such optimizations ignore the wider picture of larger state updates in the OS involving multiple structures.
We have previously argued that as the machine increasingly re-sembles a network, the OS will inevitably behave as a distributed
system [8]. Explicit communication allows the OS to deploy well-known networking optimizations to make more e cient use of the interconnect, such as pipelining (having a number of requests in flight at once), and batching (sending a number of requests in one message, or processing a number of messages together). In Sec-tion 5.2 we show the benefit of such techniques in the case of dis-tributed capability management.
This approach also enables the OS to provide isolation and re-source management on heterogeneous cores, or to schedule jobs e ectively on arbitrary inter-core topologies by placing tasks with reference to communication patterns and network e ects. Further-more, the message abstraction is a basic requirement for spanning cores which are not cache-coherent, or do not even share memory.
Message passing allows operations that might require commu-nication to be split-phase, by which we mean that the operation sends a request and immediately continues, with the expectation that a reply will arrive at some time in the future. When requests and responses are decoupled, the core issuing the request can do useful work, or sleep to save power, while waiting for the reply. A common, concrete example is remote cache invalidations. In a highly concurrent scenario, provided that completing the invalida-tion is not required for correctness, it can be more important not to waste time waiting for the operation to finish than to perform it with the smallest possible latency.
Finally, a system based on explicit communication is amenable to human or automated analysis. The structure of a message-passing system is naturally modular, because components com-municate only through well-defined interfaces. Consequently it can be evolved and refined more easily [23] and made robust to faults [30]. Indeed, a substantial theoretical foundation exists for reasoning about the high-level structure and performance of a sys-tem with explicit communication between concurrent tasks, rang-ing from process calculi such as Hoare’s communicating sequential processes and the -calculus, to the use of queuing theory to ana-lyze the performance of complex networks