06-09-2016, 09:23 AM
1453105658-HPCEC2Cloud.docx (Size: 93.55 KB / Downloads: 6)
Large, virtualized pools of computational resources raise the possibility of a new, advantageous computing paradigm for scientific research. To help achieve this, new tools make the cloud platform behave virtually like a local homogeneous computer cluster, giving users access to high-performance clusters without requiring them to purchase or maintain sophisticated hardware.
Is high-performance scientific computation using cloud computing resources feasible as an alternative to traditional resources? The availability of large, virtualized pools of computational resources raises the possibility of a new, advantageous compute paradigm for scientific research. To achieve this, the authors developed a set of tools that make the cloud platform behave virtually like a local homogeneous computer cluster. As their study results show, for research groups that don't need advanced network performance, cloud computing can provide convenient access to reliable, high-performance clusters without requiring users to purchase and maintain or even understand sophisticated hardware and high-performance computational methods. For developers, cloud virtualization allows scientific codes to be optimized and preinstalled, facilitating control over the computational environment. The authors present results from preliminary tests for serial and parallelized versions of the widely used x-ray spectroscopy and electronic structure code FEFF on the Amazon elastic compute cloud, including CPU and network performance.
Modern cloud computing platforms vary, 1 – 3 but they share two critical features: they abstract the underlying compute components and they typically charge users incrementally based on their usage. The "pay-as-you-go" billing strategy isn't new, and it has many potential advantages, especially for scientists who don't require 24/7 accessibility. Many academic computational researchers have used shared compute facilities for decades and are accustomed to being billed per CPU-hour. What makes cloud architectures a compelling new product for scientific computing—and what differentiates them from existing supercomputing facilities—is the way they abstract the underlying compute components. These components range from hardware infrastructures to operating systems to software packages. This approach offers several advantages for scientists, users, and developers alike. First, unlike supercomputer centers, cloud hardware infrastructures give users and developers sweeping control of their clusters. This is useful when scientists have applications that require particular pieces of software to be installed at the system level. For clouds that provide software as a service, the scientist never has to install a thing. The cloud provider installs, maintains, and optimizes the application and the scientist merely conforms to a specific API.
Here, we focus on a configuration of the Amazon elastic compute cloud (EC2; see http://aws.amazon.com) for scientific computation. The EC2 is part of Amazon Web services (AWS), which provides hardware infrastructure as a service. In other words, the hardware itself is abstracted into compute resources (EC2) and storage resources (simple storage service, or S3). Cloud computing services are widely used in areas such as commercial Web applications—such as Google Apps ( www.googleapps/intl/en/business/index.html) or Microsoft's Azure ( www.microsoftazure)—but have scarcely been exploited for scientific computing applications 3 largely because of their differing requirements. Scientists often require platforms that are good number crunchers, for example, and virtualization software might interfere with this capability. Second, elastic scientific codes often require a high-performance network interconnect, and no cloud platform yet offers such capability.
In our study, we set out to assess EC2's computational and network capabilities for scientific high-performance computing (HPC). In particular, we demonstrate EC2's feasibility for ab initio electronic structure and x-ray spectroscopy modeling 6 , 7 using the real-space code FEFF ( http://leonardo.phys.washington.edu/feff), which is typical of many scientific computing applications. FEFF is a widely used scientific code that calculates the electronic and optical properties of arbitrary, complex systems in real-space for large atom clusters, and runs on a variety of computing environments. Here, we present results that benchmark the performance of both serial and parallel versions of FEFF on EC2 virtual machines. We also describe a toolset that we developed that lets users deploy their own virtual compute-clusters, both for FEFF and other parallel codes. Finally, we explore EC2's intra- and internode communication performance using a tightly coupled scientific code and the Intel MPI Benchmarks.
Our efforts are in two main areas:
• Development. We created an environment that permits the FEFF user community to run different software versions in their own EC2-resident compute clusters.
• Benchmarking. We tested FEFF and other scientific codes performance on EC2 hardware.
To accomplish these efforts, we first had to gain an understanding of the EC2 and S3 infrastructure.
EC2 and S3 Infrastructure
Back to Top
The Amazon EC2 service hosts Amazon machine images (AMIs) on generic hardware located "somewhere" within the Amazon computer network. Amazon offers a set of public AMIs that users can customize to their needs, as well as several types of hardware with various performance levels. Once users select and configure an AMI, they can store it in their Amazon S3 accounts for subsequent reuse. EC2's "elasticity" denotes users' ability to spawn an arbitrary number of AMI instances while scaling the computational resources to match computational demands as needed. Currently available EC2 instances range from small (a 32-bit platform with one virtual core and 1.7 Gbytes of memory and 160 Gbytes of storage) to extra large (64-bit platform with eight virtual cores, 7 Gbytes of memory and 1,690 Gbytes of storage). These limits will likely increase in the future.
Toolsets
EC2 and S3 provide three toolsets for creating and using AMIs:
• AMI tools ( http://developer.amazonwebservicesconnec...rnalID=368) are command-line utilities used to bundle an image's current state and upload it to S3 storage.
• API tools ( http://developer.amazonwebservicesconnec...rnalID=351) serve as the client interface to the EC2 services, letting users register, launch, monitor, and terminate AMI instances.
• S3 libraries ( http://developer.amazonwebservices. com/connect/kbcategory.jspa?categoryID=47) let developers interact with the S3 server to manage stored files, such as AMIs.
The toolsets are available in several formats, including Python, Ruby, and Java. This variety of coding languages gives developers a range of options, letting them tailor their implementations to optimize results. Currently, we use the Java implementation, supplemented with our own set of Bash scripts. For developers to use EC2, they need all three sets of tools. In contrast, users need only the API tools, unless they want to modify and store our preconfigured images. We endeavored to make these tools transparent; users need neither cloud computing nor HPC expertise to run sophisticated parallel codes on the EC2 environment.
We also experimented with Elasticfox ( http://developer.amazonwebservicesconnec...rnalID=609) and S3fox ( http://developer.amazonwebservicesconnec...rnalID=366), two GUIs that provide partial implementations of the EC2 and S3 tools. Both are extensions of the Firefox browser and provide a user-friendly alternative to the EC2 and S3 tools for AWS users. Elasticfox gives an all-in-one picture of users' current AMI states and active instances, and lets users initiate, monitor, and terminate AMI instances. S3fox mimics the interface of many commonly used FTP programs, letting users create and delete S3 storage areas. We believe these graphical browser extensions will prove useful and intuitive for the scientific-user community. Therefore, we developed our user environment to accommodate these tools. Amazon also provides its own graphical interface, the AWS management console, as a browser-independent way of managing EC2 instances.
Usability
The scientific-user community includes a growing number of investigators with parallel computation experience who are familiar with Linux and MPI, but others are virtually helpless in such HPC environments. Also, while investigators are increasingly using HPC versions of FEFF to study complex materials, many users lack access to adequate HPC resources. One of cloud computing's advantages is its potential to provide such access, without requiring users to buy, maintain, or even understand HPC hardware. 8 , 9
With this in mind, we adopted a development strategy that aims to serve users lacking HPC resources. Our approach provides a complete, standalone MPI parallel runtime environment. Users need to know only their AWS account credentials; they don't need any experience with parallel codes, hardware, or the EC2. Our environment works well for the FEFF code. However, it's also general and can be used to launch any suitably configured Amazon EC2 Linux image for parallel computation. This gives users accustomed to using MPI on a workstation or cluster an immediately useful way to run both our FEFF MPI software and many other MPI parallel codes on the EC2.
Configuring EC2 for Serial Scientific Computing
Back to Top
For simplicity, we begin with an example of serial scientific computing. Starting from a public AWS AMI with a Fedora 8 Linux operating system, we created a FEFF AMI that provides both the command-line-driven FEFF code (version 8.4) and JFEFF, a Java-based GUI that facilitates FEFF84 execution. To make these programs run smoothly, we enhanced the AMI template with X11 and a Java runtime environment. JFEFF then functions as if it were running locally. The resulting AMI occupies approximately 550 Mbytes of S3 storage. The AMI's boot time of instances varies depending on EC2 availability but, on average, an instance boots in about two minutes and rarely takes more than three-and-a-half minutes. Figure 1 shows a screenshot of JFEFF running on EC2 and the FEFF AMI console, with the Elasticfox control screen running in the background. As might be expected, the only notable limitation we observed is the GUI's relatively slow response when running over a network. To overcome this problem, we modified the JFEFF GUI so it can be run on a user's local machine while the FEFF executables are resident on the EC2 instances.
Figure 1. Amazon elastic compute cloud (EC2) for serial scientific computing. Components include the FEFF EC2 console in the upper right; the Java FEFF GUI in the lower right, and a Firefox browser running the Elasticfox extension in the background.
Benchmarking FEFF84 Serial Performance Tests
As part of our overarching goal of achieving high-performance scientific computing in the cloud, we first used the FEFF AMI to test the serial performance of FEFF84 on instances with different computing power.
Figure 2 shows runtime versus cluster-size results for a typical full multiple-scattering FEFF calculation— a Boron Nitride crystal—containing between 29 and 157 atoms. This is realistic, as finite clusters with about 100 atoms are typically adequate to obtain bulk systems' converged spectra. We used two instance types, both running 32-bit operating systems:
Figure 2. Comparing serial performance (runtime versus cluster size) of different AMI instance types for typical FEFF84 x-ray spectra calculations. Overall EC2 virtual performance is similar to that of a physical system, with the medium instance being about twice as fast as the small one. Optimized code in the small instance performs better than non-optimized code on the medium one.
• a small instance using a 2.6-gigahertz AMD Opteron processor, and
• a medium instance using a 2.33-GHz Intel Xeon processor, both including FEFF compiled with Gnu Fortran without optimization flags.
For comparison, we included results from one of the University of Washington's local Linux systems with a 64-bit 2.0-GHz AMD Athlon processor. Figure 2 also shows the results we obtained with a highly optimized version of FEFF84. We produced the compiled executable on an AMD Opteron Linux system at UW's Department of Physics using the PGI Fortran compiler with the "-fast" optimization flag. This system is analogous to the one used in the small instance. As Figure 2 shows, the resulting code makes the small instance even faster than the medium one. Consequently, we believe that AMIs should include well-optimized HPC tools to provide good performance for scientific applications. Of course, developers can do this once and for all, so users won't have to configure and optimize such codes for novel compute environments.
Strategies and Tools for Parallel Cloud Computing on the EC2
Setting up a virtual cluster on the EC2 is similar to setting up a physical computer cluster, but certain aspects of the EC2 infrastructure pose unique challenges. In most physical clusters, for example, the administrator typically has complete control of the node IP addresses, but on the EC2, they're dynamically allocated at boot time. So, you must gather this and other information before you can configure a virtual cluster. Security is also quite different. For example, to reduce a cluster's vulnerability, access certificates aren't stored within the AMI and are only transferred during setup.
These challenges, together with our desire to make the compute-cluster setup transparent to users, steered us toward a cloud-cluster structure that's slightly different from most typical physical compute-clusters in three key ways. First, because Amazon charges for all instances booted regardless of CPU load, we did away with the usual server head node of physical clusters. Thus cloud clusters are composed of only compute nodes. However, we do designate one of the nodes as a common disk server because many parallel codes (such as FEFF) require shared disk access for all parallel processes. Second, we eliminated the multiuser cluster concept. The cloud clusters have a single user, specifically set up to run the payload program. Finally, users can launch as many clusters as they need to run several simulations simultaneously. Our EC2 cloud-cluster implementation is homogeneous, thus facilitating parallel task load balancing. Figure 3 shows the resulting cloud-cluster scenario.
Although not essential, we also encourage users to install the Elasticfox extension of the Mozilla Firefox browser. This extension provides a user-friendly monitor of the user's instances on the cloud, which might help avoid unnecessary charges for runaway instances. Currently, the EC2 cloud-cluster tools can be installed either per user or on server mode. In the near future, we plan to offer a specially configured EC2 AMI containing all the software required to launch a cloud cluster. This will eliminate tool and security certificate installation, thus further simplifying cloud-cluster interactions.
The toolset's main starting script is ec2_clust_launch, which sets up an EC2-cluster with N nodes. Figure 4 shows a typical launch sequence log for a cluster with two instances. First, this script requests N instances, parsing and storing the EC2 reservation's information. Then, ec2_clust_launch monitors the reservation's status until all instances have booted. Once the full reservation is running, we gather the required internal EC2 IP addresses, create the configuration files required to run MPI applications, and distribute them to all nodes. On most physical clusters, all the nodes share storage, which is usually mounted from a designated file server node. Here, we assign the instance with boot index 0 as this server and export and mount a scratch area on all nodes. We accomplish this using the standard network file system (NFS). In the final step, we transfer the secure shell (ssh) key files, used to connect to the cluster for computing purposes. We store all information about the cluster in a per-user database, which contains information that the other cluster tools will use. Several clusters can be launched simultaneously, each having a unique identifier given by the EC2 reservation ID or a user-assigned label. All scripts can be directed at a specific cluster by using the user-assigned label or the reservation ID. If neither is given, the scripts default to the last cloud cluster created. Users can access a list of active cloud clusters through the ec2_clust_list command.