January 25, 2011
To understand the benefits and challenges of cloud computing for high performance computing, it helps to take a look back at what problems are solved by other modes of outsourcing computation—and what barriers remain. While the concept of running complex scientific applications on remote resources is certainly nothing new, there have been key technological—not to mention ideological—advances that have been slowly refining the process.
Kate Keahey, currently a scientist focusing on virtualization, resource management and cloud computing at Argonne National Laboratory, agrees that outsourcing scientific applications is a common, long-standing desire, but also contends that the cloud paradigm shift has created opportunities for scientific users that the grid was unable to supply. While she feels that the grid did an enormous job of gathering momentum for distributed computing, building on the limitations using cloud computing has provided new possibilities.
As one of the world’s notable researchers working to make clouds more suitable for the complex needs of scientific users, Keahey contends that while there are opportunities in the cloud, there are also hurdles that remain. However, as the space matures and more tools and processes are developed, the cloud might become more viable for scientists to focus exclusively on their work, shedding the complexities of at least some of their physical resources, and realizing the goal of obtaining on-demand, elastic provisioning as needed.
In addition to her roles at Argonne and as a fellow at the University of Chicago’s Computation Institute, she leads the Nimbus Project, which provides scientists an open source toolkit that allows them to turn their existing clusters into Infrastructure as a Service (IaaS) clouds.
In the following interview we talked about her background with distributed computing, limitations of the grid, challenges and benefits of cloud computing for HPC and her view on critical elements that the community as a whole—vendor, users, and scientists alike—will need to address as the space matures.
HPCc: Where was your interest in distributed systems and grids piqued?
Keahey: As a grad student, in 1995, I worked on the iWay experiment, which involved combining supercomputers using fast networks—this was amazing event; we were all trying to build applications that would run on these distributed supercomputers. For this experiment, I implemented the communication system that allowed an application to run on four supercomputers in distributed locations across the country, Cornell Technology Center, Pittsburgh Supercomputing Center, Indiana and NCSA. The application was simulating galaxy collisions -- Andromeda and Milky Way (which is supposed to happen sometime in future)—the simulation ran across all of those supercomputers; it ran for 25 hours and produced the right results.
This was an incredibly interesting experiment in that it highlighted the potential of combining supercomputers over the network. It was one of the most game-changing events I have seen. After the Supercomputing Conference that year it felt like the world had changed in some way—that now it was possible to connect very distributed supercomputers by networks and have them work in this configuration. What kind of applications would be best for this, that was still a questions, but it was clear that supercomputers no longer would be just isolated machines.
HPCc: Following the iWay experiment you eventually went on to Argonne where you worked on extending some of the lessons learned from this experience. What were some of the ways you started examining the possibilities of networks and the grid early on there?
Keahey: When I started working with grids, I noticed that something was missing; it was hard for application groups to use remote resources not because they were inaccessible but because they did not support the complex, application-specific environments required by scientific codes.
I was working with Fusion scientists at that time—they had a code that was so complex it required upgrades that took a specialist 24 hours to install, yet it was a very widely-used code in the community that everyone wanted to work with… It required a specific versions of the operating system, libraries ands tools. Running it on distributed grid sites was not an option because the environment on those sites did not typically support this finicky software stack. So in practice the complexity of the code prevented scientists from provisioning remote resources to run it on.
HPCc: What were some of the solutions you came up with at Argonne to make up for some of the environment management and control issues with grid?
Keahey: We were trying to solve this “environment incompatibility” problem for the scientists so that their applications could run on any remote resource and eventually came up with virtualization. With virtualization, we knew we could create whatever environment was needed for a particular code in a virtual machine--then we could run that virtual machine on someone else’s resource. That did solve the problem--but not without creating a number of other problems in the process.
As an outsourcing paradigm, grid had the shortcoming that it didn’t recognize environment as an important aspect of outsourcing—to some extent probably because there was nothing to be done about it. Virtualization tools were not well developed at that point. Cloud computing recognizes the importance of an environment – an appliance -- and uses virtualization to provide the required capability. Why is virtualization so good at this? Because it can isolate the virtual machine from the underlying hardware. It now became possible for providers to host virtual machines s on their resources. Before you couldn’t give a user root on your resources because once they had it they could do something bad; but now you can run a virtual machine and they can have root on that VM since they are isolated from the actual hardware on which the VM is run.
HPCc: You saw several early case studies for cloud computing within the context of working around grid challenges—what were some of these experiences?
Keahey: The project with the Fusion scientists was one of the most inspiring. The code complexity I mentioned was not the only issue we were trying to find solutions for, there was another feature the Fusion scientists wanted. As they run their experiments, they need to quickly analyze the outcomes on the fly in order to tune the experimental parameters as the experiment goes on. This analysis requires running codes with very quick turnarounds. To provide this very quick turnaround, they had a cluster dedicated just for this experiment support, which was only used maybe something like 10% of the year.
It would again be interesting to use some shared or grid resources for this purpose. But they needed immediate resource availability whereas grid computing it relied on batch computing -- useful as an institutional computing model but which doesn’t scale to multiple communities across the country trying to sort out their priorities on a resource. So if someone has needs like experiment support, paper deadline, or national emergency they could not outsource that.
HPCc: You are one of the creators of Nimbus, which had its foundations in some of the work you were doing with the Fusion scientists and their needs. Describe what led to the creation and how it evolved.
Keahey: We came up with Nimbus about eight years ago during our work with the Fusion scientists -- we said, let’s deploy virtual machines for you if it will solve your problem, so we developed something called Workspace Service, which was the first part of Nimbus. This is essentially middleware that provided the same functionality that EC2 has. We were able to deploy virtual machines on-demand on a remote resource via this prototype in 2002. We tried to get scientists interested in that, but at that time, we were using vmware, so everybody was very excited until they realized they had to pay high licensing fees and they’d say “why use a vm if I can buy a real one for this amount of money.”
This problem got solved when Xen emerged; it was not only open source -- it was also fast—and it solved a large part of the performance overhead problem. From then on it became easier to run on virtual machines because the performance overhead was much smaller—it was like a huge barrier went down. It was a very significant step in enabling virtualization for scientific communities, since these communities have a very strong “need for speed”.
After a few years of R&D we released the Workspace Service’s first version somewhere in mid-2005 then we hit another problem: we were trying to get it deployed and used. We would go to application scientists and offer them Nimbus and say “you can deploy vms on remote resources” and they would say “that’s what I need but when is TeraGrid or other large infrastructure going to buy into this?” Folks at TeraGrid would say “it looks promising but we don’t see application groups with virtual machines lining up”. In other words, we had a chicken-and-egg problem. Then in mid-2006 Amazon announced EC2, which was a huge breakthrough for us because finally someone provided a service that was essentially exactly what we were trying to provide and now we could get application scientists to start using this resource – and we did.
After a while people started deploying Nimbus to provide sort of a private cloud to experiment with improving the infrastructure and doing cloud computing research, etc.
HPCc: To back up a little, the constant here is that scientists and researchers have been looking to outsource computing in any way possible but grids were not proving to be flexible enough to handle some complex applications and user needs. So what problems are left for clouds as a paradigm to replace this other model of outsourcing computation—are there still several problems barring this movement for scientific applications?
Keahey: In science people have tried in many ways to outsource computing via university-wide efforts or efforts on national scale such as grids. This gives them access to much more sophisticated resources than their institution could provide. While there are many outsourcing models – and we mustn’t forget that grid computing created a huge momentum in this space -- it seems cloud computing created a breakthrough because of ease of use and gave them exactly what they needed—at least this is true for a large group of users.
But cloud computing is a paradigm shift. Like every shift, it has some attractive elements, but also creates problems. One of these problems is certainly performance, especially for HPC applications that have significant requirements in this space. There are many aspects to that, one of which is latency another one – easier to deal with -- is throughput.
Not so long ago the major criticism was that clouds simply did not have the right hardware -- now Amazon announced the Cluster Compute Instances. Their recent offering with GPUs has also gone a long way to make it more suitable for high performance computing.
Another is dealing with data and computation privacy in the cloud – we are only beginning to understand the renegotiated trust relationships in this space. The cloud has made wonderful breakthroughs in isolating users from one another but we cannot protect data from a cloud provider—so now we can outsource, but the privacy from provider is an issue. For instance, the medical and healthcare community could benefit from the cloud in many ways, but right now the privacy status of their data is somewhat questionable. Some of this could be solved technically and some in regulatory ways.
Finally, there is also the issue of cloud markets—many people, at are not sure what to do about cloud computing because there are no functioning cloud markets. If computing is in fact tangible, some things need to change—one aspect is in standards, making it easy for users to choose between providers. Another is understanding the cost: how do the various offerings compare?
And finally how do we use clouds? There’s a lot of technology to throw at the problem right now with appliance management software and so on but this is still an area that needs development. New paradigm creates new usage patterns and thus the need for new tools – what are the best tools to leverage it?
HPCc: We were talking earlier about performance in the cloud; how are scientists evaluating what a still immature cloud market has to offer them?
Keahey: It’s hard to provide a viable comparison between offerings in the cloud. For now it’s even hard for consumers to understand different instance variations on EC2 let alone allowing for comparing between Amazon’s offering and what is offered by Rackspace. This is particularly hard with virtualization because when people compare resources they look at the architecture, clockspeed, etc. -- but with cloud it’s harder because you could have the same hardware configured to optimize different tradoffs -- thus the ultimate performance from the perspective of the user can be different. For instance, you might configure your hypervisor to have great throughput, but you will pay the price in CPU – or you could configure it can be configured with different tradeoffs.
One more thing on performance comparisons—a while ago we did an experiment with the STAR project at Brookhaven National Lab. They had a paper due, had one more simulation to produce, and all of their local resources were busy. We worked with them before and this time they asked us to produce the result on Amazon so they could run their simulation in time. We created a virtual cluster for them, it ran and they made the deadline – it was a huge success story. But after some time, their colleagues evaluated the different instance types and found that they could have produced the result for half the price if we had used a more powerful instance type.
Making these performance comparisons involves choice and investigation and these are efforts that every group is making on their own. However, it’s something that needs to be done via a service of benchmarking--across providers and instances--that shows the type of scientific calculations you have and the corresponding best choices. Having something like that would be very valuable.
Without that, the effort gets repeated. You can pick a random instance and overpay or pay for someone’s time to do a cost-performance analysis – either way you are paying.
HPCc: What percentage of scientific applications are suitable on a performance and cost level are suitable for a public cloud resource like EC2?
Keahey: I wish I knew the answer; I’ve been pondering this question for a very long time now. It’s hard to characterize these applications because the scene keeps changing. My sense is that there are many scientific applications that can be done on this type of resource and new ones are emerging every day. So far we know that embarrassingly parallel applications have been doing very well on the clouds. HPC applications with low I/O overhead have been doing reasonably well also.
This issue of performance and suitability was best described in the words of a colleague contributing to Nimbus, “I don’t want a Ferrari, I want a pickup truck.” In this case, the “Ferraris” are the Blue Waters type machines, the very luxury end of computing—and many people have mundane computations, but there are many scientists in this category and they could make use of cloud computing resources like EC2.
If you find an answer or a guess from someone in a position to make this guess, I’d be very interest. Should we invest in buying high-end machines or is this what’s going to advance science—and right now, nobody knows.
More on Grids, Clouds and Science…
Dr. Keahey has a great deal more to share related to deploying cloud computing resources for scientific applications. In one of the more insightful articles to appear on cloud computing from the past year, entitled “Mohammad and the Mountain” which can be found at the ScienceClouds resource (in addition to a number of other interesting posts) she expands on the performance, fault tolerance, and other issues using a rather unique metaphor.
May 16, 2013 |
When it comes to cloud, long distances mean unacceptably high latencies. Researchers from the University of Bonn in Germany examined those latency issues of doing CFD modeling in the cloud by utilizing a common CFD and its utilization in HPC instance types including both CPU and GPU cores of Amazon EC2.
May 10, 2013 |
Australian visual effects company, Animal Logic, is considering a move to the public cloud.
May 10, 2013 |
Program provides cash awards up to $10,000 for the best open-source end-user applications deployed on 100G network.
May 08, 2013 |
For engineers looking to leverage high-performance computing, the accessibility of a cloud-based approach is a powerful draw, but there are costs that may not be readily apparent.
05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.
04/02/2012 | AMD | Developers today are just beginning to explore the potential of heterogeneous computing, but the potential for this new paradigm is huge. This brief article reviews how the technology might impact a range of application development areas, including client experiences and cloud-based data management. As platforms like OpenCL continue to evolve, the benefits of heterogeneous computing will become even more accessible. Use this quick article to jump-start your own thinking on heterogeneous computing.