The advent of public clouds has brought large-scale HPC resources within easy reach of ordinary companies. In many situations, especially for temporary projects, cloud solutions can be more affordable for HPC than the cost of acquiring the necessary amount of of compute power in-house. Before cloud, only a few companies, such as giant financial services firms, could afford to have this type of resource.
Last year, there appeared to be considerable customer demand in our market, from companies in a variety of industries, for a large-scale HPC cluster to test software platforms. When we initially described this need to traditional HPC vendors, we were constantly asked what industry consortium or government agency was seeking this work. This was because of the magnitude of the HPC environment that was sought. We were met with polite disbelief when we told them that this was for individual companies, not a giant organization. In the end, we made a decision to build the cluster software ourselves, with the goal of being able to run it on public and private clouds.
When we looked at various commercial and open source options for developing our software, we found that most of them are optimized for different generic applications running on the cluster simultaneously. In order to accommodate such requirements, the cluster is hard partitioned with different operating systems installed on individual compute nodes and each machine is reserved to a particular application in advance, regardless of how many compute resources the application actually needs. This resulted in a fairly low utilization of available compute resources, on the order of 30% on average. This is an adequate compromise for people who want to create an HPC cluster as a generic resource and then rent it out to the public. However, it is not the best policy to serve the needs of a typical commercial user.
The HPC Solution
We decided to build the HPC software using different principles. This project was undertaken initially for some customers who were especially interested in this solution. They were seeking to reduce both costs and time. They didn’t want to spend upwards of $1 million to accomplish the work, and didn’t think they would be competitive in their own markets if it took weeks for the results.
We noticed, when talking to users of HPC resources, that a typical company has only a single HPC application, or a set of related applications, built around a common compute platform and they wish to see a software platform capable of minimizing their compute times, while maximizing utilization of available resources.
The HPC cluster was to be designed differently, around a principal of being able to dynamically schedule individual cores to ensure maximum hardware utilization. The cluster runs a single compute platform and handles related requests from a single vendor. This allows for the security model to be relaxed, and code from different compute jobs is allowed to share the operating system, allowing us to switch cores from one job to another in near real time.
The cluster is designed to solve a subset of problems in the HPC world, rather then attempting to be a generic, all-encompassing compute solution. The subset of problems we chose to tackle are embarrassingly parallel in nature. Individual computation requirements are substantial enough that compute time is at least an order of magnitude higher than the distribution time. Both the problem set and solution results are small enough to be efficiently transmitted across the available network topology and the distribution time is several orders of magnitude lower then the overall job run time. The cluster software is designed to be integrated at the code level into a particular compute intensive application, rather then provide a set of generic remote interfaces.
Cluster Architecture
The cluster is designed to break jobs up into individual compute tasks, efficiently execute compute tasks on available hardware resources, and return the results back to the client application. It operates on bare hardware, private clouds running OpenStack, and public clouds. Different deployment scenarios are designed to address a range of available resources. Bare metal is the most efficient when the customer can afford to dedicate fixed compute resources to a single application. Private clouds serve well to distribute company internal hardware resources between applications, and allow an easy way to provision a different set of compute nodes or shift the available resources from the HPC cluster to other needs. Public clouds serve well, when bursty load levels and occasional demand make it impractical to purchase your own hardware.
The HPC software uses Apache Libcloud to drive provisioning across multiple hardware platforms. We have been major contributors to the Libcloud project, and have effectively used it in a number of software projects for customers. The HPC software includes a custom orchestration layer we developed. On top of the orchestration layer, there are cluster HPC components: scheduler, job interface node, communication fabric and compute nodes, to control efficient job execution.
Components
The communication fabric consists of a collection of RabbitMQ nodes. Individual instances are allocated to either the control plane in order to facilitate reconfiguration and status messages or the data plane, which is used to transport tasks to compute nodes and receive results. RabbitMQ instances are not clustered as we have found that this severely decreases client reconnect rates. Rather, client communication libraries serve to distribute requests across related RabbitMQ instances, providing efficient scaling mechanism. Typically, problem and result payloads are sent in-band. However, the software has provisions to deploy Memcached cluster in order to enhance RabbitMQ scalability for larger payloads. Memcached benefits configurations where the problem set is greater than tens of kilobytes and individual task solution sets are greater than few hundred kilobytes.
The scheduler is responsible for allocating compute resources to individual jobs. It constructs a set of message queues for task and result deliveries across different RabbitMQ instances and instructs a subset of compute nodes to join the queues via the control plane. The scheduler has a set of sophisticated policies, which allows customers to reserve compute resources for different users, account for failed resources, and handle job requirements that need specialized hardware, such as GPU that can be present on a subset of the nodes. The speed of scheduling decisions is 100-200 msec when the cluster relative geographic proximity allows for low latency communication.
The job interface node provides a way to submit jobs in the cluster and retrieve results via REST and WSDL interfaces. The interface node provides redundancy by resubmitting failed jobs. Jobs are submitted to compute nodes via the data plane communication fabric. There, client application code can break the jobs up into compute tasks to be submitted to the job queue constructed by the scheduler, or further broken down into subtasks through a series of steps. Utilizing the HPC resources for the job decomposition allows for this potential computationally expensive step to be done in minimal time. Client code provided by the end user is optimized to break down each job type in the most efficient manner possible, through interfaces to the HPC software via a set of libraries and well defined APIs. The decomposition chain is reversed for result accumulation, until the final set of results is delivered to the interface host for client pick up. Components in the job chain control the degree of job parallelism and can inform the schedule of the resource needs, allowing the scheduler to divert reserved but underutilized resources to alternate tasks if permitted by the policy.
Compute nodes will pull individual tasks from a designated message queue, fetch out-of-band data if necessary for task execution and can either perform end calculations or break down tasks into other tasks, submitting them back to the queue. The actual computation software is provided by the client application and is distributed to individual nodes via the orchestration layer. The compute nodes monitor utilization of cores assigned to individual job queues and will fetch tasks from alternative queues if assigned queues become idle for longer then half the running average of the task duration or policy controlled time out. Accumulated computational results are posted back to results queues or uploaded to the Memcached cluster with notification transmitted via the data plane.
Performance
The overall design results in an efficient utilization of available hardware resources from a single job to hundreds of simultaneous jobs. Utilization in the 95% percent range is possible as long as the job break down results in a number of simultaneous tasks greater than total available compute cores.
Security
Security considerations for public clouds are handled via optional payload encryption in the libraries interfacing to the communication fabric and through restricting node access via distributed firewall.
Conclusion
In initial installations, we found that the solution significantly reduced the time for routine, but very vital, work, eliminating weeks from schedules. As a result of the work we did, there is now a very compelling approach that more companies are ready to deploy. By consolidating what earlier took a month to accomplish on local hardware into a three-hour run in the cloud, , companies can not only decrease their costs, but create a new competitive edge to help them serve their customers faster and better.
A few final words about public vs. private clouds. High density, low cost, compute blades are available from various vendors with the current cost per 10K core cluster in the $4 million range. The cost of lighting up 10K cores through Amazon is roughly $600K a month, although this cost can be lowered substantially via reservation or by using spot instances. However, in general, you could pay more to Amazon in one year than hardware acquisition costs of the cluster if your utilization demands continuous usage.
However, for bursty highly scalable compute problems, HPC in the cloud is an excellent and affordable alternative for many companies.
About the Author
Gregory Elkinbard, Senior Director, Program Management, Mirantis
Greg leads a breakthrough technology team working on cloud and high performance computing solutions at Mirantis, a software engineering company delivering custom cloud platforms. During his career, he has brought to market many successful cloud, Web, security, networking, and embedded OS development projects, resulting in product lines of more than $30 million in revenue.
At Mirantis, he is currently involved with building on-demand IaaS and PaaS for public and private clouds, enabling architecture, grid solutions and highly scalable HPC clusters. These solutions are in very high demand by Mirantis customers to fuel their company’s growth.
At ServiceLive, a unit of Sears Holdings Corporation, Greg’s responsibilities as a team manager were focused on building B2B and B2C Web portals, and data warehouse applications. At Brocade Communications, he managed block storage virtualization projects and development of the CIFS virtualization gateway for the Brocade Files Team. In his early career, Greg worked as an engineer at Resonate, Sun Microsystems, and Amdahl.
He has a BS in Computer Science and Engineering from the University of California, Davis. Reach Greg at [email protected]