Running Compute Intensive Tasks in the Cloud

By Greg Elkinbard, Mirantis

October 25, 2011

The advent of public clouds has brought large-scale HPC resources within easy reach of ordinary companies. In many situations, especially for temporary projects, cloud solutions can be more affordable for HPC than the cost of acquiring the necessary amount of of compute power in-house. Before cloud, only a few companies, such as giant financial services firms, could afford to have this type of resource.

Last year, there appeared to be considerable customer demand in our market, from companies in a variety of industries, for a large-scale HPC cluster to test software platforms.  When we initially described this need to traditional HPC vendors, we were constantly asked what industry consortium or government agency was seeking this work. This was because of the magnitude of the HPC environment that was sought. We were met with polite disbelief when we told them that this was for individual companies, not a giant organization. In the end, we made a decision to build the cluster software ourselves, with the goal of being able to run it on public and private clouds.

When we looked at various commercial and open source options for developing our software, we found that most of them are optimized for different generic applications running on the cluster simultaneously. In order to accommodate such requirements, the cluster is hard partitioned with different operating systems installed on individual compute nodes and each machine is reserved to a particular application in advance, regardless of how many compute resources the application actually needs. This resulted in a fairly low utilization of available compute resources, on the order of 30% on average.  This is an adequate compromise for people who want to create an HPC cluster as a generic resource and then rent it out to the public. However, it is not the best policy to serve the needs of a typical commercial user.

The HPC Solution

We decided to build the HPC software using different principles. This project was undertaken initially for some customers who were especially interested in this solution. They were seeking to reduce both costs and time. They didn’t want to spend upwards of $1 million to accomplish the work, and didn’t think they would be competitive in their own markets if it took weeks for the results.

We noticed, when talking to users of HPC resources, that a typical company has only a single HPC application, or a set of related applications, built around a common compute platform and they wish to see a software platform capable of minimizing their compute times, while maximizing utilization of available resources.

The HPC cluster was to be designed differently, around a principal of being able to dynamically schedule individual cores to ensure maximum hardware utilization. The cluster runs a single compute platform and handles related requests from a single vendor. This allows for the security model to be relaxed, and code from different compute jobs is allowed to share the operating system, allowing us to switch cores from one job to another in near real time.

The cluster is designed to solve a subset of problems in the HPC world, rather then attempting to be a generic, all-encompassing compute solution. The subset of problems we chose to tackle are embarrassingly parallel in nature. Individual computation requirements are substantial enough that compute time is at least an order of magnitude higher than the distribution time. Both the problem set and solution results are small enough to be efficiently transmitted across the available network topology and the distribution time is several orders of magnitude lower then the overall job run time. The cluster software is designed to be integrated at the code level into a particular compute intensive application, rather then provide a set of generic remote interfaces.

Cluster Architecture

The cluster is designed to break jobs up into individual compute tasks, efficiently execute compute tasks on available hardware resources, and return the results back to the client application. It operates on bare hardware, private clouds running OpenStack, and public clouds. Different deployment scenarios are designed to address a range of available resources. Bare metal is the most efficient when the customer can afford to dedicate fixed compute resources to a single application. Private clouds serve well to distribute company internal hardware resources between applications, and allow an easy way to provision a different set of compute nodes or shift the available resources from the HPC cluster to other needs. Public clouds serve well, when bursty load levels and occasional demand make it impractical to purchase your own hardware.

The HPC software uses Apache Libcloud to drive provisioning across multiple hardware platforms. We have been major contributors to the Libcloud project, and have effectively used it in a number of software projects for customers. The HPC software includes a custom orchestration layer we developed. On top of the orchestration layer, there are cluster HPC components: scheduler, job interface node, communication fabric and compute nodes, to control efficient job execution.

Components

The communication fabric consists of a collection of RabbitMQ nodes. Individual instances are allocated to either the control plane in order to facilitate reconfiguration and status messages or the data plane, which is used to transport tasks to compute nodes and receive results. RabbitMQ instances are not clustered as we have found that this severely decreases client reconnect rates. Rather, client communication libraries serve to distribute requests across related RabbitMQ instances, providing efficient scaling mechanism. Typically, problem and result payloads are sent in-band. However, the software has provisions to deploy Memcached cluster in order to enhance RabbitMQ scalability for larger payloads. Memcached benefits configurations where the problem set is greater than tens of kilobytes and individual task solution sets are greater than few hundred kilobytes.

The scheduler is responsible for allocating compute resources to individual jobs. It constructs a set of message queues for task and result deliveries across different RabbitMQ instances and instructs a subset of compute nodes to join the queues via the control plane. The scheduler has a set of sophisticated policies, which allows customers to reserve compute resources for different users, account for failed resources, and handle job requirements that need specialized hardware, such as GPU that can be present on a subset of the nodes. The speed of scheduling decisions is 100-200 msec when the cluster relative geographic proximity allows for low latency communication.

The job interface node provides a way to submit jobs in the cluster and retrieve results via REST and WSDL interfaces. The interface node provides redundancy by resubmitting failed jobs.  Jobs are submitted to compute nodes via the data plane communication fabric. There, client application code can break the jobs up into compute tasks to be submitted to the job queue constructed by the scheduler, or further broken down into subtasks through a series of steps. Utilizing the HPC resources for the job decomposition allows for this potential computationally expensive step to be done in minimal time.  Client code provided by the end user is optimized to break down each job type in the most efficient manner possible, through interfaces to the HPC software via a set of libraries and well defined APIs.  The decomposition chain is reversed for result accumulation, until the final set of results is delivered to the interface host for client pick up.  Components in the job chain control the degree of job parallelism and can inform the schedule of the resource needs, allowing the scheduler to divert reserved but underutilized resources to alternate tasks if permitted by the policy.

Compute nodes will pull individual tasks from a designated message queue, fetch out-of-band data if necessary for task execution and can either perform end calculations or break down tasks into other tasks, submitting them back to the queue. The actual computation software is provided by the client application and is distributed to individual nodes via the orchestration layer. The compute nodes monitor utilization of cores assigned to individual job queues and will fetch tasks from alternative queues if assigned queues become idle for longer then half the running average of the task duration or policy controlled time out. Accumulated computational results are posted back to results queues or uploaded to the Memcached cluster with notification transmitted via the data plane.

Performance

The overall design results in an efficient utilization of available hardware resources from a single job to hundreds of simultaneous jobs. Utilization in the 95% percent range is possible as long as the job break down results in a number of simultaneous tasks greater than total available compute cores.

Security

Security considerations for public clouds are handled via optional payload encryption in the libraries interfacing to the communication fabric and through restricting node access via distributed firewall.

Conclusion

In initial installations, we found that the solution significantly reduced the time for routine, but very vital, work, eliminating weeks from schedules. As a result of the work we did, there is now a very compelling approach that more companies are ready to deploy. By consolidating what earlier took a month to accomplish on local hardware into a three-hour run in the cloud, , companies can not only decrease their costs, but create a new competitive edge to help them serve their customers faster and better.

A few final words about public vs. private clouds. High density, low cost, compute blades are available from various vendors with the current cost per 10K core cluster in the $4 million range. The cost of lighting up 10K cores through Amazon is roughly $600K a month, although this cost can be lowered substantially via reservation or by using spot instances. However, in general, you could pay more to Amazon in one year than hardware acquisition costs of the cluster if your utilization demands continuous usage.

However, for bursty highly scalable compute problems, HPC in the cloud is an excellent and affordable alternative for many companies.

About the Author

Gregory Elkinbard, Senior Director, Program Management, Mirantis

Greg leads a breakthrough technology team working on cloud and high performance computing solutions at Mirantis, a software engineering company delivering custom cloud platforms. During his career, he has brought to market many successful cloud, Web, security, networking, and embedded OS development projects, resulting in product lines of more than $30 million in revenue.

At Mirantis, he is currently involved with building on-demand IaaS and PaaS for public and private clouds, enabling architecture, grid solutions and highly scalable HPC clusters. These solutions are in very high demand by Mirantis customers to fuel their company’s growth.

At ServiceLive, a unit of Sears Holdings Corporation, Greg’s responsibilities as a team manager were focused on building B2B and B2C Web portals, and data warehouse applications. At Brocade Communications, he managed block storage virtualization projects and development of the CIFS virtualization gateway for the Brocade Files Team. In his early career, Greg worked as an engineer at Resonate, Sun Microsystems, and Amdahl.

He has a BS in Computer Science and Engineering from the University of California, Davis. Reach Greg at [email protected]

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

2024 Winter Classic: Meet Team Morehouse

April 17, 2024

Morehouse College? The university is well-known for their long list of illustrious graduates, the rigor of their academics, and the quality of the instruction. They were one of the first schools to sign up for the Winter Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pressing needs and hurdles to widespread AI adoption. The sudde Read more…

Quantinuum Reports 99.9% 2-Qubit Gate Fidelity, Caps Eventful 2 Months

April 16, 2024

March and April have been good months for Quantinuum, which today released a blog announcing the ion trap quantum computer specialist has achieved a 99.9% (three nines) two-qubit gate fidelity on its H1 system. The lates Read more…

Mystery Solved: Intel’s Former HPC Chief Now Running Software Engineering Group 

April 15, 2024

Last year, Jeff McVeigh, Intel's readily available leader of the high-performance computing group, suddenly went silent, with no interviews granted or appearances at press conferences.  It led to questions -- what's Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Institute for Human-Centered AI (HAI) put out a yearly report to t Read more…

Crossing the Quantum Threshold: The Path to 10,000 Qubits

April 15, 2024

Editor’s Note: Why do qubit count and quality matter? What’s the difference between physical qubits and logical qubits? Quantum computer vendors toss these terms and numbers around as indicators of the strengths of t Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

April 11, 2024

Yesterday Quantonation — which promotes itself as a one-of-a-kind venture capital (VC) company specializing in quantum science and deep physics  — announce Read more…

Nvidia’s GTC Is the New Intel IDF

April 9, 2024

After many years, Nvidia's GPU Technology Conference (GTC) was back in person and has become the conference for those who care about semiconductors and AI. I Read more…

Google Announces Homegrown ARM-based CPUs 

April 9, 2024

Google sprang a surprise at the ongoing Google Next Cloud conference by introducing its own ARM-based CPU called Axion, which will be offered to customers in it Read more…

Computational Chemistry Needs To Be Sustainable, Too

April 8, 2024

A diverse group of computational chemists is encouraging the research community to embrace a sustainable software ecosystem. That's the message behind a recent Read more…

Hyperion Research: Eleven HPC Predictions for 2024

April 4, 2024

HPCwire is happy to announce a new series with Hyperion Research  - a fact-based market research firm focusing on the HPC market. In addition to providing mark Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Leading Solution Providers

Contributors

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

Intel’s Xeon General Manager Talks about Server Chips 

January 2, 2024

Intel is talking data-center growth and is done digging graves for its dead enterprise products, including GPUs, storage, and networking products, which fell to Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire