June 21, 2012
Cloud services have been made popular by offering accessibility, flexibility and capacity while remaining a fairly low cost technology to the end user. For all the benefits providers can deliver, they are unable to guarantee 100 percent availability. A recent report from the International Working Group on Cloud Computing Resiliency (IWGCR) recounts past failures and the resulting financial repercussions.
The report followed 13 cloud providers since 2007 including industry heavyweights like Microsoft, Amazon and Google. Results showed the services racked up a combined 568 hours of downtime. 2009 was especially tough for Microsoft and Amazon, as both experienced multiple outages, amounting to roughly 48 hours.
The report helps to increase awareness that failures are almost unavoidable. Perhaps the most notable example comes from Navisite in November 2007 following the acquisition of Albanza Corp. The company attempted to migrate and replace hundreds of Albanza servers to their Andover facility, but a failure led to 165,000 websites going offline for seven days.
Outages have resulted in significant financial losses as well. Total damages between the 13 providers amounted to $71.7M. Navisite suffered the most, losing approximately $17M as a result of the 2007 outage.
While a number of users suffer downtime along with their providers, some have designed their operations with failure in mind. On April 2011, Amazon suffered an outage that knocked out sites including Foursquare, Reddit, Quora, Hootsuite, Heroku and Engine Yard. Netflix, on the other hand, was able to weather the storm. In a blog post following the outage, the online video rental service said they designed their systems for this type of issue.
Why were some websites impacted while others were not? For Netflix, the short answer is that our systems are designed explicitly for these sorts of failures. When we re-designed for the cloud this Amazon failure was exactly the sort of issue that we wanted to be resilient to. Our architecture avoids using EBS as our main data storage service, and the SimpleDB, S3 and Cassandra services that we do depend upon were not affected by the outage.
The blog continued, recounting a number of design features to increase Netflix’ resiliency, but one stood out in particular. The company uses a service called “Chaos Monkey”, which simulates failures. Similar to emergency drills, the program shuts down internal services to prepare the engineering team for real-world scenarios.
We run this service because we want engineering teams to be used to a constant level of failure in the cloud.
Netflix has continued to harness the benefits of AWS with the expectation that outages will occur.
As with all constant utilities, cloud services are prone to temporary failures. This concept may deter a number of potential adopters, but examples have shown those risks can be mitigated through sound planning and design.
Researchers from the Suddhananda Engineering and Research Centre in Bhubaneswar, India developed a job scheduling system, which they call Service Level Agreement (SLA) scheduling, that is meant to achieve acceptable methods of resource provisioning similar to that of potential in-house systems. They combined that with an on-demand resource provisioner to ensure utilization optimization of virtual machines.
Experimental scientific HPC applications are continually being moved to the cloud, as covered here in several capacities over the last couple of weeks. Included in that rundown, Co-founder and CEO of CloudSigma Robert Jenkins penned an article for HPC in the Cloud where he discussed the emergence of cloud technologies to supplement research capabilities of big scientific initiatives like CERN and ESA (the European Space Agency)...
When considering moving excess or experimental HPC applications to a cloud environment, there will always be obstacles. Were that not the case, the cost effectiveness of cloud-based HPC would rule the high performance landscape. Jonathan Stewart Ward and Adam Barker of the University of St. Andrews produced an intriguing report on the state of cloud computing, paying a significant amount of attention to the problems facing cloud computing.
05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.
04/02/2012 | AMD | Developers today are just beginning to explore the potential of heterogeneous computing, but the potential for this new paradigm is huge. This brief article reviews how the technology might impact a range of application development areas, including client experiences and cloud-based data management. As platforms like OpenCL continue to evolve, the benefits of heterogeneous computing will become even more accessible. Use this quick article to jump-start your own thinking on heterogeneous computing.