June 21, 2012
Cloud services have been made popular by offering accessibility, flexibility and capacity while remaining a fairly low cost technology to the end user. For all the benefits providers can deliver, they are unable to guarantee 100 percent availability. A recent report from the International Working Group on Cloud Computing Resiliency (IWGCR) recounts past failures and the resulting financial repercussions.
The report followed 13 cloud providers since 2007 including industry heavyweights like Microsoft, Amazon and Google. Results showed the services racked up a combined 568 hours of downtime. 2009 was especially tough for Microsoft and Amazon, as both experienced multiple outages, amounting to roughly 48 hours.
The report helps to increase awareness that failures are almost unavoidable. Perhaps the most notable example comes from Navisite in November 2007 following the acquisition of Albanza Corp. The company attempted to migrate and replace hundreds of Albanza servers to their Andover facility, but a failure led to 165,000 websites going offline for seven days.
Outages have resulted in significant financial losses as well. Total damages between the 13 providers amounted to $71.7M. Navisite suffered the most, losing approximately $17M as a result of the 2007 outage.
While a number of users suffer downtime along with their providers, some have designed their operations with failure in mind. On April 2011, Amazon suffered an outage that knocked out sites including Foursquare, Reddit, Quora, Hootsuite, Heroku and Engine Yard. Netflix, on the other hand, was able to weather the storm. In a blog post following the outage, the online video rental service said they designed their systems for this type of issue.
Why were some websites impacted while others were not? For Netflix, the short answer is that our systems are designed explicitly for these sorts of failures. When we re-designed for the cloud this Amazon failure was exactly the sort of issue that we wanted to be resilient to. Our architecture avoids using EBS as our main data storage service, and the SimpleDB, S3 and Cassandra services that we do depend upon were not affected by the outage.
The blog continued, recounting a number of design features to increase Netflix’ resiliency, but one stood out in particular. The company uses a service called “Chaos Monkey”, which simulates failures. Similar to emergency drills, the program shuts down internal services to prepare the engineering team for real-world scenarios.
We run this service because we want engineering teams to be used to a constant level of failure in the cloud.
Netflix has continued to harness the benefits of AWS with the expectation that outages will occur.
As with all constant utilities, cloud services are prone to temporary failures. This concept may deter a number of potential adopters, but examples have shown those risks can be mitigated through sound planning and design.
Large-scale, worldwide scientific initiatives rely on some cloud-based system to both coordinate efforts and manage computational efforts at peak times that cannot be contained within the combined in-house HPC resources. Last week at Google I/O, Brookhaven National Lab’s Sergey Panitkin discussed the role of the Google Compute Engine in providing computational support to ATLAS, a detector of high-energy particles at the Large Hadron Collider (LHC).
Frank Ding, engineering analysis & technical computing manager at Simpson Strong-Tie, discussed the advantages of utilizing the cloud for occasional scientific computing, identified the obstacles to doing so, and proposed workarounds to some of those obstacles.
The private industry least likely to adopt public cloud services for data storage are financial institutions. Holding the most sensitive and heavily-regulated of data types, personal financial information, banks and similar institutions are mostly moving towards private cloud services – and doing so at great cost.
05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.
04/02/2012 | AMD | Developers today are just beginning to explore the potential of heterogeneous computing, but the potential for this new paradigm is huge. This brief article reviews how the technology might impact a range of application development areas, including client experiences and cloud-based data management. As platforms like OpenCL continue to evolve, the benefits of heterogeneous computing will become even more accessible. Use this quick article to jump-start your own thinking on heterogeneous computing.