With the proliferation of public cloud infrastructures, our dependability on them has increased. Many of our vital services pertaining to the research, industry or even lifestyle domain have been massively moved onto the cloud. Then, what happens when the cloud services we are depending on go down? Dr. Jose Luis Vazquez-Poletti shares some key aspects on how the scientific community can provide answers to this problem.
On April 21, popular Web 2.0 services such as Reddit, Foursquare, Dipity, Quora, BigDoor and mobypicture are faded out of the map. After the first chaotic moments, it is understood that the hosting service had something to do with the cut-off of all these services, so everyone was pointing to Amazon. As a fun fact, Amazon had to deny any involvement of Skynet, the AI system from the Terminator saga which was expected to take over Humanity on 21 April 2011.
Returning to the cloud, did the entire Amazon infrastructure go down? The answer is no, as the problem was located at the North Virginia (US-East) region and affected all of its availability zones. What started as a simple “networking event” ended in a resource shortage provoked by a large amount of re-mirroring of Elastic Block Storage (EBS) volumes. You can read the complete report on the incident here.
How come these Web 2.0 services were all hosted in the same region, having 5 different to choose from (2 in USA, 1 in Europe and 2 in Asia)? The answer is price because the US-East region provides the cheapest instances in all modalities (on demand, spot and reserved). All of these services put their trust in the infallibility of Amazon and didn’t ever think in replicating their services at other regions.
The next aspect to consider is the Service Level Agreement (SLA) offered by Amazon. It promises to keep sites up and running 99.95% of the year, or it will return 10% of the monthly bill. This means that the allowed downtime is 4.4 hours but… many services needed 4 days to recover!
It is clear that cloud computing has moved the center of gravity of application distributed execution, by exploiting virtualization at different layers and by adding a complexity level to the scheduling problem. While cloud computing can bring more flexibility in the design of applications, it also arises new research challenges. Compared with the traditional method of dedicating one server to a single application, consolidation through virtualization can boost the resource utilization rate by aggregating workloads from separate machines into a small number of servers: workloads can be now executed in a dense environment using much less machines, in which the impacts of faults can be vastly magnified. For example, any single hardware failure will affect all the virtual servers in that physical machine, or under dynamic workloads, it may be difficult to distinguish real faults from normal system.
The need of this concept revisiting is fundamental when provisioning is left to public cloud infrastructures, where an optimal budget must be met. Different strategies can be tailored, from hybrid architectures to service distribution across cloud providers. Additionally, cloud providers typically establish Service Level Agreements (SLAs) with their customers, and providers must also enforce the Quality of Service (QoS) in their infrastructures, under an unreliable and highly dynamic environment.
Cloud computing is playing an increasingly important role in current distributed computing, which involves a wide community. The cloud provides a scalable, computational model where users access services based on their requirements without regard to where the services are hosted or how they are delivered: computing processing power, storage, network bandwidth or software usage can be provided as services over the Internet. In consequence, applications developed over such on-demand infrastructures can be built upon more flexible principles, being more fault tolerant, more resilient and more dynamic. Although fault tolerance in distributed systems has been a matter of research in the past that has generated a wide collection of algorithms for fault detection, identification and correction, these concepts will have to be re-visited in the context of cloud computing.
In order to put all this ideas together, I’m co-chairing the First International Workshop on fault Tolerant Architectures for Reliable Distributed Infrastructures and Services (TARDIS2011) which will be held at the Fourth IEEE International Conference on Utility and Cloud Computing (UCC2011). This event will take place in Melbourne (Australia) on December 5-8, the submission application is already open and you are very invited to submit yours. We expect contributions not only on cloud fault tolerance, but also on beneficial mechanisms coming from other paradigms as explained before.
So… will you let the sky fall down?
About the Author
Dr. Jose Luis Vazquez-Poletti is Assistant Professor in Computer Architecture at Complutense University of Madrid (Spain), and a Cloud Computing Researcher at the Distributed Systems Architecture Research Group (http://dsa-research.org/).
He is (and has been) directly involved in EU funded projects, such as EGEE (Grid Computing) and 4CaaSt (PaaS Cloud), as well as many Spanish national initiatives.
From 2005 to 2009 his research focused in application porting onto Grid Computing infrastructures, activity that let him be “where the real action was”. These applications pertained to a wide range of areas, from Fusion Physics to Bioinformatics. During this period he achieved the abilities needed for profiling applications and making them benefit of distributed computing infrastructures. Additionally, he shared these abilities in many training events organized within the EGEE Project and similar initiatives.
Since 2010 his research interests lie in different aspects of cloud computing, but always having real life applications in mind, specially those pertaining to the high performance computing domain.
Website: http://dsa-research.org/jlvazquez/
Linkedin: http://es.linkedin.com/in/jlvazquezpoletti/