February 29, 2012
The Microsoft Windows Azure cloud experienced a serious outage early Wednesday morning (Greenwich Mean Time), preventing some customers from accessing data and applications. The problem, which has affected the system's service management component, is thought to be related to a leap day timing glitch. While the company has been actively working to restore the service, it is still experiencing intermittent issues at the time of writing.
This is the company's official response:
On February 28th, 2012 at 5:45 PM PST Microsoft became aware of an issue impacting Windows Azure service management in a number of regions. This prevented customers from creating, updating or deleting their hosted services. The root cause of this incident has been traced back to a certificate issue that affected all Windows Azure sub-regions. In order to contain the incident we disabled service management for all Windows Azure sub-regions. During this incident there was no impact on Windows Azure Storage and all storage accounts remained fully accessible in all sub-regions.
The Windows Azure engineering team developed, validated and deployed a fix for this issue. The fix has been successfully deployed to most Windows Azure sub regions. Windows Azure service management has been restored in these regions. We were unable to deploy the fix successfully to some customers in 3 sub regions – North Central US, South Central US, and North Europe. Currently impacted customers may experience issues with Access Control 2.0, Marketplace, Service Bus and the Access Control & Caching Portal and as a result may experience a loss of application functionality. Engineering teams are actively working to resolve this issue as soon as possible.
We apologize for any inconvenience this has caused you and will update the Service Dashboard on an hourly basis until this incident is resolved. We are continuously working on improving the Windows Azure Platform to help ensure similar incidents do not occur in the future.
The Azure Service Dashboard first listed the problem on Feb. 29 at 1:45 a.m. GMT (Feb. 28, 5:45 p.m. PST):
We are experiencing an issue with Windows Azure service management. Customers will not be able to carry out service management operations. We are actively investigating this issue and working to resolve it as soon as possible. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
At 4 a.m. GMT, Feb. 29 (8 p.m. PST, Feb. 28), Microsoft said that it had identified the root cause of this incident, tracing it "back to a cert issue triggered on 2/29/2012 GMT."
At 9 a.m. GMT, Feb. 29 (1 a.m. PST, Feb. 29), the company announced that it would begin fixing the problem. They began with a gradual rollout of the hotfix in North Central US sub-region. "As we proceed through the rollout," Microsoft stated, "we will progressively enable service management back for customers."
At 7:30 p.m. UTC Feb. 29 (11:30 a.m. PST, Feb. 29), the software giant posted the following statement:
We are actively recovering Windows Azure hosted services in the North Central US, South Central US and North Europe sub-regions. More and more customers applications should be back up-and-running even if service management functionality is not yet restored. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
As of approximately 4:30 p.m. PST, the service is still experiencing multiple issues. While Microsoft has been proactive in issuing status updates, they have not been forthcoming in providing an estimated time to completion. Software development times are notoriously unpredictable and the company doesn't want to over-promise, but that doesn't make it any less frustrating for the customer.
This is not Azure's first face plant. Back in 2009, while still in beta, the service went down for 22 hours. Amazon and Google have also experienced outages. What does this say about the reliability of the cloud? As with any technology there is a risk-benefit profile, but reliability is supposed to be one of cloud's calling cards. When it comes to lessons learned, this is a painful one, to the customers and to the cloud provider. There's no question that people will be talking about these very public failures for a long time.
Update (information added to the article at 7:25 p.m. PST.)
Microsoft has posted the following updates on the Azure Service Dashboard.
1:00 AM UTC We have restored full service management functionality for all Windows Azure hosted services in the North Central US sub-region. We have restored full service management functionality for most customers in the South Central US and North Europe sub-regions. We have published a recap of the incident since it started on the Windows Azure team blog (http://blogs.msdn.com/b/windowsazure/archive/2012/03/01/windows-azure-service-disruption-update.aspx). Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
1:25 AM UTC We have restored full service management functionality for all Windows Azure hosted services in the North Europe sub-region. We have restored full service management functionality for most customers in the South Central US sub-region. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
Frank Ding, engineering analysis & technical computing manager at Simpson Strong-Tie, discussed the advantages of utilizing the cloud for occasional scientific computing, identified the obstacles to doing so, and proposed workarounds to some of those obstacles.
The private industry least likely to adopt public cloud services for data storage are financial institutions. Holding the most sensitive and heavily-regulated of data types, personal financial information, banks and similar institutions are mostly moving towards private cloud services – and doing so at great cost.
In this week's hand-picked assortment, researchers explore the path to more energy-efficient cloud datacenters, investigate new frameworks and runtime environments that are compatible with Windows Azure, and design a uniﬁed programming model for diverse data-intensive cloud computing paradigms.
05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.
04/02/2012 | AMD | Developers today are just beginning to explore the potential of heterogeneous computing, but the potential for this new paradigm is huge. This brief article reviews how the technology might impact a range of application development areas, including client experiences and cloud-based data management. As platforms like OpenCL continue to evolve, the benefits of heterogeneous computing will become even more accessible. Use this quick article to jump-start your own thinking on heterogeneous computing.