Solution Providers: AWS Cloud Outage Proves Relying On Cloud Vendors As Architects Is A 'Mistake'

Solution providers say the Amazon Web Services (AWS) S3 outage earlier this week points to a need for more aggressive independent cloud design and architecture.

"Cloud does not replace the need for good strong consulting and vision on how to actually architect for continuous delivery of business applications," said Jamie Shepard, senior vice president for health care and strategy at Lumenate, No. 152 on the 2016 CRN SP500. "When we architect systems we know that if there is a failover, there is enough full capacity to handle that workload. That is how we design infrastructure. That is why our hybrid cloud design business is accelerating. Relying on cloud vendors as architects is a mistake. You cannot rely on cloud vendors as architects or to be advocates for the customer."

[Related: AWS Apologizes For Cloud Outage, Blames Typo]

Lumenate does not resell AWS cloud services, but it does leverage AWS and other public clouds, including Google and Azure, as part of complete software-defined hybrid cloud platforms, said Shepard.

For Shepard, one of the most telling takeaways from the four-hour outage on Tuesday was that AWS' own customer-facing Service Health Dashboard was not available during the crisis. "That's bad architecture," he said. "They basically leveraged one region and did not factor any redundancy into that system. AWS had to tweet the outage. That is how customers found out. That is unacceptable. No dashboard to look at. No one to call. You mean to tell me [if I'm an AWS customer whose] business is down and my users are asking me what is happening and I don't have a dashboard to tell me what is going on? "

AWS, for its part, Thursday apologized for the outage -- which was sparked by an AWS team member entering a bad command during the debugging of an S3 billing system -- and disclosed "several changes" including a move to run the Service Health Delivery dashboard across multiple AWS regions. "We understand the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions," the company said.

CRN reached out to AWS for further comment but did not receive a response at press time.

It is also telling, Shepard said, that AWS parent Amazon.com – the world's largest online retailer – remained up and running because it was architected to be geo-redundant while many other major retail websites took a hit. "The question is: did those customers know they did not have a geo-redundant system?" asked Shepard. "A lot of our customers right now are in a battle internally where people are telling them to leverage more public cloud. This is another proof point in the case for hybrid IT. It demonstrates the need for strong data center architects that understand the customers' business case and the challenges and pitfalls of public cloud enablement."

Apica, a website testing, optimization, and monitoring provider, said that 54 of the top 100 internet retailers were affected by the outage, including three sites that went down completely – Express, Lulu Lemon and One King's Lane. Appica said S3 is Amazon's largest services and is used by more than half of its one million plus customers with more 3-4 trillion pieces of data in it.

Douglas Grosfield, the founder, and CEO of Five Nines IT Solutions, a Kitchener, Ontario-based strategic service provider that provides high-level cloud consulting centered on public and private clouds, said far too many customers are blindly ceding their data to public cloud providers.

"The fox is watching the hen house when you have any major cloud provider like AWS acting as your sole source of advice, design and integration of your IT environment," said Grosfield. "I don't want to say the inmates are running the asylum, but that is clearly not best practice for cloud design, architecture and delivery. A basic tenet of ours is: you don't design a cloud architecture with a single point of failure."

Grosfield says he sees public cloud providers as the "Pied Piper" with far too many customers taking a "lemmings" approach with almost a "mindless drive" to public cloud. "Customers aren't thinking about the risks associated with putting all their eggs in one basket," he said. "You always need a plan B, a path to recovery in the event there is a failure. If you are offloading a workload to the public cloud, you better be setting up a hybrid architecture that has the level of failover and fault tolerance your business requires."

Grosfield said every time he consults with a customer that has started down the path of formulating a public cloud strategy they have inevitably looked at a public cloud as a "magic bullet."

"There is no single magic bullet," he said. "You need to invest in a distributed architecture. You have to take a hybrid approach. IT today is simply not a matter of speeds and feeds. It is delivering business outcomes to your end users and any ’all or nothing’ tact will ultimately fail."

AWS said the S3 service disruption in the Northern Virginia (US-EAST-1) Region was caused by an S3 team member debugging an issue causing the S3 billing system to progress more slowly than expected. "Unfortunately, one of the inputs to the command was entered incorrectly, and a larger set of servers was removed than intended," said AWS.

As a result of the outage, AWS said, it has "added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future."

Raymond Tuchman, CEO of Experis Technology Group, a fast-growing Potomac, Md.-based Hewlett Packard Enterprise private cloud powerhouse with its own 80,000 square-foot cloud services data center, said the AWS outage is going to go a long way toward helping to "educate" customers on the pros and cons of private versus public cloud.

Experis provides cloud consulting centered on both private and public cloud, but generally does not resell public cloud platforms like AWS, Google or Azure.

Tuchman says he sees many customers moving in a knee-jerk fashion to the public cloud as a means of putting IT infrastructure out of sight and out of mind. "A lot of clients want to outsource it and let a large company like AWS run it," he said. "They mistakenly think that then they won't have to think about it anymore. But that is a falsehood. When you outsource anything, you still have to concern yourself with it. It doesn't matter whether it is IT or something else. If you have someone doing plumbing at your house, you need to check to be sure they are credentialed and know what they are doing. You have to make sure they have the resources and capability to do it."

The problem is many corporations are "skeptical" of their own internal IT staff and are simply following prevailing "public opinion that says outsource it to AWS, Microsoft, Google or another large corporation and they will take care of me," said Tuchman. "That is simply not the case in real life. You have to worry. And as shown by the AWS outage one small error can affect thousands of customers."

Tuchman says customers are becoming increasingly aware that private cloud is in most cases more secure and cost effective than a public cloud. The cost differences, in fact, can be staggering, he said, with one recent cost analysis from one of his customers showing the cost of AWS at $159,000 per month versus $60,000 a month for private cloud.

"If you have a consistent workload that is growing systematically, a private cloud is less risky, more secure and cost-effective than a public cloud," said Tuchman. "A lot of people now think you can't get fired if you buy AWS. But AWS is not infallible."