Editor’s note: This is the 17th article in the “Real Words or Buzzwords?” series from SecurityInfoWatch.com contributor Ray Bernard about how real words can become empty words and stifle technology progress.
A while ago I was checking out a cloud-based access control offering from a company that I didn’t know had anything “cloud” going on. This was a 40-year company (not Brivo, BluBØX, or Feenics). I used the link on their public website and logged in to the demo account on the cloud system.
I was checking out the ID badge issuance process, when suddenly the browser displayed a system error page, with details about the database error and the software code involved. The menu was still at the top of the page, so I navigated to the first page of the badge process, which listed the personnel in the demo account, but this time all of the photos were the same: a placeholder image that said, “No Photo.” Navigating around, I saw that anything having to do with Personnel showed the No Photo image, and seemed to be missing some parts of the personnel data previously displayed. Whatever I did had changed the state of something in the system and it wasn’t correcting itself.
This occurred at 11:30 p.m. and I was concerned that I may have taken some portion of their system offline. There was no tech support phone number to call, so after making notes and taking screenshots, I emailed the head of marketing (it was a website issue) and the head of sales (since this was the demo sales uses account).
The next morning the problem was still there and I had received no feedback from my email. So, I called the head of marketing, who said, “Don’t worry about it. It’s a completely separate computer just for demos. It’s not our actual system. It’s not really up to date.” I was speechless, so she continued.
“Our actual system is on Microsoft Azure, and that guarantees 99.999% uptime. So, we wouldn’t have that same kind of problem. Don’t worry about it.” So, I didn’t. I just crossed that company off my list of candidate cloud system vendors and went on with other work. There were so many things wrong with the responses that it would have taken way to long to write an educational note back about it.
The first question I should have been asked was, “What prompted you to try our demo system?” I thought for sure the sales manager would follow up on a potentially hot prospect. After all, I took the time to write them an explanatory note and provide screenshots. How many other people ran into that problem and just closed the browser, mentally writing off the company and the product?
What’s Wrong with 99.999% Uptime?
More recently, I have had several sales people mention high availability and state that Microsoft Azure guarantees them “five nines” of uptime, referring to their SaaS (Software as a Service) offering. There are several things wrong with this thinking.
- No Guarantee for SaaS. SaaS vendors receive an uptime guarantee for the Platform as a Service (PaaS) or Infrastructure as a Service (IaaS) services they subscribe to. A cloud infrastructure provider can’t possibly guarantee that your cloud SaaS application won’t fail, so they don’t. Cloud application developers know this, and most but evidently not all sales folks do.
- It’s Not 99.999%. In its Service Level Agreement (SLA) for Virtual Machines, Microsoft Azure states (with my comment in brackets): “For all Virtual Machines that have two or more instances deployed in the same Availability Set [redundant virtual servers running on different physical hardware], we guarantee you will have Virtual Machine Connectivity to at least one instance at least 99.95% of the time. For any Single Instance Virtual Machine using premium storage for all Operating System Disks and Data Disks, we guarantee you will have Virtual Machine Connectivity of at least 99.9%.” This means that allowable downtime per month is about 22 minutes for 99.95%, and 44 minutes for 99.9%.
- Uptime Itself is not Guaranteed. The guarantee provides a service credit if the monthly uptime percentage target is not met. For Virtual Machines in an Availability Set, less than 99.5% gets you a 10% credit, less than 99% gets you a 25% credit, and less than 95% gets you a 100% credit, with the credit being applicable to the next month’s billing.
- Downtime Excludes Maintenance. Continuing about Azure, for “periods of Downtime related to network, hardware, or Service maintenance or upgrades impacting Single Instances . . . we will publish notice or notify you at least five (5) days prior to the commencement of such Downtime.” Downtime is usually short (like a reboot), so rarely should scheduled downtime be a significant issue.
- Read the SLA details. Outside of the top five cloud providers, Service Level Agreement (SLA) terms can vary more than you would expect. Some SLAs state that downtime must be continuous, meaning that four different incidents of 10 minutes of downtime doesn’t count as 44 minutes of downtime—it counts as zero downtime under the 99.9% uptime target, because none of the downtime periods were 44 minutes long. (That’s not how Azure does it, but some other providers calculate it that way.)
- Don’t Brag About Infrastructure Uptime. Making a big deal about uptime sends the message that your service didn’t used to have high uptime. Besides, it’s a brag about something that most companies have in common with their competitors, so the value of bragging about uptime is minimal at best, so just mention it and brag instead about your application’s features.
Infrastructure Uptime is Generally Good and Improving
With a big-name cloud provider, high availability for infrastructure is typically very good, and that trend is improving. For any cloud infrastructure provider, information should be available on their overall record of service, including how well they meet uptime targets. Vendors of cloud-based security applications get no real benefit from talking at length about infrastructure uptime. Customers assume it will be good (unless it’s the vendors own in-house data center). Customers are more concerned about how they can use the cloud application to improve their security-effectiveness or cost-effectiveness. Providing case study examples around these two factors will typically have a much higher sales ROI than talking about cloud data center technical details.
The User Experience is What Counts
The reason I started this article with a story of a cloud system problem is that many people I have talked to equate high availability with good user experience, when many more factors are likely to impact user experience than server or network uptime.
For example, one of the five essential could computing characteristics is “resource pooling using a multi-tenant model,” illustrated nicely on WhatIsCloud.com. Multitenancy allows several cloud system users to share the same underlying IT resource or its instance while each remains unaware that it may be used by others. The primary consideration I hear discussed is about data isolation—that one cloud system customer cannot access the data from another cloud system customer, which is important. However, application performance is also an important part of the picture.
I have used a few client cloud-based systems whose responsiveness varied greatly depending upon the time of day or on some other invisible factor. This violates the principle that resource sharing should have no impact on users. With one system I checked out, some functions were not available due to a database issue—according to the error information provided, which suggested retrying after a short wait. Retrying worked, and I couldn’t tell if the system had recovered from a coding error in the application, or if the database resource had reached 100% utilization and simply wasn’t available. I checked out the same function again a few weeks later, and running lots of searches to see if the problem would recur, and it didn’t. In fact, the system seemed consistently more responsive than before.
That’s the result we should expect from a cloud-based system, that the system is continually improving in features and performance.
My point is that multitenancy means more than just data isolation. It also means engineering the system so that the multitenancy aspect of the system has no negative impacts on the user experience. And customers have a right to expect that.
About the Author:
Ray Bernard, PSP CHS-III, is the principal consultant for Ray Bernard Consulting Services (RBCS), a firm that provides security consulting services for public and private facilities (www.go-rbcs.com). He is the author of the Elsevier book Security Technology Convergence Insights available on Amazon. Mr. Bernard is a Subject Matter Expert Faculty of the Security Executive Council (SEC) and an active member of the ASIS International member councils for Physical Security and IT Security.