By now, you are likely aware of the Amazon Web Service failure from April 21st through April 25th. What is most interesting about this to me is not the failure itself, but the response. It seems that every time there is some sort of cloud failure, that there is a wide spread knee jerk reaction that cloud is unreliable and shouldn't be used. Before too long, the outage will be forgotten and people will return to the attitude of 'cloud is great and we should move all of our production into the cloud.' It is important to note that 'cloud' is neither panacea nor nightmare.
Cloud is not a cure all for everything IT. (Although there are several other considerations, I am focusing on failure here.) Cloud has many of its own issues that are unique to it as relative to traditional infrastructures. To look at this fairly, we need to quit talking about 'cloud' as though it were a single entity. There are a multitude of public clouds available for consumption. Each of these has some unique value proposition. If not, it will die. What each of these clouds has in common is a high level of complexity and a high level of scalability. The larger a cloud grows, the more likely it is to exhibit some level of failure.
Conversely, there is no reason for fear of the cloud. This multitude of cloud offerings simply provides additional choice for various IT needs. Whether cloud is appropriate and which cloud is appropriate depends completely on the particular needs the IT organization for the systems being evaluated. In the next part of this series, I will discus how to evaluate cloud.
For now, I want to the appropriate response to cloud failures. And that is to look for mechanism to prevent and recover from them. If someone has a failure within a physical server, they (should) have multiple mechanisms to protect them from that failure. For example, if a block on a hard disk platter gets corrupted, there may be an ECC mechanism within the drive to recover that block. Next, the drive is likely a member of a RAID set allowing for recovery. Lastly, the data is being backed up and may also be replicated. As you can see, this example has 5 methods of recovering from a single failure.
There are mechanisms to recover from cloud failures as well. Some of these mechanisms are innate to the particular cloud. Thus, many failures should and do occur every day. However, the built-in mechanisms prevent in affect on client workloads most of the time. Occasionally there have been and will continue to be times when failures to affect one cloud customer, a larger portion of a cloud, or even an entire cloud. Some of these examples come from circumstances that many fail to consider--failure of the company that owns the cloud. There have been times when entire clouds have been shut down without notice leaving customers out in the cold.
There are two options to deal with this. First, one can use the cloud only for workloads where the risk of failure is acceptable. Or, one can employ methods of protecting themselves from failures. In some cases, this means careful code deployment. In others, it means utilizing third party tools.
One example for cloud storage is TwinStrata CloudArray. Although its primary purpose is to facilitate cloud storage access without development, it includes the ability to replicate data across multiple clouds. For example, it can store the same data in both an Atmos based cloud such as AT&T or Windstream along with Amazon’s proprietary S3. Similarly, CloudSwitch provides mobility across multiple clouds.
In part 2, I will discuss cloud shopping.