So yesterday I wore a different hat for a few minutes. I was a scribe. I was requested to just sit in and later advice the business team after a meeting they were having with one of our bigger customers. The requirement was basic, they would like to move their applications to our cloud at some point. Cloud as its commonly known is an interesting service with an even more interesting support model.
The desire for designers to avoid single points of failure for critical applications so that catastrophic errors don't occur runs deep. Such failures lead to huge financial losses and a diminished corporate brand for all parties involved.
I hope their admins/cio has adviced the business accordingly. Since in this case I was on the provider side, I figured it might help someone if I gave a few tips on what to look out for if coming or going for critical hosting services from one of the cloud providers: Also note I am more interested in storge and networking issues in the data center for now - we just begun). If you buy some virtual machines and run applications well....so some tips follow:
Test disaster recovery. It is probably the most difficult one but ensure that you have a solid disaster recovery plan. So lets assume you want to port/move 10 applications to the cloud:
As a client; make a thorough analysis of each application before even engaging a provider. This analysis should give direction to the service provider on how to port and test the applications in the cloud, if you'll do it, the use it as a guide.
Your plan when executed should test the basic applications in the cloud, the service provider's configuration (what is needed for all ten applications) and also the additional functionality needed for a successful disaster recovery of those applications. Sadly no one in Kenya will do this for you. Well no one that I know of.
Use whatever you get from the tests above to build an SLA. Don't blindly walk into an SLA. And don't walk out without one. Confirm in any way you can that there is at minimum architectural reliability.
Interrogate the person selling the solution, do they look clueful? Ask if you can get independent audits of the cloud infrastructure. Good providers will let you do it. What should be analyzed first? How do you gain confidence that the SLA you come up with covers all the various types of failure by the service provider? Performance metrics are still needed for each supported application. And please ensure you have before and after porting/migrating statistics to use for comparing whether things are better or worse off.
I'll throw in some sample questions to ask and maybe just maybe someone will benefit from them:
Storage:
How many vendors are used for all application storage?
Is de-duplication addressed?
How is the SAN switching done?
Is only one SAN switch vendor used for all of the applications?
How many vendors are used for data replication, encryption to encrypt data for all of the applications?
which encryption algorithm is in use, for which tool?
how many PKI vendors to manage certificates?
and lastly where are the damn certificates stored?
You can go deeper (sorry I've been working overtime trying to figure out data centers of late so this will be lengthy)
For the network find out what routers/switch are used within the data center?
are they redundant?
which firewall, IDS/IPS, load balancers, can they steer traffic between redundant data centers?
can you test this?
are the load balancers redundant?
who is/are the ISP's for internet connections, is it on redundant fiber?
As a client you at least need to know something about your apps:
what is the application's best user response time?
response time under load at a certain number of concurrent users, what is the peak number of users expected?
How long does the application need to recover from a failure before it affects your operations and leads to loss in any form? at what point do you lose your job? get sued?:-)
Is there any component that could affect the application - eg an application tied to a mac-address?
etc etc....
Finally for each application, throw in a grid of information, maybe a row per application into the SLA. So here you probably want to have functionality requirements, performance metrics and financial penalties for the various types of downtime errors per application.
While cloud providers are not obligated to deploy identical architecture as the clients ie same products and software and models and releases as the client's, the provider must meet similar functionality and response times. Areas where this deviation is a risk needs to be documented and the risk of downtime calculated and documented. This also includes the risk of brand loss and potential for law suits.
Some performance data for each application also needs to be collected to complete the SLA. Just so you can say - hey it used to be like this, now its worse off....or better off...
This will be fun......