Softlayer Cloud: a Scary Story of One Bad Service

Posted in: General, Networks
Tags: cloud, downtime, experience, issues, operations, softlayer

2 May2011

Disclaimer: the information in this post is the author’s personal opinion and is not the opinion or policy of his employer.

It was spring 2010 when we decided that even though Softlayer‘s server provisioning system is really great and it takes only a few hours to get a new server when we need it, it is still too long sometimes. We wanted to be able to scale up when needed and do it faster. It was especially critical because we were working hard on bringing up Facebook integration to our site and that project could have dramatically changed our application servers cloud capacity requirements.

What buzzword comes to your mind when we talk about scaling up really fast, sometimes within minutes, not hours or days? Exactly – cloud computing! So, after some initial testing and playing around with Softlayer’s (really young back then) cloud solution called CloudLayer and talking to our account manager we’ve decided to switch our application from a bunch of huge and at the time pretty expensive 24-core monster servers to a cluster of 8-core cloud instances. To give you some perspective: we had ~250 cores at the start of the project and at the end of 2010 we’d have more then 100 instances – we weren’t a small client with a few instances).

For those who are not familiar with Softlayer cloud: they sell you “dedicated” cores and memory, which is supposed to give you an awesome performance characteristics comparing to shared clouds like EC2.

Long story short, after a month of work on the project we had our application running on the cloud and were able to scale it up and down pretty fast if needed. And since the cloud was based on faster cpu and faster memory machines, we even saw improved performance of single-threaded requests processing (avg. response time dropped by ~30% as far as I remember). We were one happy operations team…

But then the problems started. Pretty painful and weird problems. The problems all looked exactly the same – some day (or night) some of our cloud instances would start blocking on I/O operations forever or for really long periods of time. Then some time after that our filesystems would switch to read-only mode because of write errors (timeouts) on their storage.

Just to make it clear: our application does not log much to local disks and almost does not read anything. But sometimes we still need to read a config file or write some logging information, and syslogd was still there and sometimes needed to write to disk. So, many times a week we would wake up, see an instance dead (all critical processes locked in D-state), call Softlayer, and spend hours (literally) to bring it back up. Sometimes it’d happen to one server, sometimes to ten of them. First time in my life I was afraid to go to bed because I was almost certain a monitoring SMS would wake me up within a few hours. This was probably one of the most stressful periods of my professional life to date.

What was making the situation even worse, nobody on Softlayer’s side seemed to care about the trend we all were seeing pretty clearly – their cloud was seriously broken. Every ticket about a dead instance was processed as if it was the first one with such problem ever. Support people would poke around an instance for some time, then try to restart it, etc. After some time they’d give up and escalate it to so called InfoSys team. Which would take its time (usually hours) to finally respond with a template answer telling us there was a problem with their SAN and we were the only customers experiencing the problem and that they are working with their SAN vendor to fix this problem. We’ve got dozens of tickets and they all looked exactly the same (compared word by word) even though they were sent weeks apart.

After a month (I believe) of this hell in our operations team’s life and being close to a nervous breakdown I’ve contacted our dedicated account manager and told him we were going to start leaving Softlayer for some other provider (not only the cloud but all servers) if they would not fix the problem within one week. And this is where things started moving really fast: I’ve received dozens of calls from all kinds of managers, was called-in to a few meetings with a senior manager responsible for the cloud, etc and within a week they’ve found a solution. They’d setup local disks on their hosts systems and move our filesystems to local disks instead of putting them on their SAN. This was supposed to “fix” the problem until they figured it out completely and, as expected, it did help us. The problems went away and if they were still happening again – that’d be some instance they forgot to switch to local storage.

Time was passing by and for a few months we did not have any serious issues with the cloud. Then we’ve started growing. Being pretty sure the problem had been solved by that time we didn’t actually realize that new cloud instances were created on SAN again. And really soon we’ve got bad news – their problem was still there. Any issues happening with our instances were processed within the same template:

an instance breaks;
we receive a notification and file a ticket;
Softlayer support guy spends up to an hour trying to figure out what’s happening;
the ticket is escalated to the InfoSys team;
we wait up to 1 day (yes, two cases out of dozens we had with them it took more than a day!) for an answer from this systems team and the problem resolution.

This InfoSys team from my point of view was a really interesting case of a poorly integrated third party service within a huge company (this is my vision of the problem, I do not know how this happened in reality). Every time we’d have a problem that needed their attention, support guy would escalate the problem to the systems team and this is it – nothing he could do and the only thing we’d hear hours after the escalation was “we can’t contact systems team to get an update, we should wait”. Nobody from this team ever contacted us directly, it was impossible to reach out to any of them, they never seemed to care about those downtimes we had – they’d follow the same “fix one instance at a time” patterns again and again and whole Softlayer support department would not be able to get a few words of an update from them.

At this point we’ve started thinking about moving away from the cloud, but it’d involve a month or two of work and having a small operations team with lots of concurrent tasks limited your ability to allocate so many resources to such a project so we didn’t do that back then.

After a few more months of work with their cloud and having intermittent downtimes (rare, but still painful) at the beginning of 2011 two things happened:

First, we’ve got Softlayer guys to admit that their “dedicated” cores are actually Intel Hyper-Threading threads (1/2 of a cpu core). This made us realize that the cloud wasn’t actually cheap comparing to the new hardware available on the market.

Second, our account manager and a few guys from the systems team in clear text asked us to leave their cloud because they could not support our I/O requirements anymore (remember I told you before – we barely used their storage).

This was enough and we’ve finally made the decision to give up and go back to real hardware only we control and manage. It took us more than a month of work, but we think we’ve got pretty good system built as the replacement for Softlayer cloud based solution.

Now, the reason I’ve decided to finally write this blog post is funny: I was going to write it a few times and didn’t have a chance to do it. But today (April 30, 2011) one of 5 cloud instances we have left in Softlayer cloud has died. We use those instances for all kinds of experimental “servers”, etc and it wasn’t uber-critical downtime. So, this instance died and so far (4.5 hours after its death) the problems looks really similar to what we’ve experienced many-many times in the last year and again – they follow the same template: support guys have spent an hour messing with the instance, then they’ve transferred the ticket to their InfoSys team and 3 hours later there weren’t any updates and they would not be able to reach that team to get anything from them. Update: It took them 4.5 hours to recover the instance and the only explanation we’ve got was “there were some problems with the host system”.

I really hope every single operations team in the world considering using Cloudlayer as their solution for some problems would google first, see my experience and would make their own conclusions. These are the facts and my own experience and it is up to you guys to decide how to interpret them and apply them to your own cases.

Homo-Adminus Blog

Yet Another Admin’s Blog