Softlayer Cloud: a Scary Story of One Bad Service
Disclaimer: the information in this post is the author’s personal opinion and is not the opinion or policy of his employer.
It was spring 2010 when we decided that even though Softlayer‘s server provisioning system is really great and it takes only a few hours to get a new server when we need it, it is still too long sometimes. We wanted to be able to scale up when needed and do it faster. It was especially critical because we were working hard on bringing up Facebook integration to our site and that project could have dramatically changed our application servers cloud capacity requirements.
What buzzword comes to your mind when we talk about scaling up really fast, sometimes within minutes, not hours or days? Exactly – cloud computing! So, after some initial testing and playing around with Softlayer’s (really young back then) cloud solution called CloudLayer and talking to our account manager we’ve decided to switch our application from a bunch of huge and at the time pretty expensive 24-core monster servers to a cluster of 8-core cloud instances. To give you some perspective: we had ~250 cores at the start of the project and at the end of 2010 we’d have more then 100 instances – we weren’t a small client with a few instances).
For those who are not familiar with Softlayer cloud: they sell you “dedicated” cores and memory, which is supposed to give you an awesome performance characteristics comparing to shared clouds like EC2.
Long story short, after a month of work on the project we had our application running on the cloud and were able to scale it up and down pretty fast if needed. And since the cloud was based on faster cpu and faster memory machines, we even saw improved performance of single-threaded requests processing (avg. response time dropped by ~30% as far as I remember). We were one happy operations team…
But then the problems started. Pretty painful and weird problems. The problems all looked exactly the same – some day (or night) some of our cloud instances would start blocking on I/O operations forever or for really long periods of time. Then some time after that our filesystems would switch to read-only mode because of write errors (timeouts) on their storage.
Just to make it clear: our application does not log much to local disks and almost does not read anything. But sometimes we still need to read a config file or write some logging information, and syslogd was still there and sometimes needed to write to disk. So, many times a week we would wake up, see an instance dead (all critical processes locked in D-state), call Softlayer, and spend hours (literally) to bring it back up. Sometimes it’d happen to one server, sometimes to ten of them. First time in my life I was afraid to go to bed because I was almost certain a monitoring SMS would wake me up within a few hours. This was probably one of the most stressful periods of my professional life to date.
What was making the situation even worse, nobody on Softlayer’s side seemed to care about the trend we all were seeing pretty clearly – their cloud was seriously broken. Every ticket about a dead instance was processed as if it was the first one with such problem ever. Support people would poke around an instance for some time, then try to restart it, etc. After some time they’d give up and escalate it to so called InfoSys team. Which would take its time (usually hours) to finally respond with a template answer telling us there was a problem with their SAN and we were the only customers experiencing the problem and that they are working with their SAN vendor to fix this problem. We’ve got dozens of tickets and they all looked exactly the same (compared word by word) even though they were sent weeks apart.
After a month (I believe) of this hell in our operations team’s life and being close to a nervous breakdown I’ve contacted our dedicated account manager and told him we were going to start leaving Softlayer for some other provider (not only the cloud but all servers) if they would not fix the problem within one week. And this is where things started moving really fast: I’ve received dozens of calls from all kinds of managers, was called-in to a few meetings with a senior manager responsible for the cloud, etc and within a week they’ve found a solution. They’d setup local disks on their hosts systems and move our filesystems to local disks instead of putting them on their SAN. This was supposed to “fix” the problem until they figured it out completely and, as expected, it did help us. The problems went away and if they were still happening again – that’d be some instance they forgot to switch to local storage.
Time was passing by and for a few months we did not have any serious issues with the cloud. Then we’ve started growing. Being pretty sure the problem had been solved by that time we didn’t actually realize that new cloud instances were created on SAN again. And really soon we’ve got bad news – their problem was still there. Any issues happening with our instances were processed within the same template:
- an instance breaks;
- we receive a notification and file a ticket;
- Softlayer support guy spends up to an hour trying to figure out what’s happening;
- the ticket is escalated to the InfoSys team;
- we wait up to 1 day (yes, two cases out of dozens we had with them it took more than a day!) for an answer from this systems team and the problem resolution.
This InfoSys team from my point of view was a really interesting case of a poorly integrated third party service within a huge company (this is my vision of the problem, I do not know how this happened in reality). Every time we’d have a problem that needed their attention, support guy would escalate the problem to the systems team and this is it – nothing he could do and the only thing we’d hear hours after the escalation was “we can’t contact systems team to get an update, we should wait”. Nobody from this team ever contacted us directly, it was impossible to reach out to any of them, they never seemed to care about those downtimes we had – they’d follow the same “fix one instance at a time” patterns again and again and whole Softlayer support department would not be able to get a few words of an update from them.
At this point we’ve started thinking about moving away from the cloud, but it’d involve a month or two of work and having a small operations team with lots of concurrent tasks limited your ability to allocate so many resources to such a project so we didn’t do that back then.
After a few more months of work with their cloud and having intermittent downtimes (rare, but still painful) at the beginning of 2011 two things happened:
First, we’ve got Softlayer guys to admit that their “dedicated” cores are actually Intel Hyper-Threading threads (1/2 of a cpu core). This made us realize that the cloud wasn’t actually cheap comparing to the new hardware available on the market.
Second, our account manager and a few guys from the systems team in clear text asked us to leave their cloud because they could not support our I/O requirements anymore (remember I told you before – we barely used their storage).
This was enough and we’ve finally made the decision to give up and go back to real hardware only we control and manage. It took us more than a month of work, but we think we’ve got pretty good system built as the replacement for Softlayer cloud based solution.
Now, the reason I’ve decided to finally write this blog post is funny: I was going to write it a few times and didn’t have a chance to do it. But today (April 30, 2011) one of 5 cloud instances we have left in Softlayer cloud has died. We use those instances for all kinds of experimental “servers”, etc and it wasn’t uber-critical downtime. So, this instance died and so far (4.5 hours after its death) the problems looks really similar to what we’ve experienced many-many times in the last year and again – they follow the same template: support guys have spent an hour messing with the instance, then they’ve transferred the ticket to their InfoSys team and 3 hours later there weren’t any updates and they would not be able to reach that team to get anything from them. Update: It took them 4.5 hours to recover the instance and the only explanation we’ve got was “there were some problems with the host system”.
I really hope every single operations team in the world considering using Cloudlayer as their solution for some problems would google first, see my experience and would make their own conclusions. These are the facts and my own experience and it is up to you guys to decide how to interpret them and apply them to your own cases.
Related posts:
25 Responses to this entry
Perhaps, your words would have a not-good effect to theirs business but I hope Softlayer will soon release a more clearly explanation about the case.
SL is a well-know Cloud service provider so far, but in my opinion, they have to deal with any unwanted problem better than what they have done with your case.
BTW, nothing is perfect, either Amazon or Google!
Yeah, nothing is perfect. But the thing is, their main offering – dedicated server hosting – is just awesome, the best I’ve ever seen. And when their cloud stuff was this bad and they couldn’t actually provide any support for it – this is what made me so frustrated and thinking that this was actually some crappy 3rd party service poorly integrated into their system.
Interesting post Alexey. Looking at your Quantcast graph, it appears you do have the type of workload that makes sense in the cloud–namely, much larger traffic spikes on weekdays. At Squidoo we also see higher traffic during the week, but fortunately we are able to absorb it using aggressive Varnish caching (Istvan from Percona set this up for us–you can read details on my blog).
Switching hosts might not be as bad as you think. For one, you may be able to get much more aggressive pricing given your size and reputation. Another hidden gem is that you get to document (if you haven’t already) and test your server provisioning and backup strategies. This is an invaluable task. Try to look at the bright side
I think you guys are working on some interesting stuff. I would love to compare performance and operations notes sometime. Are you planning to attend Velocity Conf this year?
Yeah, we have caching in place too. I’d say ~90% of our traffic is served from squid. It really helps
re: Velocity – unfortunately I missed my chance to get the tickets and then it was all sold out
My experience with Softlayer’s cloud is similar. Their “SAN” was constantly giving us problems (slow and or hung instances due to IO), which forced me to quit using them. This was about a year ago so I don’t know if things have improved.
I have found their dedicated servers to perform well for very heavily loaded mysql databases (dual quad core xeons, 32GB RAM, RAID 10, 4x15k drives, BBU). This setup has worked well for our custom sharded Rails app (based on db-charmer which works great btw).
Hi Alexey,
As one of the folks on the front lines for SoftLayer, I’m bummed to read your experience in the cloud. While it’s true that the cloud isn’t necessarily the right medium for every application, I understand why it should fit your use case.
We’ve corresponded a few times in the past, and I’m going to send you an email with more info, but you should know that our number one priority is our customers’ experience, and the fact that yours has been so negative is a terrible black eye. I’ve made executive management aware of this post and have given some background on your account. I know it doesn’t mean much to you right now, but the experience you describe is completely unacceptable.
I understand your frustration and hope that as we move forward, you let me know directly if there’s anything I can do to help: khazard@softlayer.com
-Kevin
This is exactly the same story we had on the SoftLayer cloud. We have very low I/O requirements, but our systems would regularly be waiting on it (with average service request times often hanging around 1-60 *seconds*).
Any correspondence I had with the SoftLayer support team was terrible followed the same format: say they’ll look in to it, an hour later they’d “escalate it,” and I’d find out some time later they didn’t “correctly” escalate it so no one had even looked at it. When we finally did get to the right group, they would work with their SAN vendor to correct it – but never did before we were able to move to another provider.
While working with different levels of supervisors and executives there, I /feel/ I was outright lied to by a number of SoftLayer employees, because I genuinely don’t know how so many people can have so little of a clue what is going on there. As the cloud instances all use the SANs, one would think a client wouldn’t always be the first one to let SoftLayer know when a severely debilitating problem is going on with one of them. Having reached a (always temporary) solution with some of the engineers, it seems as though there are a handful of people who know how compartments of SoftLayer’s infrastructure works – and if you don’t reach one of them, you’re out of luck.
At any rate, I’ll be publishing a similar write-up shortly covering my experiencing in further detail… I just thought it was very interesting that we had such a similar experience.
This sounds a lot like the problems we had with Sun/Oracle Unified Storage. The systems were pitched as supporting “unlimited” snapshots, but in reality within 90-ish days of deployment snapshot deletions would exclusively lock the storage pools, taking them out long enough for the initiators to time-out, and on Solaris 10, you can’t restore an initiator when that happens. You just have to reboot the initiator.
Maybe SoftLayer bought Sun equipment? Ours was the Sun 7310. It sounds so similar to our issues it’s scary.
Doesn’t help you since you weren’t managing the SAN, but our “solution” was to disable de-dupe, never delete a snapshot, and migrate off the 7310s as fast as we could, onto Nexenta boxes. Crazy sounding I know since they’re very close to the same OS in most ways that matter. The big difference is you have zero access to address these issues on the Sun systems. With the Nexenta box you could simply order deletions, stagger them so you don’t queue up too much IO. Replicate in both directions easily if the worst happens and your Target is no longer serviceable. You have none of these options on the Sun equipment without being warned that you’ll void your very expensive warranty/service-contracts. That and they hide the log files and most tools you might use to debug the issues.
I can’t say enough bad about those systems. They really managed to take a great foundation and turn it into garbage with enterprisey BS.
Yeah, this sounds a lot like the explanations we’ve heard at the end of our cloud trial. They told us there were some problems with their SAN’s internal maintenance procedures taking up too much resources.
Alexey,
My name is Steve Kinman (skinman) and I knew your blog about CloudLayer was coming. I have a technical background but only in Windows and pre-cloud so I cannot speak to the technical challenges you list in your review of our service but can assure you that eyes will see it. The reason for my comment here is my concern for the support you received from Infosys. I am very interested in hearing your concerns and challenges you saw when dealing with them and other avenues of support. Please email me at skinman@softlayer.com if you have some free time. Sorry it took a review like this to get our full attention. We do take all feedback very seriously even when it hurts.
Skinman
Softlayer service has gone downhill amazingly over the past few years, I guess it’s the iplanet curse. High network latency and mystery outages and poor connectivity on the pod level let alone a datacenter level. I’ve stopped recommending them and we are planning to move what we have left shortly.
As an aside, most of the frontline staff have no idea what they are doing now. On ESX they told me to go to my start menu and open up adaptec storage manager. It really is like aol..
Beautiful post about poor qualification of cloud provider. If you make a cloud hosting you have to hire best hardware and software engineers or your clients one day will run from you.
I empathise, we had exactly the same experience though luckily we weren’t using the cloudlayer for mission-critical services. But we had the same problems with cloudlayer instances switching the root filesystem to read-only mode after a series of write errors, and I had exactly the same experience trying to get the helpdesk team to do anything other than minimal troubleshooting. Every time the response was “I rebooted your server and it’s ok now”.
Needless to say I’ve recommended that we avoid Softlayer’s cloud for future projects.
We’ve had the same problem at a smaller scale. One of our instances in particular has repeatedly switched to read-only mode. The response has been pretty much the same. Reboot of the server and a long fsck with hours of downtime.
It’s nice to get confirmation that others have experienced this, as when I insisted at Softlayer that there must be a larger issue I just got a standard “sorry for the inconvenience” type reply. Softlayer really need to be a bit more transparent with their clients that are investing on their infrastructure.
I have several dedicated and cloud instances with SL. Their dedicated machines are GREAT. Top notch data centres, connectivity, rapid provisioning, etc.
However. Stay the hell away from CloudLayer. I’ve had dozens of cloud machines on Rackspace, Amazon, Linode, and a bunch of other lesser known regional operators. I can’t even take SoftLayer’s “cloud” offering seriously in comparison. It’s a complete joke. You couldn’t even compare the offering with your average cheap virtual server provider.
- I/O performance in particular came up between 20-50x slower than the competition (depending on the tests and time of day)- it all runs on networked SAN instead of local RAID10. Which also makes it vulnerable to network congestion and router issues…………
- CPU doesn’t stack up in tests either
- Some of the default o/s install images are faulty
- I’ve had numerous, multiple-hour long outages with no-one to talk to, just dumb ass useless ticket responses, like “thank you for using softlayer” for hours on end while I’m sat pulling my hair out with sites down thanks to stuff like router issues. Hell, that’s why I’m in a data centre and not my own garage right?
- I could go on!!!!!
Hopefully they can hire some cloud gurus and get their acts together. If/when they do they’ll have the best platform around. Until then, choose dedicated if you want to use SL.
We’ve had similar issues on Amazon EBS. Once every few days, I/O operations on EBS device would freeze either completely or for about a minute. After weeks spent debugging and trying to reproduce the issue (which initially was thought to be userland), it turned out to be a kernel bug in Ubuntu 10.04, so upgrading to 10.10 completely solved the problem.
Hi Alexey,
Just wanted you to know we have the same problem. Two cloud machines (not even in production, no load) had the file system go read-only after our first month and and again in month 2. It took one hour each both times for them to bring it back up and do an fsck. Softlayer’s support was great however, but the cloud service is unacceptable. I’ve been with Linode/Slicehost/Rackspace and have never had such problems.
On a good note though, their dedicated machines are a dream to work with. I wish there was a company with a good hybrid solution, but Softlayer as it is, is not it.
Good luck,
Phu
Same story here. Their dedicated server setup and hosting is fantastic but their cloud offering is embarrassingly bad.
A quick, concrete example that anyone can verify: rebooting a cloud instance on SoftLayer by typing “reboot” at the command line will work *maybe* 50% of the time on a good day, the instance will not come back up the other 50% of the time.
We’ve had a half dozen instances where even a hard reboot through their portal does nothing and we had to file tickets just to get an instance booted back up. One time we had an unclean shutdown and corrupted filesystem all from trying to reboot the thing. I’ve filed at least a dozen tickets and always get BS non-answers like the ones you mention above.
Finally, even when the instances are working the IO on them is really, really bad. My bonnie++ runs show 29MB/s read, 6MB/s write, and 86 seeks/s. To put that in perspective, you are looking at a fraction of the performance you would expect from a single 4200rpm hard drive. I’ve got a blog post in the works benching all of the cloud provider’s IO and Softlayer wins the “worst” award by an enormous margin.
The take away is that you should consider them for dedicated hosting but not VMs. It is frustrating though as if they brought the service level of their cloud offering up to snuff they would have a really killer combination for mixing and matching metal with VMs.
Found this blog entry via WHT forums. Thanks for the write up and info about your experience. Softlayer Cloud is on my list to test out as I am already testing out Amazon EC2, Rackspace Cloud, GoGrid all via Rightscale.com. Still disk end is the weak spot for cloud hosting. Some of the hardware behind these cloud hosting servers ain’t all that as I found i.e. Rackspace poor performance.
Would be interested to hear if they ever get to bottom of the issues you experienced.
cheers
George
[...] week, Alexey Kovyrin’s post on his frustrations caught my attention. In particular, the following two paragraphs just made me [...]
[...] Softlayer Cloud: a Scary Story of One Bad Service [...]
At one point we had considered their cloud service, however opted to their dedicated offerings. I did love how their provisioning was super fast, new boxes within 4 hours, always.
What aggravated our team was when there were hardware issues, for example a drive failure or ready to fail (throwing smart errors) they could not or did not want to do a direct drive/drive copy which would have been completed within 2 hours, rather we had to load all the software then transfer data which took much longer. There reason what they had told us was that their policy prohibits then from touching data. I has expected more from them.
Indeed a poor excuse and not very flexible, they could of easily had a customer sign some form of waiver/agreement to ‘touch the data’ ?
Heya i’m for the first time here. I found this board and I to find It truly helpful & it helped me out a lot. I am hoping to give something back and help others like you helped me.
We’re a service-oriented managed cloud provider, in business since 2006. While there are always issues with shared infrastructure because it’s impossible to predict which users will push a component to the edge, with sufficient monitoring and aggressive hardware management, we’ve been able to support a reasonably large community of cloud clients in each of our hardware clusters with minimal cross-talk. It’s true, our prices are a bit higher on entry-level configurations (small number of cores) than cut-rate clouds but we believe the extra, personal service we give, and the extra performance headroom in our cloud provides our customers greater value, and they seem to agree.
What I find interesting is the tradeoff between the savings and increased reliability that can be provided in a shared cloud versus self-managed physical hardware, which people always seem to return to after a bad cloud experience. We offer virtual data centers in our cloud that “look” like dedicated hardware, including allocating “real” cores to applications, 80BGbit infiniband connections between servers and storage, and fast SANs – except there is guaranteed reliability, scalability, and pay-per-use. So the question is, how do you trade off any shared infrastructure drawbacks against having to build and manage your own hardware (or at least be responsible for systems design and administration)? The more I know about this, the better we can provide a cloud that is a nearly indistinguishable alternative to having your own hardware, unlike some of the public clouds today where you really have no idea what you’re getting (like the comment above where you get one hyper-thread as a “core”)
I’d love to hear from you.
-Eric Novikoff (eric@enki.co)
ENKI