Yesterday we experienced one of the longest disruptions to our products to date. For just under 2 hours the hosting platform that powers a large portion of our products was interrupted, resulting in outages for a significant percentage of our users.
The reliability of our products is as much of a high priority as it is for you, and when a disruption like this occurs we want to make sure we do everything we can to minimise your disruption, and learn from the experience for the future.
So, what actually happened?
Despite all the preparations and backup plans in the world sometimes things just go wrong.
AMP (the Aptoma Media Platform) uses hosting services primarily from IBM Softlayer, Rackspace and Avalio to manage the core Aptoma products. At 15.12 yesterday Softlayer ran into problems, and quickly posted news to their service pages:
“During a routine maintenance on the UPS systems in our Amsterdam data center, we experienced an outage affecting multiple PDUs (power distribution units (“strømskinner”)) within our server room at 2014-05-19 13:12 UTC. Staff are working closely with the building engineers and vendors to identify the root cause of the power trip and failure of redundant systems”
In plain english, there was a power failure during a routine maintenance session. No data was lost, there were no software issues and all our systems performed as expected – it seems this time it was just down to human error.
We’ve experienced this kind of situation ourselves, so we know the pain they were feeling, but they kept us updated during the process, and we were pleased that the partner we have chosen reacted so quickly and professionally to get everything up and running again.
What did we do about it?
Even though the problem was with Softlayer, we were not completely helpless.
Within 30 minutes we were able to get some cached services up and running for public facing content, and made sure friendly error messages were displayed in the products – the last thing we wanted was for you to see a blank screen, with no idea of what was happening
We also posted continous updates to our status pages at: http://status.aptoma.no You may not have seen this before, but we’ve had it in place for sometime, and if you ever want to know more about our service status this is the first place to look.
Heres an example from yesterday:
and also an overview of the products affected (the ones with the red cross):
What did we learn?
When something goes wrong with one of our suppliers the obvious question to ask is “why did we not have an alternative in place?”
We considered this factor when choosing Softlayer, but the network infrastructure and service they provide for us, fits our needs perfectly so our priority has been to optimise this, rather than have an inferior service on standby.
When something goes wrong, as it did yesterday, we want to be up and running and back to our normal level of service as soon as humanly possible and our relationship with Softlayer enabled us to do just that.
In getting our cached services online quickly during the downtime we discovered a new way to make the outage a little less painful for you, and you can be assured that we’ll be looking into ways to improve this further in the future.
We are sorry that you had to experience this disruption to your workflow. Hopefully this will be an isolated incident, but we will be taking another look at our setup just to be sure there was nothing else we could have done and in the future we will post some more information about the hosting set up we have in place for those who are curious to know more.
But for now, thanks for your patience during this inconvenient time.