Service outages 29.09.2014 and 01.10.2014

This week we’ve had two urgent platform maintenance incidents affecting customers of our products everything from 9 minutes to 1hour 57minutes

We are sorry that you have experienced a disruption in your regular service. Availability of the services that we provide is our top priority.

If you have any questions or concerns about this outage please do not hesitate to contact us on support@aptoma.com.

Details about the incidents

Monday 29.09.2014 Rackspace datacenters in London underwent urgent security maintenance. DrFront v2 customers were affected starting approx 21:30 in the evening for a total of 9 minutes. 

Wednesday 01.10.2014 we experienced the following downtimes starting approx 23:00 in the evening.

DrVideo
APIs zero downtime.
Publishing new content unavailable from 23:05 to 00:20 (1h:15m)

DrPublish
APIs zero downtime.
Publishing new content unavailable from 23:05 to 00:20 (1h:15m)

DrMobile
APIs zero downtime.
Publishing new content unavailable from 00:27 to 01:27 (1h)

DrFront V3
Unavailable from 23:03 – 00:07 (64 min)
Unavailable from 00:36 – 01:29 (53 min) In sum 1h:57m.

In order to maintain full transparency on our service patform we have an active status page which details ongoing issues in realtime at: http://status.aptoma.no. Please go there to subscribe for immediate information in any future incident. You can also follow updates at: https://twitter.com/aptomaops

Why the short notice?

A critical security flaw has recently been identified in the Xen Hypervisor technology, which powers cloud providers like Amazon, Rackspace and IBM Softlayer. To patch and apply this update before it coule be exploited, swift action was needed, and a system reboot for the host server which runs the virtual cloud servers were deemed necessary.

How did we prepare?

We have redundancy on most levels for our systems, but we are not automatically protected against sitewide reboots such as those we saw this week from Rackspace, Amazon and Softlayer. Based on reports from the Rackspace reboots Monday, and our own experience with minimal downtime of 9 minutes, and the prognosis from Softlayer of 10-20 minutes disruptions, we decided not to initiate extensive and potentially risky migrations of content production systems.

Before the incident we contacted all customers that could be affected by email and most of them also by phone. We rerouted API traffic to other data centers to ensure that data distribution would not be affected. We kept a report running from the minute we received notice about the upgrades and until resolution, for Monday’s incident and Wednesday’s incident.

What we’ll do now

We’ve set aside resources for following up affected customers, please reach out on support@aptoma.com or to me at geir@aptoma.com. Also, we’ve scheduled a follow up meeting with IBM Softlayer Tuesday October 7th. We have also initiated an analysis of how we can better be shielded from sitewide platform disruptions such as these.

We apologize for any troubles these incidents may have caused for you.

Best regards, Geir Berset
Aptoma AS

geir@aptoma.com