On Thursday, Service Fusion suffered a site-wide outage which lasted approximately 1 hour. First let me start with a sincere apology from our entire staff. We take the reliability of our systems very seriously; it’s our number one priority. I’ve written this post-mortem to give you an account of what happened, how it was resolved, and what we are doing to ensure it never happens again.
Just after 2pm Central on Thursday, the website experienced an outage when the freeable memory of the caching tier became too small to support the active connections on the site. Since the caching tier supports critical site functionality, the site was not able to render pages correctly so users received server-level errors from the web farm.
Our immediate priority was to get site functionality restored by increasing freeable memory for the caching tier of the site. As a cloud-hosted site, this action entailed increasing memory on the caching cluster of servers at our Amazon-hosted data center. And while this is a semi-automated process, it is not immediate, and it took the remainder of the outage window to allocate additional memory to the caching cluster of servers. Once the allocation process was completed the server farm was recycled to reconnect to the caching tier and site functionality was restored.
We are in the process of implementing additional levels of system alerting to ensure Engineering and Operations can take appropriate steps long before an outage occurs. Additionally, while the current caching tier is redundant within itself, we are also building a fail-over caching tier of servers that we can use in the event of a catastrophic issue with the current tier.
We want to reiterate our apology for the magnitude of this issue and the impact it caused our customers and their customers. We are working diligently to implement these Engineering next steps to prevent issues like this from happening again.
Chief Technology Officer