System outage
Incident Report for Service Fusion
Postmortem

Site Outage Post-Mortem

On Thursday, Service Fusion suffered a site-wide outage which lasted approximately 1 hour.  First let me start with a sincere apology from our entire staff. We take the reliability of our systems very seriously; it’s our number one priority. I’ve written this post-mortem to give you an account of what happened, how it was resolved, and what we are doing to ensure it never happens again.

What Happened

Just after 2pm Central on Thursday, the website experienced an outage when the freeable memory of the caching tier became too small to support the active connections on the site.  Since the caching tier supports critical site functionality, the site was not able to render pages correctly so users received server-level errors from the web farm.

How It Was Resolved

Our immediate priority was to get site functionality restored by increasing freeable memory for the caching tier of the site.  As a cloud-hosted site, this action entailed increasing memory on the caching cluster of servers at our Amazon-hosted data center.  And while this is a semi-automated process, it is not immediate, and it took the remainder of the outage window to allocate additional memory to the caching cluster of servers.  Once the allocation process was completed the server farm was recycled to reconnect to the caching tier and site functionality was restored.

Next Steps

We are in the process of implementing additional levels of system alerting to ensure Engineering and Operations can take appropriate steps long before an outage occurs. Additionally, while the current caching tier is redundant within itself, we are also building a fail-over caching tier of servers that we can use in the event of a catastrophic issue with the current tier.  

Closing

We want to reiterate our apology for the magnitude of this issue and the impact it caused our customers and their customers. We are working diligently to implement these Engineering next steps to prevent issues like this from happening again.

 

Sincerely,

Tony Fratiani
Chief Technology Officer

Service Fusion

Posted Oct 12, 2018 - 17:00 CDT

Resolved
The issue has been resolved and all sites are fully operational. A postmortem will be posted on this page once the investigation have been completed. We sincerely appreciate your patience as we worked on resolving this unforeseen issue.
Posted Oct 11, 2018 - 15:15 CDT
Update
We expect the issue to be resolved with 15-30 minutes. Please continue monitoring this page for updates.
Posted Oct 11, 2018 - 15:09 CDT
Identified
Our engineer were able to identify the root cause of this issue and are working on resolving it.
Posted Oct 11, 2018 - 14:20 CDT
Investigating
We are currently investigating a system outage. Please monitor this page for updates.
Posted Oct 11, 2018 - 14:17 CDT
This incident affected: Admin System - admin.servicefusion.com, Worker App, vip1, vip2 and other *.totalfsm.com domains, QuickBooks Desktop Integration, QuickBooks Online Integration, and Customer Web Portal & Customer Apps.