On March 31st, we have experienced a
wp-admin area outage starting 13:31 EST which took us 68 minutes to solve and it was rooted in the way we were making failovers inside our distributed storage layer.
It started at 13:28 EST when we pushed live a tweak in the NFS configuration on all the storage nodes. At the master promote – demote stage, the failover has proven unsuccessful, causing an excessive load on the backend nodes running the
wp-admin, which were not able to read PHP code. We entered our level 2 emergency procedure, decoupling frontend serving from backend and further investigating what happened to the newly promoted storage master. We have thus discovered, the hard way, that a state in our automatic demote-promote failover process has not been modeled. The recovery took some good 30 minutes.
As a consequence, we have had a two-day workshop with our entire team to design and plan a new storage architecture which will be developed, tested and rolled out by the end of next month, which, most important of all, will implement a decentralized storage approach, alongside a new failover policy for the storage layer. And we are expecting significant performance improvement for logged-in users and
Another consequence is an adjustment of the way we monitor and make uptime data public in our status page for all our customers’
And finally, we are committed to be transparent with such issues, should they ever occur again.
We take full responsibility for the outage and we believe that we learned a lot. We appreciate a lot our customers’ trust and positive attitude. Multumim!