Today, May 17, 2018, the BitMEX trading engine encountered several separate and heretofore unpredictable problems, causing feed latency and downtime in spurts throughout the day.
Disks mounted to the main trading engine hardware degraded sharply in performance at roughly 10:00 UTC. This degradation caused feed latency during scheduled archive and reindex jobs, which caused significant backpressure. Disk I/O operations were running at roughly 1/20 of their expected rate.
BitMEX runs redundant drives, but in this case, both drives were simultaneously exhibiting this degraded behavior. We had no choice but to schedule a maintenance downtime to replace them. Unfortunately, backpressure reached critical levels faster than we expected and we moved up our timetable.
At no point was data integrity compromised by this problem, but restoring the machine to a functional state with nominal disk performance took longer than expected to execute and verify.
After this action was complete, we restarted trading. Unfortunately, another problem was uncovered during the next archive, where a reindex job combined with a previously rare request pattern led to unexpected index regeneration and symbol revalidation on specific tables. This led to another backpressure scenario, with similar symptoms.
We have identified and fixed multiple contributing factors to the above behavior. The trading engine team will be closely monitoring engine performance throughout the day while continuing root cause analysis for the slowdowns.