During 2019-02-01 23:05 UTC to 2019-02-02 07:24 UTC, we had an error with one of our core servers, which had caused our executor servers failing to execute cron jobs. The problem has been solved at 2019-02-02 07:25 UTC, and the system started working again since that.
We're investigating the root cause of the problem, and will enhance the whole system from the bottom up once the thorough investigation is done.
In the preliminary inspection, we found that the failure is related to partition space shortage caused by irrational disk partitioning of a pretty old CentOS. While Redis doing BIGSAVE to the partition, there was no enough space in the partition, so Redis kept doing BIGSAVE (as it's triggered by AOF file size). Finally both partition space and RAM were exhausted, and the server could only partly function during the failure time.
As a quick repair, we moved one of our Redis log servers to a new dedicated system with 4 times of RAM and 10 times of disk space.
Any missed cron jobs that should be run during the failure time have been executed (for one time) when the system was back to working again.
We're really sorry for the malfunctioning of the service. We will further investigate the whole failure and publish more information if necessary.
No comments:
Post a Comment