← Back to EasyCron.com
Showing posts with label downtime. Show all posts
Showing posts with label downtime. Show all posts

Feb 3, 2019

Service malfunctioning during 2019-02-01 23:05 UTC to 2019-02-02 07:24 UTC

During 2019-02-01 23:05 UTC to 2019-02-02 07:24 UTC, we had an error with one of our core servers, which had caused our executor servers failing to execute cron jobs. The problem has been solved at 2019-02-02 07:25 UTC, and the system started working again since that.

We're investigating the root cause of the problem, and will enhance the whole system from the bottom up once the thorough investigation is done.

In the preliminary inspection, we found that the failure is related to partition space shortage caused by irrational disk partitioning of a pretty old CentOS. While Redis doing BIGSAVE to the partition, there was no enough space in the partition, so Redis kept doing BIGSAVE (as it's triggered by AOF file size). Finally both partition space and RAM were exhausted, and the server could only partly function during the failure time.

As a quick repair, we moved one of our Redis log servers to a new dedicated system with 4 times of RAM and 10 times of disk space.

Any missed cron jobs that should be run during the failure time have been executed (for one time) when the system was back to working again.

We're really sorry for the malfunctioning of the service. We will further investigate the whole failure and publish more information if necessary.

Jun 13, 2017

Datacenter hardware maintenance

Our dedicated servers provider OVH (ovh.com) has sent us an alert about a hardware (electrical reboot) replacement that will happen on June 14, at 6:00AM EDT and last about 1 to 1.5 hour:
http://status.ovh.net/?do=details&id=14704&edit=yep

This hardware replacement will affect one of our core servers which is relied by our other servers to perform EasyCron's service.

As a result, during the intervention, our service will be interrupted. And the service will recover automatically once the maintenance is finished and the server is booted again.

From information we got, the replacement of electric part could do no impact to the server:
http://status.ovh.net/?do=details&id=14521
or cause a variable length of downtime (from dozen of minutes to 1 hour):
http://status.ovh.net/?do=details&id=14437
http://status.ovh.net/?do=details&id=14530

There is nothing we can do now to 100% avoid this service interrupt. But we're working on a solution to have our HA (high availability) strategy cover this server, so that in the future if similar maintenance happen again, our service will not be affected.

As a remedy, EasyCron will fire those cron jobs that have missed executions during the maintenance for *one* time after the service back to normal. That means, your affected cron jobs will get triggered for one time after our server is up again.

We're really sorry for the inconvenience that will be caused by this server maintenance.

UPDATE:

Our server's downtime started from
June 14th, 6:39 AM EDT, and ended on
June 14th, 7:14 AM EDT, totally lasted
35 minutes.
Our service fully backed to normal at
June 14th, 7:20 AM EDT.