Wednesday 1st July 2020

MongoDB shared cluster High load on the MongoDB shared cluster

The mongodb shared cluster hosting free mongodb databases has a higher load than usual. It started going up at 15:25 UTC slowing reaching the point where it could not serve most of the requests as expected ~30 minutes ago. It is expected that requests would also fail since then because of timeouts or aborted connections.

The service has been restarted and we will monitor it closely, as well as adding monitoring to better catch this ramp up.

Dedicated databases are not impacted by this issue. If you are impacted, you can migrate your free plan to a dedicated plan using the migration feature. You can find it in the "Migration" menu of your add-on.

22:41 UTC: Load seems back to its normal state. The monitoring has been adjusted and we should then receive an alert at the start of the event instead.

23:19 UTC: The issue is back, the load is not as high as before but it might make the cluster slow.

23:54 UTC: Users impacting the cluster the most have been contacted to avoid this issue. Further actions will be taken later today if the issue persists.

2020-07-01 06:25 UTC: The node crashed due to a fatal assertion hit and restarted

06:38 UTC: The node is still unreachable for an unknown reason

07:48 UTC: The cluster is currently being repaired. For an unknown reason, nodes wouldn't listen to their network interfaces.

09:32 UTC: The repair is halfway through. The cluster might be able to be up again in ~1h30

11:50 UTC: The repair is done, the node successfully restarted. You should now be able to connect to the cluster. We are now re-starting the follower node for it to join back the cluster.

15:09 UTC: The leader node crashed again because of an assertion failure which means it is now unreachable again as mongodb reads its entire journal and rebuilds the indexes.

15:30 UTC: It usually takes 1h30 for mongodb to read the whole journal so it should be up again around 16:20 UTC.

16:34 UTC: It is taking longer than usual.

20:06 UTC: The restarts weren't successful. The secondary node successfully started at some point but was shutdown to avoid any issue with the primary one. We'll try starting it again.

2020-07-02 09:15 UTC: The first node has been accessible now and again but keeps on crashing due to user activity. The second node failed to sync to the first node so it cannot be used as primary right now. We are now trying to bring the first node back up without making it accessible to users so we can at least get backups of every database. Once this is done, we will update you on the next steps. This process will take a while as Mongo takes hours (literally) to come up after a crash.

12:00 UTC: The first node is finally back up (but incoming connections are shut off for now). We are now taking backups of all databases, you should see a new backup appear in your dashboard in the coming minutes / hours. Once this is done, we will start working on bringing the second node back in sync. Once the cluster is healthy, we will bring it back online.

14:30 UTC: Backups are over, customers who were using the free shared plan in production can create a new paid dedicated add-on and import the latest backup there. Meanwhile, we are now rebuilding the second node from the first one to make the cluster healthy again. Once it's over, we will bring the service back up (if everything goes well).

15:55 UTC: The second node is synced up and the service is available again. We are still monitoring things closely.

18:35 UTC: The service is working smoothly, no issues or anomalies to report.