[RETROACTIVE] [PAR] Some databases instances went down.

At 04:30 UTC: a pulsar cluster started to behave strangely (See https://www.clevercloudstatus.com/incident/574 ) At 05:30 UTC: on PAR, notification services on the hypervisors try to send messages in a loop, filling the system with stuck processes. At 07:00 UTC: the OS of these hypervisors start to kill processes to make room. It impacted some applications and databases. We start working on shutting down the stuck processes and restarting the broken instances. At 10:00 UTC: we finish restarting all the broken instances.

[RETROACTIVE] [PAR] Some databases instances went down.

Updates

Affected Components