Incidents
Full history of incidents.
January 2025
A network maintenance on the Paris region is planned on January 16th, 2025 between 21:30 UTC and 23:30 UTC. No service interruption or degradation is expected during this maintenance. We will update this incident throughout the operations.
EDIT 2025-01-16 21:27 UTC: The maintenance is about to start.
EDIT 2025-01-16 23:30 UTC: The maintenance is now over, no impact observed.
A maintenance is planned on Grafana starting at 9:00 Paris CET and expected to last for several minutes. We plan to upgrade the version of Grafana to v9.5.5.
EDIT 10:10 CET: The update completed successfully
16:50 UTC: We identified an issue where the user can be disconnected if the user try to see add-on informations of a MateriaKV. We are working on a fix.
EDIT 17:05 UTC: A fixed is deployed, if any disconnection happens, please inform the support.
Load balancers are experiencing connectivity issues with the event bus when retrieving configuration; we are investigating.
EDIT 10:00 UTC : it seems that the issue is due to packets losse between SGP region and PAR region, we are investigating.
EDIT 11:00 UTC : the incident is related to an AS which is not in our network, we have contacted our network providers to mitigate the issue.
The shared RabbitMQ cluster on the Paris region is experiencing intermittent degraded performances. We are investigating.
EDIT 10:27 UTC: The underlying issue has been found, a RabbitMQ client was reconnecting too fast / too often leading to a global increase of load on the cluster. The situation has been addressed. We continue to monitor the situation.
EDIT 2025-01-13 09:45 UTC: The situation is back to normal since 10/01/2025 10:30 UTC. The incident is over.
We are currently in the process of upgrading our Cellar Paris service. During this upgrade, customers might experience increased requests latencies when we restart certain components. No service interruption is to be expected. The upgrade should take place throughout the week starting on Monday 13/01/2025 around 09:00 UTC . We will update this maintenance once it is over or if anything comes up.
EDIT 2025-01-15 17:05 UTC: The maintenance is now over. No service interruption occurred.
We are currently in the process of upgrading our Cellar North service. During this upgrade, customers might experience increased requests latencies when we restart certain components. No service interruption is to be expected. The upgrade should take place between Wednesday 2025-01-08 15:00:00 UTC and Friday 2025-01-10. We will update this maintenance once it is over or if anything comes up.
EDIT 2025-01-09 16:40 UTC: The upgrade is now over. Some queries were slower than usual today around 12:00 UTC but no requests were lost.
The monitoring detect a high number of simultaneous connections which result on connection refused from the load balancer, we are investigating the issue and a way to solve it.
EDIT 15:20 UTC : The number of simultaneous connections goes back to normal, we are still investigating the reason behind this raise.
The monitoring has detected that one of our hypervisor is unreachable, we are investigating the issue.
EDIT 18:00 UTC : The hypervisor is back up and running, we are proceeding to the recovery process.
EDIT 18:15 UTC : Services are back up and running, we are investigating the reason of the issue.
December 2024
An hypervisor is unreachable on the MEA region, we are investigating.
EDIT 07:41 UTC: The hypervisor is back online since 07:27 as well as all of its services.
The processing components of the access logs pipeline have a bug that prevent us to process the access log properly. We are currently investigating the issue.
EDIT 14:50 UTC : We found the bug and a patch that comes with. We have deployed the path in production and we are processing access logs queues.
EDIT 10:15 UTC : The access logs have been completely consume, we are processing on the fly since yesterday 18: 45 UTC.
(Times in UTC) An hypervisor has crashed on the RBX region. Applications that were running on this hypervisor are currently redeploying. We are investigating the reason of the crash.
- 00:35: it looks like the kernel went rogue. We are rebooting the server
- 00:45: The server was successfully rebooted, we are starting to check that all the services are restarting correctly.
- 00:55: Everything is back to normal. All databases are up and running.
A maintenance is planned on Grafana starting at 21:00 Paris CET and expected to last for 1 hour
[ 01:00 CET] Maintenance is now completed
A few nodes of the pulsar storage layer known as bookkeeper crashed and propagate the pulsar cluster with them. We are restoring the bookkeeper cluster and then we will help the cluster pulsar to recover.
EDIT 19:15 UTC : We have deployed a patch to fix the bookkeeper cluster, we have deployed the new configuration and we are rolled out the cluster. The pulsar cluster should be available.
EDIT 20:20 UTC : Some nodes of the bookkeeper cluster are under memory pressure, we are investigating the issue.
EDIT 21:20 UTC: We found the issue and are deploying the patch.
EDIT 21:50 UTC: Situation is back to normal.
We observed an issue while accessing Grafana Metrics dashboards with the message Access denied to this dashboard
A patch is currently beiing deployed
[ 12:30 CET]: All organisations have been patched
An hypervisor on the Paris region is experiencing degraded I/O operations. We are looking into it.
EDIT 20:25 UTC+1: The hypervisor is back to normal levels since 20:08 UTC+1. We keep investigating the reason of the slow I/O. Applications on this hypervisor were redeployed elsewhere to avoid any issues.
An hypervisor has crashed on the PAR region. Applications are currently redeploying. We are investigating the reason of the crash, probably an issue with the RAID array.
EDIT 11:50 CET : hypervisor is up and running, we are still investigating the root cause.
An hypervisor has crashed on the PAR region. Applications are currently redeploying. We are investigating the reason of the crashed.
EDIT 16:40 CET - HV has been restarted and is now running
Since yesterday morning, we have difficulties with the access logs ingestion pipeline. We are working to solve the issue.
EDIT 14:20 UTC : We are recovering the lag on the access logs, we will finish to consume it in the night. We are working on solution to speed up the recovery
EDIT D+1 10:20 UTC : We have recovered the log of access logs, but we have still an issue to produce messages due to the cluster pulsar underlying meta-data storage, we have passed a patch yesterday that should improve the production, but it takes time to accomplished its job, we are investigating a way to improve the current situation
EDIT D+1 17:20 UTC : We have found a way to solve the issue with the ingestion pipeline of access logs. We are currently deploying it.
EDIT D+2 19:00 UTC : We have deployed the patch, but we are currently impacted by the incident : https://www.clevercloudstatus.com/incident/927
EDIT D+2 19:20 UTC : We are recovering the lag.
EDIT D+3 16:50 UTC : The situation is back to normal.
November 2024
An hypervisor has crashed on the PAR region. Applications are currently redeploying. We are investigating the reason of the crashed.
EDIT 01:00 UTC : we have found the reason of the hypervisor crashed, we have a broken raid due to a failling disk. we are restoring the raid and ensuring that the system and data is ok.
EDIT 01:30 UTC : the hypervisor is up and running, we are restarting database on it
EDIT 01:40 UTC : databases have been restored