Incidents
Full history of incidents.
March 2024
The metrics query is currently unavailable as some indexing shared are offline. We are working to get them up as quickly as possible. There is no impact on ingestion pipeline and storage layer.
EDIT 13:30 UTC : Indexing components are online and query is available
A cleanup process has triggered some durability lag on our storage layer. You may experience query instability.
Mon Mar 25 20:32:34 2024 UTC: all components are back to normal
(times in UTC)
Around 21:00, a part of the logs drains stack broke in a way that our monitoring did not see right away. It started to fill up the disk of the underlying RabbitMQ. At 21:37, We were alerted by the lack of space on RabbitMQ. We started investigating it around 22:10. At 22:57: the log drain stack is back up! However, to fix the RabbitMQ, we had to drop the pending queues. Our logs are still collected in our new logs infrastructure, but all drains lost the logs between 21:00 and 22:57.
We are currently investigating requests slowness on the Cellar north service.
EDIT 15:52 UTC: The issue has been identified and is being worked on. Timeouts should now be very sporadic since 15:38 UTC but some timeouts may still appear. We continue working on the issue.
EDIT 17:30 UTC: The service is now stable for the past hour, we will continue to monitor it for the next few hours.
The MySQL dev add-on cluster was unreachable. This should now be fixed
Scope:
- Database Load Balancer (configuration update)
Expected Impact:
- Brief disconnections or connection drops during the update process.
- Potential minor performance fluctuations.
Additional Information:
- We will deploy a patch on the load balancer that reduce memory consumption and enable more telemetry.
- Please report any issues with a method for reproducing the problem
- This maintenance is a direct follow up of the incident to propagate the patch https://www.clevercloudstatus.com/incident/826
EDIT 16:25 UTC : We have patched RBX, RBXHDS and MTL regions
EDIT 16:25 UTC : We are rolling out the patch on PAR region.
EDIT 16:45 UTC: We have patched PAR region, we start the WSW region
EDIT 16:55 UTC: We have patched WSW region, we start the SYD region
EDIT 17:05 UTC: We have patched SYD region, we start the GRAHDS region
EDIT 17:05 UTC: We have patched GRAHDS region, we start the SCW region
EDIT 17:15 UTC: We have patched SCW region, we start the SGP region.
EDIT 17:35 UTC: We have patched the SGP region as well. The maintenance is over.
We are currently investigating requests slowness on the Cellar north service.
EDIT 19:46 UTC+1: The underlying storage system is currently having issues and is rebalancing the data. No data loss is to be expected but timeouts may occur. We are looking to stabilize the system.
EDIT 20:09 UTC+1: The underlying storage system has stabilized the last 5 minutes. We keep an eye to make sure everything is okay
EDIT 21:47 UTC+1: The service is now stable. We will need to perform additional maintenance to fully fix the underlying issue. We will create the maintenances in the following days accordingly.
One instance of the load balancer has been unreachable, we have performed a patch and rebooted it. We are watching the load balancer.
EDIT 14:00 UTC : We have updated the load balancer configuration, you may have seen some connections cut during the reload.
EDIT 14:30 UTC : We have seen the same instabilities on RBX HDS database load balancer configuration, we have applied the patch that thr RBX database load balancer.
EDIT 2024-03-19T10:00:00Z : After a few days of observation, the currently deployed patched does not handle the issue correctly, we are are working on improving it.
EDIT 2024-03-19T11:30:00Z : We have deployed a new version of the patch that should handle the issue, we are watching the metrics to validate it
EDIT 2024-03-19T18:30:00Z : We have rolled out the patch as it handles the issue correctly. We are still watching metrics. However, we will need a few days to validate the behavior.
EDIT 2024-03-19T16:00:00Z : The patch is validated as we do not see the issue occurring.
We have seen network instabilities which implies slowness and errors. We are investigating the issue.
We suspect that we may have impacted by one of this maintenance of our infrastructure provider. See:
- https://network.status-ovhcloud.com/incidents/ym5hrzs2cn3b
- https://network.status-ovhcloud.com/incidents/np3j0flx9w24
EDIT 14:00 UTC : we are confident that was not link to the infrastructure provider and was an isolated incident. We are still monitoring the issue, but it seems to be solved.
Maintenance Window: 2024-03-06T09:00:00Z - 2024-03-06T11:00:00Z (UTC)
Scope:
- Application Load Balancer (software upgrade)
Expected Impact:
- Brief disconnections or connection drops during the upgrade process.
- Potential minor performance fluctuations.
Additional Information:
- We will deploy a patch on the load balancer control plane that affect sticky sessions.
- Please report any issues with a method for reproducing the problem (e.g., curl command for application load balancer issues).
EDIT 11:00 UTC : We have updated the cleverapps load balancers, we will soon restart it. We will proceed to another upgrades this afternoon.
EDIT 14:20 UTC : We are beginning the updates of paris load balancers.
EDIT 18:00 UTC : We are still updating load balancers
EDIT 19:30 UTC : We have stopped the updates process. we will continue the updates process tomorrow
EDIT 10:00 UTC : We are beginning updates of load balancers.
EDIT 10:45 UTC : We have finished the updates.
Maintenance Window: 2024-03-05T13:00:00Z - 2024-03-05T17:00:00Z (UTC)
Scope:
- Database Load Balancer (software and hardware upgrade)
- Application Load Balancer (software and hardware upgrade)
Expected Impact:
- Brief disconnections or connection drops during the upgrade process.
- Potential minor performance fluctuations.
Additional Information:
- Software upgrade already record on cleverapps.io (https://www.clevercloudstatus.com/incident/803) and on Paris load balancers (https://www.clevercloudstatus.com/incident/807 and https://www.clevercloudstatus.com/incident/805)
- Please report any issues with a method for reproducing the problem (e.g., curl command for application load balancer issues and/or psql / redis / mysql queries for database load balancer).
EDIT : The maintenance window has been update to next tuesday.
EDIT 14:00 UTC : We are beginning the maintenance.
EDIT 16:00 UTC : We have finished to install new hardware alongside the existing one on rbxhds, we will beginning to switch the traffic on database and software load balancers. we also start the installation of load balancers on rbx region.
EDIT 16:20 UTC : We are switching the instance of database load balancer.
EDIT 16:45 UTC : We have fully switch the load balancer of rbxhds region, we have finished to install alongside the current one, load balancers on rbx region. We will begin to switch the traffic to the new instances.
EDIT 17:15 UTC : We have finished to switch the traffic from old load balancers to new ones. The maintenance is over.
February 2024
We have detected trouble to access elasticsearch add-on on the Montreal regions.
We are working on it
[2024-02-27 08:50 UTC] we detected the root cause and fix it. Everything should now be ok
We are detecting a high number of errors on our query layer. The impacted components have been restarted and are currently reloading.
A query batch has ddos some queries components. They are currently reloading. Query is unavailable while they are loading.
Query is back online
We are detecting an higher number of errors than expected on the reverse-proxies dedicated to add-ons in the MTL zone. A really small percentage of users may experience trouble to connect to their database. We are investigating.
EDIT 17:45 UTC: we have passed a configuration to try to mitigate the issue. We are watching.
EDIT 18:00 UTC : we have done a rolling reboot of load balancers to give them more capacity.
The shared MTL cluster is located on the HV which has crashed. We are working on it https://www.clevercloudstatus.com/incident/818
EDIT 16:50 UTC : The shared cluster is up and running.
An hypervisor in MTL zone has crashed. It is currently rebooting. Applications are currently moved on other hypervisors. Add-ons on the rebooting HV are not reachable. Once the HV is up, we will make sure all add-ons are up.
EDIT 16:45 UTC : The hypervisor has rebooted and now operational
A maintenance is planned on our DEV PostgreSQL add-ons cluster (software upgrade)
EDIT 2024-02-19 21:00 UTC : Maintenance was successfully completed, maintenance end.
We are investigating unavailable git repositories updates if you push a new commit on the MTL region. Deployments are currently working as expected. We are looking into the issue.
EDIT 18:02 UTC: The issue has been identified and fixed. If you pushed any commits that didn't get applied, please let our support know about it so we can force a deployment.
We have an issue with git authentication during deployment. We are investigating the issue.
EDIT 15:30 UTC : We have resolve the git authentication issue.