Incidents

Full history of incidents.

March 2024

[GLOBAL] Metrics query unavailable 2 years ago

Fixed · Metrics · Global

The metrics query is currently unavailable as some indexing shared are offline. We are working to get them up as quickly as possible. There is no impact on ingestion pipeline and storage layer.

EDIT 13:30 UTC : Indexing components are online and query is available

[Metrics] Query instability 2 years ago

Fixed · Metrics · Global

A cleanup process has triggered some durability lag on our storage layer. You may experience query instability.

Mon Mar 25 20:32:34 2024 UTC: all components are back to normal

Logs drains are down 2 years ago

Fixed · Services Logs · Global

(times in UTC)

Around 21:00, a part of the logs drains stack broke in a way that our monitoring did not see right away. It started to fill up the disk of the underlying RabbitMQ. At 21:37, We were alerted by the lack of space on RabbitMQ. We started investigating it around 22:10. At 22:57: the log drain stack is back up! However, to fix the RabbitMQ, we had to drop the pending queues. Our logs are still collected in our new logs infrastructure, but all drains lost the logs between 21:00 and 22:57.

Cellar North: Requests slowness 2 years ago

Fixed · Global

We are currently investigating requests slowness on the Cellar north service.

EDIT 15:52 UTC: The issue has been identified and is being worked on. Timeouts should now be very sporadic since 15:38 UTC but some timeouts may still appear. We continue working on the issue.

EDIT 17:30 UTC: The service is now stable for the past hour, we will continue to monitor it for the next few hours.

[DEV] MTL cluster unavailable 2 years ago

Fixed · Global

The MySQL dev add-on cluster was unreachable. This should now be fixed

[Global] Database load balancers maintenance 2 years ago

Fixed · Reverse Proxies · Global

Scope:

Database Load Balancer (configuration update)

Expected Impact:

Brief disconnections or connection drops during the update process.
Potential minor performance fluctuations.

Additional Information:

We will deploy a patch on the load balancer that reduce memory consumption and enable more telemetry.
Please report any issues with a method for reproducing the problem
This maintenance is a direct follow up of the incident to propagate the patch https://www.clevercloudstatus.com/incident/826

EDIT 16:25 UTC : We have patched RBX, RBXHDS and MTL regions

EDIT 16:25 UTC : We are rolling out the patch on PAR region.

EDIT 16:45 UTC: We have patched PAR region, we start the WSW region

EDIT 16:55 UTC: We have patched WSW region, we start the SYD region

EDIT 17:05 UTC: We have patched SYD region, we start the GRAHDS region

EDIT 17:05 UTC: We have patched GRAHDS region, we start the SCW region

EDIT 17:15 UTC: We have patched SCW region, we start the SGP region.

EDIT 17:35 UTC: We have patched the SGP region as well. The maintenance is over.

Cellar North: Requests slowness 2 years ago

Fixed · Cellar · Global

We are currently investigating requests slowness on the Cellar north service.

EDIT 19:46 UTC+1: The underlying storage system is currently having issues and is rebalancing the data. No data loss is to be expected but timeouts may occur. We are looking to stabilize the system.

EDIT 20:09 UTC+1: The underlying storage system has stabilized the last 5 minutes. We keep an eye to make sure everything is okay

EDIT 21:47 UTC+1: The service is now stable. We will need to perform additional maintenance to fully fix the underlying issue. We will create the maintenances in the following days accordingly.

[RBX][RBXHDS] database load balancer instabilities 2 years ago

Fixed · Reverse Proxies · Global

One instance of the load balancer has been unreachable, we have performed a patch and rebooted it. We are watching the load balancer.

EDIT 14:00 UTC : We have updated the load balancer configuration, you may have seen some connections cut during the reload.

EDIT 14:30 UTC : We have seen the same instabilities on RBX HDS database load balancer configuration, we have applied the patch that thr RBX database load balancer.

EDIT 2024-03-19T10:00:00Z : After a few days of observation, the currently deployed patched does not handle the issue correctly, we are are working on improving it.

EDIT 2024-03-19T11:30:00Z : We have deployed a new version of the patch that should handle the issue, we are watching the metrics to validate it

EDIT 2024-03-19T18:30:00Z : We have rolled out the patch as it handles the issue correctly. We are still watching metrics. However, we will need a few days to validate the behavior.

EDIT 2024-03-19T16:00:00Z : The patch is validated as we do not see the issue occurring.

[RBX] Network instabilities 2 years ago

Fixed · Infrastructure · Global

We have seen network instabilities which implies slowness and errors. We are investigating the issue.

We suspect that we may have impacted by one of this maintenance of our infrastructure provider. See:

https://network.status-ovhcloud.com/incidents/ym5hrzs2cn3b
https://network.status-ovhcloud.com/incidents/np3j0flx9w24

EDIT 14:00 UTC : we are confident that was not link to the infrastructure provider and was an isolated incident. We are still monitoring the issue, but it seems to be solved.

[PAR] Load balancer maintenance 2 years ago

Fixed · Global

Maintenance Window: 2024-03-06T09:00:00Z - 2024-03-06T11:00:00Z (UTC)

Scope:

Application Load Balancer (software upgrade)

Expected Impact:

Brief disconnections or connection drops during the upgrade process.
Potential minor performance fluctuations.

Additional Information:

We will deploy a patch on the load balancer control plane that affect sticky sessions.
Please report any issues with a method for reproducing the problem (e.g., curl command for application load balancer issues).

EDIT 11:00 UTC : We have updated the cleverapps load balancers, we will soon restart it. We will proceed to another upgrades this afternoon.

EDIT 14:20 UTC : We are beginning the updates of paris load balancers.

EDIT 18:00 UTC : We are still updating load balancers

EDIT 19:30 UTC : We have stopped the updates process. we will continue the updates process tomorrow

EDIT 10:00 UTC : We are beginning updates of load balancers.

EDIT 10:45 UTC : We have finished the updates.

[RBX] Load balancer maintenance 2 years ago

Fixed · Global

Maintenance Window: 2024-03-05T13:00:00Z - 2024-03-05T17:00:00Z (UTC)

Scope:

Database Load Balancer (software and hardware upgrade)
Application Load Balancer (software and hardware upgrade)

Expected Impact:

Brief disconnections or connection drops during the upgrade process.
Potential minor performance fluctuations.

Additional Information:

Software upgrade already record on cleverapps.io (https://www.clevercloudstatus.com/incident/803) and on Paris load balancers (https://www.clevercloudstatus.com/incident/807 and https://www.clevercloudstatus.com/incident/805)
Please report any issues with a method for reproducing the problem (e.g., curl command for application load balancer issues and/or psql / redis / mysql queries for database load balancer).

EDIT : The maintenance window has been update to next tuesday.

EDIT 14:00 UTC : We are beginning the maintenance.

EDIT 16:00 UTC : We have finished to install new hardware alongside the existing one on rbxhds, we will beginning to switch the traffic on database and software load balancers. we also start the installation of load balancers on rbx region.

EDIT 16:20 UTC : We are switching the instance of database load balancer.

EDIT 16:45 UTC : We have fully switch the load balancer of rbxhds region, we have finished to install alongside the current one, load balancers on rbx region. We will begin to switch the traffic to the new instances.

EDIT 17:15 UTC : We have finished to switch the traffic from old load balancers to new ones. The maintenance is over.

February 2024

Trouble to connect to Montreal Elastic-search 2 years ago

Fixed · Global

We have detected trouble to access elasticsearch add-on on the Montreal regions.

We are working on it

[2024-02-27 08:50 UTC] we detected the root cause and fix it. Everything should now be ok

[Metrics] query instability 2 years ago

Fixed · Metrics · Global

We are detecting a high number of errors on our query layer. The impacted components have been restarted and are currently reloading.

Metrics query issue 2 years ago

Fixed · Metrics · Global

A query batch has ddos some queries components. They are currently reloading. Query is unavailable while they are loading.

Query is back online

[MTL] Connection instability towards add-ons 2 years ago

Fixed · Reverse Proxies · Global

We are detecting an higher number of errors than expected on the reverse-proxies dedicated to add-ons in the MTL zone. A really small percentage of users may experience trouble to connect to their database. We are investigating.

EDIT 17:45 UTC: we have passed a configuration to try to mitigate the issue. We are watching.

EDIT 18:00 UTC : we have done a rolling reboot of load balancers to give them more capacity.

[MTL] PostgreSQL shared cluster is down 2 years ago

Fixed · PostgreSQL shared cluster · Global

The shared MTL cluster is located on the HV which has crashed. We are working on it https://www.clevercloudstatus.com/incident/818

EDIT 16:50 UTC : The shared cluster is up and running.

[MTL] HV unreachable 2 years ago

Fixed · Infrastructure · Global

An hypervisor in MTL zone has crashed. It is currently rebooting. Applications are currently moved on other hypervisors. Add-ons on the rebooting HV are not reachable. Once the HV is up, we will make sure all add-ons are up.

EDIT 16:45 UTC : The hypervisor has rebooted and now operational

[PAR] PostgreSQL in plan `DEV` maintenance 2 years ago

Fixed · Global

A maintenance is planned on our DEV PostgreSQL add-ons cluster (software upgrade)

EDIT 2024-02-19 21:00 UTC : Maintenance was successfully completed, maintenance end.

[MTL] Git repositories update currently unavailable 2 years ago

Fixed · Git repositories · Global

We are investigating unavailable git repositories updates if you push a new commit on the MTL region. Deployments are currently working as expected. We are looking into the issue.

EDIT 18:02 UTC: The issue has been identified and fixed. If you pushed any commits that didn't get applied, please let our support know about it so we can force a deployment.

[MTL] Git deployment issue 2 years ago

Fixed · Git repositories · Global

We have an issue with git authentication during deployment. We are investigating the issue.

EDIT 15:30 UTC : We have resolve the git authentication issue.