Incident History

Full history of incidents.

Newest first

January 2024

Elevated rate of failed deployments 2 years ago

Fixed · Deployments · Global

We are seeing an elevated rate of failed deployments. We are investigating the issue.

EDIT 15:58 UTC: The issue has been identified and deployments should be back to normal since 15:40 UTC.

[Metrics] Elevated queries error rate 2 years ago

Fixed · Access Logs · Global

We are seeing elevated error rate for metrics read queries due to the underlying storage system. The problem has been identified and we are working toward its resolution. This can impact some of the grafana dashboards or API queries. Write performance is not impacted.

Update Thu Jan 04 14:48:00 2024 UTC: We have triggered some data balancing. Some queries may take longer than expected. This can impact some of the grafana dashboards or API queries. Write performance may be impacted.

Update Thu Jan 04 20:44:01 2024 UTC: data balancing is more aggressive than expected, overloading some components. Query may be unavailable during that time

Update Fri Jan 05 02:26:05 2024 UTC: some components are still overloaded. We are currently catching up the lag, but query is disabled for now.

Update Fri Jan 05 08:01:45 2024 UTC: our write-path is still overloaded. We are searching for the bottleneck

Update Fri Jan 05 16:03:48 2024 UTC: a cleanup subroutine has been triggered to balance and remove slack space from our internal Btree storage. Query is still disabled to speed-up the process.

Update: Sat Jan 06 11:25:28 2024 UTC: lag has been absorbed. Query is now up, the cleanup subroutine is still in-progress. You may notice latency spikes during query.

Update: Mon Jan 08 14:36:57 2024 UTC: cleanup subroutine is still in-progress, and some workloads triggered an overloading of some components. Query is disabled to speed-up recovery

Update: Mon Jan 08 16:36:18 2024 UTC: query is now open.

Update Tue Jan 09 14:38:34 2024 UTC: Some StorageServers are late, meaning that a really small portion of the data is not available for the query. We are currently catching up with the lag

Update Tue Jan 16 14:56:55 2024 UTC: closing the ticket.

[PAR] Load balancer network connectivity 2 years ago

Fixed · Reverse Proxies · Global

We have removed the ip address 46.252.181.103 from the domain name domain.par.clever-cloud.com. One of our network partner has detected an abnormal amount of traffic coming to this ip address and begin to mitigate it. We are investigating the issue

EDIT 15:15 UTC: we are still digging the issue, the abnormal traffic is over and everything seems going back to normal

EDIT 16:30 UTC : we have put back the ip address in the load balancer pool 46.252.181.103

December 2023

[NORTH] Partial Cellar requests timeout 2 years ago

Fixed · Cellar · Global

Between 16:58 UTC and 17:03 UTC, the Cellar service on the North region timed out on some requests. The faulty component has been decommissioned and further investigations will be done to understand the source of the timeouts. The service is currently up and running.

EDIT 2023-12-30 00:51 UTC: The problem has been identified and resolved. The component is back in the pool and is working as expected. This incident is now over.

[Metrics] Elevated queries error rate 2 years ago

Fixed · Access Logs · Global

We are seeing elevated error rate for metrics queries due to the underlying storage system. The problem has been identified and we are working toward its resolution. This can impact some of the grafana dashboards or API queries.

EDIT 09:44 UTC: The issue is not fully resolved yet but we are seeing improvements. We continue working on the issue.

EDIT 11:04 UTC: Queries are now working since 10:15 UTC, we continue monitoring to ensure everything is working as intended.

EDIT 15:43 UTC: Everything is back to normal, this incident is now over.

[RBX] Unreachable hypervisor 2 years ago

Fixed · Infrastructure · Global

An hypervisor is unreachable, we are investigating.

EDIT 03:17 UTC : There is no database affected on this hypervisor and applications has been redeployed.

EDIT 03:30 UTC : The hypervisor has been reboot and everything comes back to normal

[RBX] Unreachable hypervisor 2 years ago

Fixed · Infrastructure · Global

An hypervisor is unreachable, we are investigating.

EDIT 3:37 UTC : The issue seems to be related with the following OVH incident : https://bare-metal-servers.status-ovhcloud.com/incidents/x135vv46x85l

EDIT 3:45 UTC : Applications on this hypervisor are currently redeploying and there is no such addons on it, we also have remove temporarely the A record from domain.rbx.clever-cloud.com to solve connection issues

EDIT 4:00 UTC : Applications have been redeployed, we are waiting after ovh folk to go further

EDIT 05:30 UTC : The hypervisor is reachable again, we are starting the recovery process

EDIT 05:45 UTC : The recovery process is over, everything works normally, the load balancer ip affected by the incident will be put later in the pool. for the record, the ip is 87.98.177.176 for domain.rbx.clever-cloud.com.

[PAR] Load balancer connexions issues 2 years ago

Fixed · Reverse Proxies · Global

We are seeing the number of connexions on load balancers rising, we are investigating

EDIT 10:20 UTC : the investigation is still in progress and we are mitigating the issue with a rise a maximum connexions

EDIT 11:00 UTC : We are now on the nominal values, we are still watching

[Paris] Datacenter update 2 years ago

Fixed · Global

We are planning to do various updates on one of our datacenter in the Paris region starting at 14:15 UTC. It will last for a few hours. No issue is to be expected during this maintenance.

We will update this status accordingly.

EDIT 15:10 UTC: Maintenance is over, no impact during the operations.

[RBX] Hypervisor down 2 years ago

Fixed · Infrastructure · Global

Our monitoring detected that an hypervisor located in RBX-1 is unreachable. We are investigating.

EDIT 06:07 AM UTC: the hypervisor has become unresponsive due to a really high cpu load average. It has been rebooted. Almost all databases are reachable, we are fixing the last ones.

EDIT 06:45 AM UTC: all databses are now up

API instability 2 years ago

Fixed · API · Global

We have detected a high number of errors towards certain APIs. One of the core database have been restarted to restore the service.

Internal loadbalancer desync 2 years ago

Fixed · Reverse Proxies · Global

We have detected a configuration issue on our internal loadbalancer. It has been fixed. You may have experienced issues connecting to api.clever-cloud.com and the console for a few minutes.

Matomo add-on creation does not work 2 years ago

Fixed · Matomo add-on · Global

We are investigating an issue with Matomo add-ons failing to create since a few days.

EDIT 2023-12-21 16:00 UTC+1: We found and fixed the rood cause. Matomo add-ons can now be ordered again.

[Paris] Datacenter update 2 years ago

Fixed · Global

We are planning to do various updates on one of our datacenter in the Paris region starting at 15:40 UTC. It will last for a few hours. No issue is to be expected during this maintenance.

We will update this status accordingly.

EDIT 17:30 UTC: Maintenance is over, no impact during the operations.

[PAR] Load balancer connection issues 2 years ago

Fixed · Reverse Proxies · Global

We are observing connections issues on load balancers. We are investigating.

EDIT 16:00 UTC : We have found that one of our customers is under ddos, we are mitigating the issue.

EDIT 16:30 UTC : The ddos seems to be mitigated, we are watching.

[JED] An hypervisor is unreachable 2 years ago

Fixed · Infrastructure · Global

An hypervisor is unreachable on the Jeddah region since 10:25 UTC. We are investigating.

EDIT 10:55 UTC: The hypervisor went back online at 10:33 UTC. All applications were redeployed to another hypervisor. The incident is now over.

Hypervisor in SCW crashed 2 years ago

Fixed · Infrastructure · Global

An hypervisor in the SCW region crashed. We restarted it.

Some databases went unavailable, We are checking that they all rebooted correctly.

EDIT 15:51 UTC: all checks have completed. All the services are operational.

EDIT 04/12/2023 11:00 UTC : It seems that the load balancer behind the ip 212.129.27.183 was impacted by the incident. The issue is solved.

[SCW] A database reverse proxy went unresponsive for 3 minutes 2 years ago

Fixed · Reverse Proxies · Global

16:44 UTC: one of the reverse proxy for databases became unresponsive on SCW. We restarted it. 16:47 UTC: the reverse proxy has restarted and is working again.

Consequences: some applications on SCW may have lost connection to their database for a few minutes. They may have crashed and been redeployed by our monitoring.

November 2023

API seems to be slow 2 years ago

Fixed · API · Global

Our main API responds slowly. We are investigating to find out why.

EDIT 19h UTC : The issue has been solved

[Paris] Datacenter update 2 years ago

Fixed · Global

We are planning to do various updates on one of our datacenter in the Paris region starting at 13:30 UTC. It will last for a few hours. No issue is to be expected during this window.

We will update this status accordingly.

EDIT 17:30 UTC: All updates are now over. Operations went smoothly and no impact was detected.