Incidents

Full history of incidents.

March 2025

Slower pulsar ingestion and read 1 year ago

Fixed · Pulsar · Global

09:47 UTC: After a routine upgrade of some components in a specific AZ, we've seen some disruptions in message ingestion 10

09:15 UTC: One of the component of broker was updated & rebooted

10:28 UTC: Message ingestion starts to recover and back to normal levels, incident solved

[PAR] Network connectivity issue 1 year ago

Fixed · Infrastructure · Global

Network Incident: Disrupted link with network provider

We would like to inform our users that a network incident was reported on one of our provider's links, which occurred last night. This incident has led to disruptions in access to some of our services.

Incident Details:

Type of Incident: Network connectivity issue=
Start Time: 01h30 UTC
Estimated End Time: 02h00 UTC

Impact:

Users may have experience increased response times or interruptions in access to certain services.

[MySQL] a maintenance is currently occuring 1 year ago

Fixed · Global

We are currently updating our controle plane and our images to support new MySQL versions 8.4.3-3 and 8.0.41-32

No impact expected

[16:20 CET] Those new version have correctly been released

[PostgreSQL] a maintenance is currently occuring 1 year ago

Fixed · Global

We are currently updating our controle plane and our images to support new PostgreSQL image versions 16 and 17

No impact expected

[11:30 CET] This release has been correctly done

[PAR] Hypervisor unreachable 1 year ago

Fixed · Infrastructure · Global

An hypervisor is unreachable on the Paris region, we are investigating the issue

EDIT 21:15 UTC : A failed nvme is preventing the hypervisor to boot, but no customer services has been impacted. The hypervisor is only for testing purposes and distributed systems.

[MTL] hypervisor overloaded 1 year ago

Fixed · Infrastructure · Global

An hypervisor is temporary overloaded on the MTL, we are investigating.

EDIT 12:00 UTC : the overload is caused by an hardware failure, we are draining the hypervisor. Databases on this hypervisor will migrate to another healthy hypervisor.

EDIT 12:45 UTC : we have migrated a few databases on the hypervisor, during the operation we had to reboot the hypervisor which resolved the issue, we are still investigating the root cause. Services on the hypervisor should be up.

[MEA] An hypervisor is rebooting 1 year ago

Fixed · Infrastructure · Global

An hypervisor is rebooting on the MEA region, we are working to restart all services.

EDIT 16:44 CET: The hypervisor has rebooted and all services are available again.

[Gouv] Network reachability issue 1 year ago

Fixed · Infrastructure · Global

We are experiencing a network reachability issue on the Gouv region. We are looking into it.

EDIT 13:22 CET: Our infrastructure provider acknowledged the incident and is working on it.

EDIT 13:46 CET: Our infrastructure provider continues to investigate the issue.

EDIT 13:52 CET: The whole region is impacted, no service hosted on that can be reached.

EDIT 14:17 CET: A fix has been implemented on our infrastructure provider side. We regained access to the infrastructure since 14:08. We are making sure all services are restarted.

EDIT 14:32 CET: All services have been restarted, we keep watching if anything comes up.

EDIT 14:40 CET: All KMS nodes are fully operational

[PAR] Power supply failure on region EU-FR-1 (Paris) 1 year ago

Fixed · Infrastructure · Global

Clever Cloud Incident – Explanations and Lessons Learned

Today we experienced an incident affecting our infrastructure. Here is a summary of the causes, ongoing analysis, and planned actions to strengthen our resilience.

Incident Timeline:

Power Outage: A power failure reduced our computing capacity by one-third, highlighting the need to expand our infrastructure to five datacenters to better absorb such incidents.
Network Issue: A network outage related to BGP announcements followed, revealing an underlying issue that requires further investigation to prevent recurrence.

Why Did Recovery Take Time?

Machine Reconnection: The corrective measure to prevent overload during VM reconnection to Pulsar was not fully effective. An in-depth analysis is underway to improve this process.
Orchestration Evolution: Our current system is reaching its limits. We are working on a new orchestration architecture to better manage recovery and optimize performance.

Next Steps:

We will publish a detailed post-mortem and schedule meetings with customers to:

Analyze the incident,
Explain upcoming changes,
Demonstrate our commitment to improving infrastructure resilience.

We will keep you informed about these actions. Thank you for your patience and trust.

The Clever Cloud Team

Detailed timeline :

EDIT 2:11pm (UTC): One of the Paris datacenters has experienced an electricity issue. Some hypervisors have been rebooted.

EDIT 2:21pm (UTC): Applications are being restarted, the situation is stabilizing.

EDIT 2:28pm (UTC): Monitoring is up and generating new statuses

EDIT 2:34pm (UTC): Recovering process is still ongoing. Customers can open a ticket though their email endpoint in addition to the Web Console.

EDIT 2:43pm (UTC): Infrastructure is under high load. We're accelerating the recovery process with load sanitization

EDIT 2:48pm (UTC): All Load balancers are now back in sync. Per service Availability :

Load Balancing (dedicated instances) : OK
Cellar Storage : OK
Orchestration : OK but recovering a lot of runtime instances.
API : OK
Metrics API : KO (network topology split)
Databases : Globally OK (individual situations being worked on)
Monitoring : OK (in sync from a couple of minutes)
Infrastructure overall load : high

EDIT 3:00pm (UTC): Deployments are done but still slow. Clever Cloud API is being restarted.

EDIT 3:05pm (UTC): Infrastructure load sanitization : 30 %

EDIT 3:40pm (UTC): Overall situation is better. Still a few tousands VMs in the recovery queue.

EDIT 3:48pm (UTC): Most databases should be available. We're experiencing an additional delay with some encrypted databases.

EDIT 3:53pm (UTC): Infrastructure load sanitization : 100 %

EDIT 4:15pm (UTC): Remaining Databases are recovered. Estimated total fix time is expected in the 10/15 next minutes.

EDIT 4:30pm (UTC): All applications should run fine (orchestration point of view). Orchestration monitoring is partially up and running (stuck apps will be quickly unstuck)

EDIT 5:10pm (UTC): All stuck applications should now be available (monitoring point of view).

EDIT 5:16pm (UTC): All databases are available

EDIT 5:44pm (UTC): Incident is considered closed (a few cases are still dealt with customers)

February 2025

PAR: An hypervisor is unreachable 1 year ago

Fixed · Infrastructure · Global

An Hypervisor in the PAR region is unreachable, we are working on it.

EDIT 16:30 UTC+1: This hypervisor has faced a network issue making it unreachable during several minutes. It is now up and running.

EDIT 19:15 UTC+1: The incident is over.

[MTL] Dev Cluster Migration 1 year ago

Fixed · MySQL shared cluster · Global

[2025-02-27] 12:16UTC We are deploying a new shared cluster for Dev mysql 8.4 addon

[2025-02-27] 11:40UTC New mysql cluster successfully deployed

[2025-02-27] 13:30UTC We are deploying a new shared cluster for Dev postgres 15 addon

[2025-02-28] 09:17UTC New postgresql cluster successfully deployed

Certificate renewal failed for *.services.clever-cloud.com 1 year ago

Fixed · Reverse Proxies · Global

We made a renewal of wildcard certificate *.services.clever-cloud.com and some services were impacted. We are investigating the issue and had roll back the new certificate.

Grafana maintenance 1 year ago

Fixed · Global

We are experimenting some problems on Grafana service, we are currently investigating.

After a brief interruption, access to the service has been restored.

Deployments may fail unexpectedly 1 year ago

Fixed · Deployments · Global

Some deployments may fail unexpectedly. Deployment phases may be successful (build / run phases) but the deployment may still be considered as a failure. We are investigating the issue.

EDIT 11:09 UTC: After investigations, the issue was misdiagnosed and no customer deployments failed when they shouldn't have. Therefore, this incident is now resolved.

TLS endpoint certificate issues 1 year ago

Fixed · Infrastructure · Global

UTC 12:28: We are alerted that some services are not reachable properly, MateriaTS and MateriaKV are impacted

UTC 12:30: Identified an expired certificate

UTC 12:35: Certificate is renewed

UTC 12:38: Certificate is being propagated

UTC 12:48: Certificate is now propagated, services are back to normal

Services MateriaTS and MateriaKV are back to normal.

RBX: A hypervisor is unreachable 1 year ago

Fixed · Infrastructure · Global

A Hypervisor in the RBX region is unreachable, we are investigating it. No databases are impacted, applications are already deploying on other hypervisors.

EDIT 9h45 : The hypervisor is still down. A ticket has been opened with our infrastructure provider. At this time, no customer is impacted.
EDIT 9h55: The infrastructure provider has informed us that a hardware failure has caused the crash. They are currently investigating in the datacenter.
EDIT 10h15: The hypersivor is UP and running.

January 2025

[SCW] Load balancers instabilities 1 year ago

Fixed · Reverse Proxies · Global

We are investigating instabilities in our load balancers on the Scaleway region

EDIT 13:43 UTC: We are experiencing an extremely high number of connections with invalid data

EDIT 15:42 UTC: We updated the configuration on our load balancers to handle the connection peaks better

EDIT 21:38 UTC: The issue is resolved since the configuration change.

[PAR] Network maintenance 1 year ago

Fixed · Global

We will conduct a network maintenance on the Paris region tonight between 21:30 UTC and 23:30 UTC. The goal of this maintenance is to prepare for the maintenance planned on Thursday. No network degradation or impact is expected. We will update this incident throughout the operations.

EDIT 21:45 UTC : The maintenance has been started on routers.

EDIT 23:25 UTC : The maintenance is still in progress, we extend the maintenance window until 00:00 UTC.

EDIT 00:00 UTC : We've encountered some unexpected issues, we've decided to extend again the maintenance until 00:30 UTC to complete the migration.

EDIT 00:30 UTC : The maintenance is now over. No identified impact for customers.

[PAR] Network maintenance 1 year ago

Fixed · Global

A network maintenance on the Paris region is planned on January 23th, 2025 between 21:30 UTC and 23:30 UTC. No service interruption or degradation is expected during this maintenance. We will update this incident throughout the operations. This maintenance is the second out of three maintenances to improve our transit peering.

EDIT 21:37 UTC: The maintenance will start shortly.

EDIT 00:47 UTC: The maintenance is now over. A small connection cut was seen between 00:17:17 and 00:17:43 for trafic incoming on one of our routers.

[MTL] An hypervisor rebooted 1 year ago

Fixed · Infrastructure · Global

An hypervisor rebooted on the MTL region. This hypervisor only hosts some of the FSBucket add-ons as well as some DEV MySQL add-ons. We are investigating the root cause and start restoring the service.

EDIT 16:45 UTC: The services are available again. The issue most probably comes from an ongoing electrical maintenance. This server was located in room B710: https://network.status-ovhcloud.com/incidents/m9fzvt0nd8jm. We will follow up with OVH to get more information. In the meantime, we'll keep an eye on the maintenance status.