Incident History

Full history of incidents.

March 2023

Fixed · API · Global

(All times in UTC)

11:30 Our main API keeps stopping to respond. We are investigating it. This impacts the following, in an irregular fashion:

clever ssh may not succeed
Some deployments may not go through

Applications should keep running, but some monitoring deployments may fail.

12:55 The API seems to have stabilized. The database seems to have had a huge load. We are investigating the queries responsible for that load and try to improve them.

[PAR] Investigating network issues 3 years ago

Fixed · Infrastructure · Global

We are currently investigating network issues on our Paris zone.

EDIT 17:15 UTC: The issue is now resolved. A part of our infrastructure in Paris couldn't access some public DNS servers anymore, leading to multiple DNS queries failing. An upstream network provider made a change that fixed the problem around 16:52 UTC.

Core API is experiencing issues 3 years ago

Fixed · API · Global

Clever Cloud Core API is currently experiencing performance issues. We are investigating it.

EDIT 16:03 UTC: We are seeing improvements, we continue to monitor the situation and keep investigating the root cause. We continue to add more data collection around the various points of contention.

[PAR] An hypervisor went down 3 years ago

Fixed · Infrastructure · Global

An hypervisor went down, we are investigating. Applications are being redeployed.

Update 11:11 AM UTC: The hypervisor has been rebooted, add-ons should be reachable. Root cause of the issue will be determined later. In the meantime, applications hosted on that hypervisor are still redeploying. We continue to monitor the situation.

Update 03:13 PM UTC: the same hypervisor went down again. It has been rebooted. Add-ons should be reachable. In the meantime, applications hosted on that hypervisor are still redeploying. We continue to monitor the situation.

Core API is experiencing issues 3 years ago

Fixed · API · Global

Clever Cloud Core API is currently experiencing performance issues. We are investigating it.

EDIT 14:37 UTC: We are seeing improvements, we continue to monitor the situation.

EDIT 16:23 UTC: The incident is now over.

[MTL] Deployment partial issues 3 years ago

Fixed · Deployments · Global

We are facing a network issue between MTL and our control plane causing some deployment issues. A workaround has been found and deployments are, as of now, OK on this region. A ticket has been opened in our subcontractor to solve the root cause.

EDIT 03/03 02:15 PM UTC: Connectivity between MTL and our control is to fully restored.

[RBX] Deployment partial issues 3 years ago

Fixed · Deployments · Global

We are experiencing failures when deploying apps to RBX. We are investigating.

EDIT 10:32 AM UTC: a connectivity issue have been detected between RBX and our control-plane. The issue is now fixed.

February 2023

[Support] Ticket Center maintenance 3 years ago

Fixed · Global

A maintenance has been planned on our Ticket Center tool February 28th, 2023 at 19:00 UTC. Users will need to refresh their Clever Cloud Console (https://console.clever-cloud.com) to complete the update. Otherwise, the Ticket Center might display an authentication error. During that time, actions on tickets (creation, comment, ..) might fail.

The maintenance is expected to last 5 minutes. If you urgently need to contact us, you can send an email to support@clever-cloud.com

EDIT 19:38 UTC: The maintenance is now over. Actions on the ticket center should be fully available. If you encoutner any problems following this update, please email us at support@clever-cloud.com

[JED] Hypervisor update on Jeddah 3 years ago

Fixed · Global

We need to conduct an update on our Jeddah hypervisors on February 28th, 2023. Services of impacted users will be migrated starting at 20:00 UTC before the update begins.

Impacted users will receive an email for each impacted service.

EDIT 2023-02-28 20:25 UTC: The maintenance is starting

EDIT 2023-02-28 22:18 UTC: The maintenance is now over.

[PAR] Degraded performance towards github.com 3 years ago

Fixed · Infrastructure · Global

We are currently experiencing degraded performances towards github.com services from our Paris infrastructure. We are investigating the issue. Tools relying on GitHub (composer, go, ...) might take longer than usual to fetch their dependencies or experience connections timeouts / instabilities.

EDIT 15:48 UTC: We are seeing improvements and the situation is currently back to normal. The root cause seemed to be a BGP announce change from GitHub's side that made our traffic go through suboptimal routes, leading to degraded performances. We keep monitoring the situation.

EDIT 16:30 UTC: The incident is fully resolved.

Core API is experiencing issues 3 years ago

Fixed · API · Global

Clever Cloud Core API is currently experiencing performance issues. We are investigating it.

[PAR] Planned hypervisors reboot 3 years ago

Fixed · Global

This is a follow up for the various hypervisors incidents we had those last weeks. A first batch of hypervisors will be updated to try and fix the issue. Impacted users will shortly be contacted by email.

The reboot is planned tonight (15/02/2023) at 22:00 UTC. Maintenance will start at 21:00 UTC.

EDIT 21:07 UTC: The maintenance is starting. Add-ons will be automatically migrated in the next few minutes.

EDIT 22:52 UTC: The maintenance is over.

[PAR] An hypervisor went down 3 years ago

Fixed · Infrastructure · Global

An hypervisor went down, we are investigating. Applications are being redeployed.

EDIT 22:47 UTC: The hypervisor is back online with add-ons UP since a few minutes. Root cause of the issue will be determined later. In the meantime, applications hosted on that hypervisor are still redeploying. We continue to monitor the situation.

EDIT 23:44 UTC: The incident is now over. Sorry for the inconvenience.

Deployments may be very slow or stuck when uploading cache artefacts 3 years ago

Fixed · Deployments · Global

We are currently seeing applications having troubles complete their deployments, especially when using dedicated build VM. They may be stuck or very slow at the cache archives upload. We are investigating.

EDIT 10:55 UTC: The root cause has been found. It was only impacting multipart uploads. For deployments already at the upload phase, you will need to cancel the current deployment and start a new one for the problem to be fixed. Sorry for the inconvenience.

[PAR] An hypervisor went down 3 years ago

Fixed · Infrastructure · Global

An hypervisor went down, we are investigating.

EDIT 22:24 UTC: The hypervisor is up again since 10 minutes. Add-ons are available again. We make sure all applications were redeployed.

EDIT 00:17 UTC: The incident is over.

A Hypervisor is unresponsive 3 years ago

Fixed · Infrastructure · Global

At 16:29 UTC, a staff member started investigating an alert on one of our hypervisors. They saw the hypervisor could not be logged into anymore.

All services running on that hypervisor are still up and running, but deployments fail to stop the obsolete VMs and we cannot connect to the host itself. We are considering a "semi" kernel crash on the hypervisor's host. We are investigating and may reboot the hypervisor in the following minutes/hours. (First, we try migrating as much important services as possible to avoid causing too much downtime to our customers.)

EDIT 16:46 UTC: We are starting to migrate add-ons on the impacted hypervisor.

EDIT 18:54 UTC: We rebooted the hypervisor, everything went well, all the remaining services are UP again.

[RETROACTIVE] Git repositories SSH remote identification changed 3 years ago

Fixed · Git repositories · Global

Between 12:39 UTC and 20:10 UTC, some users may have experienced an error message WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! when pushing code using git+ssh on our Git repositories. This was due to an update of the allowed signature algorithms of our SSH servers. Users that had an old signature algorithm stored in their known_hosts ssh file were impacted.

The change has been rolled back.

January 2023

[MTL] Performance issues on MySQL shared cluster 3 years ago

Fixed · MySQL shared cluster · Global

A few customer complains about performance issues on MySQL shared cluster. We are investigating.

EDIT 10:00 UTC We have made a hardware upgrade to the MySQL shared cluster

Monitoring detect an increasing number of unreachable virtual machines 3 years ago

Fixed · Infrastructure · Global

Monitoring detect an increasing number of unreachable virtual machines. It seems related to an update deployment.

EDIT 01:00 UTC the update deployment has been rollback

Clever Cloud API is slow 3 years ago

Fixed · API · Global

Monitoring report that the number of timeout increase on the Clever Cloud API. We are investigating why.

EDIT 9:08 UTC : Backends behind Clever Cloud API are up and running. Numbers of timeouts have decreased. Everything is operating normally.