Clever Cloud Status

Incidents

Full history of incidents.

Oldest first

April 2023

Fixed · Deployments · Global

We are having issues with some deployments. To fix them, we halt deployments for a few minutes.

EDIT 25 of April 08:04 AM UTC: We are still experiencing some deployments issues. Issue have been identified, we are working on a fix.

Fixed · Deployments · Global

Some VMs are currently stuck in the STOPPING state. We are investigating.

EDIT 8:00 AM UTC: VMs are no longer stuck.

Reason: a bad user found a way to start a lot of huge instances and run resource-heavy cryptomining operations. It loaded the hypervisors and made some APIs unresponsive. We blocked them and took actions to prevent future abuse of our service.

Fixed · Access Logs · Global

Metrics and access logs (through Grafana, the Web Console or the API) are experiencing either slowness or missing data. We are currently looking into it.

EDIT 10:40 AM UTC: We have found the issue and we are currently fixing it.

EDIT 04:01 PM UTC: The issue is resolved

Fixed · Deployments · Global

Deployments are currently stuck, preventing users from deployment their applications. Other parts of the infrastructure are currently experiencing instabilities.

EDIT 09:35 UTC: The problem should now be mostly resolved. Some services might still have troubles, dedicated incidents will be opened. We continue monitoring the situation.

Fixed · Access Logs · Global

A maintenance on the storage layer of the metrics platform slowdown the ingestion and we have some log to consume.

EDIT 12:42 UTC: The maintenance operation is complet'ed and no more lag is present

Fixed · Reverse Proxies · Global

We are currently investigating reverse proxies issues on the Warsaw zone.

EDIT 20:30 UTC: The problem was due to an increased load and capacity has been added to handle it. We continue to monitor the incident.

EDIT 00:53 UTC: We did not see any other issues since 20:30 UTC. This problem is now fixed.

Fixed · MongoDB shared cluster · Global

The shared MongoDB cluster is having issues with people abusing it.

We have disabled the creation of new MongoDB DEV plans. This will give us time to setup a new cluster and clean the existing one.

You can still provision the other MongoDB plans.

Fixed · MongoDB shared cluster · Global

MongoDB shared cluster is currently down. We are working on getting it back on.

It seems that the cluster got a lot of connections and could not handle the load.

The cluster is currently reconstructing. Waiting for it to finish.

19:40 The cluster has finished reconstructing and is now taking connections.

Fixed · Reverse Proxies · Global

We've had reverse proxies instabilities on the Warsaw zone between 16:15 UTC and 16:22 UTC. During that time, some connections might have been refused or closed unexpectedly. The problem has been fixed and the underlying issue has been found.

March 2023

Fixed · Cellar · Global

Cellar control plane has detected inconsistencies. We are investigating the issue.

EDIT 09:25 UTC : We begin the recovrey process. We are waiting for the process to terminate

EDIT 09:32 UTC : The recovery process has ended successfully cluster is healthy

Fixed · Infrastructure · Global

An hypervisor has crashed

Fixed · Deployments · Global

We are observing that few deployments are freeze or unsync with load balancing system.

EDIT 15:00 UTC : Deployment system is now in-sync and freeze deployement are up and running

Fixed · Access Logs · Global

The monitoring has detected that the metrics storage layer is offline. We are investagating.

EDIT 13:56 UTC : A node has crashed, the metrics storage layer finished its recovery process, it will take 20 minutes to consume the lag.

EDIT 14:22 UTC : Lag has been consumed and metrics storage layer has been operating normally

Fixed · Infrastructure · Global

The monitoring has detected that a hypervisor is not responding. We are investigating.

EDIT 8:31 UTC: hypervisor is up and running.

Fixed · Reverse Proxies · Global

We are currently investigating reverse proxies instabilities on our Paris zone.

EDIT 18:56 UTC: To be more specific about the instabilities, the connections were slower to be processed, increasing the response time, sometimes drastically. The root cause has been found and fixed at 18:42 UTC. Since then, everything is back to normal. We continue to monitor the situation.

EDIT 19:11 UTC: Additional investigation will be performed to pinpoint the exact cause of the problem and measures will be added to prevent it from happening again. Sorry for the inconvenience.

Fixed · API · Global

The main Clever Cloud API will go under maintenance for about 30 minutes, starting at 21:00 UTC.

During these 30 minutes, some deployments may not go through. Some calls may fail.

Everything seems to have gone well. The operation was over at 21:28.

EDIT 23:15 UTC: It seems like some application creation are having issues following this change, we are investigating.

EDIT 00:10 UTC: A fix has been implemented and applications are now correctly created. Some users may have had the API answer a 200 - OK for application creation but following requests for that application would return a 404 - Not Found. Sorry for the inconvenience.

Fixed · Reverse Proxies · Global

(All times UTC)

  • At 20:10 one of the 4 reverse proxies on zone PAR stops responding to some requests. No internal metrics changed, no weird logs were written. The requests would just time out. The other three were still running, so the requests errors were random.
  • At 20:25 it stops responding at all.
  • At 20:40 our external monitoring tool alerts us. We investigate, find which reverse proxy failed, restarted it.
  • At 20:43 the reverse proxy is restarted and traffic goes fine.
Fixed · MongoDB shared cluster · Global

(All times in UTC)

16:30 we started seeing alerts about high load on the primary node. 17:00 we started getting report about the cluster being unreachable. 18:00 after checking the cluster, we decided to restart the primary node.

Data may have been lost as the node was not writing / replicating correctly. We are still waiting for the primary node to restart. The secondary does not seem to elect itself as primary.

19:30 the secondary finally got promoted as primary. We are blocking users with unfair use of the cluster. 22:45 we detect that the node we restarted failed to get back in the cluster. We decide to remove it entirely and re-create that node from scratch. 2023-03-13 10:00 the node has fully reached the "SECONDARY" state. We put it back into production.

Measures have been taken to prevent future unfair use from users.

Main API is down
Fixed · API · Global

(All times in UTC)

11:30 Our main API keeps stopping to respond. We are investigating it. This impacts the following, in an irregular fashion:

  • clever ssh may not succeed
  • Some deployments may not go through

Applications should keep running, but some monitoring deployments may fail.

12:55 The API seems to have stabilized. The database seems to have had a huge load. We are investigating the queries responsible for that load and try to improve them.

Fixed · Infrastructure · Global

We are currently investigating network issues on our Paris zone.

EDIT 17:15 UTC: The issue is now resolved. A part of our infrastructure in Paris couldn't access some public DNS servers anymore, leading to multiple DNS queries failing. An upstream network provider made a change that fixed the problem around 16:52 UTC.