Incidents

Full history of incidents.

April 2023

Deployments halted for emergency maintenance 3 years ago

Fixed · Deployments · Global

We are having issues with some deployments. To fix them, we halt deployments for a few minutes.

EDIT 25 of April 08:04 AM UTC: We are still experiencing some deployments issues. Issue have been identified, we are working on a fix.

Deployments instability: VM stuck in STOPPING state 3 years ago

Fixed · Deployments · Global

Some VMs are currently stuck in the STOPPING state. We are investigating.

EDIT 8:00 AM UTC: VMs are no longer stuck.

Reason: a bad user found a way to start a lot of huge instances and run resource-heavy cryptomining operations. It loaded the hypervisors and made some APIs unresponsive. We blocked them and took actions to prevent future abuse of our service.

Metrics and access logs queries are experiencing issues 3 years ago

Fixed · Access Logs · Global

Metrics and access logs (through Grafana, the Web Console or the API) are experiencing either slowness or missing data. We are currently looking into it.

EDIT 10:40 AM UTC: We have found the issue and we are currently fixing it.

EDIT 04:01 PM UTC: The issue is resolved

Deployments are stucks and infrastructure instabilities 3 years ago

Fixed · Deployments · Global

Deployments are currently stuck, preventing users from deployment their applications. Other parts of the infrastructure are currently experiencing instabilities.

EDIT 09:35 UTC: The problem should now be mostly resolved. Some services might still have troubles, dedicated incidents will be opened. We continue monitoring the situation.

The metrics platform is experiencing ingestion slowdown 3 years ago

Fixed · Access Logs · Global

A maintenance on the storage layer of the metrics platform slowdown the ingestion and we have some log to consume.

EDIT 12:42 UTC: The maintenance operation is complet'ed and no more lag is present

[WSW] Reverse proxies instabilities 3 years ago

Fixed · Reverse Proxies · Global

We are currently investigating reverse proxies issues on the Warsaw zone.

EDIT 20:30 UTC: The problem was due to an increased load and capacity has been added to handle it. We continue to monitor the incident.

EDIT 00:53 UTC: We did not see any other issues since 20:30 UTC. This problem is now fixed.

Shared MongoDB cluster on PAR does not take new add-ons. 3 years ago

Fixed · MongoDB shared cluster · Global

The shared MongoDB cluster is having issues with people abusing it.

We have disabled the creation of new MongoDB DEV plans. This will give us time to setup a new cluster and clean the existing one.

You can still provision the other MongoDB plans.

MongoDB free Cluster on PAR is unresponsive 3 years ago

Fixed · MongoDB shared cluster · Global

MongoDB shared cluster is currently down. We are working on getting it back on.

It seems that the cluster got a lot of connections and could not handle the load.

The cluster is currently reconstructing. Waiting for it to finish.

19:40 The cluster has finished reconstructing and is now taking connections.

[RETROACTIVE][WSW] Reverse proxies instability 3 years ago

Fixed · Reverse Proxies · Global

We've had reverse proxies instabilities on the Warsaw zone between 16:15 UTC and 16:22 UTC. During that time, some connections might have been refused or closed unexpectedly. The problem has been fixed and the underlying issue has been found.

March 2023

[PAR] Cellar control plane has detected inconsistencies 3 years ago

Fixed · Cellar · Global

Cellar control plane has detected inconsistencies. We are investigating the issue.

EDIT 09:25 UTC : We begin the recovrey process. We are waiting for the process to terminate

EDIT 09:32 UTC : The recovery process has ended successfully cluster is healthy

[PAR] Hypervisor has crashed 3 years ago

Fixed · Infrastructure · Global

An hypervisor has crashed

Deployment slowdown 3 years ago

Fixed · Deployments · Global

We are observing that few deployments are freeze or unsync with load balancing system.

EDIT 15:00 UTC : Deployment system is now in-sync and freeze deployement are up and running

Metrics storage layer issue 3 years ago

Fixed · Access Logs · Global

The monitoring has detected that the metrics storage layer is offline. We are investagating.

EDIT 13:56 UTC : A node has crashed, the metrics storage layer finished its recovery process, it will take 20 minutes to consume the lag.

EDIT 14:22 UTC : Lag has been consumed and metrics storage layer has been operating normally

[RBX HDS] A hypervisor is unreachable 3 years ago

Fixed · Infrastructure · Global

The monitoring has detected that a hypervisor is not responding. We are investigating.

EDIT 8:31 UTC: hypervisor is up and running.

[PAR] Reverse proxies instabilities 3 years ago

Fixed · Reverse Proxies · Global

We are currently investigating reverse proxies instabilities on our Paris zone.

EDIT 18:56 UTC: To be more specific about the instabilities, the connections were slower to be processed, increasing the response time, sometimes drastically. The root cause has been found and fixed at 18:42 UTC. Since then, everything is back to normal. We continue to monitor the situation.

EDIT 19:11 UTC: Additional investigation will be performed to pinpoint the exact cause of the problem and measures will be added to prevent it from happening again. Sorry for the inconvenience.

Maintenance on main Clever Cloud API 3 years ago

Fixed · API · Global

The main Clever Cloud API will go under maintenance for about 30 minutes, starting at 21:00 UTC.

During these 30 minutes, some deployments may not go through. Some calls may fail.

Everything seems to have gone well. The operation was over at 21:28.

EDIT 23:15 UTC: It seems like some application creation are having issues following this change, we are investigating.

EDIT 00:10 UTC: A fix has been implemented and applications are now correctly created. Some users may have had the API answer a 200 - OK for application creation but following requests for that application would return a 404 - Not Found. Sorry for the inconvenience.

One PAR reverse proxy is not responding 3 years ago

Fixed · Reverse Proxies · Global

(All times UTC)

At 20:10 one of the 4 reverse proxies on zone PAR stops responding to some requests. No internal metrics changed, no weird logs were written. The requests would just time out. The other three were still running, so the requests errors were random.
At 20:25 it stops responding at all.
At 20:40 our external monitoring tool alerts us. We investigate, find which reverse proxy failed, restarted it.
At 20:43 the reverse proxy is restarted and traffic goes fine.

Free MongoDB cluster on PAR unreachable 3 years ago

Fixed · MongoDB shared cluster · Global

(All times in UTC)

16:30 we started seeing alerts about high load on the primary node. 17:00 we started getting report about the cluster being unreachable. 18:00 after checking the cluster, we decided to restart the primary node.

Data may have been lost as the node was not writing / replicating correctly. We are still waiting for the primary node to restart. The secondary does not seem to elect itself as primary.

19:30 the secondary finally got promoted as primary. We are blocking users with unfair use of the cluster. 22:45 we detect that the node we restarted failed to get back in the cluster. We decide to remove it entirely and re-create that node from scratch. 2023-03-13 10:00 the node has fully reached the "SECONDARY" state. We put it back into production.

Measures have been taken to prevent future unfair use from users.

Main API is down 3 years ago

Fixed · API · Global

(All times in UTC)

11:30 Our main API keeps stopping to respond. We are investigating it. This impacts the following, in an irregular fashion:

clever ssh may not succeed
Some deployments may not go through

Applications should keep running, but some monitoring deployments may fail.

12:55 The API seems to have stabilized. The database seems to have had a huge load. We are investigating the queries responsible for that load and try to improve them.

[PAR] Investigating network issues 3 years ago

Fixed · Infrastructure · Global

We are currently investigating network issues on our Paris zone.

EDIT 17:15 UTC: The issue is now resolved. A part of our infrastructure in Paris couldn't access some public DNS servers anymore, leading to multiple DNS queries failing. An upstream network provider made a change that fixed the problem around 16:52 UTC.