Incidents
Full history of incidents.
April 2023
We are having issues with some deployments. To fix them, we halt deployments for a few minutes.
EDIT 25 of April 08:04 AM UTC: We are still experiencing some deployments issues. Issue have been identified, we are working on a fix.
Some VMs are currently stuck in the STOPPING state. We are investigating.
EDIT 8:00 AM UTC: VMs are no longer stuck.
Reason: a bad user found a way to start a lot of huge instances and run resource-heavy cryptomining operations. It loaded the hypervisors and made some APIs unresponsive. We blocked them and took actions to prevent future abuse of our service.
Metrics and access logs (through Grafana, the Web Console or the API) are experiencing either slowness or missing data. We are currently looking into it.
EDIT 10:40 AM UTC: We have found the issue and we are currently fixing it.
EDIT 04:01 PM UTC: The issue is resolved
Deployments are currently stuck, preventing users from deployment their applications. Other parts of the infrastructure are currently experiencing instabilities.
EDIT 09:35 UTC: The problem should now be mostly resolved. Some services might still have troubles, dedicated incidents will be opened. We continue monitoring the situation.
A maintenance on the storage layer of the metrics platform slowdown the ingestion and we have some log to consume.
EDIT 12:42 UTC: The maintenance operation is complet'ed and no more lag is present
We are currently investigating reverse proxies issues on the Warsaw zone.
EDIT 20:30 UTC: The problem was due to an increased load and capacity has been added to handle it. We continue to monitor the incident.
EDIT 00:53 UTC: We did not see any other issues since 20:30 UTC. This problem is now fixed.
The shared MongoDB cluster is having issues with people abusing it.
We have disabled the creation of new MongoDB DEV plans. This will give us time to setup a new cluster and clean the existing one.
You can still provision the other MongoDB plans.
MongoDB shared cluster is currently down. We are working on getting it back on.
It seems that the cluster got a lot of connections and could not handle the load.
The cluster is currently reconstructing. Waiting for it to finish.
19:40 The cluster has finished reconstructing and is now taking connections.
We've had reverse proxies instabilities on the Warsaw zone between 16:15 UTC and 16:22 UTC. During that time, some connections might have been refused or closed unexpectedly. The problem has been fixed and the underlying issue has been found.
March 2023
Cellar control plane has detected inconsistencies. We are investigating the issue.
EDIT 09:25 UTC : We begin the recovrey process. We are waiting for the process to terminate
EDIT 09:32 UTC : The recovery process has ended successfully cluster is healthy
We are observing that few deployments are freeze or unsync with load balancing system.
EDIT 15:00 UTC : Deployment system is now in-sync and freeze deployement are up and running
The monitoring has detected that the metrics storage layer is offline. We are investagating.
EDIT 13:56 UTC : A node has crashed, the metrics storage layer finished its recovery process, it will take 20 minutes to consume the lag.
EDIT 14:22 UTC : Lag has been consumed and metrics storage layer has been operating normally
The monitoring has detected that a hypervisor is not responding. We are investigating.
EDIT 8:31 UTC: hypervisor is up and running.
We are currently investigating reverse proxies instabilities on our Paris zone.
EDIT 18:56 UTC: To be more specific about the instabilities, the connections were slower to be processed, increasing the response time, sometimes drastically. The root cause has been found and fixed at 18:42 UTC. Since then, everything is back to normal. We continue to monitor the situation.
EDIT 19:11 UTC: Additional investigation will be performed to pinpoint the exact cause of the problem and measures will be added to prevent it from happening again. Sorry for the inconvenience.
The main Clever Cloud API will go under maintenance for about 30 minutes, starting at 21:00 UTC.
During these 30 minutes, some deployments may not go through. Some calls may fail.
Everything seems to have gone well. The operation was over at 21:28.
EDIT 23:15 UTC: It seems like some application creation are having issues following this change, we are investigating.
EDIT 00:10 UTC: A fix has been implemented and applications are now correctly created. Some users may have had the API answer a 200 - OK for application creation but following requests for that application would return a 404 - Not Found. Sorry for the inconvenience.
(All times UTC)
- At 20:10 one of the 4 reverse proxies on zone PAR stops responding to some requests. No internal metrics changed, no weird logs were written. The requests would just time out. The other three were still running, so the requests errors were random.
- At 20:25 it stops responding at all.
- At 20:40 our external monitoring tool alerts us. We investigate, find which reverse proxy failed, restarted it.
- At 20:43 the reverse proxy is restarted and traffic goes fine.
(All times in UTC)
16:30 we started seeing alerts about high load on the primary node. 17:00 we started getting report about the cluster being unreachable. 18:00 after checking the cluster, we decided to restart the primary node.
Data may have been lost as the node was not writing / replicating correctly. We are still waiting for the primary node to restart. The secondary does not seem to elect itself as primary.
19:30 the secondary finally got promoted as primary. We are blocking users with unfair use of the cluster. 22:45 we detect that the node we restarted failed to get back in the cluster. We decide to remove it entirely and re-create that node from scratch. 2023-03-13 10:00 the node has fully reached the "SECONDARY" state. We put it back into production.
Measures have been taken to prevent future unfair use from users.
(All times in UTC)
11:30 Our main API keeps stopping to respond. We are investigating it. This impacts the following, in an irregular fashion:
clever sshmay not succeed- Some deployments may not go through
Applications should keep running, but some monitoring deployments may fail.
12:55 The API seems to have stabilized. The database seems to have had a huge load. We are investigating the queries responsible for that load and try to improve them.
We are currently investigating network issues on our Paris zone.
EDIT 17:15 UTC: The issue is now resolved. A part of our infrastructure in Paris couldn't access some public DNS servers anymore, leading to multiple DNS queries failing. An upstream network provider made a change that fixed the problem around 16:52 UTC.