Incidents
Full history of incidents.
September 2023
On Monday 2023-09-11 around 20:00 UTC, our main API (api.clever-cloud.com) will be unavailable. The CLI and Console will be impacted and may display errors during some requests. Deployments will also be impacted and won't be available either through the Console/CLI or using git.
The maintenance is planned for one hour but is expected to last a few minutes at most.
EDIT 20:00 UTC: The maintenance is starting.
EDIT 20:02 UTC: The API is now unavailable as well as the Console.
EDIT 20:16 UTC: One of the steps took a bit more time than expected, we are back on track.
EDIT 20:44 UTC: Unexpected problems occurred and we are currently doing a rollback of the changes.
EDIT 20:54 UTC: The maintenance is over, changes were rollback and everything should now be operational again.
We are currently experiencing unreachable hypervisors on the JED region. We are investigating the issue.
EDIT 18:50 UTC: The hypervisors are back online since 25 minutes now, all services were restarted by our monitoring.
An hypervisor has crashed, we are currently investigating the root cause
EDIT 18:45 - The hypervisor had a kernel panic. During the reboot operation the kernel has been upgraded and this issue should not occur again.
We've seen network instabilities on the PAR region. It is currently resolved but we are still investigating the root cause.
EDIT 06:34 UTC: The problem is back with elevated packet loss. Our network provider is currently having an incident and is looking into the issue.
EDIT 06:46 UTC: Some DNS domains for services hosted on other regions may also have issues to resolve because their authoritative server is currently hosted on the PAR region.
EDIT 06:55 UTC: The incident is still ongoing and our network provider is still looking into the issue.
EDIT 07:20 UTC: Our upstream network provider is currently experiencing a DDoS attack. We are currently looking to use an alternative network transit to avoid going through the upstream network provider.
EDIT 07:47 UTC: We are seeing improvements for the last 20 minutes. We still are waiting for a confirmation of the issue resolution.
EDIT 07:58 UTC: We are seeing some loss again.
EDIT 08:15 UTC: The DDoS is still happening. It's partially mitigated. We still see some loss, but there is less impact globally.
EDIT 10:54 UTC: We still see loss from time to time, but much less that before. We keep an eye on the situation.
EDIT 15:45 UTC: Most of the ddos is mitigated, we didn't have any loss those past few hours, we still monitor the situation.
EDIT 2023-09-06 15:24 UTC: No more instabilities were detected since yesterday. The incident is now over.
When ordering a new database it can take time to reach them (databases are correctly created)
[EDIT] 21:38 UTC the root cause was identified and a path deployed
Some websites went unreachable by external monitoring. This indicates reverse proxies are not taking connections as they should.
Metrics on the proxies seem ok. We are investigating why they are acting like that.
Seems some applications where causing connections to enqueue and blocking new connections. We are looking into ways to avoid this to happen.
The issue is resolved
August 2023
We are currently investigating an elevated rate of TLS errors/timeouts on our Paris reverse proxies serving applications domains.
EDIT 14:53 UTC: We are seeing signs of improvements since 14:50. We continue monitoring the situation.
EDIT 15:23 UTC: We confirm that the issue has been resolved since 14:50. Sorry for the inconvenience this incident may have caused.
The logs infrastructure currently do not ingest new logs anymore. The root cause has been identified. In the meantime, if you need to access your logs, you can SSH to your application: https://www.clever-cloud.com/doc/reference/clever-tools/ssh-access/#show-your-applications-logs
EDIT 11:00 UTC: The problem is now resolved. Some logs may have been lost for that period. We apologies for the inconvenience.
Between 20:43 UTC and 20:48 UTC, an add-on reverse proxy of the PAR region was unreachable. Some applications may have had errors connecting to their add-ons during that time if they didn't automatically switched to another working proxy. The issue has been fixed.
Between 20:25 and 20:31 UTC, a FSBucket server was unreachable on the SCW region. The issue has been fixed, applications using an FSBucket add-on may have had I/O issues (read/write timeout or hang) during that time. Applications will re-connect to the FSBucket add-on automatically.
There is an issue that are preventing to load tls certificate by our load balancing system. We are investaging the issue
EDIT 18:00 UTC: The issue is resolved
We are currently experiencing issues with deployments.
Some applications may have been redeployed multiple times with the Monitoring/Unreachable reason. Most of those deployments were false positives. Other applications may currently have troubles deploying.
We are working on restoring the service.
EDIT 15:49 UTC: The underlying issue has been found and fixed. Some deployments may have failed even when there were no reason for them to fail. You can start them again if needed. If you still have deployment issues, feel free to reach our support team.
Between 13:51 UTC and 14:12 UTC, some requests may have failed to establish a TLS connection to the Cellar service.
The issue has been identified and has been fixed.
We detect that the ip 212.129.27.183 is unreachable, we have identified the root cause and we are waiting for the feedback of scaleway cloud provider.
EDIT 12:39 UTC : The ip address is reachable
The storage layer of metrics and access logs has lost some data nodes. We are fixing the issue
EDIT 09:18 UTC : We are recovering from the events and consuming the lags. The storage layer is now operational
July 2023
We are detecting some errors on our reverse proxies, your apps may not be reachable. We are working on it.
EDIT 22:48 PM UTC: all reverse proxies are now working properly
An issue with the control plane triggered some issues when ordering or migrating MySQL add-ons.
Edit 12:50 UTC: Control plane has recovered, everything is now OK
We are currently looking into an issue regarding applications deployments. They may be able to start but may never complete.
EDIT 15:05 UTC: The issue appears to be limited to the Paris zone
EDIT 15:20 UTC: A counter measure has been deployed to mitigate issues. Deployments are now scheduled as expected. Some errors may still appear in your Logs. We're processing stuck deployments, but you may cancel or start a new one if you want to prioritize your deployment.
We have asked our provider to transfer the domain name cleverapps.io. The transfer ends at 12:30 UTC and we saw that records are missing or have not the right value.
EDIT 15:00 UTC : we have found that NS records and SOA records was not good, we have updated it. EDIT 16:00 UTC: everything is back to normal.
Following the yesterday deployment, we had issues with http and tcp redirections which cause infinite loop and timeouts. We are investigating the issue.
EDIT 09:00 UTC The issue was found and fixed