Incident History
Full history of incidents.
October 2020
Metrics and access logs requests might experience issues following the maintenance of a core component of those features. Requests can either take a very long time to complete or simply answer an error. We are working toward a fix.
Data won't be lost, the ingestion is simply delayed.
Impacted products:
- Metrics (in console or using the API)
- Access logs (charts in the console's overview or using the CLI / API)
EDIT 14:03 UTC: Ingestion is now catching up on the delay, everything looks good. Looks like it may take 30 to 40 minutes to go completely back to normal.
EDIT 14:25 UTC: Ingestion has now caught up, everything should be back to normal.
EDIT 21:26 UTC: New issues are ongoing, we are investigating.
EDIT 22:16 UTC: Ingestion is running. We are consuming queues.
EDIT 23:30 UTC: Ingestion is back to normal. Fixed.
Loading the console might result in various errors preventing users from logging in. We are currently investigating. The CLI shouldn't be impacted. Already loaded console webpages shouldn't be impacted either.
EDIT 13:02 UTC: A change causing this issue has been backed out. We will investigate further why it went wrong despite working correctly on our test infrastructure. Sorry for the disruption.
The redsmin dashboard for redis add-ons is currently unavailable. The Redsmin provider has been notified. We will update this post as soon as we have an update.
EDIT 10:54 UTC: Redsmin is currently working on a fix.
EDIT 19:54 UTC: The fix seems to be complete. Redsmin interfaces should now be able to load.
September 2020
There was an issue with regards to new logs collection between 16:45 and 17:00 UTC Some of these logs may have taken more time than usual to be processed. No logs have been lost.
0727 UTC: The free shared postgresql cluster Leader has crashed due to disk issue 1000 UTC: The team sees the issue. 1016 UTC: The team promotes the follower as leader. 1050 UTC: All applications using dbs on that cluster are redeployed.
We are currently looking into a login issue. Once you validated the form, the login process will reset, not allowing you to proceed to the wanted resource (console / CLI / other).
For any support queries, you can send us an email at support@clever-cloud.com
EDIT 14:26 UTC: The issue has been found and should now be fixed. We will investigate it further to prevent it from happening again.
August 2020
Our old Cellar cluster (cellar.services.clever-cloud.com) which still has some data nodes on Scaleway is currently unreachable due to networking issues on Scaleway's side: https://status.scaleway.com/incident/956
We are monitoring the situation. Our new Cellar cluster (cellar-c2.services.clever-cloud.com) is still reachable and works fine.
EDIT 12:02 UTC: A reverse proxy node is somehow still able to communicate with the nodes on Scaleway. All cellar-c1 traffic has been routed through that reverse proxy and requests should be served as expected.
EDIT 12:34 UTC: The network issue seems to not be on Scaleway's side per say but more on Level3/CenturyLink side which is a more global networking provider.
EDIT 15:17 UTC: The incident on Level3/CenturyLink seems to be resolved. The cluster is now fully reachable.
Postgresql-c1 which is an old PostgreSQL cluster still hosted on Scaleway may currently be unreachable due to some Level3/CenturyLink networking issues. Scaleway has an incident opened here: https://status.scaleway.com/incident/956
EDIT 15:17 UTC: The incident on Level3/CenturyLink seems to be resolved. The cluster is now fully reachable.
Due to an outage of the Level3/CenturyLink networking provider, you might experience issues:
-
reaching our services: if your FAI uses this provider, you might experience timeouts reaching our infrastructure
-
reaching external services from our infrastructure: if you contact external services from our infrastructure, the peering routes might use this network provider and your requests might timeout too.
This incident will group the previous opened incidents:
-
https://www.clevercloudstatus.com/incident/294
-
https://www.clevercloudstatus.com/incident/295
We do not have an ETA for the service to come back to normal.
EDIT 15:17 UTC: The incident on Level3/CenturyLink seems to be resolved. All connections either incoming or outgoing to/from our services should be working as expected. Please reach to our support if not.
An hypervisor went down (electrically shut off) unexpectedly.
This was caused by a human error, partly related to a laggy UI (low-level UI of a server manager used for a group of servers).
The person who triggered this realized the issue immediately and restarted the server which has stopped responding to our monitoring for a total of 3 minutes.
Chronology:
14:01:30 UTC: The server goes down
14:04:30 UTC: The server responds to our monitoring again and starts restarting static VMs (add-ons and custom services)
14:07:05 UTC: The last static VM starts answering to our monitoring again.
Impact:
Customers with add-ons on this server will find connection errors in their application logs during those 3 to 6 minutes and those applications most likely responded with errors to end users during that time.
Customers with applications with a single instance which happened to be on that server will have experienced about 2 to 3 minutes of downtime before a new instance started responding on another server.
New access logs are currently not processed. They are currently kept until they can be processed again. Access logs emitted before this incident are fine.
This impacts:
- Access logs fetch using the CLI or the API
- Live request map in the console
- Total number of requests / status codes in the console (those are still available to display but the total of requests will be wrong in a few hours as access logs emitted won't be taken into account)
The issue has been identified and we are working toward a fix.
EDIT 14:07 UTC: The problem has been solved and the access logs stored have been processed. You should now be able to have an up-to-date livemap and fetch recent access logs using the CLI / API. Request count will be affected and won't be computed for the time window the access logs were not processed.
The Metrics platform is unavailable at the moment. We are investigating the source of the issue.
14:30 UTC: It looks like an issue with the storage backend, we are working on bringing it back to life.
14:52 UTC: The storage backend looks fine but writes are still failing. We are still investigating this issue. It may take a while.
15:11 UTC: Again, the storage backend looked perfectly fine... restarting everything did fix the issue though so then again maybe it wasn't fine after all. Writes are functional, ingestion is working at full speed, fresh data will be available in ~20 minutes.
15:30 UTC: Ingestion delay is back to normal. Incident is over.
SSH connections may have failed randomly those past few hours. The root cause has been found and a better monitoring will be put in place. Instances should have automatically reconnected after a restart of one of the main components. If it didn't, you can try to restart your application. If you absolutely need to SSH to your application to debug something before restarting it, ping us on the support
July 2020
The Metrics platform is unavailable at the moment due to an issue with the storage backend. We are investigating.
16:23 UTC: Some storage nodes were misbehaving. The issue is now fixed: reads are functional again and ingestion is now catching up.
16:28 UTC: Ingestion delay has been divided by two, incident should be over in under 10 minutes.
16:28 UTC: Incident is over.
Access logs are currently unavailable to query. Activity maps in the console may show outdated data. Querying old access logs works fine, only newer logs aren't processed for now.
EDIT 13:27 UTC: all access logs should be available again since 12:50 UTC. The root cause has been identified and will be addressed. Some logs may have been lost during that timeframe.
An add-on reverse proxy stopped responding properly at 18:53 UTC. Due to issues during the restart process, it was finally accessible again at 19:05 UTC.
During this incident, you may have seen random issues while opening new connections to your databases.
It seems that following an outage of Cloudflare on DNS resolutions, users can experience issues resolving our domains.
The incident on Cloudflare side: https://www.cloudflarestatus.com/incidents/b888fyhbygb8
EDIT 21:35 UTC: The DNS resolution seems to be back again, our services are currently reachable from our point of view. It may vary depending on your location.
EDIT 22:51 UTC: Cloudflare implemented a fix and we did not see any new issue since then. This incident is now closed.
The access logs map and analytics are not working since 13:45UTC. We are working on a fix.
EDIT 16:30 UTC: fixed.
During 3 minutes the MongoDB shared cluster has experiencing high load. That prevent new users connection. We are upscaling it.
EDIT 14:32 UTC - the cluster had been upscale.
Some customers are experiencing issues with resolving our domains clever-cloud.com domains and their own domains pointing domain..clever-cloud.com.
At the moment, we know the problem is affecting customers of the French ISP Orange.
08:28 UTC: We found that Orange NS servers were indeed still using the faulty NS records from last night's incident. We have updated the zone on those name servers which should have never been used in the first place and hopefully Orange customers will be able to resolve our domains (and by extension their domains) properly.
08:42 UTC: Looks like the propagation is quite fast and this indeed fixed the issue for affected customers.