Incidents
Full history of incidents.
October 2020
The redsmin dashboard for redis add-ons is currently unavailable. The Redsmin provider has been notified. We will update this post as soon as we have an update.
EDIT 10:54 UTC: Redsmin is currently working on a fix.
EDIT 19:54 UTC: The fix seems to be complete. Redsmin interfaces should now be able to load.
September 2020
There was an issue with regards to new logs collection between 16:45 and 17:00 UTC Some of these logs may have taken more time than usual to be processed. No logs have been lost.
0727 UTC: The free shared postgresql cluster Leader has crashed due to disk issue 1000 UTC: The team sees the issue. 1016 UTC: The team promotes the follower as leader. 1050 UTC: All applications using dbs on that cluster are redeployed.
We are currently looking into a login issue. Once you validated the form, the login process will reset, not allowing you to proceed to the wanted resource (console / CLI / other).
For any support queries, you can send us an email at support@clever-cloud.com
EDIT 14:26 UTC: The issue has been found and should now be fixed. We will investigate it further to prevent it from happening again.
August 2020
Our old Cellar cluster (cellar.services.clever-cloud.com) which still has some data nodes on Scaleway is currently unreachable due to networking issues on Scaleway's side: https://status.scaleway.com/incident/956
We are monitoring the situation. Our new Cellar cluster (cellar-c2.services.clever-cloud.com) is still reachable and works fine.
EDIT 12:02 UTC: A reverse proxy node is somehow still able to communicate with the nodes on Scaleway. All cellar-c1 traffic has been routed through that reverse proxy and requests should be served as expected.
EDIT 12:34 UTC: The network issue seems to not be on Scaleway's side per say but more on Level3/CenturyLink side which is a more global networking provider.
EDIT 15:17 UTC: The incident on Level3/CenturyLink seems to be resolved. The cluster is now fully reachable.
Postgresql-c1 which is an old PostgreSQL cluster still hosted on Scaleway may currently be unreachable due to some Level3/CenturyLink networking issues. Scaleway has an incident opened here: https://status.scaleway.com/incident/956
EDIT 15:17 UTC: The incident on Level3/CenturyLink seems to be resolved. The cluster is now fully reachable.
Due to an outage of the Level3/CenturyLink networking provider, you might experience issues:
-
reaching our services: if your FAI uses this provider, you might experience timeouts reaching our infrastructure
-
reaching external services from our infrastructure: if you contact external services from our infrastructure, the peering routes might use this network provider and your requests might timeout too.
This incident will group the previous opened incidents:
-
https://www.clevercloudstatus.com/incident/294
-
https://www.clevercloudstatus.com/incident/295
We do not have an ETA for the service to come back to normal.
EDIT 15:17 UTC: The incident on Level3/CenturyLink seems to be resolved. All connections either incoming or outgoing to/from our services should be working as expected. Please reach to our support if not.
An hypervisor went down (electrically shut off) unexpectedly.
This was caused by a human error, partly related to a laggy UI (low-level UI of a server manager used for a group of servers).
The person who triggered this realized the issue immediately and restarted the server which has stopped responding to our monitoring for a total of 3 minutes.
Chronology:
14:01:30 UTC: The server goes down
14:04:30 UTC: The server responds to our monitoring again and starts restarting static VMs (add-ons and custom services)
14:07:05 UTC: The last static VM starts answering to our monitoring again.
Impact:
Customers with add-ons on this server will find connection errors in their application logs during those 3 to 6 minutes and those applications most likely responded with errors to end users during that time.
Customers with applications with a single instance which happened to be on that server will have experienced about 2 to 3 minutes of downtime before a new instance started responding on another server.
New access logs are currently not processed. They are currently kept until they can be processed again. Access logs emitted before this incident are fine.
This impacts:
- Access logs fetch using the CLI or the API
- Live request map in the console
- Total number of requests / status codes in the console (those are still available to display but the total of requests will be wrong in a few hours as access logs emitted won't be taken into account)
The issue has been identified and we are working toward a fix.
EDIT 14:07 UTC: The problem has been solved and the access logs stored have been processed. You should now be able to have an up-to-date livemap and fetch recent access logs using the CLI / API. Request count will be affected and won't be computed for the time window the access logs were not processed.
The Metrics platform is unavailable at the moment. We are investigating the source of the issue.
14:30 UTC: It looks like an issue with the storage backend, we are working on bringing it back to life.
14:52 UTC: The storage backend looks fine but writes are still failing. We are still investigating this issue. It may take a while.
15:11 UTC: Again, the storage backend looked perfectly fine... restarting everything did fix the issue though so then again maybe it wasn't fine after all. Writes are functional, ingestion is working at full speed, fresh data will be available in ~20 minutes.
15:30 UTC: Ingestion delay is back to normal. Incident is over.
SSH connections may have failed randomly those past few hours. The root cause has been found and a better monitoring will be put in place. Instances should have automatically reconnected after a restart of one of the main components. If it didn't, you can try to restart your application. If you absolutely need to SSH to your application to debug something before restarting it, ping us on the support
July 2020
The Metrics platform is unavailable at the moment due to an issue with the storage backend. We are investigating.
16:23 UTC: Some storage nodes were misbehaving. The issue is now fixed: reads are functional again and ingestion is now catching up.
16:28 UTC: Ingestion delay has been divided by two, incident should be over in under 10 minutes.
16:28 UTC: Incident is over.
Access logs are currently unavailable to query. Activity maps in the console may show outdated data. Querying old access logs works fine, only newer logs aren't processed for now.
EDIT 13:27 UTC: all access logs should be available again since 12:50 UTC. The root cause has been identified and will be addressed. Some logs may have been lost during that timeframe.
An add-on reverse proxy stopped responding properly at 18:53 UTC. Due to issues during the restart process, it was finally accessible again at 19:05 UTC.
During this incident, you may have seen random issues while opening new connections to your databases.
It seems that following an outage of Cloudflare on DNS resolutions, users can experience issues resolving our domains.
The incident on Cloudflare side: https://www.cloudflarestatus.com/incidents/b888fyhbygb8
EDIT 21:35 UTC: The DNS resolution seems to be back again, our services are currently reachable from our point of view. It may vary depending on your location.
EDIT 22:51 UTC: Cloudflare implemented a fix and we did not see any new issue since then. This incident is now closed.
The access logs map and analytics are not working since 13:45UTC. We are working on a fix.
EDIT 16:30 UTC: fixed.
During 3 minutes the MongoDB shared cluster has experiencing high load. That prevent new users connection. We are upscaling it.
EDIT 14:32 UTC - the cluster had been upscale.
Some customers are experiencing issues with resolving our domains clever-cloud.com domains and their own domains pointing domain..clever-cloud.com.
At the moment, we know the problem is affecting customers of the French ISP Orange.
08:28 UTC: We found that Orange NS servers were indeed still using the faulty NS records from last night's incident. We have updated the zone on those name servers which should have never been used in the first place and hopefully Orange customers will be able to resolve our domains (and by extension their domains) properly.
08:42 UTC: Looks like the propagation is quite fast and this indeed fixed the issue for affected customers.
Our domain provider briefly gave out an empty DNS zone file after a configuration change.
EDIT 20:04 UTC - fixed.
June 2020
The mongodb shared cluster hosting free mongodb databases has a higher load than usual. It started going up at 15:25 UTC slowing reaching the point where it could not serve most of the requests as expected ~30 minutes ago. It is expected that requests would also fail since then because of timeouts or aborted connections.
The service has been restarted and we will monitor it closely, as well as adding monitoring to better catch this ramp up.
Dedicated databases are not impacted by this issue. If you are impacted, you can migrate your free plan to a dedicated plan using the migration feature. You can find it in the "Migration" menu of your add-on.
22:41 UTC: Load seems back to its normal state. The monitoring has been adjusted and we should then receive an alert at the start of the event instead.
23:19 UTC: The issue is back, the load is not as high as before but it might make the cluster slow.
23:54 UTC: Users impacting the cluster the most have been contacted to avoid this issue. Further actions will be taken later today if the issue persists.
2020-07-01 06:25 UTC: The node crashed due to a fatal assertion hit and restarted
06:38 UTC: The node is still unreachable for an unknown reason
07:48 UTC: The cluster is currently being repaired. For an unknown reason, nodes wouldn't listen to their network interfaces.
09:32 UTC: The repair is halfway through. The cluster might be able to be up again in ~1h30
11:50 UTC: The repair is done, the node successfully restarted. You should now be able to connect to the cluster. We are now re-starting the follower node for it to join back the cluster.
15:09 UTC: The leader node crashed again because of an assertion failure which means it is now unreachable again as mongodb reads its entire journal and rebuilds the indexes.
15:30 UTC: It usually takes 1h30 for mongodb to read the whole journal so it should be up again around 16:20 UTC.
16:34 UTC: It is taking longer than usual.
20:06 UTC: The restarts weren't successful. The secondary node successfully started at some point but was shutdown to avoid any issue with the primary one. We'll try starting it again.
2020-07-02 09:15 UTC: The first node has been accessible now and again but keeps on crashing due to user activity. The second node failed to sync to the first node so it cannot be used as primary right now. We are now trying to bring the first node back up without making it accessible to users so we can at least get backups of every database. Once this is done, we will update you on the next steps. This process will take a while as Mongo takes hours (literally) to come up after a crash.
12:00 UTC: The first node is finally back up (but incoming connections are shut off for now). We are now taking backups of all databases, you should see a new backup appear in your dashboard in the coming minutes / hours. Once this is done, we will start working on bringing the second node back in sync. Once the cluster is healthy, we will bring it back online.
14:30 UTC: Backups are over, customers who were using the free shared plan in production can create a new paid dedicated add-on and import the latest backup there. Meanwhile, we are now rebuilding the second node from the first one to make the cluster healthy again. Once it's over, we will bring the service back up (if everything goes well).
15:55 UTC: The second node is synced up and the service is available again. We are still monitoring things closely.
18:35 UTC: The service is working smoothly, no issues or anomalies to report.