Clever Cloud Status

Incident History

Full history of incidents.

Newest first

July 2020

Fixed · Reverse Proxies · Global

Our domain provider briefly gave out an empty DNS zone file after a configuration change.

EDIT 20:04 UTC - fixed.

June 2020

Fixed · MongoDB shared cluster · Global

The mongodb shared cluster hosting free mongodb databases has a higher load than usual. It started going up at 15:25 UTC slowing reaching the point where it could not serve most of the requests as expected ~30 minutes ago. It is expected that requests would also fail since then because of timeouts or aborted connections.

The service has been restarted and we will monitor it closely, as well as adding monitoring to better catch this ramp up.

Dedicated databases are not impacted by this issue. If you are impacted, you can migrate your free plan to a dedicated plan using the migration feature. You can find it in the "Migration" menu of your add-on.

22:41 UTC: Load seems back to its normal state. The monitoring has been adjusted and we should then receive an alert at the start of the event instead.

23:19 UTC: The issue is back, the load is not as high as before but it might make the cluster slow.

23:54 UTC: Users impacting the cluster the most have been contacted to avoid this issue. Further actions will be taken later today if the issue persists.

2020-07-01 06:25 UTC: The node crashed due to a fatal assertion hit and restarted

06:38 UTC: The node is still unreachable for an unknown reason

07:48 UTC: The cluster is currently being repaired. For an unknown reason, nodes wouldn't listen to their network interfaces.

09:32 UTC: The repair is halfway through. The cluster might be able to be up again in ~1h30

11:50 UTC: The repair is done, the node successfully restarted. You should now be able to connect to the cluster. We are now re-starting the follower node for it to join back the cluster.

15:09 UTC: The leader node crashed again because of an assertion failure which means it is now unreachable again as mongodb reads its entire journal and rebuilds the indexes.

15:30 UTC: It usually takes 1h30 for mongodb to read the whole journal so it should be up again around 16:20 UTC.

16:34 UTC: It is taking longer than usual.

20:06 UTC: The restarts weren't successful. The secondary node successfully started at some point but was shutdown to avoid any issue with the primary one. We'll try starting it again.

2020-07-02 09:15 UTC: The first node has been accessible now and again but keeps on crashing due to user activity. The second node failed to sync to the first node so it cannot be used as primary right now. We are now trying to bring the first node back up without making it accessible to users so we can at least get backups of every database. Once this is done, we will update you on the next steps. This process will take a while as Mongo takes hours (literally) to come up after a crash.

12:00 UTC: The first node is finally back up (but incoming connections are shut off for now). We are now taking backups of all databases, you should see a new backup appear in your dashboard in the coming minutes / hours. Once this is done, we will start working on bringing the second node back in sync. Once the cluster is healthy, we will bring it back online.

14:30 UTC: Backups are over, customers who were using the free shared plan in production can create a new paid dedicated add-on and import the latest backup there. Meanwhile, we are now rebuilding the second node from the first one to make the cluster healthy again. Once it's over, we will bring the service back up (if everything goes well).

15:55 UTC: The second node is synced up and the service is available again. We are still monitoring things closely.

18:35 UTC: The service is working smoothly, no issues or anomalies to report.

Fixed · Access Logs · Global

Metrics ingestion is delayed, we are investigating.

08:37 UTC: Ingestion delay is back to normal. Incident was caused by a few storage nodes misbehaving after a short network issue.

Fixed · API · Global

Our Core API is experiencing issues that impact deployments, we are working on it.

EDIT 14:50 UTC: situation is back to normal.

Fixed · FS Buckets · Global

Due to an excessive amount of temporary files created by various PHP sessions in a very short time (+6GB in two minutes), the underlying fs-bucket for PHP sessions became full. You may have noticed various errors from 'write failed: No space left on device' to some more "random" errors that were caused by this.

Applications using redis as a session backend were not impacted by the session issue. It may have been impacted if your application generates temporary files, which are on the same fs-bucket.

Our clean-up policy of temporary files was not aggressive enough, we'll reduce it to once a day and will continue to monitor if we need to upgrade the current disk space.

This incident started at 11:33 UTC+2 and was fully resolved at 11:41 UTC+2

Fixed · Cellar · Global

We are investigating issues on Cellar Addons, we are experiencing network issues.

EDIT 15:20 UTC: fixed.

Fixed · Reverse Proxies · Global

On our MTL zone, reverse proxies are currently not able to update themselves once a configuration change happens (application deployment, added domain, ...). We are looking into it.

19:00 UTC: If your application deploys, your application will not be up-to-date. It will continue to show the old content, the old instance will be kept until this incident is over.

19:12 UTC: The issue has been identified, we are fixing it.

19:20 UTC: The issue was caused by the configuration checker that took way more time than usual before applying each configuration changes. A configuration option to disable those checks inside the program handling the configuration has been enabled. The configuration remains checked by the reverse proxy itself but it is way faster.

Deployments should now be up-to-date.

Fixed · Infrastructure · Global

It seems a global network outage happened for 1 minute, leading to possible loss of connection to most of our services. It seems to be back for now but we are investigating and we will provide further information.

15:22 UTC: We continue to investigate what's been impacted. Currently deployment are disabled to recover from the event.

15:27 UTC: Deployments are now available.

15:56 UTC: The situation on the platform is stabilized. It seems the outage was between both of our datacenters in the Paris zone. We are asking for more details to our hosting provider.

16:05 UTC: Our network provider came back to us. The network outage lasted for 1 minute and 20 seconds. One of the links was lost between those two datacenters. The backup link should have been up 2 seconds after the loss of the first link. But for some reason it did not switch (or not correctly). After a 1 minute timeout, all links were closed and reset leading to a new link election which takes ~20 seconds. From there, the connection has been restored. Our network provider will continue to investigate why the initial backup link did not switch.

Once the network started working again, our monitoring was able to check what was currently "down". The services that were down were restarted but nothing should have impacted reaching your application (it was mostly internal services). Add-ons connections should have been back at the same time from applications but if your application crashed because it couldn't reach the add-on, then it should have been automatically redeployed once the deployment system was up again which should have been a bit before 15:27 UTC.

We are sorry for the inconvenience this outage created. The time of this incident has been changed from 15:06 UTC to 15:04 UTC to correctly match the date and hours.

Fixed · Access Logs · Global

Metrics are currently unavailable.

An index node has been restarted to upscale it. Its replica did not like the surge of requests and decided to crash a few seconds later. We are currently in the process of upscaling all index nodes to avoid such issues, those 2 nodes were the last remaining on the list.

Index nodes have to scan the whole dataset on start, this will take close to an hour to resolve.

08:07 UTC: Incident is over.

Fixed · Global

Access logs are not available since 2020-06-04 10:34 UTC. Everything is operational at 2020-06-04 11:25 UTC.

Fixed · Access Logs · Global

Metrics are currently unavailable as the 2 replicas of a chunk of the index are down. Estimated time to resolution: 30 minutes.

13:38 UTC: Incident is over.

Fixed · Access Logs · Global

Metrics and access logs currently have some delay in their ingestion. We are currently under one hour of delay with the gap closing at a low rate. Data for older metrics / access logs remain available. We are currently working to reduce the current ingestion delay.

EDIT 06:58 UTC: Ingestion is back at its normal rate, we are currently under the 30 minutes of delay. This should be at 0 seconds of delay in the next couple of minutes. EDIT 06:18 UTC: Ingestion delay is back to normal too since a few minutes. Incident is over. Everything (access logs / metrics) should have the latest data again.

May 2020

Fixed · Global

Due to a backend issue, access logs between 17:05 UTC and 18:19 UTC were lost. All access logs emitted after 18:47 UTC can't be queried because they are not yet indexed. The indexation fails because of a issue with the GEO IP location feature. Once this is fixed, logs will then be indexed. This impacts accesslogs retrieval through the command line and the various stats displayed in the console (last 24 hours of requests, heatmap, live map, status code, ...).

Update 21:04 UTC: The GEO IP feature has been fixed. It seems to have initially broke with an auto update of the GEO IP library but more tests will need to be conducted to be sure of the root cause. All access logs between 18:47 UTC and now have been consumed and you should now be able to query them. We will work on improving the monitoring of the whole system to detect this kind of issue faster.

Fixed · Global

In an effort to deal with spikes of API load in recent days, the API and the deployments will be disabled for maintenance starting at 20:30 UTC tonight (22:30 Paris time) for a duration of up to 10 minutes. Thank you for your understanding.

20:30 UTC: Maintenance is starting.

20:35 UTC: API and deployments are back up, maintenance is over.

Fixed · Deployments · Global

Deployments were failing without any reasons or any logs in the console / CLI. The root cause has been identified and fixed at 10:25 UTC. All deployments started after 09:55 UTC failed and must be restarted.

Sorry for the inconvenience. We keep watching the status of the deployment system to make sure the problem is indeed resolved.

EDIT 10:40 UTC: Everything is back to normal.

Fixed · Reverse Proxies · Global

19:43 UTC: one of the add-ons reverse proxies stopped responding to a part of the requests. 19:45 UTC: We restarted it. Some still working connections were lost, but the reverse proxy is now operational.

Fixed · Access Logs · Global

Metrics cannot be read at the moment because of an issue with the index components.

A chunk and its replica are both non-responding which means the service as a whole is unavailable. We are working on it.

10:00 UTC: An index node being unavailable threw us off on the wrong track. Its replica was actually working just fine, the issue was with both front read nodes being stuck at the same time. We will improve monitoring and try to figure out what went wrong and why.

April 2020

Fixed · Access Logs · Global

A misconfiguration on new Metrics storage nodes caused a bad cluster state which in turn causes issues with the ingestion. We are working on fixing this issue.

14:35 UTC: We found the cause of the issue and are working on fixing it.

14:47 UTC: The root cause is fixed and the ingestion is now running at full speed. The misconfiguration issue was just half the story, what caused this issue was a partial network split.

14:56 UTC: Ingestion is all caught up. Incident is over.

Fixed · Global

Access logs are currently unavailable. Queries to get access logs might not work as expected. We are working on a fix.

EDIT 13:12 UTC+2: It also impact the real time map in the console. You may not see live queries to your applications. But your application still receive the requests as usual.

EDIT 13:22 UTC+2: Fixed; but during the downtime period the access logs were deleted. We identified the root cause and are fixing it.

Fixed · MongoDB shared cluster · Global

At 10:12 UTC, a MongoDB data node of the free and shared cluster went down. It has been restarted at 10:16 UTC. Connections and queries during that time frame may have failed.

Dedicated add-ons (XS SmallSpace and above) were not impacted