Clever Cloud Status

Incidents

Full history of incidents.

Oldest first

December 2020

Fixed · Access Logs · Global

Metrics and access logs are currently unavailable to query. We are working on a fix.

EDIT 11:42 UTC: The issue has been fixed, metrics and access logs can be queried again. There is a delay (currently 30 minutes) in the ingestion that is currently being resolved.

EDIT 12:10 UTC: The ingestion delay is now resolved, everything should be back to normal.

Fixed · Global

Due to an issue in the advanced metrics system listing, we will need to reset it. During the next few hours, the advanced metrics will either be unavailable or display partial results for the metrics listing. Metrics will stay available, only the listing will be unavailable.

EDIT 18:53 UTC: The maintenance is still in progress.

EDIT 00:00 UTC: The maintenance is done, the custom metrics should be available again.

Fixed · Deployments · Global

We are investigating an issue related to deployments. It looks like some deployments are not starting and others are not updating the reverse proxies configuration as expected.

13:54 UTC: Related to this issue, the API is unavailable at this time. We are working on it.

13:55 UTC: We stopped the deployments to avoid any more missing updates.

13:56 UTC: The API being unavailable means that the Console and the CLI will display various errors.

14:05 UTC: Git push are also unavailable, an error will occur. The main problem has been identified and we are working toward a resolution.

14:23 UTC: We are still working on fixing the root cause of this issue.

14:49 UTC: We are still working on fixing the root cause of this issue. In the mean time, we have managed to get a fully up-to-date configuration on some reverse proxies.

15:07 UTC: We believe we have fixed the root cause of the issue and are working on cleaning everything up.

15:15 UTC: Everything is looking good now. If you still have an issue, please contact us.

Fixed · Reverse Proxies · Global

Today, between 17:24 UTC and 17:34 UTC, customers using our Sozu reverse proxies may have noticed errors when connecting through one of the proxies. An upgrade maintenance was ongoing which led to stop the Sozu service and a reboot of the machine. Unfortunately, the traffic wasn't correctly redirected to an alternative instance, leading to various TLS errors or HTTP errors when connecting to the non-healthy instance. Once the machine was up again, the traffic would be correctly handled.

The root cause have not yet been found but this shouldn't have happened as we routinely do such maintenance operations without any issues. We will look further into this. Apologies for the inconvenience.

Fixed · Reverse Proxies · Global

Today, between 16:00 UTC and 17:50 UTC, some reverse proxies configurations updates went missing. Applications that redeployed during this time frame may have not been correctly updated on some of our reverse proxies leading to HTTP 503 / This application is redeploying or HTTP 404 / Not Found error alongside the regular applications responses.

The root cause of this is still unclear, additional investigations will be performed. A bit before 16:00, we had an incident on an internal tool that may be related.

Fixed · Deployments · Global

A part of the deployment system is experiencing higher load than usual which may incur some delay before deployments actually start.

We are working on it.

16:23 UTC: This incident is over.

Fixed · Access Logs · Global

We are experiencing significant delay on the ingestion pipeline of Metrics.

The original incident started at around 05:15 UTC and we have been containing it since then with a lag under tens of seconds at worst.

It's now getting worse due to attempts at fixing the issue which are currently doing the opposite. This will take a while to solve.

11:17 UTC: The ingestion delay is now reduced to about 15 seconds. The issue is not completely solved, this is only a first step.

11:58 UTC: The ingestion delay is now back to normal. The root cause is not entirely fixed so this may come back but we will consider this incident as resolved for now.

Fixed · Access Logs · Global

Metrics in the Clever Cloud Console of applications and add-ons as well as old access logs (not the live ones) are currently unavailable. Status code charts and heat map in the application overview will also be unavailable. The system is currently recovering.

EDIT 21:10 UTC: The service is now back to normal since ~30 minutes.

November 2020

Fixed · Infrastructure · Global

A network outage is currently affecting multiple servers. We are currently investigating. Multiple services may be in degraded states or unreachable. Customers applications and databases will experience the same issues.

EDIT 17:22 UTC+1: The network have been restored on those servers. We continue investigating which services are currently impacted. Applications that lost network connectivity to our monitoring are restarting. Applications that crashed because they lost their database access are also restarting.

EDIT 17:45 UTC+1: Deployments may still take some time to start or for those ongoing, to finish. We are cleaning-up the situation.

EDIT 18:17 UTC+1: Deployments are back to normal since 18:05. We are still cleaning up the rest of the mess and making sure everything is back to normal and working fine.

EDIT 18:25 UTC+1: Incident is over.

The issue came during a maintenance of our infrastructure provider during which multiple electricity cables were disconnected on active switches. Some of our servers were linked to those switches, cutting their network access for 5 minutes. Backup network links of those servers were also affected leading to a total loss of network. We will investigate this incident further with the infrastructure provider.

Fixed · Global

The database of the core API and therefore the core API itself will be unavailable for up to 5 minutes (~ 1 minute if everything goes to plan), starting at 11:00 UTC.

11:03 UTC: The maintenance is starting, console is in maintenance mode.

11:06 UTC: Maintenance is almost over.

11:07 UTC: Maintenance is over.

Fixed · Reverse Proxies · Global

We are investigating a major issue on public and internal reverse proxies (not private).

EDIT 17:55 UTC - we identified the issue (DDOS).

EDIT 17:56 UTC - we fixed the issue on internal reverse proxies.

EDIT 19:15 UTC - we are still working to fix the issue.

EDIT 20:30 UTC - fixed and situation is back to normal. We will publish a post mortem.


Post mortem 2020-11-15

16:45 UTC: Our monitoring throws an alert: public and internal reverse proxies traffic is abnormally decreasing. Dedicated reverse proxies for Premium clients are not impacted. The on-call team starts investigations;

16:53 UTC: We see a lot of HTTP requests timing out with PR_END_OF_FILE_ERROR randomly on multiple reverse proxies.

17:00 UTC: We diagnose lots of IPs running an abnormal DDoS shape on our Paris infrastructure on identified domain names which prevents reverse proxies from accepting connections and causes reduced traffic;

17:30 UTC: After banning these addresses, new ones are used for the attack and we start banning IP ranges. During this period, we are applying custom reverse proxies configurations to limit the attack impact on various clients;

17:56 UTC: We are applying these bans on the internal reverse proxies, the internal situation comes back to normal; then we ban these on public reverse proxies;

18:00 UTC: Traffic is back to normal; PR_END_OF_FILE_ERROR disappeared and we are now facing SSL_ERROR_SYSCALL. We start investigating;

18:24 UTC: We determine these errors are due to configuration errors applied during the reverse proxies configuration changes.

20:06 UTC: All configurations are fixed, everything is working as usual. We are improving reverse proxies auto-configuration to avoid error-prone manual actions. We are fixing custom clients' configuration items and are watching monitoring data closely.

20:14 UTC: Reverse proxy improved auto-configuration is deployed.

20:30 UTC: We announce the end of the incident. The attack logs will be used to improve our DDoS detection system.

Fixed · Infrastructure · Global

For about a minute around 10:23 UTC, our servers in one of our two Paris datacenters could not reach any outside network including the other datacenter.

The impact on applications deployed on more than one scaler should be null (apart from database access depending on your particular case). Applications deployed on a single instance had about a 50% chance of being affected.

This network incident had an impact on Metrics, the service was unavailable for 15 minutes after the incident and ingestion has been delayed for another 15 minutes.

As of now, we don't know exactly what happened but we expect that a router malfunctioned and went haywire for a minute.

Fixed · Infrastructure · Global

We have detected a network incident between Free (French ISP) and our network provider for the Paris zone. We are seeing 40 to 50% of packet loss on this interconnection.

Our network provider is investigating the issue.

10:48 UTC: We no longer experience packet loss on this interconnection. We are awaiting more information from our network provider on the cause and resolution of this incident.

10:57 UTC: The issue is back, we are experiencing the same amount of loss again.

11:07 UTC: The issue went away again. We are still awaiting word from our provider.

11:32 UTC: We are experiencing packet loss again on the same link.

11:35 UTC: The issue went away again.

11:36 UTC: The issue, ultimately, lies with Free and we cannot do anything about it from our side. Until the root cause is properly fixed, the loss issue may come back off and on.

14:53 UTC: Our network provider tells us that the peering link has been affected by the side effects of a DDOS targeting another customer of our network provider. They are working on providing measures to prevent more attacks targeting this network which should in turn prevent this link from getting overwhelmed.

October 2020

Fixed · Deployments · Global

We are seeing some deployments timing out. It looks like the retry mechanism is doing its job just fine and deployments are starting anyway for all affected applications but you may be observing an usual delay.

We are investigating this issue.

13:47 UTC: The issue is fixed. All deployments have been working fine during this period, only delayed by a few seconds. The issue came from a misconfigured deployment component which was sending broken messages to hypervisors. The broken component has been dealt with.

Fixed · Infrastructure · Global
  • A core RabbitMQ node stopped responded and some databases were unreachable for 30 seconds. We are investigating the outage.
  • Some applications may register a connection loss to their database.

13:17 UTC - no other network loss. All critical parts of Clever Cloud have been checked and restarted to make sure they still communicate with each other.

Logs interuption
Fixed · Services Logs · Global

Logs were interrupted for 15 minutes due to an internal issue. They were recorded and are being ingested. It may take a few minutes to receive all logs and current logs.

The issue is currently fixed and awaiting for full resolution.

Fixed · FS Buckets · Global

Some FS Buckets addons are experiencing issues. We identified the issue and are working on its resolution.

EDIT 21:25 UTC: The issue is fixed. The PHP applications may not work correctly. We are redeploying them.

EDIT 22:30 UTC : Applications with FS Buckets have been redeployed. The incident is closed.

Post mortem: An incorrect human action conducted the FS Buckets system to follow the wrong path between different storage nodes. We applied fix to avoid this cause.

Fixed · Cellar · Global

We identified issues on Cellar addons availability. We identified them and are working on their resolution.

EDIT 15:25 UTC: fixed. We are investigating the reasons.

EDIT 15:45 UTC: we identified the reasons and applied a fix.

Fixed · Access Logs · Global

Metrics and access logs requests might experience issues following the maintenance of a core component of those features. Requests can either take a very long time to complete or simply answer an error. We are working toward a fix.

Data won't be lost, the ingestion is simply delayed.

Impacted products:

  • Metrics (in console or using the API)
  • Access logs (charts in the console's overview or using the CLI / API)

EDIT 14:03 UTC: Ingestion is now catching up on the delay, everything looks good. Looks like it may take 30 to 40 minutes to go completely back to normal.

EDIT 14:25 UTC: Ingestion has now caught up, everything should be back to normal.

EDIT 21:26 UTC: New issues are ongoing, we are investigating.

EDIT 22:16 UTC: Ingestion is running. We are consuming queues.

EDIT 23:30 UTC: Ingestion is back to normal. Fixed.

Fixed · Console · Global

Loading the console might result in various errors preventing users from logging in. We are currently investigating. The CLI shouldn't be impacted. Already loaded console webpages shouldn't be impacted either.

EDIT 13:02 UTC: A change causing this issue has been backed out. We will investigate further why it went wrong despite working correctly on our test infrastructure. Sorry for the disruption.