Clever Cloud Status

Incident History

Full history of incidents.

Newest first

April 2020

Fixed · Access Logs · Global

Metrics and AccessLogs are currently unavailable due to issues. We are working to fix them.

06:38 UTC: Everything is back online, ingestion is catching up.

06:52 UTC: Ingestion delay is back to normal.

Fixed · FS Buckets · Global

One of our FSBucket system is experiencing issues on write actions. We have identified the issue and are working to fix it.

EDIT 13:01 UTC: fixed.

Fixed · Access Logs · Global

We are experiencing significative packet loss issues with some Metrics storage nodes.

Ingestion is failing. Access to metrics may be difficult.

15:42:30 UTC: The network is back to normal. We are working on getting the ingestion back to its normal state. Metrics access may be shut down temporarily during this.

16:00 UTC: Ingestion is back online, working through 50 minutes of data.

16:14 UTC: Ingestion delay is almost back to normal.

16:17 UTC: Ingestion delay is back to normal. Incident is over.

Fixed · Cellar · Global

We are experiencing significative packet loss issues with some Cellar nodes. This may impact access to some files temporarily.

We are looking into it.

15:42:30 UTC: The network is back to normal. We are making sure the service goes back to normal.

16:15:00 UTC: Replication of objects created during the incident is ongoing. Service is operational but can be a little slower than usual.

17:05:00 UTC: Everything is back to normal

Fixed · API · Global

The API was unavailable for about 5 to 10 minutes during which most of the requests it received were hanging. Our functional monitoring did not report this issue so it may have be related to authenticated requests only.

Our CLI and Console were impacted.

We will investigate this incident further.

March 2020

Fixed · Access Logs · Global

Metrics ingestion is completely stuck at the moment. We are investigating.

11:02 UTC: Ingestion is back online. It's unclear exactly what went wrong at the moment but it is most likely linked to the issue from yesterday. A complete reboot of all storage nodes 'fixed' the issue. Those storage nodes now have 48 minutes of buffered data to ingest.

11:11 UTC: Ingestion delay very close to normal.

11:17 UTC: Ingestion delay is back to normal.

Fixed · Reverse Proxies · Global

Following an usual update on one of our public reverse proxies, some requests were hanging instead of being processed.

During the upgrade process, one of the workers of this reverse proxy continued to accept connections but didn't process them and kept them until the requests timeout. The issue has been resolved by 11:10 UTC+1 and will be investigated further. This is the first time the upgrade process fails us in months and we will take extra-steps to avoid and detect this issue faster.

Fixed · Access Logs · Global

We are experiencing an important network issue with the storage nodes of Metrics.

Because of this, we disabled ingestion temporarily which will make things easier to debug and fix.

17:26 UTC: Network issue seems to be gone, ingestion is restarted

17:31 UTC: Ingestion is going smoothly. As of now, we don't know what happened network-wise, we are awaiting word from our provider. As of now, it looks like a congestion issue from our point of view.

17:35 UTC: Ingestion delay back to normal

Fixed · Reverse Proxies · Global

An add-on reverse proxy was rejecting most of the connections it received for about 20 minutes. It was restarted at 21:55 UTC+1. Some application might have connections errors because of the interruption but should be able to reconnect.

We will investigate this further as we have monitoring for such a case and it apparently didn't trigger here.

Fixed · cleverapps.io domains · Global

An issue with one of our reverse proxy lead to TLS errors on *.cleverapps.io for about 20 minutes.

The issue has been resolved and the root cause has been found. A patch will be applied to avoid this happening again.

Fixed · Deployments · Global

We are experiencing issues on internal systems. We have disabled deployments to limit potential impacts on our internal systems.

EDIT 16:25UTC: fixed.

N.B. between issues and the deployments deactivation, some applications were responding HTTP 503. It's now fixed.

Fixed · Infrastructure · Global

From 00:14:40 UTC to 00:23:10 UTC, the PAR zone was unreachable from multiple networks.

We don't know exactly what happened at this time but it looks like the impact was fairly minimal on actual users as we can't see any meaningful dip in aggregated incoming bandwidth usage of load balancers.

This post will be updated once we get more details from our network operator.

Fixed · Console · Global

The status of application is not correctly displayed in the console but this has no impact on the fact that they are up or down

EDIT: it's now fixed, app status and ssh access are now operational.

Fixed · Access Logs · Global

Metrics cannot be queried currently, any request will return an empty result.

This is caused by multiple instances of the same component crashing at the same time.

We are working on fixing this, this may take a while for a definitive fix (30 minutes at best, 1h30 at worst).

14:41 UTC: Metrics are currently available but this will probably not last as there is only partial redundancy on the affected component and the cause of the crash is not fixed

15:23 UTC: Metrics cannot be queried again

15:33 UTC: Metrics can be queried, but issues may still arise from time to time, issue is still not fixed.

15:45 UTC: Two nodes of the storage backend crashed under the load caused by the reload of the first components, this caused a delay in the ingestion and a pause in the reload of the first components. At this time, ingestion is catching up on the delay and queries are running fine despite the issues. You will most likely encounter issues as we work our way through this.

16:48 UTC: We have complete redundancy, this issue is now fixed.

Fixed · API · Global

We are experiencing issues on the Clever Cloud API that can affects console and cli requests.

EDIT 15:28 - we are still experiencing issues, we are working on a fix;

EDIT 15:39 - fixed.

February 2020

Fixed · SSH Gateway · Global

An incident on a component used by the SSH gateway occurred at 08:56 UTC.

This issue has been fixed at 09:12 UTC.

From 08:56 UTC to 09:12 UTC, all clever ssh commands would hang forever.

Since 09:12 UTC, you may get the message "Opening an ssh shell." and then nothing. If this does happen, you will have to restart the application you are trying to ssh to.

Fixed · FS Buckets · Global

Some of FS Buckets are experiencing issues.

EDIT 18:24 - We identified the issues, applications linked are redeploying.

January 2020

Fixed · Services Logs · Global

We are investigating an issue with the logs collection pipeline which is noticeably delayed.

14:17 UTC: A component of the "live logs" part of the pipeline was a bit overloaded and started slowing everything down slowly until it became actually noticeable. It has been restarted and the pipeline is now working on the delayed logs waiting in queue.

14:21 UTC: The load came back up soon after the restart, we are working on bringing it down; we may have to shut it down temporarily to scale it up (quick note: we are working on a new pipeline which can be scaled at will without any downtime)

14:25 UTC: We are temporarily shutting down the Logs API to make things easier.

14:34 UTC: Logs API is back and delay is back to <5 seconds, we are still watching the situation closely.

14:58 UTC: Everything is indeed back to normal.

Fixed · MongoDB shared cluster · Global

The MongoDB shared cluster lost its primary node at 12:45 UTC. The primary role should have been transfered to another node, but depending on your client configuration, you might not have automatically reconnected. The node's system was unresponsive and we had to forcefully reboot it. Update 13:01: Everything is back to normal. Applications may have to be redeployed.

Fixed · Services Logs · Global

BETA Access Logs dashboard is experiencing issues. We identified the issues and are working to fix them.

EDIT 13:55 UTC: fixed.