Incidents

Full history of incidents.

April 2020

Packet loss issues with some Metrics storage nodes 6 years ago

Fixed · Access Logs · Global

We are experiencing significative packet loss issues with some Metrics storage nodes.

Ingestion is failing. Access to metrics may be difficult.

15:42:30 UTC: The network is back to normal. We are working on getting the ingestion back to its normal state. Metrics access may be shut down temporarily during this.

16:00 UTC: Ingestion is back online, working through 50 minutes of data.

16:14 UTC: Ingestion delay is almost back to normal.

16:17 UTC: Ingestion delay is back to normal. Incident is over.

Packet loss issues with some Cellar nodes 6 years ago

Fixed · Cellar · Global

We are experiencing significative packet loss issues with some Cellar nodes. This may impact access to some files temporarily.

We are looking into it.

15:42:30 UTC: The network is back to normal. We are making sure the service goes back to normal.

16:15:00 UTC: Replication of objects created during the incident is ongoing. Service is operational but can be a little slower than usual.

17:05:00 UTC: Everything is back to normal

API unavailability 6 years ago

Fixed · API · Global

The API was unavailable for about 5 to 10 minutes during which most of the requests it received were hanging. Our functional monitoring did not report this issue so it may have be related to authenticated requests only.

Our CLI and Console were impacted.

We will investigate this incident further.

March 2020

Metrics ingestion halted 6 years ago

Fixed · Access Logs · Global

Metrics ingestion is completely stuck at the moment. We are investigating.

11:02 UTC: Ingestion is back online. It's unclear exactly what went wrong at the moment but it is most likely linked to the issue from yesterday. A complete reboot of all storage nodes 'fixed' the issue. Those storage nodes now have 48 minutes of buffered data to ingest.

11:11 UTC: Ingestion delay very close to normal.

11:17 UTC: Ingestion delay is back to normal.

Requests hang during reverse proxy upgrade 6 years ago

Fixed · Reverse Proxies · Global

Following an usual update on one of our public reverse proxies, some requests were hanging instead of being processed.

During the upgrade process, one of the workers of this reverse proxy continued to accept connections but didn't process them and kept them until the requests timeout. The issue has been resolved by 11:10 UTC+1 and will be investigated further. This is the first time the upgrade process fails us in months and we will take extra-steps to avoid and detect this issue faster.

Metrics ingestion disabled caused by network instability 6 years ago

Fixed · Access Logs · Global

We are experiencing an important network issue with the storage nodes of Metrics.

Because of this, we disabled ingestion temporarily which will make things easier to debug and fix.

17:26 UTC: Network issue seems to be gone, ingestion is restarted

17:31 UTC: Ingestion is going smoothly. As of now, we don't know what happened network-wise, we are awaiting word from our provider. As of now, it looks like a congestion issue from our point of view.

17:35 UTC: Ingestion delay back to normal

Unresponsive add-on reverse proxy 6 years ago

Fixed · Reverse Proxies · Global

An add-on reverse proxy was rejecting most of the connections it received for about 20 minutes. It was restarted at 21:55 UTC+1. Some application might have connections errors because of the interruption but should be able to reconnect.

We will investigate this further as we have monitoring for such a case and it apparently didn't trigger here.

TLS errors on *.cleverapps.io domains 6 years ago

Fixed · cleverapps.io domains · Global

An issue with one of our reverse proxy lead to TLS errors on *.cleverapps.io for about 20 minutes.

The issue has been resolved and the root cause has been found. A patch will be applied to avoid this happening again.

Deployments issue 6 years ago

Fixed · Deployments · Global

We are experiencing issues on internal systems. We have disabled deployments to limit potential impacts on our internal systems.

EDIT 16:25UTC: fixed.

N.B. between issues and the deployments deactivation, some applications were responding HTTP 503. It's now fixed.

PAR unreachable from multiple networks 6 years ago

Fixed · Infrastructure · Global

From 00:14:40 UTC to 00:23:10 UTC, the PAR zone was unreachable from multiple networks.

We don't know exactly what happened at this time but it looks like the impact was fairly minimal on actual users as we can't see any meaningful dip in aggregated incoming bandwidth usage of load balancers.

This post will be updated once we get more details from our network operator.

Status of app not correctly displayed 6 years ago

Fixed · Console · Global

The status of application is not correctly displayed in the console but this has no impact on the fact that they are up or down

EDIT: it's now fixed, app status and ssh access are now operational.

Metrics unavailability 6 years ago

Fixed · Access Logs · Global

Metrics cannot be queried currently, any request will return an empty result.

This is caused by multiple instances of the same component crashing at the same time.

We are working on fixing this, this may take a while for a definitive fix (30 minutes at best, 1h30 at worst).

14:41 UTC: Metrics are currently available but this will probably not last as there is only partial redundancy on the affected component and the cause of the crash is not fixed

15:23 UTC: Metrics cannot be queried again

15:33 UTC: Metrics can be queried, but issues may still arise from time to time, issue is still not fixed.

15:45 UTC: Two nodes of the storage backend crashed under the load caused by the reload of the first components, this caused a delay in the ingestion and a pause in the reload of the first components. At this time, ingestion is catching up on the delay and queries are running fine despite the issues. You will most likely encounter issues as we work our way through this.

16:48 UTC: We have complete redundancy, this issue is now fixed.

API is experiencing issues 6 years ago

Fixed · API · Global

We are experiencing issues on the Clever Cloud API that can affects console and cli requests.

EDIT 15:28 - we are still experiencing issues, we are working on a fix;

EDIT 15:39 - fixed.

February 2020

Impossibility to open an SSH shell on an application 6 years ago

Fixed · SSH Gateway · Global

An incident on a component used by the SSH gateway occurred at 08:56 UTC.

This issue has been fixed at 09:12 UTC.

From 08:56 UTC to 09:12 UTC, all clever ssh commands would hang forever.

Since 09:12 UTC, you may get the message "Opening an ssh shell." and then nothing. If this does happen, you will have to restart the application you are trying to ssh to.

Some of FS Buckets are experiencing issues. 6 years ago

Fixed · FS Buckets · Global

Some of FS Buckets are experiencing issues.

EDIT 18:24 - We identified the issues, applications linked are redeploying.

January 2020

Logs ingestion delayed 6 years ago

Fixed · Services Logs · Global

We are investigating an issue with the logs collection pipeline which is noticeably delayed.

14:17 UTC: A component of the "live logs" part of the pipeline was a bit overloaded and started slowing everything down slowly until it became actually noticeable. It has been restarted and the pipeline is now working on the delayed logs waiting in queue.

14:21 UTC: The load came back up soon after the restart, we are working on bringing it down; we may have to shut it down temporarily to scale it up (quick note: we are working on a new pipeline which can be scaled at will without any downtime)

14:25 UTC: We are temporarily shutting down the Logs API to make things easier.

14:34 UTC: Logs API is back and delay is back to <5 seconds, we are still watching the situation closely.

14:58 UTC: Everything is indeed back to normal.

Mongodb shared cluster unavailable for 15 minutes 6 years ago

Fixed · MongoDB shared cluster · Global

The MongoDB shared cluster lost its primary node at 12:45 UTC. The primary role should have been transfered to another node, but depending on your client configuration, you might not have automatically reconnected. The node's system was unresponsive and we had to forcefully reboot it. Update 13:01: Everything is back to normal. Applications may have to be redeployed.

Access Logs dashboard experiencing issues 6 years ago

Fixed · Services Logs · Global

BETA Access Logs dashboard is experiencing issues. We identified the issues and are working to fix them.

EDIT 13:55 UTC: fixed.

Metrics ingestion delayed + read impossible 6 years ago

Fixed · Access Logs · Global

Metrics ingestion is delayed while we investigate an issue with the storage backend. This issue is caused by the addition of capacity to the storage backend.

Metrics cannot be read as well, this includes access logs, hence the overview of your organizations is not available.

16:20 UTC: The issue is fixed, ingestion is working again. Overview is still not loading for now (because recent data is not there).

16:34 UTC: There was another issue with the reading part, which is now fixed. Everything is now working as normal. Though there may be some hiccups with the ingestion in the coming minutes.

16:43 UTC: This issue is resolved. Sorry about the inconvenience.

December 2019

Metrics / Access-logs: Upgrade of the backend storage 6 years ago

Fixed · Global

This upgrade is a follow up of https://status.clever-cloud.com/incident/238 (which is also a follow up of https://status.clever-cloud.com/incident/237).

During approximately 1 hour, Metrics and access logs (dot maps / requests count in the console) will be unavailable both in reading and writing starting December 26th at 14:00 UTC+1.

All data will be kept and ingested at the end of the maintenance.

EDIT 13:00 UTC: Maintenance is starting

EDIT 13:23 UTC: Initial steps are done, WRITE have been delayed up to 8 minutes and some READ may have failed. The second phase of the maintenance will begin shortly.

EDIT 14:32 UTC: Second phase is over. There were two ingestion delays, peaking at 4 minutes each. The maintenance is not over yet but it should not impact the ingestion nor the read.

EDIT 14:58 UTC: It should not have had any impact but it still did. Ingestion is delayed, reads are impossible; we are investigating.

EDIT 15:21 UTC: The issue is solved; reads are back, ingestion is working

EDIT 15:31 UTC: Ingestion delay is back to normal

EDIT 16:00 UTC: Maintenance is over.