Incidents

Full history of incidents.

May 2024

[Global] Missing telemetry from access logs 1 year ago

Fixed · Access Logs · Global

We had an issue with the compute pipeline of telemetry from access logs which did not perform the computation since the end of the last week. We have fixed the issue, but we could not recover missing computations.

Metrics latencies and errors 1 year ago

Fixed · Global

We identified an increase of errors on read queries, writing isn't impacted.

[Global] Application load balancers software maintenance 1 year ago

Fixed · Reverse Proxies · Global

Maintenance Window: 2024-05-22T09:00:00Z - 2024-05-24T20:00:00Z (UTC)

Scope:

We will roll out software updates on every region of Clever Cloud for all application load balancers

Expected Impact:

Brief disconnections or connection drops during the upgrade process.
Potential minor performance fluctuations.

Additional Information:

Please report any issues with a method for reproducing the problem (e.g., curl command for application load balancer issues).

EDIT 13:06 UTC : We are beginning the rolling of the WSW region.

EDIT 15:00 UTC : We have updated the WSW region. We are beginning the rolling of SGP, MTL and SYD regions.

EDIT 16:25 UTC : We have updated the MTL region.

EDIT 17:00 UTC : We have updated the SYD and SGP region. Next region tomorrow

EDIT D+1 08:10 UTC : We will begin the update of MEA and GRA-HDS.

EDIT D+1 09:20 UTC : We have finished the update of GRA-HDS, we are beginning RBX and RBX-HDS. The rolling update of MEA is still running.

EDIT D+1 11:15 UTC : We have finished the update of MEA, RBX and RBX-HDS.

EDIT D+1 15:00 UTC : We are beginning the update of SCW and dedicated regions.

EDIT D+2 13:00 UTC :We have finished the update of SCW and dedicated regions. We will perform the update of the PAR regions and dedicated application load balancers on a new status here : https://www.clevercloudstatus.com/incident/855

[Paris] Add-on proxy unreachable 1 year ago

Fixed · Reverse Proxies · Global

An add-on reverse proxy was unreachable from 14:23 UTC+2 to 14:27 UTC+2. During that time, connections to add-on services might have timed out or failed with various errors.

The issue has been resolved.

Metrics latencies and errors 1 year ago

Fixed · Metrics · Global

We identified an increase of errors on read queries, writing isn't impacted.

UTC 08:53 : Read queries have been disabled in order to solve the issue UTC 09:23 : Read queries are now back to normal.

Some Clever Cloud services domains appear as fraudulent in Microsoft Edge 1 year ago

Fixed · Cellar · Global

The following domains seem to have been reported as unsafe to Microsoft Defender SmartScreen:

cellar-c2.services.clever-cloud.com
cellar-fr-north-hds-c1.services.clever-cloud.com
cellar-fr-north-c1.services.clever-cloud.com

This means that Microsoft Edge users might have issues downloading files stored on our Cellar services.

We are currently in discussion with them to get the domains unblocked. In the meantime, do not hesitate to click the "This domain is safe" link in the Defender screen.

EDIT 2024-05-07

We have setup a workaround. To avoid this workaround to be used by malicious users, we won't disclose it here.

Please come to us if you need to use it.

EDIT 2024-06-28

The domains were un-flagged a few days ago. The incident is now over.

One database load-balancer crashed 1 year ago

Fixed · Reverse Proxies · Global

(Times are in UTC)

At 2024-05-04 23:31 a database load balancer lost its network routes. The alert about that was set as low priority and did not wake up the on-call agent. At 01:23, another service failed because of that load-balancer issue. This time, the failure triggered a high priority alert.

The on-call agent investigated the issue and saw that the load-balancer was responsible for the other service's failure. They fixed the network issue. Every impacted service got back online around 01:45.

Clever Cloud's PAR region has 8 of those load balancers. Only the services that were trying to connect to this one got downtime. Some customers applications redeployed themselves and connected to another one, quickly fixing the issue.

On 2024-05-06, we made the first alert a high priority one. It should already have been high priority. We also made sure that every other "load balancer is unreachable" alerts were high priority ones.

[Global] Increasing timeouts and errors on api.clever-cloud.com 1 year ago

Fixed · API · Global

We are observing timeouts and errors on api.clever-cloud.com. We are investigating the issue.

EDIT 16:00 UTC : We have found an issue, we are patching it and redeploying the api.

EDIT 16:10 UTC : We have deployed a new version of the api

EDIT 16:20 UTC : The issue seems to be solved, we are keeping a eye on it

EDIT 16:30 UTC : The issue is solved we did not observe errors and timeouts

April 2024

Heptapod: Email notifications failures 1 year ago

Fixed · Heptapod Cloud · Global

Some emails issued by the heptapod service weren't correctly delivered to their recipients the last few days. The underlying issue has been fixed and the mail backlog is currently being processed. Additional monitoring will be put in place to monitor the email queue.

We will update this incident once the backlog is fully processed.

EDIT 2024-04-25 16:00 UTC: The backlog has been fully ingested. The incident is now over.

[Global] Metrics infrastructure improvement 2 years ago

Fixed · Metrics · Global

An operation on the metric cluster is pending which will make it more resilient to spikes and load. It shouldn't impact read queries of metrics, it can generate lag in the writing path.

EDIT UTC 18:29 : Operation is done, services weren't disturbed.

[Global] Access logs ingestion issue 2 years ago

Fixed · Access Logs · Global

Beginning at 5h00 UTC, we seen a drop in the rate of access logs consumption which seems to be caused to difficulty to produce them. We are investigating the issue. You may see delays to retrieve your access logs.

EDIT 10:30 UTC : We are performing a rolling restart of the underlying pulsar brokers, you may seen disconnection.

EDIT 16:00 UTC : The rolling restart is performed. We still have ingestion issues we will keep investigating

EDIT D+1 08:50 UTC : We have still ingestion issues on few partitions which may be related to an underlying trouble, we are digging into it.

EDIT D+2 14:00 UTC : We have found the underlying issue and solve it, we are consuming the remaining lags.

EDIT D+3 13:00 UTC : We are still consuming the remaining lags, the current eta of full recovery is targeting tomorrow during the night

EDIT D+4 06:00 UTC : We have done consuming the remaining lag.

Platform email services delay 2 years ago

Fixed · Mails · Global

We are currently experiencing a disruption in our email services due to an unforeseen issue, emails will be delayed until this issue is resolved. Our team is actively working to restore access as quickly as possible. We will keep you updated on our progress and notify you as soon as services are fully operational again.

EDIT 20:04 UTC+2: We are still working on the issue.

EDIT 2024-04-19 12:17 UTC+2: The issue has been fixed, we continue to monitor the situation.

Metrics: Lag in queries results 2 years ago

Fixed · Metrics · Global

Metrics queries results are lagging a bit, we have identified the underlying issue and issued a preliminary fix. We are monitoring the result. Grafana dashboards or results obtained from the metrics API might be missing some recent values until this is resolved.

EDIT 2024-04-18 10:53 UTC+2: The issue is still present. We've been force to sample incoming data until we figure out the underlying issue.

EDIT 2024-04-18 12:07 UTC+2: Our storage layer has been stabilized, we still apply a sampling on incoming data. Queries should be working properly.

EDIT 2024-04-18 21:13 UTC+2: The situation has improved, sampling on incoming data has been disabled. We continue to monitor the system but queries should now return the correct data without lag.

EDIT 2024-04-18 23:37 UTC+2: This incident is now over.

PAR: Hypervisor unreachable 2 years ago

Fixed · Infrastructure · Global

An hypervisor on the Paris region was unreachable and rebooted. We are looking into it and making sure it restarts all of its services.

EDIT 15:40 UTC+2: All services are up again since ~15:30 UTC+2. We continue to monitor the situation. If you still have issues, please contact our support.

Cellar on Paris is experiencing trouble 2 years ago

Fixed · Cellar · Global

Ceph (the software we are running Cellar on) is rebalancing some shards due to a change in its storage capacity. Some requests might fail while doing so.

edit: after a few alerts, everything has been running smoothly.

[GLOBAL] Metrics query unavailable 2 years ago

Fixed · Metrics · Global

The metrics query is currently unavailable as some indexing shared are offline. We are working to get them up as quickly as possible. There is no impact on ingestion pipeline and storage layer.

UTC 11:00: Queries are available

[GLOBAL] Metrics query unavailable 2 years ago

Fixed · Metrics · Global

The metrics query is currently unavailable as some indexing shared are offline. We are working to get them up as quickly as possible. There is no impact on ingestion pipeline and storage layer.

UTC 14:35 Services are back to normal

[Global] Access logs ingestion lags 2 years ago

Fixed · Access Logs · Global

We have some lags on the ingestion of access logs. We are working on it.

EDIT 12:00 UTC : we have consume the lag for 75% of the access logs, we are working on the remaining ones.

EDIT 14:45 UTC : we are in-sync for 80% of access logs, we are consuming the remaining ones (eta 22h to consume the lag)

EDIT 16:41 UTC : we have consumed all remaining access logs

Cleverapps.io certificate has expired 2 years ago

Fixed · cleverapps.io domains · Global

The *.cleverapps.io wildcard certificate failed to renew. We are currently renewing it.

Edit 15:05 UTC: we renew the certificate. It should appear on the load balancers in a few minutes. Edit 15:11 UTC: the certificate has been deployed on all cleverapps.io load balancers. Incident is over.

Cellar North: Requests slowness / timeouts 2 years ago

Fixed · Cellar · Global

We identified an issue where certain requests to the Cellar service on the North region might have timed-out or were slower than usual to respond. The issue has been fixed but we are looking for the underlying cause and we keep monitoring the situation.

EDIT 11:56 UTC+2: A storage node was unexpectedly unresponsive and incurred timeouts in various parts of the storage cluster. It started around 10:31 UTC+2 and went unnoticed until 11:26 UTC+2. Additional monitoring will be put in place to better handle this situation.