Incidents

Full history of incidents.

Oldest first

August 2024

Pulsar connection issues 1 year ago

Fixed · Pulsar · Global

Connections issues (producers/consumes) during cluster upgrade
It can lead to fail in app redeployement

[PostgreSQL] Trouble to create DEV add-on 1 year ago

Fixed · PostgreSQL · Global

Order of DEV add-on is currently locked. No impact on existing add-on.

We are investigating

[EDIT 12:00 CEST]: we have identied and fix the lock

CEPH-NORTH-HDS: rebalance in progress 1 year ago

Fixed · Cellar · Global

Because of the hardware issue describe in https://www.clevercloudstatus.com/incident/874, we need to rebalance data on Cellar North. Customer may experience higher latency than usual.

GRA-HDS: Hypervisor unreachable 1 year ago

Fixed · Infrastructure · Global

An hypervisor on the GRA-HDS region is unreachable. We are working on it.

EDIT Thu Aug 01 09:13:09 2024 UTC: hypervisor has been rebooted. A hardware issue has been detected. All applications have been redeployed and there was no customer databases on the hypervisor.

July 2024

Deployment outage 1 year ago

Fixed · Deployments · Global

Application deployement take unbound time to proceed.
We are investigating the issue

09:30 UTC We notice deployment perturbation. 10:02 UTC We have found a cause of the perturbation and fixed it.

14:30 UTC We notice other issues with the deployment system. 16:30 UTC After further investigations, we found the cause of the perturbations. We applied a temporary fix. The deployments are back on tracks!

We are working on a stronger fix for the deployments.

Metrics latencies and errors 1 year ago

Fixed · Metrics · Global

We identified a bottleneck on our FoundationDB cluster for warp10-c2. Writing is impacted and might occur a lag in read metrics. We enabled sampling on data.

6:32 UTC: We identified unusual usage that was harming the system

6:35 UTC: Unusual usage stopped, the storage layer is starting to recover

6:58 UTC: Storage layer fully recovered, we still investigate & watch over the system

7:45 UTC: System is back to normal

WSW region hypervisors unexpected reboot 1 year ago

Fixed · Infrastructure · Global

At 2024-07-12 23:35 UTC, we received an alert about WSW hosts not responding. We checked and coud not ping any of our servers.

At 23:43 We pinged again. A ssh connection to the hypervisors allowed us to see the servers had an uptime of 1 minute. We checked that all services running on the servers restarted correctly and fixed those that were not correctly running. Applications have been redeployed by the monitoring. At 23:55 everything seemed to be back to normal.

We don’t know yet why the servers were rebooted.

Pulsar cluster is encoutering issues 1 year ago

Fixed · Pulsar · Global

Some Pulsar brokers are having issues connecting to the underlying zookeeper. We are investigating the reason.

There was an issue with zookeeper sessions. It is now fixed.

[PAR] Update of Load Balancer IP Addresses 1 year ago

Fixed · Global

We've updated load balancer IP addresses for applications and websites hosted on Clever Cloud. The new IP addresses now in use are:

91.208.207.214
91.208.207.215
91.208.207.216
91.208.207.217
91.208.207.218
91.208.207.220
91.208.207.221
91.208.207.222
91.208.207.223

Important:

We are going to remove 4 IPs that you must stop to use between now and August 23rd, 2024:

46.252.181.103
46.252.181.104
185.42.117.108
185.42.117.109

After this date, your applications and websites will no longer be able to use these IP addresses.

We still recommend to use CNAME DNS records when it's possible. To ensure that there is no disruption to your applications and websites, please make sure that your apex domain names are updated to point to the new IP addresses. You can update your apex domain names by editing the DNS records for your domain.

Impact:

There should be no downtime for your applications or websites as a result of this change. However, if you do not update your apex domain names before August 23rd, your applications and websites may be unavailable.

What you need to do:

Review your apex domain names and ensure that they are pointing to the new IP addresses. If you are unsure how to update your apex domain names, please contact your domain registrar or Clever Cloud support.

For more information:

Please refer to the Clever Cloud documentation for more information about load balancers and DNS records: https://developers.clever-cloud.com/doc/administrate/domain-names/#using-personal-domain-names You can take a look at the changelog entry about this change: https://developers.clever-cloud.com/changelog/2024-06-28-new-ip-list-paris
You can also contact Clever Cloud support if you have any questions.

Warsaw region (WSW) planned hardware maintenance with full service unavailability 1 year ago

Fixed · Global

Time slot: 09/07/24 from 07:00 AM UTC to 09:00 AM UTC

Our infrastructure provider will perform hardware maintenance impacting our whole Warsaw region. As there is an electrical outage risk for servers, we will follow their advice to shut down the region during the maintenance that may last up to 2 hours (from 7:00 AM UTC to 09:00 AM UTC). In case your services cannot bear such unavailability, we advise you to migrated them to another region such as Clever Cloud Paris before the maintenance.

You can request assistance by reaching out to our support.

EDIT 2024-07-09 10:05 UTC: The maintenance has completed. All service are up and running on the WSW region.

Partial logs drains unavailability 1 year ago

Fixed · Global

Some logs drains are not correctly sent to their target. The issue has been identified and is being resolved.

EDIT 17:00 UTC+2: Drains should be available again. The incident is now over.

Addons logs are unavailable 1 year ago

Fixed · Services Logs · Global

Addon logs are not available, there is an outage on the Elasticsearch cluster

08:23 es sink has been paused to restore logs Drains

system restored

June 2024

Partial logs drains unavailability 1 year ago

Fixed · Services Logs · Global

Some logs drains are not correctly sent to their target. The issue seems to have started since 2024-06-21 10:30 UTC+2. We identified the issue and are working towards a fix.

EDIT 16:53 UTC+2: Logs drains should now be fully functional since 14:13 UTC+2 and are stable since then. If you still have missing logs from your drains, please open a ticket and we will investigate it further.

[PAR] Network instabilities 1 year ago

Fixed · Infrastructure · Global

The monitoring has detected a cut in network traffic for the Paris datacenters, we are investigating the issue.

EDIT 06:05 UTC : The network traffic is come back as its casual rate. We have seen a cut outside Clever Cloud network, we are investigating why.

EDIT 06:15 UTC : We have seen a second cut, we have identified that a network provider is doing maintenance operation which seems to be the cause.

EDIT 07:21 UTC : We have seen a third cut.

EDIT 07:40 UTC : We have contacted our network provider and confirmed that cuts are coming from the maintenance. For now, we are aware of 5 cuts due to the maintenance.

EDIT 08:10 UTC : We are not expecting more network cuts as the maintenance window is over, but we are watching.

Small network outage in front of our datacenters 1 year ago

Fixed · Infrastructure · Global

At 20:30 UTC, our monitoring registered a wave of network reconnections and downtime of IPSec tunnels for a few minutes.

We checked all the tunnels and restarted the ones that did not restart automatically. We checked the load balancers and did not see anything strange except the spike on reconnections.

After investigating, our probes revealed a very low rate of packets from the internet for 5 minutes.

Logs drains: Delivery issue 1 year ago

Fixed · Services Logs · Global

We are encountering delivery issues with the Logs Drains platform. We are currently investigating the issue. Logs drains delivery may be delayed until this issue is resolved.

EDIT 09:30 UTC: We may have found the origin of the issue and implement a fix. We are monitoring the fix. Currently, logs drain are delivered without delay.

EDIT 14:09 UTC: The situation is now stable. Incident is closed.

Main API unavailability 1 year ago

Fixed · API · Global

Our main API was unavailable for a few minutes between 13:20 UTC until 13:23 UTC. We are looking into it. Deployments started during that period may be impacted.

EDIT 2024-06-04 13:32 UTC: The root cause has been found and fixed. Deployments that were started during that period may have failed. You should be able to retry them. Please contact our support team if you still face any issues.

[Global] Access logs ingestion lags 1 year ago

Fixed · Access Logs · Global

We have lag on the ingestion pipeline, we are investigating the issue.

EDIT 20:00 UTC : We are still investigating the issue.

EDIT 22:00 UTC : We are still seeing lag on the ingestion pipeline, we have found that there is a bottleneck on the offload process of the tiered storage of pulsar that we are fixed, but we now need to wait pulsar to finish its offload process.

EDIT 2024-06-07 15:00 UTC : The pulsar cluster have finished the offload tasks and we are recovering the lag starting yesterday around 16h utc.

EDIT 2024-06-10 08:00 UTC : We have finsihed to recovers the access logs lag since saturday 14:00 utc.

May 2024

[Singapore] Network issues 1 year ago

Fixed · Infrastructure · Global

We are experiencing some network issues due to our infrastructure provider network misconfiguration. We are currently observing high latency and packet loss when trying to reach machines in Singapore. A ticket is being created

EDIT 10:30AM UTC : A ticket has been opened with our infrastructure provider

EDIT 12:45PM UTC : Network seems more stable since 12:20PM UTC

EDIT 17:00PM UTC : The ticket with our infrastructure provider has been closed

[Global][Dedicated] Application load balancers software maintenance 1 year ago

Fixed · Global

Maintenance Window: 2024-05-27T09:00:00Z - 2024-05-29T20:00:00Z (UTC)

Scope:

We will roll out software updates on PAR region and dedicated load balancers

Expected Impact:

Brief disconnections or connection drops during the upgrade process.
Potential minor performance fluctuations.

Additional Information:

Please report any issues with a method for reproducing the problem (e.g., curl command for application load balancer issues).

EDIT 2024-05-28 13:15 UTC : We are beginning the maintenance. We are starting with cleverapps.io.

EDIT 14:10 UTC : We have updated cleverapps.io load balancers, we are updating the par region one.

EDIT 15:30 UTC : We have updated three of nine load balancers of par region, the updates are still running

EDIT 16:30 UTC : We have update six of nine load balancers of par region, the are still running for the last ones.

EDIT 17:00 UTC : We have finished to update the par load balancer, we will do the dedicated one starting tomorrow

EDIT 2024-05-28 08:00 UTC : We have seen an increase of tls error on par region, we have rollback 8 of 9 instances of this load balancer on the previous version which is not affected. We keep one instance with the issue the time to dig and found the the root cause.

EDIT 12:30 UTC : We have found the issue and written a patch, we are releasing it and then we will deploy the new version. The issue was limited to services under *.services.clever-cloud.com certificate only.

EDIT 13:30 UTC : We have deployed the new release on par region, we will start very soon with others regions and cleverapps.io

EDIT 14:50 UTC: We have deployed the new release on every region including dedicated ones and cleverapps.io. We will begin the dedicated load balancers very soon.

EDIT 17:20 UTC : We are finishing the last dedicated load balancers for today and we will terminate the others tomorrow.