Incidents
Full history of incidents.
November 2023
We detected an issue to both access and ingest Clever-cloud applications and addons metrics.
We are investigating.
Edit 27 Nov 2023 11:02:23: Query is now functional. We are also observing an issue with metrics from add-ons. We are on it. Edit 27 Nov 2023 06:00 PM: A regression on token's regen has been fixed, and all tokens have been updated.
During the deployment of maintenance update to solve the https://www.clevercloudstatus.com/incident/767, we applied a patch that put cellar into an unreachable state. We are currently rolling back the update.
EDIT 17:00 UTC : Cellar is available
Over the past few days, our platform encountered several glitches in the handling of connections, with some of our customers experiencing slowdowns in some services. Here are the results of our investigations and the actions taken by our teams:
Update 2023-12-01 18:00 UTC
Cellar
After running more tests, we discovered performance issues on long-distance connections, possibly caused by HTTP/2, which we activated on Cellar a few weeks ago. Our analyses confirmed that uploading data to Cellar using HTTP/2 in such conditions could heavily limit the throughput, whereas HTTP/1.1 gave us better and consistent results. The improvements seen for customers affected by the identified problems far outweigh the benefits of HTTP/2 seen in few cases. So we're disabling HTTP/2 and monitoring throughput to confirm this on a larger scale.
Update 2023-11-28 13:30 UTC
Load balancers 🥁
We will begin to include new load balancer instances deployed yesterday in the load balancer pool starting 14:00 UTC. New load balancer IP addresses that will be added with the current ones are :
- 91.208.207.214
- 91.208.207.215
- 91.208.207.216
- 91.208.207.217
- 91.208.207.218
EDIT 15:30 UTC : The monitoring saw an increasing number of 404 response status code. We rolled back the modification and investigate the issue. It was an overlapping of internal ip addresses with the cellar load balancer which is fixed now.
EDIT 15:45 UTC : After further investigation, we could resume the maintenance.
EDIT 18:05 UTC : We have finished to deploy new instances.
Update 2023-11-27 17:20 UTC
Load balancers 🥁
We have installed new load balancers. We will review and test them tonight and will add them to the lb pool tomorrow morning (2023-11-28).
UPDATE 2023-11-27 15:30 UTC
Load balancers 👀
We are still seeing a few random SSL errors here and there. We are investigating. The culprit may be a lack of allocated resources. We are following this lead.
… we have fine tuned the load balancers, which have caused temporary more SSL Errors for a minute. The traffic seems to be better.
UPDATE 2023-11-27 14:00 UTC
Load balancers ❌
We are experiencing new errors on the load balancers: customers report PR_END_OF_FILE_ERROR errors in their browsers while connecting to their apps and SSL_ERROR_SYSCALL from curl.
We are able to reproduce these errors. They look like the incident from friday 24th in the morning. We are looking for the configuration misshap that may have escaped our review.
✅ It's fixed. We started to write a monitoring script for that kind of configuration error, we will speed up the writing and the deployment of this monitoring in production.
UPDATE 2023-11-27 08:30 UTC
Load balancers ✅
We've been monitoring the load balancers all week-end: The only desync was observed (and fixed right away by the on-call team) on old sōzu versions (0.13) that are still processing 10% of Paris' public traffic! We plan to remove these old load balancers quickly this week.
We consider the desynchronization issue resolved.
Cellar 👀
Last Friday, we configured Cellar's front proxies to lower their reload rate. We haven't seen any slowness since, but it was already hard to reproduce on our side. No slowness on Cellar was reported during the week-end, but we are still on the look.
UPDATE 2023-11-24 15:00 UTC
Load balancers 🥁
After more (successful) load tests, the new version of sōzu (0.15.17) is being installed on all impacted public and private load balancers. Upgrades should be over in the next two hours.
Cellar 👀
The team continues to investigate the random slowness issues still encountered by some customers, which we are trying to reproduce in a consistent way.
UPDATE 2023-11-24 10:45 UTC
Load balancers 🥁
We've tested our new Sōzu release (0.15.17) all night with extra monitoring and no lag or crash was detected. The only remaining issues were on the non updated (0.13.6) instances. They were detected by our monitoring and the on-call team restarted them.
We are pretty confident that this new release solves our load balancers issues. We plan to switch all private and public Sōzu load balancers to 0.15.17 today and monitor them over the coming days.
Temporary incident:
While updating our configuration to grow the traffic shares of the new (0.15.17) load balancers, a human mistake (and not a newly discovered bug) broke part of the configuration, causing many ssl version errors on 15% of the requests between 09:25 and 09:50 UTC.
UPDATE 2023-11-23 18:43 UTC
Certificates ✅
As we planed earlier, the renewal of all certificates in RSA 2048 has been completed, except for a few wildcards (mostly ours) which require manual intervention. This will be dealt with shortly.
Load balancers 🛥️
We were able to identify the root cause of our desync/lag in Sōzu. A specific request, a ‘double bug’, was causing worker crashes. We developed fixes and are confident they will fix our problems. We’ll test them and be monitoring the situation before deploying them fully in production.
Cellar 👀
We’ve upgraded our load balancers infrastructure and monitoring tools to check whether this will improve the various types of problems reported to us.
Original Status (2023-11-23 14:12 UTC)
1. Key management in Sōzu and security standards
Background: Two months ago, we migrated our Let’s Encrypt aumotatic certificate generation from RSA 2048 keys to RSA 4096 keys. Following a major certificates renewal in early November, this led to timeouts when processing requests, and then 504 errors.
Actions:
- On Monday November 13, we rolled back key generation to RSA 2048 for all new certificates.
- On Monday November 20, we launched a complete key regeneration in RSA 2048, which requires an increase in our Let's Encrypt quotas (in progress).
Back to normal: Within the day, while we finish regeneration.
Next steps: We have also explored a migration to the ECDSA standard, which according to our initial tests will enable us to improve both the performance and security levels of our platform. Such a migration will be planned in the coming months, after a deeper impact analysis.
2. HTTPS performance issues
Background: We noted a significant drop in HTTPS request processing performance, with capacity reduced from 8,000 to 4,000 requests per second, due in particular to an excessive number of syscalls via rustls.
Actions: We developed a Sōzu update and pushed it on November 16.
Back to normal: The problem is now resolved.
3. Load balancers desync/lag
Background: Load balancers are sometimes out of sync, Sōzu gets stuck in TLS handshakes or requests. The workers no longer take the config updates, causing the proxy-manager to freeze. The load balancers then miss all new config updates until we restart them.
Actions: We have improved our tooling to detect the root cause of the problem at a deeper level. We have been able to confirm that this concerns both Sōzu versions 0.13.x and 0.15.x.
Next steps: We'll be tracing the problem in greater depth within the day, to decide what actions to take in the short term to mitigate the problem.
4. Random slowness on Cellar
Background: Customers are reporting slowness or timeouts on Cellar, which we are now able to identify and qualify. If the cause has not been fully spotted, we have several ways of mitigating the problem.
Actions: Add capacity to front-ends infrastructure and enhance network configuration.
Since the incident https://www.clevercloudstatus.com/incident/746, we have TLS handshake issues that appears as timeout. We are deploying a series of patches to solve this issues and it should be better as we deploy them.
A maintenance is planned for our addon cellar API, some calls will probably not be available for a couple of minutes. This will not impact any deployed service.
The maintenance will start today November 20, 2023 at 12:00 UTC+1.
EDIT 2023-11-20 12:10 UTC+1: Maintenance is starting
EDIT 2023-11-20 13:00 UTC+1: Maintenance is now over, the addon cellar API is fully available
We are facing some instabilities with our addon load balancers in Paris, some connections can randomly be interrupted. We have identified the issue and we are currently fixing it.
EDIT 18:20: All addon load balancers have been fixed, we are currently actively monitoring their state
EDIT 2023-11-20 18:49 UTC: The fix has not been as effective as we would have hoped. We are currently issuing another fix. During the next few minutes, you might encounter some connection refused errors when connecting to some add-ons.
EDIT 2023-11-20 18:59 UTC: The operations are done. We are now monitoring the situation.
Our monitoring has trigger alerts about the Clever Cloud API. We are investigating
EDIT 15:00 UTC : The amount of errors is decreasing, we are still investigating EDIT 15:30 UTC: We have the issue, we are deploying a patch EDIT 15:45 UTC: The patch has been applied successfully EDIT 16:00 UTC : The situation has came back to normal, we are watching
We are currently scaling up our kafka cluster for metrics usage. You may not see latest datapoints
EDIT 15th of november 09:09 AM UTC: cluster has been scaled up and partitions distributed among new brokers
Our main API responds slowly. We are investigating to find out why.
(times in CET)
- EDIT 11:54 - API has been redeployed and it solved the issue. We are still investigation the root cause.
- EDIT 12:09 - API is down, we are working on it.
- EDIT 12:23 - The API is recovering, our load balancers on Paris and Scaleway regions are down
- EDIT 12:30 - Load balancers on Scaleway region is now operational
- EDIT 12:33 - Load balancers on PAR region is now operational. API is now fully operational
- EDIT 14:17 - We found and fixed a cause of the API problems. The API seems to be running fine, we may have found the root cause of the issue. We are keeping an eye on it.
A maintenance is planned for our API, some API calls will probably not be available for a couple of minutes. This will not impact any deployed service.
The maintenance will start today November 14, 2023 at 12:00 UTC+1.
EDIT 2023-11-14 12:09 UTC+1: API is down, we report this maintenance
EDIT 2023-11-14 15:20 UTC+1: Maintenance is starting
EDIT 2023-11-14 17:10 UTC+1: Maintenance has been correctly completed
We are detecting a performance issues on our new metrics storage layer. You may not see fresh datapoints. We are working on it.
EDIT Mon Nov 13 18:51:01 2023 UTC: config tuning has been made, cluster is now fully recovered. Lag will be resolved within minutes
We are detecting a high number of errors on our reverse proxies towards cleverapps domains, we are investigating
EDIT: fixed
One of our load balancers randomly close connection during TLS handshake
EDIT 16h35: This issue has been fixed.
This maintenance is planned on our API, this maintenance will only impact the view of the add-ons and the creation of new one. Each already deployed add-ons will still be available.
This concern only Jenkins, ElasticSearch, MySQL, PostgreSQL, MongoDB and Redis API.
For each kind of add-ons expect a downtime of 20 to 30 minutes.
The maintenance will start tonight November 9, 2023 at 21:00 UTC.
EDIT 2023-11-09 22:00 UTC+1: Maintenance is starting.
EDIT 2023-11-10 01:00 UTC+1: Maintenance is now completed.
We are seeing issues with our APIs. We are investigating this.
Edit 2023-11-09 : we are keeping this incident open as the performance issues seem to have lowered but not vanished. There seems to be a seasonality with these issues, we are still searching why we have these surges in load.
Customers have complained about slow request handling on Cellar for a few days. On random requests, Cellar might take up to a few seconds to send the first byte.
We are investigating this issue.
We have difficulties to reach our main api (api.clever-cloud.com). We are investigating.
EDIT 16:30 UTC : The main api is reachable, we are investigating the root cause which may seems related to the database.
EDIT 16:40 UTC : We have detected that we missed of capacity on the database, we have update the capacity and reboot the database, we are deploying the api again.
EDIT 18:00 UTC : The main api is reachable
One hypervisor went down on rbx region. We are trying to reboot it.
All the applications on that HV are being redeployed. A few add-ons that are on it are unavailable.
The hypervisor was not rebooting from our OVHCloud interface. We asked the support and they put it back up again.
12:28 UTC: the HV is running, we are starting the cleaning procedure and making sure all the add-ons have restarted correctly.
Our main api is encountering issues while loging into our database. It may make some requests fail in a random manner. We are investigating it.
The accesslogs APIs are having issues. We are working on it.
All the access logs are still stored, but the API will not give you the recent ones (up to two weeks).
Edit 2023-11-14: we are still working on making the accesslogs available from the API.