Incidents

Full history of incidents.

October 2021

Certificate Issue with rbx-hds cellar 4 years ago

Fixed · Cellar · Global

There is an issue with the certificate associated with the cellar of our hds zone, we are investigating.

12:58 UTC: The issue is resolved.

Here is what we know so far:

The revocation server of the Certification Authority providing this certificate says that this certificate has been revoked on 2021-06-23, except it was still accepted just fine a few hours ago.

We have asked for a reissue of the certificate (this is an automatic operation). The reissued certificate has been installed and is working fine. Meanwhile, we have asked the CA about this revocation without any warning or notice and are waiting for an answer.

Push error on Git repositories using HTTPS 4 years ago

Fixed · Git repositories · Global

We are currently seeing git push errors at least when using the HTTP protocol, the connection gets refused. This mostly impacts pushes from our CLI.

We are investigating the issue.

EDIT 10:00 UTC: The issue has been fixed, pushes using the HTTP protocol should now be working as intended. Pushes and clones using SSH protocol were not impacted. We'll investigate further the issue.

Wrong Java version for some applications 4 years ago

Fixed · Deployments · Global

Some applications have the wrong CC_JAVA_VERSION environment variable value, this may lead to unexpected deployment errors or runtime errors if the application redeploys. We are looking into it.

EDIT 18:45 UTC: CC_JAVA_VERSION should now be fixed with the right value. Impacted applications are redeploying to make sure they use the right version.

EDIT 18:58 UTC: If you changed the value of CC_JAVA_VERSION between 09:30 UTC and 18:45 UTC, the value might have been replaced with its previous version. Make sure you set it back to the right version if needed. Sorry for the inconvenience.

Documentation style is currently broken 4 years ago

Fixed · Global

Our documentation is currently broken, we are looking into it. If you have any technical questions, feel free to contact our support team in the meantime.

EDIT 14:57 UTC: The problem has been fixed, the documentation should now be fully accessible at https://www.clever-cloud.com/doc/

Let's Encrypt root certificate expiration 4 years ago

Fixed · Infrastructure · Global

A Let's Encrypt root certificate has expired yesterday. This can lead to various errors TLS error for old clients, the most common being "Certificate date is invalid".

Our Let's Encrypt certificates already provide the up-to-date Let's Encrypt chain but some older clients might not be able to trust that new chain because they don't have the new root Certificate Authority in their trustore. If you are in this situation with clients you can't update, we can sell certificates that will be trusted by those older clients. You can contact us on the support with the domains you need to protect.

You can also find more information about this expiration on Let's Encrypt website: https://letsencrypt.org/docs/dst-root-ca-x3-expiration-september-2021/

September 2021

Some API endpoints are returning high rates of HTTP 500 - Internal Server errors 4 years ago

Fixed · API · Global

Some API endpoints are returning high rates of HTTP 500 - Internal Server errors. We are investigating.

EDIT 14:57 UTC: A fix has been pushed, the errors should be resolved. We continue to monitor the situation.

EDIT 15:19 UTC: No more Internal server errors are happening, this incident is now closed.

Metrics / Access Logs unavailability 4 years ago

Fixed · Access Logs · Global

The metrics and access logs are currently unavailable. We are looking into it.

EDIT 06:43 UTC: queries should be back to normal, the ingestion lag should take a few minutes to be consumed.

EDIT 11:12 UTC: Everything is back to normal

FSBuckets are down 4 years ago

Fixed · FS Buckets · Global

FSBuckets are not mounting properly on new deployments, we are investigating

Edit: New hypervisors were added but they had no support for fsbuckets yet.

[Metrics / Access Logs] Ingestion lag and queries unavailability 4 years ago

Fixed · Access Logs · Global

Metrics and access logs are currently partially unavailable to query. We are investigating.

EDIT 14:35 UTC: The root cause has been identified, the ingestion lag currently sits at around 2 hours so metrics queries will be out of sync for the time being. Access logs are not ingesting and are currently kept in a separate queue. We expect the lag to start decreasing later tonight. This incident is a follow-up to the urgent maintenance of yesterday which mainly aimed at better stabilizing the cluster.

EDIT 23:34 UTC: Metrics have been fully ingested, access logs are still delayed but they are currently being written. Queries might still be slow, this is expected.

EDIT 6:30 UTC: The situation is back to normal.

Metrics / Access logs: possible lag ingestion and query failure 4 years ago

Fixed · Access Logs · Global

Following maintenance, access logs and metrics may be unavailable for some queries as well as ingestion lag. Overviews of applications/organizations requests may also be impacted.

EDIT 16:17 UTC: the maintenance is still ongoing. Reads and writes are disabled since 15:42, this is expected.

EDIT 21:50 UTC: the maintenance is finished. Ingestion is catching up.

Logs & logs drains are experiencing issues 4 years ago

Fixed · Services Logs · Global

Logs & logs drains are experiencing issues.

EDIT 15:14 UTC: fixed, the related drains are currently catching up.

August 2021

[Paris] Reverse proxies requests timeout 4 years ago

Fixed · Reverse Proxies · Global

Between 15:48 UTC+2 and 15:50 UTC+2, one of our reverse proxies on Paris unexpectedly timed out during a maintenance upgrade on our Paris zone.

The time outs last for about 2 minutes before the proxy was put out of the pool.

Some requests might have failed during the first minute and then, all requests handled failed during the remaining minute. Additional investigation will be performed to analyze what happened.

Slow downloads/errors using npm/yarn 4 years ago

Fixed · Deployments · Global

We are observing slow downloads and errors using yarn or npm, this may impact and slow down deployments of node applications with lots of packages especially. This seems to be an issue on npm side, and does not seem to be restricted to Clever Cloud.

Update: NPM Registry posted on their status page confirming the incident and are working on a fix: https://status.npmjs.org/incidents/bydjtj102gsn Update: The issue has now been fixed, node deployments are back to normal.

Elasticsearch 7.10 and above will be upgraded due to security issues 4 years ago

Fixed · Global

Elasticsearch add-ons on versions 7.10 and above are subject to security vulnerabilities. Those add-ons will be updated to Elasticsearch 7.14.0 on August 16, 2021 starting at 21:00 UTC+2. Add-ons that need to be upgraded will be unavailable for about 10 minutes.

Affected customers have been e-mailed about this and can contact our support team for any additional question.

EDIT 21:05 UTC+2: Update is beginning.

EDIT 22:00 UTC+2: Updates are over and were successful for most of the add-ons. Owners of add-ons that couldn't be updated will be contacted. If you encounter any issue following this update, please reach to our support team.

Planned downtime of an hypervisor 4 years ago

Fixed · Global

This maintenance operation is a follow-up to this incident : https://www.clevercloudstatus.com/incident/378.

We will be switching back to the original server (which has been fixed by the manufacturer). The server should be down for 10 minutes if our provider does not encounter any issues (may last up to an hour otherwise).

Affected customers have been e-mailed about this and can migrate their add-ons automatically beforehand.

2021-08-25 19:02 UTC: Server is going down.

19:17 UTC: This is taking longer than expected. Server management software decided to reapply firmware settings; this takes a few minutes.

19:24 UTC: Server is up. Add-ons are starting up.

19:26 UTC: Everything is up. Incident is over.

E-mails delayed 4 years ago

Fixed · Global

Our e-mail provider is investigating an issue with their API. Until their incident is resolved, our e-mails will be delayed by an unknown amount of time.

You can follow their incident here: https://status.mailgun.com/incidents/jj6fx7nqwn9t

21:19 UTC: Incident is resolved.

July 2021

Performance issues on one node of our free MongoDB cluster 4 years ago

Fixed · MongoDB shared cluster · Global

09:27 UTC - Monitoring agent on one of the nodes stopped responding then started responding again a few seconds afterwards. We put it on a random network error. 09:39 UTC - Some users start to have issues using their free add-on. 09:40 UTC - We investigate. It turns out that the mongodb process was not really listening to incoming connections anymore. 09:42 UTC - We try to restart the service. 09:46 UTC - We actually reboot the whole VM. 09:47 UTC - The VM is up and running, the mongodb process is cleaning itself up. 09:50 UTC - The mongodb process finishes its cleanup and starts accepting connections again.

PAR: High load on some hypervisors, leading to applications / add-ons slowness and Monitoring/Unreachable events 4 years ago

Fixed · Infrastructure · Global

Starting at 5:54 UTC and up until 6:12 UTC, some hypervisors experienced higher CPU load. This higher load may have slowed down applications and add-ons that were hosted on those hypervisors.

The root cause has been identified. Unfortunately, the higher load also triggered a lot of redeployments with the Monitoring/Unreachable cause. Most of them were cancelled on time but some of them went through. Some of the deployments that started did not correctly finish and ended-up in a blocked state. Those deployments are currently being canceled and all cancels should be over in a few minutes.

We have developed a fix that will prevent those events from happening again and it will be deploy in the next hours.

PAR: High load on some hypervisors, leading to applications / add-ons slowness and Monitoring/Unreachable events 4 years ago

Fixed · Infrastructure · Global

Starting at 18:13 UTC and up until 18:43 UTC, some hypervisors experienced higher CPU load. This higher load may have slowed down applications and add-ons that were hosted on those hypervisors.

The root cause has been identified and the issue has been fixed. Unfortunately, the higher load also triggered a lot of redeployments with the Monitoring/Unreachable cause. Most of them were cancelled on time but some of them went through. Some of the deployments that started did not correctly finish and ended-up in a blocked state. Those deployments are currently being canceled and all cancels should be over in a few minutes.

We will investigate more in depth about this increased CPU load usage and see how we can better prevent this.

Short network unavailability between our two Paris datacenters 4 years ago

Fixed · Infrastructure · Global

For less than a minute, communication between our two Paris datacenters has been unavailable between 19:03 UTC and 19:04 UTC.

We do not have any more information at the moment (though it is most likely a routing issue). Everything is working fine now except for Metrics (and accesslogs) which will come back in a few minutes.

EDIT 19:33 UTC: It happened again at 19:29 UTC. We are awaiting more information from our network provider.

EDIT 19:42 UTC: It happened again at 19:41 UTC.

This was due to a maintenance on one of the fiber optic channels between our two Paris datacenters. Our network provider was not made aware of this maintenance which caused the connection to switch back and forth between links when a link went on and off again.