Incidents

Full history of incidents.

June 2019

A cellar node restarted, timeouts or HTTP 500 errors sent 6 years ago

Fixed · Cellar · Global

A node from the old Cellar cluster restarted at 21:30 UTC. While it went okay at first thanks to the restart of the few nodes a few days ago, it started emitting HTTP 500 errors or timeouts, as it was before. Service should be back online in a few hours once the cluster stabilized itself again.

The new Cellar cluster is not impacted by those issues.

EDIT 23:40 UTC: Cluster now seems to be in a good shape again

Deployments keep getting through the build phase 6 years ago

Fixed · Deployments · Global

Some deployments seem to keep building (even if the build succeeds, another build starts). We are looking into it.

EDIT 12:33 UTC: We may have identified the root cause. It may be due to a change that happened this morning. We will revert it.

EDIT 12:43 UTC: The change has been reverted and we confirm that it resolves the issue. Sorry for the inconvenience.

A cellar node is restarting 6 years ago

Fixed · Cellar · Global

One node of our old Cellar cluster is restarting, some requests are failing (timeouts or 500 errors). This will be resolved once the node has fully restarted. We may need to restart more nodes right after.

EDIT 23:30 UTC: Other nodes need to be restarted. We saw <1% of failing requests, expect the same amount for the remaining restarts.

EDIT 02:00 UTC: Nodes have been restarted, failing requests are getting lower and lower, still under 1%.

API Unavailable or very slow 6 years ago

Fixed · API · Global

Our main API is currently having some troubles to respond to requests in a timely manner. We are investigating it.

EDIT 15:32 UTC: The issue has been identified, we are currently re-deploying the API. Console is still unavailable.

EDIT 15:34 UTC: The API successfully redeployed and is now available. Console is now available too. The incident is over.

Logs system part restart 6 years ago

Fixed · Global

We will restart a part of our logs system, it will take 2 minutes. After this interruption, the logs produced during the restart will be available but the logs ordering will be lost. This restart is a part of our new logs system development.

EDIT 14:27UTC: finished.

Two Factor Authentication 6 years ago

Fixed · Console · Global

Two Factor Authentication is currently down, resulting in users unable to log in.

May 2019

Cellar c1: some nodes are down 6 years ago

Fixed · Cellar · Global

Cellar c1 may have issues. Some nodes do not restart correctly.

Logs ingestion issue 6 years ago

Fixed · Services Logs · Global

Logs ingestion is currently having issue. We start investigating the issue

EDIT 15:31 UTC: The issue has been fixed. Some of the logs were lost but not all of them, you should have the last ~15 minutes, the buffer wasn't large enough to keep them all. We will increase it next week.

[MySQL] Shared cluster is under high load 6 years ago

Fixed · MySQL shared cluster · Global

One of our MySQL shared cluster is under high load. We are investigating.

EDIT 16:20 UTC: Problematic queries have been killed and the cluster load is going down. We continue to monitor the situation but it should go back to normal. We also have a newer MySQL shared cluster on MySQL version 8. You can migrate your database to it using the "Migrate" tool.

EDIT 16:45 UTC: The performance issue is back, we are trying to narrow down the issue

EDIT 17:00 UTC: Performances are again back to normal. We will keep an eye on it. Meanwhile, do not hesitate to migrate to our new cluster to avoid this issue.

EDIT 10/05/19 08:10 UTC: The issue has come back.

EDIT 10/05/19 12:00 UTC: Owners of the potential abusive queries have been notified. Cluster performances are back to normal. As usual, we will keep an eye on it.

Logs pipeline issue 6 years ago

Fixed · Services Logs · Global

We have an issue in the logs ingestion pipeline. We are working on it.

EDIT: Issue resolved at 15:48:20 UTC

April 2019

Important error rates on Cellar 7 years ago

Fixed · Cellar · Global

We are recording important error rates on Cellar between 500 and 503 errors.

EDIT 09:41 UTC: 503 errors are now gone but were replaced by 500 errors that get triggered after a few seconds. We are checking the cluster's state

EDIT 10:10 UTC: Error rate is decreasing but continue to be important. Deployments are also impacted by this issue if you are using build cache.

EDIT 10:27 UTC: Error rate is still at ~20% and continue to decrease.

EDIT 11:52 UTC: We did not receive any errors since 11:40 UTC, the cluster is now in good shape and everything should be back to normal.

This cellar cluster will soon be deprecated (new cellar add-ons are already created on an up-to-date cluster) in favor of a better and maintained version.

API is unreachable 7 years ago

Fixed · API · Global

Clever Cloud API is very slow, we are investigating.

EDIT 16:11 UTC: fixed.

Problems on applications/addons access 7 years ago

Fixed · Reverse Proxies · Global

We are experiencing issues on applications/addons access. We are now investigating and we will come back with more informations.

EDIT 17:56 UTC: the systems are backing to normal. It was a DNS resolver problem.

EDIT 17:58 UTC: fixed.

Addons migrations issues 7 years ago

Fixed · API · Global

There is a problem that prevents addon migration (if you start or have a running migration, the process will fallback to previous state without problems).

cleverapps.io domains marked as dangerous 7 years ago

Fixed · Global

The cleverapps.io domain has been marked as dangerous. As far as we know, one or more subdomains have been reported as dangerous and the complete domain has been added to the list.

This means that your browser may show you a security alert when visiting a cleverapps.io site.

We are looking into reporting the mistake to the relevant lists and services.

Meanwhile, we remind our users that they should never use a cleverapps.io domain for production; they should only be used for development and tests.

March 2019

Issue with cellar-c1 7 years ago

Fixed · Cellar · Global

A cellar-c1 node crashed in a way which caused a very important load on all other nodes; this is causing general slowness and an elevated error rate.

It should go back to normal gradually and will not take more than an hour at the most.

EDIT 17:00 UTC: Error rate and performance is back to normal

Cellar issues 7 years ago

Fixed · Cellar · Global

We are experiencing issues on our Cellar features.

EDIT 7:25 UTC: multipart uploads are down, the fixes are ongoing. EDIT 15:38 UTC: the cluster has been fixed, everything is back to normal.

Network issue on the older infrastructure 7 years ago

Fixed · Infrastructure · Global

We are experiencing a network issue on the older infrastructure in Paris. We are investigating.

EDIT 06:12 UTC: The network issue is over. This was an issue with our provider which affected all our servers but not all at the same time. Nothing was actually fully unreachable at any point in time but there was a lot of packet loss.

Peering issues with part of the SFR network 7 years ago

Fixed · Global

We are getting reports from some SFR network users who cannot access the Clever Cloud Console. It seems to impact only some SFR customers.

EDIT 9:19 UTC: This only affects the older SFR network, not the SFR-Numericable network. This specifically affects all SFR peering going through TH2.

EDIT 9:50 UTC: This has been resolved at 9:36:30; if you are still experiencing issues, please tell us.

Network outage 7 years ago

Fixed · Infrastructure · Global

A network issue is happening. Applications may be unreachable.

Console is partly down. Some apis are down.

EDIT 18:20 UTC: Here is the history and context of the network issue:

At 17:25, a maintenance on a component of a redundant network link caused one of the underlying links to fail. For reasons unknown at this time, the failing link was elected and about 30% of packets were lost until 17:29.

At 17:30, the network engineer decided to revert the change; this caused additional loss for about 30 seconds. Network was back to normal at 17:31.