Incidents

Full history of incidents.

December 2019

Fixed · Reverse Proxies · Global

2019-12-17 11:00, public reverse proxies (using haproxy) start experiencing performance issues. Requests take a long time to process.
14:00, we add two (sozu) reverse proxies to the pool.
14:15, sozu proxies are actually experiencing issues too. We removed them from the pool. We start cleaning them.
15:12, sozu proxies have been cleaned, updated and restarted. We add them to the proxy pool again.
16:25, things seem to run smoothly. We consider this issue fixed.

Fixed · Access Logs · Global

Metrics ingestion is delayed, this is due to a very high load of the storage platform, due to a maintenance which is linked to the previous incident.

17:45 UTC: Ingestion is back to normal performance, delay will be back to normal in 15 minutes.

Metrics: Up-to-date metrics are delayed 6 years ago

Fixed · Access Logs · Global

Metrics collection currently has some troubles and up-to-date metrics are not available since ~30 minutes. All the metrics are stored but won't be retrievable. We are looking into it.

18:42 UTC: We are still working on it. This is a never-before-seen, massive issue so we are unable to give any ETA at this time.

22:35 UTC: The issue has been narrowed down and is now under resolution. We will wait until tomorrow morning to continue restoring this service. All metrics gathered before this incident are still accessible, only new metrics are not. Those are currently stored and will be processed once the Metrics cluster goes back to normal. More news tomorrow morning.

12:00 UTC: We have been back working on this since 7:30 UTC, things are looking good; still at least a few hours to go.

13:55 UTC: The issue with the storage platform is now finally fixed. The ingestion is now running at full speed and catching up; it's processing the 22 hours of data which have been accumulating.

15:25 UTC: We are about halfway there.

16:50 UTC: We are 4/5 of the way there. It should be resolved in under an hour.

17:30 UTC: You should now already see recent points in your applications' metrics. Delay will be back to normal in less than 30 minutes. Closing this off.

November 2019

Logs collection issue on public zones 6 years ago

Fixed · Services Logs · Global

We are experiencing issues on logs collection system on public zones.

EDIT 19:44 UTC: fixed, the logs collection is catching up its lag.

EDIT 19:49 UTC: back to normal state.

Cellar: Requests hang 6 years ago

Fixed · Cellar · Global

Some requests are hanging when talking to our Cellar cluster. We are investigating the issue

EDIT 11:03 UTC: The cluster is now back up. A node was shutdown for maintenance reasons as it already happened these past weeks. Somehow the data it hosted was unavailable even though replicated data is available on other nodes. We will investigate this incident further.

TLS errors on cellar-c2 6 years ago

Fixed · Cellar · Global

Cellar-c2 is having trouble with TLS connections, we are working on it.

08:16 UTC: The issue is resolved

Network issues on our Paris zone 6 years ago

Fixed · Infrastructure · Global

We are currently having network issues across the platform. We are investigating the issue.

EDIT 16:32 UTC: The network issue seems to be resolving, only one of our datacenter had the issue but it may have impacted applications and add-ons that weren't in this datacenter.

EDIT 16:36 UTC: Console is not stable because of Clever Cloud API issues due to datacenter network problems.

EDIT 16:40 UTC: Our network provider is already aware of the issue and is looking into it.

EDIT 17:00 UTC: Our datacenters still have issues, we working on it with our provider.

EDIT 17:17 UTC: The network issue on our datacenters is over but it included additional issues. API is currently having issues and our console is unreachable at the moment.

EDIT 17:34 UTC: Console and API are up again and we are making sure that all services are up and running again.

EDIT 19:26 UTC: The incident is currently over and nothing has come up since 17:34 UTC.

We are still waiting for more information from our Network Provider that we will add here as soon as we get it.

The network perturbation was ongoing from 16:18 UTC to 16:30 UTC. One of our datacenter experienced high packet loss due to routing issues. Those issues were only impacting the external trafic (communication between our 2 datacenters was not impacted). Applications and add-ons were UP but unfortunately, because of those routing issues, you may have experienced difficulties reaching out your applications.

Those issues also impacted some of our systems and made our API / Console unavailable for 1h during which deployments were also not working.

SSL certificates generation is experiencing issues due to Let's Encrypt organisation issues 6 years ago

Fixed · API · Global

SSL certificates generation is experiencing issues due to Let's Encrypt organisation issues: https://letsencrypt.status.io/.

Routing issue on one of our networks 6 years ago

Fixed · Infrastructure · Global

A network issue is occurring: a node does not route correctly to one of our DCs. We are working around it in the meantime. IPs in the 46.252.181.0/24 range are unreachable.

EDIT 05:30 UTC: routes have been updated to avoid the incriminated router. Traffic is back to normal.

SSH Access on Instances is Unavailable 6 years ago

Fixed · SSH Gateway · Global

We are currently experiencing problems with the SSH Gateway, we are investigating the issue to fix it.

Git push issue 6 years ago

Fixed · Deployments · Global

Since 13:41 until 14:06 UTC, git push no longer triggered a deployment. If you did a git push during those times, please use the "restart last pushed commit" button to actually deploy the latest commit.

October 2019

Metrics unavailable 6 years ago

Fixed · Access Logs · Global

Metrics are unavailable because of multiple nodes of the indexing system which went down simultaneously. They are reloading their index in memory.

Service should be back in 15 minutes.

Meanwhile, ingestion is still working fine.

15:01 UTC: Incident is over.

Metrics ingestion delay + slow read queries 6 years ago

Fixed · Access Logs · Global

We are experiencing an issue on the Metrics service which is due to an error while adding capacity to the storage cluster. We are working on it.

10:26 UTC: The ingestion issue is fixed, the system is now catching up.

10:33 UTC: The ingestion delay is almost back to normal.

10:36 UTC: There is still a bit of a lag but it should come back to normal in a few minutes. Read performance is still a bit hit or miss but coming back to normal as well. We will reopen the incident if it does not.

11:06 UTC: The ingestion lag is increasing. We are investigating. This may take a while.

11:30 UTC: The cause has been identified and partially fixed.

11:37 UTC: Lag is now <5s ; we are currently working on fixing the issue in a more permanent way.

11:45 UTC: The issue is now fixed.

Logs collection issue 6 years ago

Fixed · Services Logs · Global

We have an issue with logs collection in the Paris zone. We are working on it.

13:20 UTC: The issue has been identified and at least partially fixed. Logs are coming through but we are still making sure that everything is indeed fine.

13:25 UTC: The issue is indeed fixed. Some older logs are still being collected.

13:33 UTC: Incident is over.

September 2019

Connection issues with a free MongoDB cluster 6 years ago

Fixed · MongoDB shared cluster · Global

A free shared mongodb cluster has too many connections opened which prevents new connections from working. We are looking into which user(s) are opening too many connections and we will start a new cluster to alleviate the issue. We have no immediate solution, sorry for the inconvenience.

08:40: The problem has been alleviated by allowing more connections. It will slow down the service but you can at least connect to your databases and migrate to paid add-ons if you were using this service for production. We will start a new cluster very soon to improve performance.

Monitoring is flooding the deployment system 6 years ago

Fixed · Deployments · Global

False positives in monitoring are causing a lot of deployments, making legit deployments harder to process.

17:21 UTC: Incident is over. A monitoring component was still complaining about a few applications in a loop, there was no actual issue, just a very overzealous alerter process. Deployments performance has been back to normal since 16:43 however.

Deployments delayed 6 years ago

Fixed · Deployments · Global

Deployments are delayed because of an unusual amount of deployments to be processed.

12:44 UTC: The delay is now back to normal. Some deployments may be stuck though, please contact us if you are experiencing such an issue.

Build Cache Archive Downloads and Uploads are hanging 6 years ago

Fixed · Global

There seems to be a problem with download and upload of build cache archive causing them to hang. It is resolved for now but we are watching closely to see if the problem reappears

Logs experiencing issues 6 years ago

Fixed · Services Logs · Global

Our logging infrastructure (including live logs) is experiencing issues.

EDIT 20:51 UTC: fixed.

EDIT 23:19 UTC: the logging infrastructure is experiencing issues. We are working on a fix.

EDIT 23:25 UTC: fixed.

August 2019

[MySQL] Free shared cluster currently down 6 years ago

Fixed · MySQL shared cluster · Global

The cluster is currently down. If it can't be brought up, a failover will be issued

EDIT 00:00 UTC: Cluster is now available again, no failover happened.