Incidents

Full history of incidents.

July 2022

Tickets center maintenance in the Clever Cloud Console 3 years ago

Fixed · Global

At 19:00 UTC+2 on Tuesday 26th July 2022, our tickets center the Clever Cloud Console will be unavailable for a few minutes.

Once the maintenance is over, you will have to refresh your Clever Cloud Console to be able to access your tickets or contact our team.

During this maintenance, you will still be able to reach our support team using our email address: support@clever-cloud.com

EDIT 2022-07-26 18:59 UTC+2: The maintenance is about to start.

EDIT 2022-07-26 19:10 UTC+2: The maintenance is now over. You will need to refresh your Clever Cloud Console to access the ticket center.

[RETROACTIVE][PAR] Shared RabbitMQ cluster publishers unavailability 3 years ago

Fixed · RabbitMQ shared cluster · Global

The cluster refused publishers messages since 12:32:34 UTC due to a system wide alert that stopped all nodes from accepting publishers messages. This means that RabbitMQ clients would keep trying to publish their messages until the cluster accepted them. The node has been restarted at 17:23 UTC, fixing the issue.

Investigations will be carried out to understand how this happened and why our monitoring did not raise an alert.

The cluster should now be fully operational.

One FSBucket node is down 3 years ago

Fixed · FS Buckets · Global

One of our FSBucket node is currently down. We are working to resolve the issue.

EDIT 10:40 UTC - fixed.

Deployments may be blocked 3 years ago

Fixed · Deployments · Global

In rare occasions, an inappropriate behavior of our scheduling infrastructure can lead to deployment being stuck. We've identified the root cause and we're qualifying a fix. If it happens, don't hesitate to reach our support team.

Connectivity issue between PAR and NYC 3 years ago

Fixed · Deployments · Global

We are experiencing connectivity issue between NYC and PAR. These connectivity issue are impacting deployments on the NYC zone. We are working on it.

EDIT 16:03 pm: connectivity has been resolved

Metrics maintenance 3 years ago

Fixed · Access Logs · Global

In our efforts to stabilize the Metrics infrastructure, we will perform a maintenance on 13 of July. Once it is started, some lag can be expected for a few hours.

Maintenance will start at 07:30 am UTC

EDIT 07:30 am UTC: Starting maintenance

EDIT 08:16 am UTC: Maintenance is over, we are catching up with the lag

EDIT 08:30 am UTC: Queries are currently disabled to speed up recovery

EDIT 09:17 am UTC: our maintenance triggered a major compaction on our storage layer. To speed up recovery, query are still disabled

EDIT 16:20 pm UTC: major compaction is over. We are struggling to handle both read and write operations at the same time. We are working on it.

EDIT 20:23 pm UTC: queries are still disabled. We are testing new configurations to resolve the issue

EDIT 14 of July 9:22 am UTC: it's a brand new day, we are still working on it.

EDIT 14 of July 18:26 pm UTC: We are struggling to handle both read and write operations at the same time. We are working on it. Happy french national day.

EDIT 16 of July 17:35 pm UTC: We found a performance issue triggered when the dotmap on the Console is accessed. We disabled some macros used to retrieve data to allow other users to access metrics. Metrics and access logs are now accessible.

[RETROACTIVE][PAR] Cellar unavailability 3 years ago

Fixed · Cellar · Global

The service was having troubles handling most of the requests between 11:24 and 11:28 UTC. We will investigate further the issue. The Cellar service is currently operational.

Roubaix: intermittent network failures 3 years ago

Fixed · Infrastructure · Global

Starting 09:31 UTC, we saw intermittent network failures on the Roubaix (RBX) zone hosted on OVH. Failures are both from the external and internal networks. Timeouts reaching your applications or add-ons might have happened.

Some applications are being redeployed for Monitoring/Unreachable because the monitoring couldn't see them anymore.

Things seem to be working fine again since 09:37 UTC. We continue to monitor the situation and will try to get more information from OVH.

EDIT 11:12 UTC: The issue has not occurred again. We will wait for any input from OVH and will add it here if we get any useful information.

Roubaix: an hypervisor has been lost 3 years ago

Fixed · Infrastructure · Global

An hypervisor has been lost on the OVH Roubaix zone. We are investigating. Impacted services are FSBuckets and add-ons.

EDIT 15:32:00 UTC: The server is back online. We are making sure services are correctly restarted. Additional services were impacted: One application reverse proxy and one add-on reverse proxy were unavailable.

EDIT 15:48:00 UTC: We are still investigating the cause of the reboot. We opened a ticket on OVH services to know if they had any un-planned intervention for that machine.

EDIT 16:03:00 UTC: The machine is unreachable again. We are investigating.

EDIT 16:11:00 UTC: The machine is up again. We are starting to suspect a hardware issue.

EDIT 16:30:00 UTC: We will drop all services from the machine to avoid any other issues until we know more about the underlying issue. FSBuckets server will be moved out around 19:00 UTC.

EDIT 19:59:00 UTC: Unfortunately, FSBuckets are going to require more time to move to another server. So far the server is working fine but OVH suspects an issue with the power supply.

EDIT 23:58:00 UTC: The FSBuckets migration is starting. FSBuckets will be set into read-only and applications will be redeployed to use the new server.

EDIT 2022-07-09 00:28:00 UTC: Buckets are fully migrated. The server is now empty and will be investigated further by OVH. This incident is now over.

[PAR] Network maintenance 3 years ago

Fixed · Global

A network maintenance has been scheduled by our network provider for Wednesday 06/07/22 at 22:30 UTC. The maintenance should not have any visible impacts other than a few seconds of network delay while the network links switch to the backup links.

EDIT 22:30 UTC. The maintenance is starting.

EDIT 22:55 UTC: Maintenance is over, no visible impact happened, links failed over in less than 100ms each time.

Ingestion queue issue 3 years ago

Fixed · Access Logs · Global

One of the server queue storage reach its disk max storage capacity

One of the partition is corrupted, fixing

EDIT 17:10 UTC: The underlying issue has been fixed. The queue is currently being processed. Some events might have been lost during the cluster rebalance. Data points will take a few more hours to be up-to-date in the various dashboards.

EDIT: Queue is in sync

Main API is currently experiencing timeouts 3 years ago

Fixed · API · Global

We are currently looking into it. Console and CLI are not working correctly.

A batch was sent by an employee. The throttle interval was set two small and the batch made a huge amount of queries to the database, making it unresponsive. We stopped the batch and will restart it with a higher throttle interval.

Removal of TLS 1.0 and 1.1 from our load balancers 3 years ago

Fixed · Global

When you access a website or an online application, you most often do so in a “secure” way. This is for example the well-known green padlock that symbolizes HTTPS connections in your browser, which has become a standard these years thanks to initiatives like Let’s Encrypt.

This means that the data transferred to the server is encrypted, and that even if they are intercepted, they cannot be read by a third party. This protection has been provided by the TLS (Transport Layer Security) protocol for almost 20 years, whether it’s a personal site, an online shop or an access to your bank’s services.

Over time, this critical technical brick on the Internet has evolved to strengthen the level of security it offers. In August 2018, its version 1.3 (the latest) was released. Meanwhile, versions 1.0 and 1.1 were considered to no longer offer a sufficient level of protection. They have been deprecated by the IETF (Internet Engineering Task Force) since March 2021 and have therefore been gradually removed from recent browsers such as Firefox, Chrome and its derivatives or Safari.

At Clever Cloud, we have seen our customers adopt TLS 1.2 and 1.3 gradually. On our load balancers, based on our in-house and open source reverse proxy Sōzu, the latest version accounts for over 90% of the requests processed each day. TLS 1.2 for just under 9%. TLS 1.0 and 1.1 for only a few tens of thousands of requests per day, less than 0.1% of our traffic.

While we have maintained these versions for compatibility reasons, this will no longer be the case as of June 30. We will of course inform the customers affected by this choice, and encourage them to switch to more recent versions, which will have advantages for them in terms of security, performance and SEO.

Several reminders will be sent between now and the final shutdown of TLS 1.0 and 1.1. If you have any questions on this subject, please contact our support team through the Console.

EDIT 2:00 PM UTC: every public load balancers has been updated with new configuration

June 2022

Network issue and partial reverse proxies outage on RBX 3 years ago

Fixed · Reverse Proxies · Global

17:12 UTC, there was a unreported network issue. It caused two of our reverse proxies to fail. 17:13 UTC, two alerts get sent through the on-call system. The on-call person ACK both of them, handles the first one and mistake the second one for a redundant alert of the first one. 18:30 UTC, some customers complain about issues between APIs. We start investigating. 19:45 UTC, the culprit is found: a reverse proxy was down. It is restarted and everything goes back to normal. 19:50 UTC, we find the unattended alert and understand the mistake that was made. (reading the two alerts as one issue.)

Storage issue on Warp10 3 years ago

Fixed · Global

A component stop consuming this queue

Metrics / Access logs: ingestion and query issues 3 years ago

Fixed · Access Logs · Global

We are experiencing issues network connectivity issues

EDIT 14:37 UTC: Network connectivity has been resolved. Database is starting.

Metrics / Access logs: query issues 3 years ago

Fixed · Access Logs · Global

One of our indexes is reloading which can lead to performance issues on queries.

Ticket Center is unavailable 3 years ago

Fixed · Global

Due to a massive Cloudflare outage (https://www.cloudflarestatus.com) the support is not available in the ticket center. You can still contact the support via email to support@clever-cloud.com

Edit 07:13 UTC : the ticket center is back online.

Metrics / Access logs: query issues 3 years ago

Fixed · Access Logs · Global

One of our indexes is reloading which can lead to performance issues on queries.

EDIT 13:02 UTC: The index has reloaded

[PAR] An hypervisor has been lost 3 years ago

Fixed · Infrastructure · Global

An hypervisor has been lost on the Paris zone. We are investigating.

EDIT 06:04 UTC: The server experienced a hardware failure. It may not be able to come back. Applications on it were redeployed elsewhere. Custom services and add-ons are currently impacted.

EDIT 06:23 UTC: A public reverse proxy serving requests for domain.par.clever-cloud.com (185.42.117.109) was on this hypervisor. This IP was moved to another server. Between 05:23 and 05:35, it was unreachable.

EDIT 06:52 UTC: ETA for server to come back is 08:00

EDIT 07:46 UTC: Hardware has been changed, server will be rebooted.

EDIT 07:57 UTC: Server is back online, we are making sure all services are up.

EDIT 09:10 UTC: Everything is now back to normal, the incident is over. We will investigate further on the reason of the hardware failure.