Incidents

Full history of incidents.

Oldest first

February 2024

[PAR Scaleway] Load balancer maintenance 2 years ago

Fixed · Global

Maintenance Window: 2024-03-07T13:00:00Z - 2024-03-07T17:00:00Z (UTC)

Scope:

Database Load Balancer (software and hardware upgrade)
Application Load Balancer (software and hardware upgrade)

Expected Impact:

Brief disconnections or connection drops during the upgrade process.
Potential minor performance fluctuations.

Additional Information:

Software upgrade already record on cleverapps.io (https://www.clevercloudstatus.com/incident/803) and on Paris load balancers (https://www.clevercloudstatus.com/incident/807 and https://www.clevercloudstatus.com/incident/805)
Please report any issues with a method for reproducing the problem (e.g., curl command for application load balancer issues and/or psql / redis / mysql queries for database load balancer).

EDIT : We will do the maintenance next week on 2024-03-07.

EDIT 17:30 UTC : We have finished to deploy the load balancers alongside the current ones. We will switch the traffic from the old one to the new one.

EDIT 17:32 UTC : We have switch the traffic from old to new database load balancer with a unexpected behavior which is fixed now. We may have seen unexpected connections refused.

EDIT 17:35 UTC : We will begin to switch the application load balancer soon.

EDIT 17:50 UTC : We have switched the first instance of application load balancer.

EDIT 18:00 UTC : We have finished to roll the application load balancer. We are watching.

[MTL] Load balancer maintenance 2 years ago

Fixed · Global

Maintenance Window: 2024-02-23T13:00:00Z - 2024-02-23T17:00:00Z (UTC)

Scope:

Database Load Balancer (software and hardware upgrade)
Application Load Balancer (software and hardware upgrade)

Expected Impact:

Brief disconnections or connection drops during the upgrade process.
Potential minor performance fluctuations.

Additional Information:

Software upgrade already record on cleverapps.io (https://www.clevercloudstatus.com/incident/803) and on Paris load balancers (https://www.clevercloudstatus.com/incident/807 and https://www.clevercloudstatus.com/incident/805)
Please report any issues with a method for reproducing the problem (e.g., curl command for application load balancer issues and/or psql / redis / mysql queries for database load balancer).

EDIT : We move the maintenance to 2024-02-23 instead of 2024-02-22.

EDIT 15:00 UTC: We are beginning the maintenance alongside the current installation. We will do the failover next week.

EDIT 2024-02-26 16:30 UTC : We have added two new IP address to domain.mtl.clever-cloud.com. DNS records.

EDIT 2024-02-26 17:00 UTC : We have removed the two old IP address from domain.mtl.clever-cloud.com. DNS records.

EDIT 2024-02-26 17:15 UTC : We will update the DNS records for database load balancers.

EDIT 2024-02-26 17:30 UTC : We have updated the DNS records for database load balancers, we are watching

[WSW] Load balancer maintenance 2 years ago

Fixed · Global

Maintenance Window: 2024-02-20T13:00:00Z - 2024-02-20T17:00:00Z (UTC)

Scope:

Database Load Balancer (software and hardware upgrade)
Application Load Balancer (software and hardware upgrade)

Expected Impact:

Brief disconnections or connection drops during the upgrade process.
Potential minor performance fluctuations.

Additional Information:

Software upgrade already record on cleverapps.io (https://www.clevercloudstatus.com/incident/803) and on Paris load balancers (https://www.clevercloudstatus.com/incident/807 and https://www.clevercloudstatus.com/incident/805)
Please report any issues with a method for reproducing the problem (e.g., curl command for application load balancer issues and/or psql / redis / mysql queries for database load balancer).

EDIT 13:30 UTC : We are preparing the hardware and software upgrade along-side the current stack.

[SGP] Load balancer maintenance 2 years ago

Fixed · Global

Maintenance Window: 2024-02-15T13:00:00Z - 2024-02-15T17:00:00Z (UTC)

Scope:

Database Load Balancer (software and hardware upgrade)
Application Load Balancer (software and hardware upgrade)

Expected Impact:

Brief disconnections or connection drops during the upgrade process.
Potential minor performance fluctuations.

Additional Information:

Software upgrade already record on cleverapps.io (https://www.clevercloudstatus.com/incident/803) and on Paris load balancers (https://www.clevercloudstatus.com/incident/807 and https://www.clevercloudstatus.com/incident/805)
Please report any issues with a method for reproducing the problem (e.g., curl command for application load balancer issues and/or psql / redis / mysql queries for database load balancer).

EDIT 13:30 UTC : We are starting the preparation to do the upgrades

EDIT 13:50 UTC : We have finished the preparation, we are beginning the rolling of the application load balancer

EDIT 14:20 UTC : We have finished to do the rolling of the application load balancer

EDIT 14:30 UTC : We are starting to roll the database load balancer

EDIT 14:35 UTC : We have finished to roll the database load balancer.

[SYD] Load balancer maintenance 2 years ago

Fixed · Global

Maintenance Window: 2024-02-13T13:00:00Z - 2024-02-13T17:00:00Z (UTC)

Scope:

Database Load Balancer (software and hardware upgrade)
Application Load Balancer (software and hardware upgrade)

Expected Impact:

Brief disconnections or connection drops during the upgrade process.
Potential minor performance fluctuations.

Additional Information:

Software upgrade already record on cleverapps.io (https://www.clevercloudstatus.com/incident/803) and on Paris load balancers (https://www.clevercloudstatus.com/incident/807 and https://www.clevercloudstatus.com/incident/805)
Please report any issues with a method for reproducing the problem (e.g., curl command for application load balancer issues and/or psql / redis / mysql queries for database load balancer).

EDIT 13:15 UTC : We are beginning the hardware upgrade along side the current hardware.

EDIT 14:20 UTC : We have finished the hardware upgrade, we will start the rolling by application load balancer.

EDIT 14:35 UTC : We have rolled the first application load balancer, we are beginning the second one.

EDIT 15:00 UTC : We have finished to roll the application load balancer, we are beginning the database load balancer.

EDIT 15:15 UTC : We have rolled the first database load balancer, we are watching.

EDIT 15:25 UTC : We have rolled the second database load balancer.

EDIT 15: 25 UTC : We have rolled all load balancers, we are keeping an eye on them, but the maintenance is over.

[Heptapod Cloud] Planned upgrade to 1.0 2 years ago

Fixed · Global

We are planning an upgrade of our Heptapod Cloud offer heptapod.host on Wednesday 2024-02-07 at 14:00 UTC. Heptapod will be updated to the 1.0 version.

Expected downtime of the service is 30 minutes. During that time, git and mercurial operations might fail as well as loading the UI.

EDIT 16:45 UTC: The update is over.

[PAR] Load balancer maintenance 2 years ago

Fixed · Reverse Proxies · Global

We will proceed to software upgrade of the database load balancer which should be transparent. You may observed a few connection cut during the operation. If you have an issue during this maintenance, please contact the support with a way to reproduce the issue (a curl command or a psql / mysql / redis example will be great).

EDIT 10:15 UTC : We have started the maintenance

EDIT 11:15 UTC : We have finished the maintenance

[PAR] Hypervisor DOWN 2 years ago

Fixed · Infrastructure · Global

After receiving many alerts, it seems a hypervisor is down. We are working on getting it back up.

EDIT 21:30 UTC : The hypervisor is now responding after an hard reboot. We are currently ensuring that every virtual machines in a healthy state and investigating the HV crash root cause.

EDIT 22:00 UTC: Every VM on the hypervisor are running as expected, the root cause was a kernel panic (the kernel is now in a more stable version)

[PAR] Load balancer maintenance 2 years ago

Fixed · Reverse Proxies · Global

We will proceed to software upgrade of the load balancer which should be transparent. You may observed a few connection cut during the operation. If you have an issue during this maintenance, please contact the support with a way to reproduce the issue (a curl command will be great). This software upgrade was running successfully on cleverapps.io without issue since one week (https://www.clevercloudstatus.com/incident/803).

EDIT 15:00 UTC : The software upgrade is still in progress

EDIT 15:30 UTC : The first server that host a load balancer instance has been updated

EDIT 15:45 UTC : We are proceeding to the others load balancers.

EDIT 16:30 UTC : We have updated 2/3 of load balancers.

EDIT 17:00 UTC : We have updated all load balancers.

January 2024

[cleverapps.io] Load balancer maintenance 2 years ago

Fixed · cleverapps.io domains · Global

EDIT 10:30 UTC : we have begun the maintenance procedure for one of the two instances.

EDIT 11:10 UTC : we have finished the upgrade, we will restart the instance this afternoon around 14:00 UTC.

EDIT 15:00 UTC : we have restart one the two load balancer instances, we are watching the metrics to get more insights between the two versions.

EDIT 9:30 UTC D+1 : since yesterday, we have observed telemetry and saw enhancement of them, we will begin the update of the second one

EDIT 11:00 UTC D+1 : the update is achieved without issues.

[Heptapod Cloud] Security update 2 years ago

Fixed · Global

An update of our Heptapod Cloud service will be done today at 15:00 UTC+1 to apply the latest Gitlab security patches related to https://about.gitlab.com/releases/2024/01/25/critical-security-release-gitlab-16-8-1-released/. Expected downtime should be less than 1 minute.

EDIT 15:34 UTC+1: Patches were applied and services were restarted. The maintenance is now over.

[Metrics] query latency 2 years ago

Fixed · Metrics · Global

We have enabled a new parameter designed to improve the reliability of the cluster. Some queries may not work. We are watching it.

[Metrics] Requests timeouts 2 years ago

Fixed · Metrics · Global

We are currently observing requests timeouts on the Metrics cluster. The issue has been identified and we are working towards the resolution. No data loss is to be expected. Various graphs (grafana, console, ..) might not properly load or render with various errors.

Edit Tue Jan 23 17:59:56 2024 UTC: A faulty configuration has been applied to a node to investigate a memory-leak. The configuration backfired on the whole cluster, making it unhealthy. The configuration have been rollback. The storage layer is currently under healing mode. To speed-up the recovery, query have been disabled.

Edit Tue Jan 23 19:51:21 2024 UTC: cluster is now healthy and recovering lag, which should last a few hours. Query will be opened when lag is resorbed.

Edit Wed Jan 24 00:04:59 2024 UTC: datalag is now ok. We are still reloading metrics's metadata, so query is still not available. Should be up in a few hours

Edit Wed Jan 24 01:54:22 2024 UTC: metadata lag is now ok, query is back online

[Accesslog] Not available 2 years ago

Fixed · Global

We are encountering problems with the delivery of accesslogs. We are investigating.

EDIT Edit Thu Jan 25 11:00:00 2024 UTC : Platform is now ok, we're ingesting lag

EDIT Edit Thu Jan 25 16:54:00 2024 UTC : Lag ingested, Some applications may not have accesslog reachable.

[Scaleway] Load balancer instability 2 years ago

Fixed · Reverse Proxies · Global

We are detecting a higher number of errors than usual on the load-balancers serving the scaleway zone. We are investigating.

[Metrics] rate-limiting error 2 years ago

Fixed · Access Logs · Global

We are detecting a high number of errors on our storage layer. As a result, cluster is rate-limiting queries. You may experience trouble to retrieve datapoints. We are watching.

Update Tue Jan 16 17:11:02 2024 UTC: cluster is no longer applying rate-limit

Issue when deploying a PostgreSQL 10 add-on 2 years ago

Fixed · PostgreSQL · Global

Trouble to deploy new PostgreSQL add-on in version 10. This version is temporarily disabled for migration and new add-on order.

[Paris] Datacenter updates 2 years ago

Fixed · Global

We are planning to do various updates on one of our datacenter in the Paris region starting at 10:35 UTC. It will last for a few hours. No issue is to be expected during this maintenance.

We will update this status accordingly.

EDIT 2024-01-10 20:00 UTC: Maintenance is over, no impact during the operations.

Elevated rate of failed deployments 2 years ago

Fixed · Deployments · Global

We are seeing an elevated rate of failed deployments. We are investigating the issue.

EDIT 15:58 UTC: The issue has been identified and deployments should be back to normal since 15:40 UTC.

[Metrics] Elevated queries error rate 2 years ago

Fixed · Access Logs · Global

We are seeing elevated error rate for metrics read queries due to the underlying storage system. The problem has been identified and we are working toward its resolution. This can impact some of the grafana dashboards or API queries. Write performance is not impacted.

Update Thu Jan 04 14:48:00 2024 UTC: We have triggered some data balancing. Some queries may take longer than expected. This can impact some of the grafana dashboards or API queries. Write performance may be impacted.

Update Thu Jan 04 20:44:01 2024 UTC: data balancing is more aggressive than expected, overloading some components. Query may be unavailable during that time

Update Fri Jan 05 02:26:05 2024 UTC: some components are still overloaded. We are currently catching up the lag, but query is disabled for now.

Update Fri Jan 05 08:01:45 2024 UTC: our write-path is still overloaded. We are searching for the bottleneck

Update Fri Jan 05 16:03:48 2024 UTC: a cleanup subroutine has been triggered to balance and remove slack space from our internal Btree storage. Query is still disabled to speed-up the process.

Update: Sat Jan 06 11:25:28 2024 UTC: lag has been absorbed. Query is now up, the cleanup subroutine is still in-progress. You may notice latency spikes during query.

Update: Mon Jan 08 14:36:57 2024 UTC: cleanup subroutine is still in-progress, and some workloads triggered an overloading of some components. Query is disabled to speed-up recovery

Update: Mon Jan 08 16:36:18 2024 UTC: query is now open.

Update Tue Jan 09 14:38:34 2024 UTC: Some StorageServers are late, meaning that a really small portion of the data is not available for the query. We are currently catching up with the lag

Update Tue Jan 16 14:56:55 2024 UTC: closing the ticket.