Incidents

Full history of incidents.

November 2022

Main API increased response time 3 years ago

Fixed · API · Global

We are currently having an increased response time on our main API (api.clever-cloud.com). Services relying on it are impacted and might experience extra time to answer to requests.

We are investigating the issue.

EDIT 14:26 UTC: The underlying issue has been identified and fixed. Services, including the Console and CLI should now be loading as usual. Sorry for the inconvenience.

Metrics / Access logs: ingestion issue 3 years ago

Fixed · Access Logs · Global

We are currently having ingestion issues on the metrics and access logs services. Data for the last 20 minutes is currently missing. We are investigating.

EDIT 11:23 UTC: Ingestion lag is now resolved, metrics and access logs should now be up-to-date.

October 2022

SYD/SGP network issues 3 years ago

Fixed · Infrastructure · Global

Our monitoring has raised network issues, we are investigating.

Status:

Ping does not go through between PAR (on BSO Network) and SGP/SYD (OVH Network)
Ping does go through between PAR and other OVH zones (RBX, MTL, WSW…)
Ping does go through between RBX and SGP/SYD (OVH to OVH)
Applications on both SGP and SYD are still UP and reachable from other networks. Deployments on these zones are still unavailable.

UPDATE 20:13 UTC: Network is kind of coming back up, but we see 80% to 90% packet loss. 21:50 UTC: still a lot (90%) of loss on the PAR -> SGP/SYD route, way less (30%) in the SGP/SYD -> PAR route. 2022-11-01 0812 UTC: >90% of loss on the PAR -> SGP/SYD route. 2022-11-01 1812 UTC: network seems fine.

PAR zone proxies unavailable 3 years ago

Fixed · Reverse Proxies · Global

PAR reverse proxies unavailable for a short time period

Network issue on OVH 3 years ago

Fixed · Reverse Proxies · Global

Some network issues on our provider OVH can leads to some desync of reverse proxy configurations

Performance issue on Metrics/AccessLogs 3 years ago

Fixed · Access Logs · Global

We are experiencing performance issues on our metrics/accesslogs infrastructure. We are on it.

Update 10:36 UTC: Performance has been fixed.

Deployment queue / Application slowness 3 years ago

Fixed · Infrastructure · Global

Some deployments can have abnormal delays to deploy. Applications may experience slowness.

** EDIT 11:55 UTC **: We have found the root cause, we have mitigated the issue. we are deploying the solution.

Metrics and access logs datastore issue 3 years ago

Fixed · Access Logs · Global

The data store behind Metrics and access logs have lost a node. Some lags to query metrics and access logs could be observed.

Issues on OVH network 3 years ago

Fixed · Infrastructure · Global

At 11:26 UTC, the monitoring started alerting about unreachable proxies on RBX, RBX-HDS, MTL2, WSW. All these zones are hosted on the OVH network.

We are investigating and watching the situation.

At 11:53 UTC, the monitoring sees everything up again. We perform a few check on some services

Deployments slowness issue 3 years ago

Fixed · Deployments · Global

We observe slow deployment times, we are investigating why.

** EDIT 18:10 UTC ** : The issue has been identified and actions to solve this issue has been performed

Deployments issues 3 years ago

Fixed · Deployments · Global

Due to the pulsar incident, some deployments may fail from time to time.

Some hypervisors are behaving strangely. We are watching and fixing them.

EDIT 10:20:00 UTC: Deployments are currently unavailable while we work around the issue.

EDIT 11:31:00 UTC: Deployments issues are fixed. We continue to monitor the situation. If you have troubles redeploying an application, please contact our support.

POSTMORTEM: The Pulsar outage that started around 04:30 UTC (see https://www.clevercloudstatus.com/incident/574) got in the way of:

the deployment process, breaking some notifications at 09:30 UTC.
the uptime of some persistent VMs (like databases) (See https://www.clevercloudstatus.com/incident/576), making the monitoring trigger deployments.

The pulsar notification system is being gradually deployed on our infrastructure, having passed the tests on our preproduction zone. We do have a fallback method for notifications. However, the issue was weird enough that the pulsar notification was not cleanly failing. They rather timed out after a long time, preventing the fallback to trigger. We stopped all deployments at 10:20 UTC. We worked on quickly adding an emergency flag to prevent the hypervisors from using pulsar for notifications. This way, we can bypass it and go straight to the fallback method.

To avoid this issue, we are working on the following:

monitor the pulsar logs before it impacts the rest of the production.
try to mitigate the long timeout issue on the notification actors, allowing for a quicker fallback.

Pulsar add-ons issues 3 years ago

Fixed · Pulsar · Global

The pulsar cluster hosting the pulsar add-ons is undergoing issues. We are investigating.

POSTMORTEM (all times are UTC) : Around 04:30: Timeouts in inter-nodes connections started to show up in the logs. They did not lead to alerts in the monitoring Around 05:00: We start getting issues in our infrastructure from software using that cluster.

11:30 : we disable the brokers to analyze the issue.

14:42: The incident is now resolved. If you still encounter any problems, please contact our support.

[RETROACTIVE] [PAR] Some databases instances went down. 3 years ago

Fixed · Infrastructure · Global

At 04:30 UTC: a pulsar cluster started to behave strangely (See https://www.clevercloudstatus.com/incident/574 ) At 05:30 UTC: on PAR, notification services on the hypervisors try to send messages in a loop, filling the system with stuck processes. At 07:00 UTC: the OS of these hypervisors start to kill processes to make room. It impacted some applications and databases. We start working on shutting down the stuck processes and restarting the broken instances. At 10:00 UTC: we finish restarting all the broken instances.

[RETROACTIVE][RBXHDS] Random HTTP 503 responses errors 3 years ago

Fixed · Reverse Proxies · Global

Between 10:25 UTC and 20:25 UTC, some applications hosted on the OVH RBXHDS zone may have experienced random 503 response errors due to faulty reverse proxies. The issue has been found and is now resolved.

Additional investigations will be conducted to understand why our monitoring system did not report the issue earlier. Apologies for the inconvenience.

Event API issue 3 years ago

Fixed · API · Global

Event are not ACK

Maintenance on add-on APIs databases 3 years ago

Fixed · API · Global

Add-on APIs database cluster disk is nearful. We are migrating it to a bigger disk.

Operation will take 10 minutes, during which add-on API will be unreachable.

Deployments are DOWN 3 years ago

Fixed · Deployments · Global

(Times are UTC) 04:45 - Deployments are broken because of a pulsar issue. We are investigating.

05:45 - To prevent issues on the infrastructure, we disabled all deployments.

05:55 - We detect that some VMs are DOWN. It seems that the pulsar connection issues have overwhelmed the hypervisor's processes.

06:05 - We shut down the processes that fill up the hypervisors. It seems to fix the issue.

06:20 - The deployments seem to be back on tracks. We continue investigating the pulsar issue before putting it back into the deployment processes.

09:09 - We are still experiencing deployments issues. We are investigating.

12:28 - Deployments have been fixed.

High latency observed in PAR 3 years ago

Fixed · Reverse Proxies · Global

We are observing high latency on our reverse-proxies on PAR.

It looks like we are under a DDoS. We are monitoring it and blocking IPs that are performing the most requests.

EDIT 15:08 UTC: we have found the application that was taking 50% of all the platform traffic. We blocked all the IPs trying to reach that application. Traffic is now operational.

September 2022

Ingestion queue lag 3 years ago

Fixed · Access Logs · Global

Our distributed database responsible for metrics and access-logs storage is not ingesting fast enough. As a result, you may experience some lags during queries. We are investigating.

EDIT 16:06 UTC: Ingestion lag is now resolved.

APIs are slow 3 years ago

Fixed · API · Global

Our "main" API is very slow. We are investigating to find out why.