Incidents

Full history of incidents.

Oldest first

July 2023

some cleverapps are not accessibles 2 years ago

Fixed · cleverapps.io domains · Global

Some apps are not availables

21h11: only apps with redirect_https enabled are impacted

21h56: we rollback to the old cleverapps loadbalancers

Hypervisor lost on SYD region 2 years ago

Fixed · Infrastructure · Global

20h15: We have lost our hypervisors on SYD region 20h30: Our infrastructure provider on SYD lost its connectivity 21h00: hypervisors are back online

[JED] Reverse proxies instabilities 2 years ago

Fixed · Reverse Proxies · Global

We are currently experiencing issues on reverse proxies of the JED region. We are investigating them.

EDIT 16:44 UTC: The root cause has been identified and a fix has been applied. We are monitoring the results.

EDIT 16:50 UTC: The service is now operational.

[PAR] FSBucket server timeout for some actions 2 years ago

Fixed · FS Buckets · Global

An FSBucket server is currently being investigated for connection timeouts when mounting buckets. The problem has been partially identified and a first fix has been applied. Additional steps will be taken shortly to make sure everything is working as intended.

EDIT 10:49 UTC: The underlying issue has been fixed. Some applications may have had troubles mounting FSBuckets, writing or reading files stored on that server between 08:50 UTC and 10:25 UTC. Impacted applications are currently being redeployed out of caution (most of them successfully reconnected to the server after the fix has been issued).

June 2023

Metrics/access logs storage layer issue 2 years ago

Fixed · Access Logs · Global

We are detecting some errors on our storage layer responsible for storing metrics and access logs data. Queries were unavailable.

Edit 14:10 UTC: query is re-open

We continue to investigate.

[PAR] An hypervisor is unreachable 2 years ago

Fixed · Infrastructure · Global

The monitoring system has detected that an hypervisor is unreachable. We are investigating.

EDIT 16:27 UTC: The hypervisor took some time to reboot but it is now up and running. We are making sure services are working fine following this incident.

EDIT 17:10 UTC: The incident is now over. The underlying problem has been identified but the hypervisor is currently in the upgrade queue.

[PAR] Planned hypervisor reboot #2 2 years ago

Fixed · Global

An hypervisor on the Paris region needs to be rebooted due to a kernel issue. The reboot will take place tonight (June 21, 2023) at 18:00 UTC. Services on that hypervisor are already migrated apart for a few of them. Impacted customers will shortly receive an email with more details.

EDIT 18:14 UTC: The maintenance is starting

EDIT 22:00 UTC: The maintenance is now over

[PAR] Planned hypervisor reboot 2 years ago

Fixed · Global

An hypervisor on the Paris region needs to be rebooted due to a kernel issue. The reboot will take place tonight (June 21, 2023) at 20:00 UTC. Services on that hypervisor will be migrated starting at 18:00 UTC. Impacted users will shortly receive an email with more details.

EDIT 18:13 UTC: The maintenance is starting

EDIT 23:11 UTC: The maintenance is now over

Deployments are experiencing issues 2 years ago

Fixed · Deployments · Global

A deployment issue has been identified, we are working on a fix.

EDIT 20:43 UTC - fixed.

Metrics/access logs storage layer issue 2 years ago

Fixed · Access Logs · Global

We are detecting some errors on our storage layer responsible for storing metrics and access logs data. We are investigating.

Edit 04:58 PM UTC: A storage node had a hardware issue, it has been rebooted.

Maintenance: Metrics & Access-logs storage layer 2 years ago

Fixed · Access Logs · Global

We will start a maintenance this Tuesday designed to improve performance on our storage layer for metrics and access-logs. During the maintenance, you may not see latest datapoints and access-logs.

Maintenance will start 20 of June, at 03:30 PM UTC.

Edit 03:45 PM UTC: maintenance is starting.

Edit 04:58 PM UTC: maintenance is over.

MySQL add-ons creation stopped 2 years ago

Fixed · API · Global

MySQL add-on API started to timeout while trying to create add-ons. Currently created add-ons still work, though.

We are investigating the issue.

EDIT 09:00 PM UTC: the root cause has been corrected.

Maintenance: Metrics & Access-logs storage layer 2 years ago

Fixed · Access Logs · Global

We will start a maintenance this Sunday designed to improve performance on our storage layer for metrics and access-logs. During the maintenance, you may not see latest datapoints and access-logs.

Maintenance will start 18 of June, at 02:30 PM UTC.

EDIT 02:36 PM UTC: maintenance is starting.

Edit 08:21 PM UTC: maintenance is still on-going, storage layer is a few minutes late on average.

EDIT 08:51 PM UTC: maintenance is over, we are catching up lag

EDIT 08:00 PM UTC. An error during catching up the lag has put the storage layer into an inconsistent state. Queries are disabled for now

EDIT 11:00 PM UTC: storage layer is still inconsistent

EDIT 00:47 PM UTC D+1: storage layer is (finally?) consistent. We are catching up the lag

EDIT 04:30 PM UTC D+1: We have catch up the lag.

EDIT 07:29 AM UTC D+1: storage layer got inconsistencies. We are investigating the reason why.

EDIT 08:10 AM UTC D+1: storage layer is up and running. We are consuming the lag. Queries are disable during this phase.

EDIT 08:45 AM UTC D+1: We have consumed the lag. Queries are available.

[PAR] Network connectivity issues 2 years ago

Fixed · Infrastructure · Global

The monitoring system has difficulties to reach some services. We are investigating...

EDIT 00:50 UTC : The monitoring do not see network issues anymore.

EDIT 01:00 UTC : The monitoring has detected connectivity issues, we are fixing.

EDIT 01:30 UTC : The monitoring has detected new connectivity issues, we are on it.

[MTL] Network connectivity issue 2 years ago

Fixed · Infrastructure · Global

We are impacted by our infrastructure provider incident, you can get more details by following their incident website : https://network.status-ovhcloud.com/incidents/9vzvvwrm69ps

SSH connections to instances may fail 2 years ago

Fixed · SSH Gateway · Global

SSH connections may fail with the message 'Error: This application has no instances you can ssh to' or may ask you a password during the connection initialization. We are currently investigating this issue.

08:10 UTC : We have found the component causing this issue and restarted it. We are still investigating the root cause.

21/06 : The problem was most likely caused by the network instability observed at this time. We haven't detected any problems since.

One hypervisor in scaleway's DC is unresponsive 2 years ago

Fixed · Infrastructure · Global

One hypervisor only responds to ping. It does not take new VMs anymore and does not delete VMs that should be deleted.

19:57 UTC: We are going to reboot it. Some databases (that run on this hypervisor) will become unresponsive for a few minutes.

20:18 UTC: Hypervisor has been rebooted. All services hosted on it have been checked: everything is up and running.

Logs show a kernel panic.

Read-only live logs system storage layer 2 years ago

Fixed · Services Logs · Global

Live logs system storage layer falls in read-only mode. we are investigating the issue.

EDIT 09:30 UTC : Following the incident https://www.clevercloudstatus.com/incident/669, the storage layer did not perform scheduled tasks.

EDIT 09:45 UTC : The storage layer is accepting write. Logging system is operating normally.

[Paris] Network connectivity issue 2 years ago

Fixed · Infrastructure · Global

We are investigating a network connectivity issue towards our Paris region.

EDIT 00:27 UTC: The issue has been identified and fixed around 00:11 UTC. We continue identifying the impact on customer and internal services.

EDIT 01:00 UTC: We have identified services impacted by the incident and we have started to recover from the network issue. Identified impacted services are Metrics and access logs that are taking time to recover, others services should be working normally.

EDIT 02:30 UTC: Metrics and access logs are recovering from the network issue.

EDIT 04:00 UTC: Metrics and access logs are still recovering from the network issue. To follow, the incident you can go on https://www.clevercloudstatus.com/incident/669

Metrics and access logs network connectivity issue. 2 years ago

Fixed · Access Logs · Global

Following the incident https://www.clevercloudstatus.com/incident/669, we are recovering the network connectivity issue

EDIT 06:05 UTC: The storage layer is now up and healthy. We are now consuming the ingestion lag, it should take a few hours to fully resolve. Queries are now available but will show outdated data. We will update this status accordingly.

EDIT 10:00 UTC: We've had a slower ingestion than initially anticipated so queries are still returning out of date data. We've made some adjustments and saw an increase in ingestion for the last hour. We will still need a few hours to fully consume the lag.

EDIT 15:00 UTC: The lag has been consumed, the metrics and access logs stack is operating normally.